[developers] character-based discriminants

Sat Nov 23 14:58:20 CET 2019

hi again, woodley,

dan and i are currently exploring a 'makeover' of ERG input
processing, with the overall goal of increased compatibility with
mainstream assumptions about tokenization.

among other things, we would like to move to the revised (i.e.
non-venerable) PTB (and OntoNotes and UD) tokenization conventions and
avoid subsequent re-arranging of segmentation in token mapping.  this
means we would have to move away from the pseudo-affixation treatment
of punctuation marks to a 'pseudo-clitization' approach, meaning that
punctuation marks are lexical entries in their own right and attach
via binary constructions (rather than as lexical rules).  the 'clitic'
metaphor, here, is intended to suggest that these lexical entries can
only attach at the bottom of the derivation, i.e. to non-clitic
lexical items immediately to their left (e.g. in the case of a comma)
or to their right (in the case of, say, an opening quote or
parenthesis).

dan is currently visiting oslo, and we would like to use the
opportunity to estimate the cost of moving to such a revised universe.
treebank maintenance is a major concern here, as such a radical change
in the yields of virtually all derivations would render discriminants
invalid when updating to the new forests.  i believe a cute idea has
emerged that, we optimistically believe, might eliminate much of that
concern: character-based discriminant positions, instead of our
venerable way of counting chart vertices.

for the ERG at least, we believe that leaf nodes in all derivations
are reliably annotated with character start and end positions (+FROM
and +TO, as well as the +ID lists on token feature structures).  these
sub-string indices will hardly be affected by the above change to
tokenization (except for cases where our current approach to splitting
at hyphens and slashes first in token mapping leads to overlapping
ranges).  hence if discriminants were anchored over character ranges
instead of chart cells ... i expect the vast majority of them might
just carry over?

we would be grateful if you (and others too, of course) could give the
above idea some critical thought and look for possible obstacles that
dan and i may just be overlooking?  technically, i imagine one would
have to extend FFTB to (optionally) extract discriminant start and end
positions from the sub-string 'coverage' of each constituent, possibly
once convert existing treebanks to character-based indexing, and then
update into the new universe using character-based matching.  does
such an approach seem feasible to you in principle?

cheers, oe