[developers] character-based discriminants

Sat Nov 23 20:29:58 CET 2019

Hi Stephan,

My initial reaction to the notion of character-based discriminants is (1) it will not solve your immediate problem without a certain amount of custom tooling to convert old discriminants to new ones in a way that is sensitive to how the current punctuation rules work, i.e. a given chart vertex will have to be able to map to several different character positions depending on how much punctuation has been cliticized so far.  The twig-shaped discriminants used by FFTB will in some cases have to be bifurcated into two or more discriminants, as well. Also, (2) this approach loses the (theoretical if perhaps not recently used) ability to treebank a nonlinear lattice shaped input, e.g. from an ASR system.  I could imagine treebanking lattices from other sources as well — perhaps an image caption generator.

Given the custom tooling required for updating the discriminants, I’m not sure switching to character-based anchoring would be less painful than having that tool compute the new chart vertex anchoring instead — though I could be wrong.  What other arguments can be made in favor of character-based discriminants?

In terms of support from FFTB, I think there are relatively few places in the code that assume the discriminants’ from/to are interpretable beyond matching the from/to values of the `edge’ relation.  I think I would implement this by (optionally, I suppose, since presumably other grammars won’t want to do this at least for now) replacing the from/to on edges read from the profile with character positions and more or less pretend that there is a chart vertex for every character position.  Barring unforeseen complications, that wouldn’t be too hard.

Woodley

> On Nov 23, 2019, at 5:58 AM, Stephan Oepen <oe at ifi.uio.no> wrote:
> 
> hi again, woodley,
> 
> dan and i are currently exploring a 'makeover' of ERG input
> processing, with the overall goal of increased compatibility with
> mainstream assumptions about tokenization.
> 
> among other things, we would like to move to the revised (i.e.
> non-venerable) PTB (and OntoNotes and UD) tokenization conventions and
> avoid subsequent re-arranging of segmentation in token mapping.  this
> means we would have to move away from the pseudo-affixation treatment
> of punctuation marks to a 'pseudo-clitization' approach, meaning that
> punctuation marks are lexical entries in their own right and attach
> via binary constructions (rather than as lexical rules).  the 'clitic'
> metaphor, here, is intended to suggest that these lexical entries can
> only attach at the bottom of the derivation, i.e. to non-clitic
> lexical items immediately to their left (e.g. in the case of a comma)
> or to their right (in the case of, say, an opening quote or
> parenthesis).
> 
> dan is currently visiting oslo, and we would like to use the
> opportunity to estimate the cost of moving to such a revised universe.
> treebank maintenance is a major concern here, as such a radical change
> in the yields of virtually all derivations would render discriminants
> invalid when updating to the new forests.  i believe a cute idea has
> emerged that, we optimistically believe, might eliminate much of that
> concern: character-based discriminant positions, instead of our
> venerable way of counting chart vertices.
> 
> for the ERG at least, we believe that leaf nodes in all derivations
> are reliably annotated with character start and end positions (+FROM
> and +TO, as well as the +ID lists on token feature structures).  these
> sub-string indices will hardly be affected by the above change to
> tokenization (except for cases where our current approach to splitting
> at hyphens and slashes first in token mapping leads to overlapping
> ranges).  hence if discriminants were anchored over character ranges
> instead of chart cells ... i expect the vast majority of them might
> just carry over?
> 
> we would be grateful if you (and others too, of course) could give the
> above idea some critical thought and look for possible obstacles that
> dan and i may just be overlooking?  technically, i imagine one would
> have to extend FFTB to (optionally) extract discriminant start and end
> positions from the sub-string 'coverage' of each constituent, possibly
> once convert existing treebanks to character-based indexing, and then
> update into the new universe using character-based matching.  does
> such an approach seem feasible to you in principle?
> 
> cheers, oe