[developers] Extracting surface form of tokens from derivation trees

Woodley Packard sweaglesw at sweaglesw.org
Thu Apr 24 21:01:51 CEST 2014

The MRS CFROM/CTO come straight from the +FROM and +TO properties of the post-token-mapping tokens dominated by the edge each EP is introduced on.  Unfortunately they do not uniquely identify such a token; for example:

We admired the sky-blue water.

This yields a 'sky-' token and a 'blue' token, both with identical +FROM and +TO, and correspondingly a _sky_n_1_rel EP and a _blue_a_1_rel EP, both with identical CFROM/CTO.   The span "sky-blue" is considered a single token before token-mapping (e.g. as the input to TNT), so the answer to your question (b) is yes.  I don't see that this is a problem from the point of view of (a), if what you want is a correspondence between EPs and TNT-level tokens, since the EPs still point to the input token spans in this case.

In terms of punctuation attachment, "water" and "." are separate tokens for TNT but one token for the ERG, and the _water_n_1_rel EP comes out with a CFROM/CTO that includes the period.  I don't know of any way around that.

Not sure that contributes much to your problem, but maybe.

> a) has anyone done this with the Wikiwoods data?  is it doable?
> b) are there cases where one TNT token corresponds to two ERG tokens?

