[developers] Adjusting LNK values to space-delimited tokens

Mon Jun 26 05:03:17 CEST 2017

Have you considered passing a pre-tokenized string (produced by REPP or otherwise) into ACE?  Character spans will then automatically be produced relative to that string.  Or maybe I misunderstood your goal?

Woodley

> On Jun 25, 2017, at 3:14 PM, Michael Wayne Goodman <goodmami at uw.edu> wrote:
> 
> Hi all,
> 
> A colleague of mine is attempting to use ERG semantic outputs in a system originally created for another representation, and his system requires the semantics to be paired with a tokenized string (e.g., with punctuation separated from the word tokens).
> 
> I can get the space-delimited tokenized string, e.g., from repp or from ACE with the -E option, but then the CFROM/CTO values in the MRS no longer align to the string. The initial tokens ('p-input' in the 'parse' table of a [incr tsdb()] profile) can tell me the span of individual tokens in the original string, which I could use to compute the adjusted spans. This seems simple enough, but then it gets complicated as there are separated tokens that should still count as a single range (e.g. "could n't", where '_can_v_modal' and 'neg' both select the full span of "could n't") and also those I want separated, like punctuation (but not all punctuation, like ' in "The kids' toys are in the closet.").
> 
> Has anyone else thought about this problem and can share some solutions? Or, even better, code to realign EPs to the tokenized string?
> 
> -- 
> Michael Wayne Goodman
> Ph.D. Candidate, UW Linguistics