[developers] Adjusting LNK values to space-delimited tokens

Mon Jun 26 00:14:01 CEST 2017

Hi all,

A colleague of mine is attempting to use ERG semantic outputs in a system
originally created for another representation, and his system requires the
semantics to be paired with a tokenized string (e.g., with punctuation
separated from the word tokens).

I can get the space-delimited tokenized string, e.g., from repp or from ACE
with the -E option, but then the CFROM/CTO values in the MRS no longer
align to the string. The initial tokens ('p-input' in the 'parse' table of
a [incr tsdb()] profile) can tell me the span of individual tokens in the
original string, which I could use to compute the adjusted spans. This
seems simple enough, but then it gets complicated as there are separated
tokens that should still count as a single range (e.g. "could n't", where
'_can_v_modal' and 'neg' both select the full span of "could n't") and also
those I want separated, like punctuation (but not all punctuation, like '
in "The kids' toys are in the closet.").

Has anyone else thought about this problem and can share some solutions?
Or, even better, code to realign EPs to the tokenized string?

-- 
Michael Wayne Goodman
Ph.D. Candidate, UW Linguistics
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20170625/414423e7/attachment.html>