<div dir="ltr">Hi all,<div><br></div><div>A colleague of mine is attempting to use ERG semantic outputs in a system originally created for another representation, and his system requires the semantics to be paired with a tokenized string (e.g., with punctuation separated from the word tokens).</div><div><br></div><div>I can get the space-delimited tokenized string, e.g., from repp or from ACE with the -E option, but then the CFROM/CTO values in the MRS no longer align to the string. The initial tokens ('p-input' in the 'parse' table of a [incr tsdb()] profile) can tell me the span of individual tokens in the original string, which I could use to compute the adjusted spans. This seems simple enough, but then it gets complicated as there are separated tokens that should still count as a single range (e.g. "could n't", where '_can_v_modal' and 'neg' both select the full span of "could n't") and also those I want separated, like punctuation (but not all punctuation, like ' in "The kids' toys are in the closet.").<br></div><div><br></div><div>Has anyone else thought about this problem and can share some solutions? Or, even better, code to realign EPs to the tokenized string?</div><div><br></div><div>-- <br><div class="gmail_signature"><div dir="ltr">Michael Wayne Goodman<div>Ph.D. Candidate, UW Linguistics</div></div></div>
</div></div>