[developers] Adjusting LNK values to space-delimited tokens

Mon Jun 26 06:54:36 CEST 2017

I am pretty sure Matic has done some work on this problem, ...

On Mon, Jun 26, 2017 at 6:50 AM, Michael Wayne Goodman <goodmami at uw.edu>
wrote:

> Thanks Woodley,
>
> On Sun, Jun 25, 2017 at 8:03 PM, Woodley Packard <sweaglesw at sweaglesw.org>
> wrote:
>
>> Have you considered passing a pre-tokenized string (produced by REPP or
>> otherwise) into ACE?  Character spans will then automatically be produced
>> relative to that string.  Or maybe I misunderstood your goal?
>
>
> Yes, I have tried this, but (a) I still get things like the final period
> being in the same span as the final word (now with the additional space);
> (b) I'm concerned about *over*-tokenization, if the REPP rules find
> something in the tokenized string to further split up; and (c) while it was
> able to parse "The dog could n't bark .", it fails to parse things like
> "The kids ' toys are in the closet .".
>
> As to my goal, consider again "The dog couldn't bark." The initial
> (post-REPP) tokens are:
>
>     <0:3>      "The"
>     <4:7>      "dog"
>     <8:13>     "could"
>     <13:16>    "n’t"
>     <17:21>    "bark"
>     <21:22>    "."
>
> The internal tokens are:
>
>     <0:3>      "the"
>     <4:7>      "dog"
>     <8:16>     "couldn’t"
>     <17:22>    "bark."
>
> I would like to adjust the latter values to fit the string where the
> initial tokens are all space separated. So the new string is "The dog could
> n't bark .", and the LNK values would be:
>
>     <0:3>      _the_q
>     <4:7>      _dog_n_1
>     <8:17>     _can_v_modal, neg  (CTO + 1 from the internal space)
>     <18:22>    _bark_v_1  (CFROM + 1 from previous adjustment; CTO - 1 to
> get rid of the final period)
>
> My colleague uses these to anonymize named entities, numbers, etc., and
> for this task he says he can be somewhat flexible. But he also uses them
> for an attention layer in his neural setup, in which case he'd need exact
> alignments.
>
>
>> Woodley
>>
>>
>>
>>
>> > On Jun 25, 2017, at 3:14 PM, Michael Wayne Goodman <goodmami at uw.edu>
>> wrote:
>> >
>> > Hi all,
>> >
>> > A colleague of mine is attempting to use ERG semantic outputs in a
>> system originally created for another representation, and his system
>> requires the semantics to be paired with a tokenized string (e.g., with
>> punctuation separated from the word tokens).
>> >
>> > I can get the space-delimited tokenized string, e.g., from repp or from
>> ACE with the -E option, but then the CFROM/CTO values in the MRS no longer
>> align to the string. The initial tokens ('p-input' in the 'parse' table of
>> a [incr tsdb()] profile) can tell me the span of individual tokens in the
>> original string, which I could use to compute the adjusted spans. This
>> seems simple enough, but then it gets complicated as there are separated
>> tokens that should still count as a single range (e.g. "could n't", where
>> '_can_v_modal' and 'neg' both select the full span of "could n't") and also
>> those I want separated, like punctuation (but not all punctuation, like '
>> in "The kids' toys are in the closet.").
>> >
>> > Has anyone else thought about this problem and can share some
>> solutions? Or, even better, code to realign EPs to the tokenized string?
>> >
>> > --
>> > Michael Wayne Goodman
>> > Ph.D. Candidate, UW Linguistics
>>
>
>
>
> --
> Michael Wayne Goodman
> Ph.D. Candidate, UW Linguistics
>

-- 
Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20170626/af013686/attachment-0001.html>