[developers] Adjusting LNK values to space-delimited tokens

Matic Horvat matic.horvat at cl.cam.ac.uk
Tue Jun 27 12:41:41 CEST 2017


Hi,

To expand on Ann's email, the problem I needed to solve was to align the
DMRS EPs with PTB-style tokenized sentence. In addition to punctuation and
whitespace differences, I needed to align null semantics items that are not
associated with any predicate in the ERG. The latter is done using
heuristics. The code is available under MIT license here:
https://github.com/matichorvat/pydmrs.

The relevant modules are:
General alignment (without null semantics items):
https://github.com/matichorvat/pydmrs/blob/master/dmrs_preprocess/token_align.py
Null semantics item alignment:
https://github.com/matichorvat/pydmrs/blob/master/dmrs_preprocess/unaligned_tokens_align.py
Heuristics for null semantics item alignment:
https://github.com/matichorvat/pydmrs/blob/master/dmrs_preprocess/unaligned_tokens_heuristics.py

I hope that helps!

Best,
Matic




On Tue, Jun 27, 2017 at 9:26 AM, Ann Copestake <aac10 at cl.cam.ac.uk> wrote:

> Matic's thesis indeed has an approach to the version of the problem he had
> to deal with (not quite the same), and he will make code available.  The
> thesis will be generally available once he's done some corrections.  But -
> he's now working in a company so won't be supporting the code, and it was
> anyway far from perfect.
>
> Is the system you're trying to integrate with really simply
> space-tokenized?  People generally use something a little more complex.
>
> All best,
>
> Ann
>
>
> On 26/06/2017 05:54, Francis Bond wrote:
>
> I am pretty sure Matic has done some work on this problem, ...
>
> On Mon, Jun 26, 2017 at 6:50 AM, Michael Wayne Goodman <goodmami at uw.edu>
> wrote:
>
>> Thanks Woodley,
>>
>> On Sun, Jun 25, 2017 at 8:03 PM, Woodley Packard <sweaglesw at sweaglesw.org
>> > wrote:
>>
>>> Have you considered passing a pre-tokenized string (produced by REPP or
>>> otherwise) into ACE?  Character spans will then automatically be produced
>>> relative to that string.  Or maybe I misunderstood your goal?
>>
>>
>> Yes, I have tried this, but (a) I still get things like the final period
>> being in the same span as the final word (now with the additional space);
>> (b) I'm concerned about *over*-tokenization, if the REPP rules find
>> something in the tokenized string to further split up; and (c) while it was
>> able to parse "The dog could n't bark .", it fails to parse things like
>> "The kids ' toys are in the closet .".
>>
>> As to my goal, consider again "The dog couldn't bark." The initial
>> (post-REPP) tokens are:
>>
>>     <0:3>      "The"
>>     <4:7>      "dog"
>>     <8:13>     "could"
>>     <13:16>    "n’t"
>>     <17:21>    "bark"
>>     <21:22>    "."
>>
>> The internal tokens are:
>>
>>     <0:3>      "the"
>>     <4:7>      "dog"
>>     <8:16>     "couldn’t"
>>     <17:22>    "bark."
>>
>> I would like to adjust the latter values to fit the string where the
>> initial tokens are all space separated. So the new string is "The dog could
>> n't bark .", and the LNK values would be:
>>
>>     <0:3>      _the_q
>>     <4:7>      _dog_n_1
>>     <8:17>     _can_v_modal, neg  (CTO + 1 from the internal space)
>>     <18:22>    _bark_v_1  (CFROM + 1 from previous adjustment; CTO - 1 to
>> get rid of the final period)
>>
>> My colleague uses these to anonymize named entities, numbers, etc., and
>> for this task he says he can be somewhat flexible. But he also uses them
>> for an attention layer in his neural setup, in which case he'd need exact
>> alignments.
>>
>>
>>> Woodley
>>>
>>>
>>>
>>>
>>> > On Jun 25, 2017, at 3:14 PM, Michael Wayne Goodman <goodmami at uw.edu>
>>> wrote:
>>> >
>>> > Hi all,
>>> >
>>> > A colleague of mine is attempting to use ERG semantic outputs in a
>>> system originally created for another representation, and his system
>>> requires the semantics to be paired with a tokenized string (e.g., with
>>> punctuation separated from the word tokens).
>>> >
>>> > I can get the space-delimited tokenized string, e.g., from repp or
>>> from ACE with the -E option, but then the CFROM/CTO values in the MRS no
>>> longer align to the string. The initial tokens ('p-input' in the 'parse'
>>> table of a [incr tsdb()] profile) can tell me the span of individual tokens
>>> in the original string, which I could use to compute the adjusted spans.
>>> This seems simple enough, but then it gets complicated as there are
>>> separated tokens that should still count as a single range (e.g. "could
>>> n't", where '_can_v_modal' and 'neg' both select the full span of "could
>>> n't") and also those I want separated, like punctuation (but not all
>>> punctuation, like ' in "The kids' toys are in the closet.").
>>> >
>>> > Has anyone else thought about this problem and can share some
>>> solutions? Or, even better, code to realign EPs to the tokenized string?
>>> >
>>> > --
>>> > Michael Wayne Goodman
>>> > Ph.D. Candidate, UW Linguistics
>>>
>>
>>
>>
>> --
>> Michael Wayne Goodman
>> Ph.D. Candidate, UW Linguistics
>>
>
>
>
> --
> Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
> Division of Linguistics and Multilingual Studies
> Nanyang Technological University
>
>
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> Virus-free.
> www.avg.com
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
> <#m_2364004831508039477_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20170627/8da30015/attachment.html>


More information about the developers mailing list