[developers] Adjusting LNK values to space-delimited tokens

Michael Wayne Goodman goodmami at uw.edu
Mon Jul 3 21:12:58 CEST 2017


Thanks Francis, Ann, and Matic,

Matic, I'll look over the code you wrote. It sounds pretty close to what I
was after. Thanks for sharing!

And Ann, to (attempt to) answer your question, I think the tokenization
requirement is currently for preprocessing (e.g., he runs a named-entity
recognizer over a tokenized string, then uses the results for anonymizing
NEs in the DMRS graph). I think his seq-to-seq neural system also uses
tokens (as opposed to characters or other sub-word units), but I don't
think it's currently necessary to retokenize for training/decoding. You can
see the code and links to the paper here:
https://github.com/sinantie/NeuralAmr.

On Tue, Jun 27, 2017 at 3:41 AM, Matic Horvat <matic.horvat at cl.cam.ac.uk>
wrote:

> Hi,
>
> To expand on Ann's email, the problem I needed to solve was to align the
> DMRS EPs with PTB-style tokenized sentence. In addition to punctuation and
> whitespace differences, I needed to align null semantics items that are not
> associated with any predicate in the ERG. The latter is done using
> heuristics. The code is available under MIT license here:
> https://github.com/matichorvat/pydmrs.
>
> The relevant modules are:
> General alignment (without null semantics items): https://github.com/
> matichorvat/pydmrs/blob/master/dmrs_preprocess/token_align.py
> Null semantics item alignment: https://github.com/matichorvat/pydmrs/blob/
> master/dmrs_preprocess/unaligned_tokens_align.py
> Heuristics for null semantics item alignment: https://github.com/
> matichorvat/pydmrs/blob/master/dmrs_preprocess/
> unaligned_tokens_heuristics.py
>
> I hope that helps!
>
> Best,
> Matic
>
>
>
>
> On Tue, Jun 27, 2017 at 9:26 AM, Ann Copestake <aac10 at cl.cam.ac.uk> wrote:
>
>> Matic's thesis indeed has an approach to the version of the problem he
>> had to deal with (not quite the same), and he will make code available.
>> The thesis will be generally available once he's done some corrections.
>> But - he's now working in a company so won't be supporting the code, and it
>> was anyway far from perfect.
>>
>> Is the system you're trying to integrate with really simply
>> space-tokenized?  People generally use something a little more complex.
>>
>> All best,
>>
>> Ann
>>
>>
>> On 26/06/2017 05:54, Francis Bond wrote:
>>
>> I am pretty sure Matic has done some work on this problem, ...
>>
>> On Mon, Jun 26, 2017 at 6:50 AM, Michael Wayne Goodman <goodmami at uw.edu>
>> wrote:
>>
>>> Thanks Woodley,
>>>
>>> On Sun, Jun 25, 2017 at 8:03 PM, Woodley Packard <
>>> sweaglesw at sweaglesw.org> wrote:
>>>
>>>> Have you considered passing a pre-tokenized string (produced by REPP or
>>>> otherwise) into ACE?  Character spans will then automatically be produced
>>>> relative to that string.  Or maybe I misunderstood your goal?
>>>
>>>
>>> Yes, I have tried this, but (a) I still get things like the final period
>>> being in the same span as the final word (now with the additional space);
>>> (b) I'm concerned about *over*-tokenization, if the REPP rules find
>>> something in the tokenized string to further split up; and (c) while it was
>>> able to parse "The dog could n't bark .", it fails to parse things like
>>> "The kids ' toys are in the closet .".
>>>
>>> As to my goal, consider again "The dog couldn't bark." The initial
>>> (post-REPP) tokens are:
>>>
>>>     <0:3>      "The"
>>>     <4:7>      "dog"
>>>     <8:13>     "could"
>>>     <13:16>    "n’t"
>>>     <17:21>    "bark"
>>>     <21:22>    "."
>>>
>>> The internal tokens are:
>>>
>>>     <0:3>      "the"
>>>     <4:7>      "dog"
>>>     <8:16>     "couldn’t"
>>>     <17:22>    "bark."
>>>
>>> I would like to adjust the latter values to fit the string where the
>>> initial tokens are all space separated. So the new string is "The dog could
>>> n't bark .", and the LNK values would be:
>>>
>>>     <0:3>      _the_q
>>>     <4:7>      _dog_n_1
>>>     <8:17>     _can_v_modal, neg  (CTO + 1 from the internal space)
>>>     <18:22>    _bark_v_1  (CFROM + 1 from previous adjustment; CTO - 1
>>> to get rid of the final period)
>>>
>>> My colleague uses these to anonymize named entities, numbers, etc., and
>>> for this task he says he can be somewhat flexible. But he also uses them
>>> for an attention layer in his neural setup, in which case he'd need exact
>>> alignments.
>>>
>>>
>>>> Woodley
>>>>
>>>>
>>>>
>>>>
>>>> > On Jun 25, 2017, at 3:14 PM, Michael Wayne Goodman <goodmami at uw.edu>
>>>> wrote:
>>>> >
>>>> > Hi all,
>>>> >
>>>> > A colleague of mine is attempting to use ERG semantic outputs in a
>>>> system originally created for another representation, and his system
>>>> requires the semantics to be paired with a tokenized string (e.g., with
>>>> punctuation separated from the word tokens).
>>>> >
>>>> > I can get the space-delimited tokenized string, e.g., from repp or
>>>> from ACE with the -E option, but then the CFROM/CTO values in the MRS no
>>>> longer align to the string. The initial tokens ('p-input' in the 'parse'
>>>> table of a [incr tsdb()] profile) can tell me the span of individual tokens
>>>> in the original string, which I could use to compute the adjusted spans.
>>>> This seems simple enough, but then it gets complicated as there are
>>>> separated tokens that should still count as a single range (e.g. "could
>>>> n't", where '_can_v_modal' and 'neg' both select the full span of "could
>>>> n't") and also those I want separated, like punctuation (but not all
>>>> punctuation, like ' in "The kids' toys are in the closet.").
>>>> >
>>>> > Has anyone else thought about this problem and can share some
>>>> solutions? Or, even better, code to realign EPs to the tokenized string?
>>>> >
>>>> > --
>>>> > Michael Wayne Goodman
>>>> > Ph.D. Candidate, UW Linguistics
>>>>
>>>
>>>
>>>
>>> --
>>> Michael Wayne Goodman
>>> Ph.D. Candidate, UW Linguistics
>>>
>>
>>
>>
>> --
>> Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
>> Division of Linguistics and Multilingual Studies
>> Nanyang Technological University
>>
>>
>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> Virus-free.
>> www.avg.com
>> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>> <#m_8009082640403460885_m_2364004831508039477_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>
>>
>>
>


-- 
Michael Wayne Goodman
Ph.D. Candidate, UW Linguistics
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20170703/21efcb24/attachment.html>


More information about the developers mailing list