[developers] Adjusting LNK values to space-delimited tokens

Ann Copestake aac10 at cl.cam.ac.uk
Tue Jun 27 10:26:38 CEST 2017


Matic's thesis indeed has an approach to the version of the problem he 
had to deal with (not quite the same), and he will make code available.  
The thesis will be generally available once he's done some corrections.  
But - he's now working in a company so won't be supporting the code, and 
it was anyway far from perfect.

Is the system you're trying to integrate with really simply 
space-tokenized?  People generally use something a little more complex.

All best,


Ann

On 26/06/2017 05:54, Francis Bond wrote:
> I am pretty sure Matic has done some work on this problem, ...
>
> On Mon, Jun 26, 2017 at 6:50 AM, Michael Wayne Goodman 
> <goodmami at uw.edu <mailto:goodmami at uw.edu>> wrote:
>
>     Thanks Woodley,
>
>     On Sun, Jun 25, 2017 at 8:03 PM, Woodley Packard
>     <sweaglesw at sweaglesw.org <mailto:sweaglesw at sweaglesw.org>> wrote:
>
>         Have you considered passing a pre-tokenized string (produced
>         by REPP or otherwise) into ACE? Character spans will then
>         automatically be produced relative to that string.  Or maybe I
>         misunderstood your goal?
>
>
>     Yes, I have tried this, but (a) I still get things like the final
>     period being in the same span as the final word (now with the
>     additional space); (b) I'm concerned about *over*-tokenization, if
>     the REPP rules find something in the tokenized string to further
>     split up; and (c) while it was able to parse "The dog could n't
>     bark .", it fails to parse things like "The kids ' toys are in the
>     closet .".
>
>     As to my goal, consider again "The dog couldn't bark." The initial
>     (post-REPP) tokens are:
>
>         <0:3>  "The"
>         <4:7>  "dog"
>         <8:13> "could"
>         <13:16>  "n’t"
>         <17:21>  "bark"
>         <21:22>  "."
>
>     The internal tokens are:
>
>         <0:3>  "the"
>         <4:7>  "dog"
>         <8:16> "couldn’t"
>         <17:22>  "bark."
>
>     I would like to adjust the latter values to fit the string where
>     the initial tokens are all space separated. So the new string is
>     "The dog could n't bark .", and the LNK values would be:
>
>         <0:3>  _the_q
>         <4:7>  _dog_n_1
>         <8:17> _can_v_modal, neg  (CTO + 1 from the internal space)
>         <18:22>  _bark_v_1  (CFROM + 1 from previous adjustment; CTO -
>     1 to get rid of the final period)
>
>     My colleague uses these to anonymize named entities, numbers,
>     etc., and for this task he says he can be somewhat flexible. But
>     he also uses them for an attention layer in his neural setup, in
>     which case he'd need exact alignments.
>
>
>         Woodley
>
>
>
>
>         > On Jun 25, 2017, at 3:14 PM, Michael Wayne Goodman
>         <goodmami at uw.edu <mailto:goodmami at uw.edu>> wrote:
>         >
>         > Hi all,
>         >
>         > A colleague of mine is attempting to use ERG semantic
>         outputs in a system originally created for another
>         representation, and his system requires the semantics to be
>         paired with a tokenized string (e.g., with punctuation
>         separated from the word tokens).
>         >
>         > I can get the space-delimited tokenized string, e.g., from
>         repp or from ACE with the -E option, but then the CFROM/CTO
>         values in the MRS no longer align to the string. The initial
>         tokens ('p-input' in the 'parse' table of a [incr tsdb()]
>         profile) can tell me the span of individual tokens in the
>         original string, which I could use to compute the adjusted
>         spans. This seems simple enough, but then it gets complicated
>         as there are separated tokens that should still count as a
>         single range (e.g. "could n't", where '_can_v_modal' and 'neg'
>         both select the full span of "could n't") and also those I
>         want separated, like punctuation (but not all punctuation,
>         like ' in "The kids' toys are in the closet.").
>         >
>         > Has anyone else thought about this problem and can share
>         some solutions? Or, even better, code to realign EPs to the
>         tokenized string?
>         >
>         > --
>         > Michael Wayne Goodman
>         > Ph.D. Candidate, UW Linguistics
>
>
>
>
>     -- 
>     Michael Wayne Goodman
>     Ph.D. Candidate, UW Linguistics
>
>
>
>
> -- 
> Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
> Division of Linguistics and Multilingual Studies
> Nanyang Technological University
>
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> 
> 	Virus-free. www.avg.com 
> <http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient> 
>
>
> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20170627/ddcd62b2/attachment.html>


More information about the developers mailing list