[developers] Extracting surface form of tokens from derivation trees

Ned Letcher nletcher at gmail.com
Thu Feb 13 08:16:30 CET 2014


Thanks everyone for the suggestions and input; it's been most helpful. I
think that in my current setup it would be easier if I didn't have to
invoke a separate tool, so it sounds like Angelina's approach of using the
internal tokens and then comparing with the orthography of the tokens in
the chart that occupy the same position --  while still somewhat fiddly --
might be the way to go.

Ned


On Thu, Feb 6, 2014 at 8:41 PM, Stephan Oepen <oe at ifi.uio.no> wrote:

> indeed, following PTB conventions, the splitting of 'constraint-based' or
> '1/2' happens in token mapping, and hence i believe working off internal
> tokens (where there is a one-to-one correspondence to leaf nodes of the
> derivation) should be less fiddly.  recovering capitalization, i think is
> the only challenge at this level, and looking in the same chart cell for
> another token that was not downcased (as angelina does) seems relative
> straightforward to me.
>
> ned, i forgot: you could also just use the dependency converter, to obtain
> a token sequence corresponding to the derivation leafs. a little
> round-about but probably easy to pull off.  interested in instructions?
>
> oe
> On Feb 6, 2014 9:33 AM, "Woodley Packard" <sweaglesw at sweaglesw.org> wrote:
>
>> You're right Bec that I overlooked the question of where to put spaces.
>>  My suggestion also doesn't work for cases where the p-input tokens get
>> split in half by token mapping (e.g. "the blue-colored dog").  I believe
>> both problems are solvable, but it starts to get fiddly.  Probably best to
>> scratch that idea.
>>
>> Woodley
>>
>> On Feb 6, 2014, at 12:10 AM, Bec Dridan <bec.dridan at gmail.com> wrote:
>>
>> Concatenating the p-input tokens will mostly get you what you want. I
>> think you might run in to issues with leaves with spaces though, if you
>> always concatenate. You'll need some extra checking of the span between
>> tokens, possibly just the immediately adjacent characters.  I can still
>> imagine some combinations of wiki mark up, punctuation and
>> words-with-spaces that will cause problems, but I believe they would be
>> rare.
>>
>> I still think it could be useful to not downcase at the end of token
>> mapping, but just at the time of lexicon lookup. It wouldn't solve all
>> these problems, but it would retain useful information in a more accessible
>> way.
>>
>> bec
>>
>>
>> On Thu, Feb 6, 2014 at 6:56 AM, Woodley Packard <sweaglesw at sweaglesw.org>wrote:
>>
>>> Hi Ned and Stephan,
>>>
>>> Actually, I think you may want to look at the p-input field of the parse
>>> relation.  These are the tokens that come out of REPP, i.e. the input to
>>> token mapping.  There is no ambiguity at this point, the bad characters are
>>> already removed, and case is preserved.  What I would suggest is to
>>> concatenate the strings of all tokens contained in the character offset you
>>> got from the derivation tree.
>>>
>>> In the case of the example you referenced, the p-input field contains
>>> (among other tokens) the following:
>>>
>>> (21, 20, 21, <137:144>, 1, "control", 0, "null", "NN" 1.0000)
>>> (22, 21, 22, <146:147>, 1, ",", 0, "null", "," 1.0000)
>>>
>>> All this headache is brought about by the extra wiki markup embedded in
>>> the input string, which IMHO is not English.  If you put English in, taking
>>> the substring directly out of the input string will give you something more
>>> worth looking at :-)
>>>
>>> -Woodley
>>>
>>> On Feb 5, 2014, at 9:21 PM, Ned Letcher <nletcher at gmail.com> wrote:
>>>
>>> What Woodley described is I think what's going on. It looks like the
>>> START/END offsets returned by the lkb:repp call are different to that of
>>> the +FROM/+TO offsets found in the derivation. The derivation I'm using is
>>> from the gold tree in the logon repository: i-id 10032820 in
>>> $LOGONROOT/lingo/erg/tsdb/gold/ws01/result.gz. (also, for some reason I
>>> just get NIL when I try to evaluate that repp function call in my lisp
>>> buffer after loading logon)
>>>
>>> From comparing derivations and the relevant portions of the p-token
>>> field, it looks to me like using the +FORM feature of the has the same
>>> effect as extracting the string from the p-tokens field for the token that
>>> was ultimately used? But as you say, this still leaves the issue of the
>>> correct casing. Angelina's workaround is a good suggestion, but definitely
>>> feels like a hack. It seems like it would be desirable to be keeping track
>>> of the value of tokens after REPP normalization but before downcasing for
>>> lexicon lookup. I was talking about this with Bec also, and while her
>>> problem was slightly different in that she only needed features for
>>> ubertagging rather than the original surface form, she said she was also
>>> struggling with this limitation.
>>>
>>> Ned
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Feb 6, 2014 at 11:26 AM, Woodley Packard <
>>> sweaglesw at sweaglesw.org> wrote:
>>>
>>>> I believe what happened in this particular case is that "control" and
>>>> the following "," punctuation token got combined, resulting in a contiguous
>>>> CFROM/CTO span of 137 to 147, which includes not only "control" and "," but
>>>> also the deleted text in the middle.
>>>>
>>>> -Woodley
>>>>
>>>> On Feb 5, 2014, at 2:24 PM, Stephan Oepen <oe at ifi.uio.no> wrote:
>>>>
>>>> > hi ned,
>>>> >
>>>> > the practical challenge you are facing is deeply interesting.  i
>>>> > believe both angelina (in conversion to bi-lexical dependencies)
>>>> > and bec (working on ubertagging in PET) have looked at this.
>>>> >
>>>> > the derivation trees include the identifiers of internal tokens
>>>> > (an integer immediately preceding the token feature structure),
>>>> > and these tokens you can retrieve from the :p-tokens field in
>>>> > reasonably up-to-date [incr tsdb()] profiles.  this will give you
>>>> > the strings that were used for lexical lookup.  capitalization is
>>>> > lost at this point, more often than not, hence one needs to do
>>>> > something approximative in addition to finding the token.  for
>>>> > all i recall, angelina compares the actual token to others that
>>>> > have the same position in the chart (by start and end vertex,
>>>> > as recorded in the :p-tokens format); in case she finds one
>>>> > whose orthography differs from the downcased string, then
>>>> > she uses that token instead.  bec, on the other hand, i think
>>>> > consults the +CASE feature in the token feature structure.
>>>> >
>>>> > underlying all this, i suspect there is a question of what the
>>>> > characterization of initial tokens really should be, e.g. when
>>>> > we strip wiki markup at the REPP level.  but i seem unable
>>>> > to reproduce the particular example you give:
>>>> >
>>>> > TSNLP(88): (setf string
>>>> >             "Artificial intelligence has successfully been used in a
>>>> > wide range of fields including [[medical diagnosis]], [[stock
>>>> > trading]], [[robot control]], [[law]], scientific discovery and
>>>> > toys.")
>>>> > "Artificial intelligence has successfully been used in a wide range of
>>>> > fields including [[medical diagnosis]], [[stock trading]], [[robot
>>>> > control]], [[law]], scientific discovery and toys."
>>>> >
>>>> > TSNLP(89): (pprint (lkb::repp string :calls '(:xml :wiki :lgt :ascii
>>>> > :quotes) :format :raw))
>>>> > ...
>>>> > #S(LKB::TOKEN :ID 20 :FORM "control" :STEM NIL :FROM 20 :TO 21 :START
>>>> > 137 :END 144 :TAGS NIL :ERSATZ NIL)
>>>> > ...
>>>> >
>>>> > TSNLP(90): (subseq string 137 144)
>>>> > "control"
>>>> >
>>>> > i don't doubt the problem is real, but out of curiosity: how did
>>>> > you produce your derivations?
>>>> >
>>>> > all best, oe
>>>> >
>>>> > On Wed, Feb 5, 2014 at 5:02 AM, Ned Letcher <nletcher at gmail.com>
>>>> wrote:
>>>> >> Hi all,
>>>> >>
>>>> >> I'm trying to export DELPH-IN derivation trees for use in the Fangorn
>>>> >> treebank querying tool (which uses PTB style trees for importing)
>>>> and have
>>>> >> run into a hiccup extracting the string to use for the leaves of the
>>>> trees.
>>>> >> Fangorn does not support storing the original input string alongside
>>>> the
>>>> >> derivation, with the string used for displaying the original
>>>> sentence being
>>>> >> reconstructed by concatenating the leaves of the tree together.
>>>> >>
>>>> >> I've been populating the leaves of the exported PTB tree by
>>>> extracting the
>>>> >> relevant slice of the i-input string using the +FROM +TO offsets in
>>>> the
>>>> >> token information (if token mapping was used). One case I've found
>>>> where
>>>> >> this doesn't work so well (and there may be more), is where
>>>> characters which
>>>> >> have been stripped by REPP occur within a token, so these characters
>>>> are
>>>> >> then included in the slice. Wikipedia markup, for instance, results
>>>> in these
>>>> >> artefacts:
>>>> >>
>>>> >> "Artificial intelligence has successfully been used in a wide range
>>>> of
>>>> >> fields including medical diagnosis]], stock trading]], robot
>>>> control]],
>>>> >> law]], scientific discovery and toys."
>>>> >>
>>>> >> I also tried using the value of the +FORM feature, but it seems that
>>>> this
>>>> >> doesn't always preserve the casing of the original input string.
>>>> >>
>>>> >> Does anyone have any ideas for combating this problem?
>>>> >>
>>>> >> Ned
>>>> >>
>>>> >> --
>>>> >> nedned.net
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> >
>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> > +++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47)
>>>> 2284 0125
>>>> > +++    --- oe at ifi.uio.no; stephan at oepen.net;
>>>> http://www.emmtee.net/oe/ ---
>>>> >
>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>>
>>>
>>> --
>>> nedned.net
>>>
>>>
>>>
>>
>>


-- 
nedned.net
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20140213/02b3c5a2/attachment.html>


More information about the developers mailing list