[developers] Extracting surface form of tokens from derivation trees

Mon Apr 21 18:17:22 CEST 2014

I wound up putting the project I needed this for on hold for a little
while, but have just recently been trying to get this working -- in
particular, the approach involving checking for p-tokens with the same
vertices in the chart as the candidate token and then using this token if
the orthography differs from that of the candidate token.

I've discovered two gotchas with this approach. One is when the candidate
token is not actually downcased so just checking for differences in
orthography will match with downcased p-tokens. (Apparently some tokens in
the derivation are not downcased for whatever reason -- the second word in
a two word all-capitalized proper name is one example I noticed). I solved
this by downcasing the token myself before doing the comparison.

The second is that unfortunately it seems capitalization is not always
preserved in the p-tokens. Some sentence initial tokens in particular only
seem to have all lower case characters.  This one from DeepBank's 20003004
(wsj00a) for instance where the i-input began with 'Although':

(407, 0, 1, <0:8>, 1, "although", 0, "null") (324, 1, 2, <9:20>, 1,
"preliminary", 0, "null") (391, 1, 2, <9:20>, 1, "preliminary", 0, "null",
"JJ" 1.0000) (326, 2, 3, <21:29>, 1, "findings", 0, "null") (392, 2, 3,
<21:29>, 1, "findings", 0, "null", "NNS" 1.0000) (328, 3, 4, <30:34>, 1,
"were", 0, "null") (393, 3, 4, <30:34>, 1, "were", 0, "null", "VBD" 1.0000)
(365, 4, 5, <35:43>, 1, "reported", 0, "null") (382, 4, 5, <35:43>, 1,
"reported", 0, "null", "VBN" 0.9416) .......

Unless there are more gotchas I haven't noticed, this is pretty close so
maybe close enough is good enough (or I could just fudge it and uppercase
the first token). But if anyone has any further ideas regarding improving
this approach, they would be most welcome.

Cheers,
Ned

On Thu, Feb 13, 2014 at 6:16 PM, Ned Letcher <nletcher at gmail.com> wrote:

> Thanks everyone for the suggestions and input; it's been most helpful. I
> think that in my current setup it would be easier if I didn't have to
> invoke a separate tool, so it sounds like Angelina's approach of using the
> internal tokens and then comparing with the orthography of the tokens in
> the chart that occupy the same position --  while still somewhat fiddly --
> might be the way to go.
>
> Ned
>
>
> On Thu, Feb 6, 2014 at 8:41 PM, Stephan Oepen <oe at ifi.uio.no> wrote:
>
>> indeed, following PTB conventions, the splitting of 'constraint-based' or
>> '1/2' happens in token mapping, and hence i believe working off internal
>> tokens (where there is a one-to-one correspondence to leaf nodes of the
>> derivation) should be less fiddly.  recovering capitalization, i think is
>> the only challenge at this level, and looking in the same chart cell for
>> another token that was not downcased (as angelina does) seems relative
>> straightforward to me.
>>
>> ned, i forgot: you could also just use the dependency converter, to
>> obtain a token sequence corresponding to the derivation leafs. a little
>> round-about but probably easy to pull off.  interested in instructions?
>>
>> oe
>> On Feb 6, 2014 9:33 AM, "Woodley Packard" <sweaglesw at sweaglesw.org>
>> wrote:
>>
>>> You're right Bec that I overlooked the question of where to put spaces.
>>>  My suggestion also doesn't work for cases where the p-input tokens get
>>> split in half by token mapping (e.g. "the blue-colored dog").  I believe
>>> both problems are solvable, but it starts to get fiddly.  Probably best to
>>> scratch that idea.
>>>
>>> Woodley
>>>
>>> On Feb 6, 2014, at 12:10 AM, Bec Dridan <bec.dridan at gmail.com> wrote:
>>>
>>> Concatenating the p-input tokens will mostly get you what you want. I
>>> think you might run in to issues with leaves with spaces though, if you
>>> always concatenate. You'll need some extra checking of the span between
>>> tokens, possibly just the immediately adjacent characters.  I can still
>>> imagine some combinations of wiki mark up, punctuation and
>>> words-with-spaces that will cause problems, but I believe they would be
>>> rare.
>>>
>>> I still think it could be useful to not downcase at the end of token
>>> mapping, but just at the time of lexicon lookup. It wouldn't solve all
>>> these problems, but it would retain useful information in a more accessible
>>> way.
>>>
>>> bec
>>>
>>>
>>> On Thu, Feb 6, 2014 at 6:56 AM, Woodley Packard <sweaglesw at sweaglesw.org
>>> > wrote:
>>>
>>>> Hi Ned and Stephan,
>>>>
>>>> Actually, I think you may want to look at the p-input field of the
>>>> parse relation.  These are the tokens that come out of REPP, i.e. the input
>>>> to token mapping.  There is no ambiguity at this point, the bad characters
>>>> are already removed, and case is preserved.  What I would suggest is to
>>>> concatenate the strings of all tokens contained in the character offset you
>>>> got from the derivation tree.
>>>>
>>>> In the case of the example you referenced, the p-input field contains
>>>> (among other tokens) the following:
>>>>
>>>> (21, 20, 21, <137:144>, 1, "control", 0, "null", "NN" 1.0000)
>>>> (22, 21, 22, <146:147>, 1, ",", 0, "null", "," 1.0000)
>>>>
>>>> All this headache is brought about by the extra wiki markup embedded in
>>>> the input string, which IMHO is not English.  If you put English in, taking
>>>> the substring directly out of the input string will give you something more
>>>> worth looking at :-)
>>>>
>>>> -Woodley
>>>>
>>>> On Feb 5, 2014, at 9:21 PM, Ned Letcher <nletcher at gmail.com> wrote:
>>>>
>>>> What Woodley described is I think what's going on. It looks like the
>>>> START/END offsets returned by the lkb:repp call are different to that of
>>>> the +FROM/+TO offsets found in the derivation. The derivation I'm using is
>>>> from the gold tree in the logon repository: i-id 10032820 in
>>>> $LOGONROOT/lingo/erg/tsdb/gold/ws01/result.gz. (also, for some reason I
>>>> just get NIL when I try to evaluate that repp function call in my lisp
>>>> buffer after loading logon)
>>>>
>>>> From comparing derivations and the relevant portions of the p-token
>>>> field, it looks to me like using the +FORM feature of the has the same
>>>> effect as extracting the string from the p-tokens field for the token that
>>>> was ultimately used? But as you say, this still leaves the issue of the
>>>> correct casing. Angelina's workaround is a good suggestion, but definitely
>>>> feels like a hack. It seems like it would be desirable to be keeping track
>>>> of the value of tokens after REPP normalization but before downcasing for
>>>> lexicon lookup. I was talking about this with Bec also, and while her
>>>> problem was slightly different in that she only needed features for
>>>> ubertagging rather than the original surface form, she said she was also
>>>> struggling with this limitation.
>>>>
>>>> Ned
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Feb 6, 2014 at 11:26 AM, Woodley Packard <
>>>> sweaglesw at sweaglesw.org> wrote:
>>>>
>>>>> I believe what happened in this particular case is that "control" and
>>>>> the following "," punctuation token got combined, resulting in a contiguous
>>>>> CFROM/CTO span of 137 to 147, which includes not only "control" and "," but
>>>>> also the deleted text in the middle.
>>>>>
>>>>> -Woodley
>>>>>
>>>>> On Feb 5, 2014, at 2:24 PM, Stephan Oepen <oe at ifi.uio.no> wrote:
>>>>>
>>>>> > hi ned,
>>>>> >
>>>>> > the practical challenge you are facing is deeply interesting.  i
>>>>> > believe both angelina (in conversion to bi-lexical dependencies)
>>>>> > and bec (working on ubertagging in PET) have looked at this.
>>>>> >
>>>>> > the derivation trees include the identifiers of internal tokens
>>>>> > (an integer immediately preceding the token feature structure),
>>>>> > and these tokens you can retrieve from the :p-tokens field in
>>>>> > reasonably up-to-date [incr tsdb()] profiles.  this will give you
>>>>> > the strings that were used for lexical lookup.  capitalization is
>>>>> > lost at this point, more often than not, hence one needs to do
>>>>> > something approximative in addition to finding the token.  for
>>>>> > all i recall, angelina compares the actual token to others that
>>>>> > have the same position in the chart (by start and end vertex,
>>>>> > as recorded in the :p-tokens format); in case she finds one
>>>>> > whose orthography differs from the downcased string, then
>>>>> > she uses that token instead.  bec, on the other hand, i think
>>>>> > consults the +CASE feature in the token feature structure.
>>>>> >
>>>>> > underlying all this, i suspect there is a question of what the
>>>>> > characterization of initial tokens really should be, e.g. when
>>>>> > we strip wiki markup at the REPP level.  but i seem unable
>>>>> > to reproduce the particular example you give:
>>>>> >
>>>>> > TSNLP(88): (setf string
>>>>> >             "Artificial intelligence has successfully been used in a
>>>>> > wide range of fields including [[medical diagnosis]], [[stock
>>>>> > trading]], [[robot control]], [[law]], scientific discovery and
>>>>> > toys.")
>>>>> > "Artificial intelligence has successfully been used in a wide range
>>>>> of
>>>>> > fields including [[medical diagnosis]], [[stock trading]], [[robot
>>>>> > control]], [[law]], scientific discovery and toys."
>>>>> >
>>>>> > TSNLP(89): (pprint (lkb::repp string :calls '(:xml :wiki :lgt :ascii
>>>>> > :quotes) :format :raw))
>>>>> > ...
>>>>> > #S(LKB::TOKEN :ID 20 :FORM "control" :STEM NIL :FROM 20 :TO 21 :START
>>>>> > 137 :END 144 :TAGS NIL :ERSATZ NIL)
>>>>> > ...
>>>>> >
>>>>> > TSNLP(90): (subseq string 137 144)
>>>>> > "control"
>>>>> >
>>>>> > i don't doubt the problem is real, but out of curiosity: how did
>>>>> > you produce your derivations?
>>>>> >
>>>>> > all best, oe
>>>>> >
>>>>> > On Wed, Feb 5, 2014 at 5:02 AM, Ned Letcher <nletcher at gmail.com>
>>>>> wrote:
>>>>> >> Hi all,
>>>>> >>
>>>>> >> I'm trying to export DELPH-IN derivation trees for use in the
>>>>> Fangorn
>>>>> >> treebank querying tool (which uses PTB style trees for importing)
>>>>> and have
>>>>> >> run into a hiccup extracting the string to use for the leaves of
>>>>> the trees.
>>>>> >> Fangorn does not support storing the original input string
>>>>> alongside the
>>>>> >> derivation, with the string used for displaying the original
>>>>> sentence being
>>>>> >> reconstructed by concatenating the leaves of the tree together.
>>>>> >>
>>>>> >> I've been populating the leaves of the exported PTB tree by
>>>>> extracting the
>>>>> >> relevant slice of the i-input string using the +FROM +TO offsets in
>>>>> the
>>>>> >> token information (if token mapping was used). One case I've found
>>>>> where
>>>>> >> this doesn't work so well (and there may be more), is where
>>>>> characters which
>>>>> >> have been stripped by REPP occur within a token, so these
>>>>> characters are
>>>>> >> then included in the slice. Wikipedia markup, for instance, results
>>>>> in these
>>>>> >> artefacts:
>>>>> >>
>>>>> >> "Artificial intelligence has successfully been used in a wide range
>>>>> of
>>>>> >> fields including medical diagnosis]], stock trading]], robot
>>>>> control]],
>>>>> >> law]], scientific discovery and toys."
>>>>> >>
>>>>> >> I also tried using the value of the +FORM feature, but it seems
>>>>> that this
>>>>> >> doesn't always preserve the casing of the original input string.
>>>>> >>
>>>>> >> Does anyone have any ideas for combating this problem?
>>>>> >>
>>>>> >> Ned
>>>>> >>
>>>>> >> --
>>>>> >> nedned.net
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> >
>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> > +++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47)
>>>>> 2284 0125
>>>>> > +++    --- oe at ifi.uio.no; stephan at oepen.net;
>>>>> http://www.emmtee.net/oe/ ---
>>>>> >
>>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> nedned.net
>>>>
>>>>
>>>>
>>>
>>>
>
>
> --
> nedned.net
>

-- 
nedned.net
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20140422/83aaeddf/attachment-0001.html>