[developers] Extracting surface form of tokens from derivation trees

Thu Feb 6 06:21:33 CET 2014

What Woodley described is I think what's going on. It looks like the
START/END offsets returned by the lkb:repp call are different to that of
the +FROM/+TO offsets found in the derivation. The derivation I'm using is
from the gold tree in the logon repository: i-id 10032820 in
$LOGONROOT/lingo/erg/tsdb/gold/ws01/result.gz. (also, for some reason I
just get NIL when I try to evaluate that repp function call in my lisp
buffer after loading logon)

>From comparing derivations and the relevant portions of the p-token field,
it looks to me like using the +FORM feature of the has the same effect as
extracting the string from the p-tokens field for the token that was
ultimately used? But as you say, this still leaves the issue of the correct
casing. Angelina's workaround is a good suggestion, but definitely feels
like a hack. It seems like it would be desirable to be keeping track of the
value of tokens after REPP normalization but before downcasing for lexicon
lookup. I was talking about this with Bec also, and while her problem was
slightly different in that she only needed features for ubertagging rather
than the original surface form, she said she was also struggling with this
limitation.

Ned

On Thu, Feb 6, 2014 at 11:26 AM, Woodley Packard <sweaglesw at sweaglesw.org>wrote:

> I believe what happened in this particular case is that "control" and the
> following "," punctuation token got combined, resulting in a contiguous
> CFROM/CTO span of 137 to 147, which includes not only "control" and "," but
> also the deleted text in the middle.
>
> -Woodley
>
> On Feb 5, 2014, at 2:24 PM, Stephan Oepen <oe at ifi.uio.no> wrote:
>
> > hi ned,
> >
> > the practical challenge you are facing is deeply interesting.  i
> > believe both angelina (in conversion to bi-lexical dependencies)
> > and bec (working on ubertagging in PET) have looked at this.
> >
> > the derivation trees include the identifiers of internal tokens
> > (an integer immediately preceding the token feature structure),
> > and these tokens you can retrieve from the :p-tokens field in
> > reasonably up-to-date [incr tsdb()] profiles.  this will give you
> > the strings that were used for lexical lookup.  capitalization is
> > lost at this point, more often than not, hence one needs to do
> > something approximative in addition to finding the token.  for
> > all i recall, angelina compares the actual token to others that
> > have the same position in the chart (by start and end vertex,
> > as recorded in the :p-tokens format); in case she finds one
> > whose orthography differs from the downcased string, then
> > she uses that token instead.  bec, on the other hand, i think
> > consults the +CASE feature in the token feature structure.
> >
> > underlying all this, i suspect there is a question of what the
> > characterization of initial tokens really should be, e.g. when
> > we strip wiki markup at the REPP level.  but i seem unable
> > to reproduce the particular example you give:
> >
> > TSNLP(88): (setf string
> >             "Artificial intelligence has successfully been used in a
> > wide range of fields including [[medical diagnosis]], [[stock
> > trading]], [[robot control]], [[law]], scientific discovery and
> > toys.")
> > "Artificial intelligence has successfully been used in a wide range of
> > fields including [[medical diagnosis]], [[stock trading]], [[robot
> > control]], [[law]], scientific discovery and toys."
> >
> > TSNLP(89): (pprint (lkb::repp string :calls '(:xml :wiki :lgt :ascii
> > :quotes) :format :raw))
> > ...
> > #S(LKB::TOKEN :ID 20 :FORM "control" :STEM NIL :FROM 20 :TO 21 :START
> > 137 :END 144 :TAGS NIL :ERSATZ NIL)
> > ...
> >
> > TSNLP(90): (subseq string 137 144)
> > "control"
> >
> > i don't doubt the problem is real, but out of curiosity: how did
> > you produce your derivations?
> >
> > all best, oe
> >
> > On Wed, Feb 5, 2014 at 5:02 AM, Ned Letcher <nletcher at gmail.com> wrote:
> >> Hi all,
> >>
> >> I'm trying to export DELPH-IN derivation trees for use in the Fangorn
> >> treebank querying tool (which uses PTB style trees for importing) and
> have
> >> run into a hiccup extracting the string to use for the leaves of the
> trees.
> >> Fangorn does not support storing the original input string alongside the
> >> derivation, with the string used for displaying the original sentence
> being
> >> reconstructed by concatenating the leaves of the tree together.
> >>
> >> I've been populating the leaves of the exported PTB tree by extracting
> the
> >> relevant slice of the i-input string using the +FROM +TO offsets in the
> >> token information (if token mapping was used). One case I've found where
> >> this doesn't work so well (and there may be more), is where characters
> which
> >> have been stripped by REPP occur within a token, so these characters are
> >> then included in the slice. Wikipedia markup, for instance, results in
> these
> >> artefacts:
> >>
> >> "Artificial intelligence has successfully been used in a wide range of
> >> fields including medical diagnosis]], stock trading]], robot control]],
> >> law]], scientific discovery and toys."
> >>
> >> I also tried using the value of the +FORM feature, but it seems that
> this
> >> doesn't always preserve the casing of the original input string.
> >>
> >> Does anyone have any ideas for combating this problem?
> >>
> >> Ned
> >>
> >> --
> >> nedned.net
> >
> >
> >
> > --
> >
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > +++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47)
> 2284 0125
> > +++    --- oe at ifi.uio.no; stephan at oepen.net; http://www.emmtee.net/oe/---
> >
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>

-- 
nedned.net
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20140206/0b841c60/attachment.html>