[developers] Extracting surface form of tokens from derivation trees

Thu Feb 6 01:26:42 CET 2014

I believe what happened in this particular case is that "control" and the following "," punctuation token got combined, resulting in a contiguous CFROM/CTO span of 137 to 147, which includes not only "control" and "," but also the deleted text in the middle.

-Woodley

On Feb 5, 2014, at 2:24 PM, Stephan Oepen <oe at ifi.uio.no> wrote:

> hi ned,
> 
> the practical challenge you are facing is deeply interesting.  i
> believe both angelina (in conversion to bi-lexical dependencies)
> and bec (working on ubertagging in PET) have looked at this.
> 
> the derivation trees include the identifiers of internal tokens
> (an integer immediately preceding the token feature structure),
> and these tokens you can retrieve from the :p-tokens field in
> reasonably up-to-date [incr tsdb()] profiles.  this will give you
> the strings that were used for lexical lookup.  capitalization is
> lost at this point, more often than not, hence one needs to do
> something approximative in addition to finding the token.  for
> all i recall, angelina compares the actual token to others that
> have the same position in the chart (by start and end vertex,
> as recorded in the :p-tokens format); in case she finds one
> whose orthography differs from the downcased string, then
> she uses that token instead.  bec, on the other hand, i think
> consults the +CASE feature in the token feature structure.
> 
> underlying all this, i suspect there is a question of what the
> characterization of initial tokens really should be, e.g. when
> we strip wiki markup at the REPP level.  but i seem unable
> to reproduce the particular example you give:
> 
> TSNLP(88): (setf string
>             "Artificial intelligence has successfully been used in a
> wide range of fields including [[medical diagnosis]], [[stock
> trading]], [[robot control]], [[law]], scientific discovery and
> toys.")
> "Artificial intelligence has successfully been used in a wide range of
> fields including [[medical diagnosis]], [[stock trading]], [[robot
> control]], [[law]], scientific discovery and toys."
> 
> TSNLP(89): (pprint (lkb::repp string :calls '(:xml :wiki :lgt :ascii
> :quotes) :format :raw))
> ...
> #S(LKB::TOKEN :ID 20 :FORM "control" :STEM NIL :FROM 20 :TO 21 :START
> 137 :END 144 :TAGS NIL :ERSATZ NIL)
> ...
> 
> TSNLP(90): (subseq string 137 144)
> "control"
> 
> i don't doubt the problem is real, but out of curiosity: how did
> you produce your derivations?
> 
> all best, oe
> 
> On Wed, Feb 5, 2014 at 5:02 AM, Ned Letcher <nletcher at gmail.com> wrote:
>> Hi all,
>> 
>> I'm trying to export DELPH-IN derivation trees for use in the Fangorn
>> treebank querying tool (which uses PTB style trees for importing) and have
>> run into a hiccup extracting the string to use for the leaves of the trees.
>> Fangorn does not support storing the original input string alongside the
>> derivation, with the string used for displaying the original sentence being
>> reconstructed by concatenating the leaves of the tree together.
>> 
>> I've been populating the leaves of the exported PTB tree by extracting the
>> relevant slice of the i-input string using the +FROM +TO offsets in the
>> token information (if token mapping was used). One case I've found where
>> this doesn't work so well (and there may be more), is where characters which
>> have been stripped by REPP occur within a token, so these characters are
>> then included in the slice. Wikipedia markup, for instance, results in these
>> artefacts:
>> 
>> "Artificial intelligence has successfully been used in a wide range of
>> fields including medical diagnosis]], stock trading]], robot control]],
>> law]], scientific discovery and toys."
>> 
>> I also tried using the value of the +FORM feature, but it seems that this
>> doesn't always preserve the casing of the original input string.
>> 
>> Does anyone have any ideas for combating this problem?
>> 
>> Ned
>> 
>> --
>> nedned.net
> 
> 
> 
> -- 
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
> +++    --- oe at ifi.uio.no; stephan at oepen.net; http://www.emmtee.net/oe/ ---
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++