[developers] Extracting surface form of tokens from derivation trees

Ned Letcher nletcher at gmail.com
Wed Feb 5 05:02:46 CET 2014

Hi all,

I'm trying to export DELPH-IN derivation trees for use in the Fangorn
treebank querying tool (which uses PTB style trees for importing) and have
run into a hiccup extracting the string to use for the leaves of the trees.
Fangorn does not support storing the original input string alongside the
derivation, with the string used for displaying the original sentence being
reconstructed by concatenating the leaves of the tree together.

I've been populating the leaves of the exported PTB tree by extracting the
relevant slice of the i-input string using the +FROM +TO offsets in the
token information (if token mapping was used). One case I've found where
this doesn't work so well (and there may be more), is where characters
which have been stripped by REPP occur within a token, so these characters
are then included in the slice. Wikipedia markup, for instance, results in
these artefacts:

"Artificial intelligence has successfully been used in a wide range of
fields including medical diagnosis]], stock trading]], robot control]],
law]], scientific discovery and toys."

I also tried using the value of the +FORM feature, but it seems that this
doesn't always preserve the casing of the original input string.

Does anyone have any ideas for combating this problem?


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20140205/8de078d8/attachment.html>

More information about the developers mailing list