[developers] Extracting surface form of tokens from derivation trees

Stephan Oepen oe at ifi.uio.no
Thu Apr 24 21:31:11 CEST 2014

there's some documentation on initial and internal tokenization on the wiki:


adding to the comments by woodley, the token feature structures recorded
with each derivation also contain a list of (initial) token identifiers,
and both initial and internal tokens are stored in the profiles as well as
in the exports.  hence, relating EPs to sets of corresponding initial
tokens should be relatively straightforward.

the one reusable tool to post-process ERG exports into PTB-style (aka
initial) tokenization is the DTM converter by angelina.  i believe trying
DM bi-lexical dependency graphs (i.e. what we used in the SemEval 2014
context) for SMT could be very interesting, and i would be happy to assist.

all best, oe
On Apr 24, 2014 8:44 PM, "Ann Copestake" <Ann.Copestake at cl.cam.ac.uk> wrote:

> Hi All,
> So we're (myself and Matic) looking at a somewhat related issue, which is
> mapping the MRS to tokens in an SMT system - more details about the MRS/SMT
> approach soonish.  The issue is that the ERG tokenisation doesn't
> correspond
> to the sort of tokenisation the SMT system would expect - we can use
> different
> tokenisers in the SMT approach, but the attachment of punctuation to the
> token
> would be problematic if we used the ERG notion of a token.  The tentative
> solution is to map the MRS EPs to the TNT tokens (or whatever the dumb
> tokeniser is).  We'd then end up with a tfrom, tto annotation of some
> description.
> a) has anyone done this with the Wikiwoods data?  is it doable?
> b) are there cases where one TNT token corresponds to two ERG tokens?
> Sorry if this is a bit cryptic - I'm in the process of downloading
> 1212/export0.tar and will give a specific example when I've done that if
> that's helpful.
> Ann
