[developers] Extracting surface form of tokens from derivation trees

Ann Copestake Ann.Copestake at cl.cam.ac.uk
Thu Apr 24 20:43:48 CEST 2014

Hi All,

So we're (myself and Matic) looking at a somewhat related issue, which is 
mapping the MRS to tokens in an SMT system - more details about the MRS/SMT 
approach soonish.  The issue is that the ERG tokenisation doesn't correspond 
to the sort of tokenisation the SMT system would expect - we can use different 
tokenisers in the SMT approach, but the attachment of punctuation to the token 
would be problematic if we used the ERG notion of a token.  The tentative 
solution is to map the MRS EPs to the TNT tokens (or whatever the dumb 
tokeniser is).  We'd then end up with a tfrom, tto annotation of some 

a) has anyone done this with the Wikiwoods data?  is it doable?

b) are there cases where one TNT token corresponds to two ERG tokens?

Sorry if this is a bit cryptic - I'm in the process of downloading 
1212/export0.tar and will give a specific example when I've done that if 
that's helpful.


