[developers] Extracting surface form of tokens from derivation trees
Ann.Copestake at cl.cam.ac.uk
Thu Apr 24 20:43:48 CEST 2014
So we're (myself and Matic) looking at a somewhat related issue, which is
mapping the MRS to tokens in an SMT system - more details about the MRS/SMT
approach soonish. The issue is that the ERG tokenisation doesn't correspond
to the sort of tokenisation the SMT system would expect - we can use different
tokenisers in the SMT approach, but the attachment of punctuation to the token
would be problematic if we used the ERG notion of a token. The tentative
solution is to map the MRS EPs to the TNT tokens (or whatever the dumb
tokeniser is). We'd then end up with a tfrom, tto annotation of some
a) has anyone done this with the Wikiwoods data? is it doable?
b) are there cases where one TNT token corresponds to two ERG tokens?
Sorry if this is a bit cryptic - I'm in the process of downloading
1212/export0.tar and will give a specific example when I've done that if
More information about the developers