<div dir="ltr"><div><div><div><div><div>Hi Ann,<br><br></div>I have spent a fair bit of time playing with some of these issues. I have code that aligns the leaves of the derivation tree with the tokens produced by REPP (which was designed to mirror PTB tokenisation, used in other systems) that you'd be welcome to use, although it may not fit your use case exactly. I wrote it for the external supertagger training. There's still a couple of issues though: <br>
<br></div>* first, to answer your b) - yes. The primary example is hyphenated words which are split in the ERG, but a single token in TnT/REPP/PTB. That requires some mapping, but is possible, and I already do this in my code.<br>
<br></div>* the second issue, given you are looking at WikiWoods, could be wiki markup, depending on what input you are starting from. If you have an item like (from WeScience)<br><br>ws01 10032700 Such tests have been termed [[subject matter expert Turing test]]s.<br>
<br></div>REPP will give you 'tests' and '.' as the final tokens, and the derivation tree will tell you the two token IDs of its final token 'tests.', but the character offsets will say that 'tests.' is an 8 character token (ie 'test]]s.').<br>
<br></div><div> Whether that is an issue for you will depend on your setup.<br><br></div><div>Bec<br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Apr 24, 2014 at 8:43 PM, Ann Copestake <span dir="ltr"><<a href="mailto:Ann.Copestake@cl.cam.ac.uk" target="_blank">Ann.Copestake@cl.cam.ac.uk</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
Hi All,<br>
<br>
So we're (myself and Matic) looking at a somewhat related issue, which is<br>
mapping the MRS to tokens in an SMT system - more details about the MRS/SMT<br>
approach soonish. The issue is that the ERG tokenisation doesn't correspond<br>
to the sort of tokenisation the SMT system would expect - we can use different<br>
tokenisers in the SMT approach, but the attachment of punctuation to the token<br>
would be problematic if we used the ERG notion of a token. The tentative<br>
solution is to map the MRS EPs to the TNT tokens (or whatever the dumb<br>
tokeniser is). We'd then end up with a tfrom, tto annotation of some<br>
description.<br>
<br>
a) has anyone done this with the Wikiwoods data? is it doable?<br>
<br>
b) are there cases where one TNT token corresponds to two ERG tokens?<br>
<br>
Sorry if this is a bit cryptic - I'm in the process of downloading<br>
1212/export0.tar and will give a specific example when I've done that if<br>
that's helpful.<br>
<span class="HOEnZb"><font color="#888888"><br>
Ann<br>
<br>
<br>
<br>
<br>
<br>
</font></span></blockquote></div><br></div>