[developers] Extracting surface form of tokens from derivation trees

Bec Dridan bec.dridan at gmail.com
Thu Apr 24 21:23:11 CEST 2014


Hi Ann,

I have spent a fair bit of time playing with some of these issues. I have
code that aligns the leaves of the derivation tree with the tokens produced
by REPP (which was designed to mirror PTB tokenisation, used in other
systems) that you'd be welcome to use, although it may not fit your use
case exactly. I wrote it for the external supertagger training. There's
still a couple of issues though:

* first, to answer your b) - yes. The primary example is hyphenated words
which are split in the ERG, but a single token in TnT/REPP/PTB. That
requires some mapping, but is possible, and I already do this in my code.

* the second issue, given you are looking at WikiWoods, could be wiki
markup, depending on what input you are starting from. If you have an item
like (from WeScience)

ws01 10032700 Such tests have been termed [[subject matter expert Turing
test]]s.

REPP will give you 'tests' and '.' as the final tokens, and the derivation
tree will tell you the two token IDs of its final token 'tests.', but the
character offsets will say that 'tests.' is an 8 character token (ie
'test]]s.').

 Whether that is an issue for you will depend on your setup.

Bec


On Thu, Apr 24, 2014 at 8:43 PM, Ann Copestake
<Ann.Copestake at cl.cam.ac.uk>wrote:

>
> Hi All,
>
> So we're (myself and Matic) looking at a somewhat related issue, which is
> mapping the MRS to tokens in an SMT system - more details about the MRS/SMT
> approach soonish.  The issue is that the ERG tokenisation doesn't
> correspond
> to the sort of tokenisation the SMT system would expect - we can use
> different
> tokenisers in the SMT approach, but the attachment of punctuation to the
> token
> would be problematic if we used the ERG notion of a token.  The tentative
> solution is to map the MRS EPs to the TNT tokens (or whatever the dumb
> tokeniser is).  We'd then end up with a tfrom, tto annotation of some
> description.
>
> a) has anyone done this with the Wikiwoods data?  is it doable?
>
> b) are there cases where one TNT token corresponds to two ERG tokens?
>
> Sorry if this is a bit cryptic - I'm in the process of downloading
> 1212/export0.tar and will give a specific example when I've done that if
> that's helpful.
>
> Ann
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20140424/5a1dcecf/attachment.html>


More information about the developers mailing list