[developers] Non-deterministic tokenisation with REPP

Sat Mar 12 19:28:15 CET 2011

hi berthold,

REPP was designed to simplify the initial layer of tokenization,  
aiming for compatibility with PTB conventions.  hence, token-level  
ambiguity is reserved for the token mapping phase.  which, i am  
afraid, to date remains available only in PET.

bad news from the HaG point of view, i realize.  in principle, for the  
LKB, you could go back to SPPP (which had some support for  
tokenization ambiguity); but that is officially unsupported and should  
eventually be purged from the code base.  or you could ‘micro- 
tokenize’ and simulate multi-token combination effects in phrase  
structure rules; but i suspect that might be inadequate for your  
orthographemic needs?

from my point of view, we should look for someone to implement chart  
mapping in the LKB, e.g. an MSc student.  students at UiO actually  
know Lisp (and some even the LKB), but this year none of them was  
interested in this specific project.  in case there were possible  
candidates elsewhere, it would seem reverse engineering the chart  
mapping formalism is doable, in principle, without insider access.  i  
have heard rumours about no less than two proprietary implementations  
of the DELPH-IN formalism which appear to have done it.

best, oe

On 11. mars 2011, at 01.56, Berthold Crysmann <crysmann at ifk.uni- 
bonn.de> wrote:

> Hi all,
>
> is it currently possible to create alternate tokenisations with REPP?
> With Pet chart  mapping this is possible, so what I am looking for  
> is an
> LKB solution for the following problem: I need to combine adjacent
> tokens into one but preserve the original tokenisation as well, in  
> case
> I am dealing with unrelated items.
>
> Here's a concrete example: Hausa orthography separates off pronominal
> affixes of verbs but not of nouns. To arrive at a more sound treatment
> of pronominal affixes, I'd like to join putative pronominal affixes  
> with
> the words preceding them and let the grammar sort out the rest. But
> unfortunately, I do also have to preserve the original tokenisation  
> for
> homographs...
>
> I vaguely remember that something along these lines was possible at  
> some
> point earlier, so I'd be happy about any pointers.
>
> BTW: waht is the current status of CM in LKB????
>
> Cheers,
>
> Berthold
>
>
>
>