[developers] Non-deterministic tokenisation with REPP
Stephan Oepen
oe at ifi.uio.no
Sat Mar 12 19:28:15 CET 2011
hi berthold,
REPP was designed to simplify the initial layer of tokenization,
aiming for compatibility with PTB conventions. hence, token-level
ambiguity is reserved for the token mapping phase. which, i am
afraid, to date remains available only in PET.
bad news from the HaG point of view, i realize. in principle, for the
LKB, you could go back to SPPP (which had some support for
tokenization ambiguity); but that is officially unsupported and should
eventually be purged from the code base. or you could ‘micro-
tokenize’ and simulate multi-token combination effects in phrase
structure rules; but i suspect that might be inadequate for your
orthographemic needs?
from my point of view, we should look for someone to implement chart
mapping in the LKB, e.g. an MSc student. students at UiO actually
know Lisp (and some even the LKB), but this year none of them was
interested in this specific project. in case there were possible
candidates elsewhere, it would seem reverse engineering the chart
mapping formalism is doable, in principle, without insider access. i
have heard rumours about no less than two proprietary implementations
of the DELPH-IN formalism which appear to have done it.
best, oe
On 11. mars 2011, at 01.56, Berthold Crysmann <crysmann at ifk.uni-
bonn.de> wrote:
> Hi all,
>
> is it currently possible to create alternate tokenisations with REPP?
> With Pet chart mapping this is possible, so what I am looking for
> is an
> LKB solution for the following problem: I need to combine adjacent
> tokens into one but preserve the original tokenisation as well, in
> case
> I am dealing with unrelated items.
>
> Here's a concrete example: Hausa orthography separates off pronominal
> affixes of verbs but not of nouns. To arrive at a more sound treatment
> of pronominal affixes, I'd like to join putative pronominal affixes
> with
> the words preceding them and let the grammar sort out the rest. But
> unfortunately, I do also have to preserve the original tokenisation
> for
> homographs...
>
> I vaguely remember that something along these lines was possible at
> some
> point earlier, so I'd be happy about any pointers.
>
> BTW: waht is the current status of CM in LKB????
>
> Cheers,
>
> Berthold
>
>
>
>
More information about the developers
mailing list