[developers] Non-deterministic tokenisation with REPP
Berthold Crysmann
crysmann at ifk.uni-bonn.de
Mon Mar 14 13:15:54 CET 2011
On Sun, 2011-03-13 at 12:52 +0100, Berthold Crysmann wrote:
> Hi Stephan
>
> thanks for your reply.
>
> On Sat, 2011-03-12 at 19:28 +0100, Stephan Oepen wrote:
> > hi berthold,
> >
> > REPP was designed to simplify the initial layer of tokenization,
> > aiming for compatibility with PTB conventions. hence, token-level
> > ambiguity is reserved for the token mapping phase. which, i am
> > afraid, to date remains available only in PET.
> >
>
> I feared as much.
>
> > bad news from the HaG point of view, i realize. in principle, for the
> > LKB, you could go back to SPPP (which had some support for
> > tokenization ambiguity); but that is officially unsupported and should
> > eventually be purged from the code base.
>
> Ok. Would it be possible not to purge that code unless we have a
> replacement, i.e. chart-mapping in LKB?
>
> That'll give me a chance to make sure that development and runtime
> platform are able to process roughly the same kind of input.
>
It looks like SPPP depends on an external tokenizer, right?
I wonder now what the state of affairs is with FSPP? Trying to load an
old version of GG, I do not get any chart entries for any input. The
token chart is also empty.
The rules appear to load, but do not seem to work as expected. Am I
missing something?
Cheers,
B
> > or you could ‘micro-
> > tokenize’ and simulate multi-token combination effects in phrase
> > structure rules; but i suspect that might be inadequate for your
> > orthographemic needs?
> >
>
> It is indeed. To my mind, the chart-mapping formalism we have now bears
> the potential to develop grammars which are much closer to one's
> linguistic theory and at the same time avoid clumsy helper features just
> to control competition between morphological and pseudo-morphological
> expression.
>
> > from my point of view, we should look for someone to implement chart
> > mapping in the LKB, e.g. an MSc student. students at UiO actually
> > know Lisp (and some even the LKB), but this year none of them was
> > interested in this specific project. in case there were possible
> > candidates elsewhere, it would seem reverse engineering the chart
> > mapping formalism is doable, in principle, without insider access.
>
> That would certainly be the best, since it would enable me to get rid of
> the tone rules which I just keep for LKBs purposes and there only in
> parsing. In Pet, conversion of diacritics to autosegmental
> representation is entirely done by means of the CM mechanism. With
> generation, I have some functions that map internal tonal
> representations directly to diacritics in the output.
>
> Cheers,
>
> Berthold
>
> > i
> > have heard rumours about no less than two proprietary implementations
> > of the DELPH-IN formalism which appear to have done it.
> >
>
>
> > best, oe
> >
> >
> > On 11. mars 2011, at 01.56, Berthold Crysmann <crysmann at ifk.uni-
> > bonn.de> wrote:
> >
> > > Hi all,
> > >
> > > is it currently possible to create alternate tokenisations with REPP?
> > > With Pet chart mapping this is possible, so what I am looking for
> > > is an
> > > LKB solution for the following problem: I need to combine adjacent
> > > tokens into one but preserve the original tokenisation as well, in
> > > case
> > > I am dealing with unrelated items.
> > >
> > > Here's a concrete example: Hausa orthography separates off pronominal
> > > affixes of verbs but not of nouns. To arrive at a more sound treatment
> > > of pronominal affixes, I'd like to join putative pronominal affixes
> > > with
> > > the words preceding them and let the grammar sort out the rest. But
> > > unfortunately, I do also have to preserve the original tokenisation
> > > for
> > > homographs...
> > >
> > > I vaguely remember that something along these lines was possible at
> > > some
> > > point earlier, so I'd be happy about any pointers.
> > >
> > > BTW: waht is the current status of CM in LKB????
> > >
> > > Cheers,
> > >
> > > Berthold
> > >
> > >
> > >
> > >
>
More information about the developers
mailing list