[developers] Non-deterministic tokenisation with REPP

Sun Mar 13 12:52:42 CET 2011

Hi Stephan

thanks for your reply.

On Sat, 2011-03-12 at 19:28 +0100, Stephan Oepen wrote:
> hi berthold,
> 
> REPP was designed to simplify the initial layer of tokenization,  
> aiming for compatibility with PTB conventions.  hence, token-level  
> ambiguity is reserved for the token mapping phase.  which, i am  
> afraid, to date remains available only in PET.
> 

I feared as much. 

> bad news from the HaG point of view, i realize.  in principle, for the  
> LKB, you could go back to SPPP (which had some support for  
> tokenization ambiguity); but that is officially unsupported and should  
> eventually be purged from the code base.  

Ok. Would it be possible not to purge that code unless we have a
replacement, i.e. chart-mapping in LKB?  

That'll give me a chance to make sure that development and runtime
platform are able to process roughly the same kind of input. 

> or you could ‘micro- 
> tokenize’ and simulate multi-token combination effects in phrase  
> structure rules; but i suspect that might be inadequate for your  
> orthographemic needs?
> 

It is indeed. To my mind, the chart-mapping formalism we have now bears
the potential to develop grammars which are much closer to one's
linguistic theory and at the same time avoid clumsy helper features just
to control competition between morphological and pseudo-morphological
expression.   

> from my point of view, we should look for someone to implement chart  
> mapping in the LKB, e.g. an MSc student.  students at UiO actually  
> know Lisp (and some even the LKB), but this year none of them was  
> interested in this specific project.  in case there were possible  
> candidates elsewhere, it would seem reverse engineering the chart  
> mapping formalism is doable, in principle, without insider access.  

That would certainly be the best, since it would enable me to get rid of
the tone rules which I just keep for LKBs purposes and there only in
parsing.  In Pet,  conversion of diacritics to autosegmental
representation is entirely done by means of the CM mechanism. With
generation, I have some functions that map internal tonal
representations directly to diacritics in the output.  

Cheers, 

Berthold

> i  
> have heard rumours about no less than two proprietary implementations  
> of the DELPH-IN formalism which appear to have done it.
> 

> best, oe
> 
> 
> On 11. mars 2011, at 01.56, Berthold Crysmann <crysmann at ifk.uni- 
> bonn.de> wrote:
> 
> > Hi all,
> >
> > is it currently possible to create alternate tokenisations with REPP?
> > With Pet chart  mapping this is possible, so what I am looking for  
> > is an
> > LKB solution for the following problem: I need to combine adjacent
> > tokens into one but preserve the original tokenisation as well, in  
> > case
> > I am dealing with unrelated items.
> >
> > Here's a concrete example: Hausa orthography separates off pronominal
> > affixes of verbs but not of nouns. To arrive at a more sound treatment
> > of pronominal affixes, I'd like to join putative pronominal affixes  
> > with
> > the words preceding them and let the grammar sort out the rest. But
> > unfortunately, I do also have to preserve the original tokenisation  
> > for
> > homographs...
> >
> > I vaguely remember that something along these lines was possible at  
> > some
> > point earlier, so I'd be happy about any pointers.
> >
> > BTW: waht is the current status of CM in LKB????
> >
> > Cheers,
> >
> > Berthold
> >
> >
> >
> >