[developers] preprocessing functionality in pet

Thu Feb 8 14:46:55 CET 2007

Hi all,

the last message by Tim Baldwin, together with different requests from
other sides, made me aware of a general problem i'm continuously facing
when doing implementation or maintenance for pet.

Besides the basic constraint-based grammar processing, there are now a
number of, so to say, preprocessing steps already included or planned
that also operate on input strings (or external information).

Those are:
1.) preprocessing / tokenization / (lightweight?) NE-recognition 
    at the moment there is a Lisp/ecl module based on fspp which seems
    to be prohibitively slow and could in principle be replaced by a
    C(++) version since there is a C library, which would only require
    to reimplement the relatively small amount of custom code. But
    that would require an agreement on a) is this formalism what we
    really want and b) what exactly is the functionality, e.g., with
    respect to the production of ambiguous output.

    Connected is the question of a) encodings (for example conversion
    into/from real to isomorphix umlauts) and b) character case
    distinction, also in relation of what is in the binary grammar
    file. What, when and how should case be normalized?

2.) spelling/morphological rule processing. 
    this is roughly implemented along the lines of LKB, although the
    more elaborate rule filter there is still missing and seems to be
    responsible for some of the inefficiency with the german grammar.
    Here, the question is what and when this process is applied (see
    the mail i'm referring to and one of my last mails on the pet list).

3.) lexical gap discovery / POS tagging / generic entry processing
    the current implementation at least needs to be described properly
    (there is not enough space for that here). This is also related 
    to 2). We decided to apply the spelling rules to the surface forms
    associated with the generic entries and thus treat the generic
    entry types like normal lexical types. Up to now, there was no
    resistance against that.

4.) lexicon access, i.e., the mapping of strings to lexical types,
    especially with respect to case normalization (or not!)

I think these are issues that we definitely should settle (and write
down our decisions), because they also have impact on the input format
if external modules should be used instead of internal functionality.

I don't want to make these decisions here because i'd like to do some
real work/improvement on pet and i'm tired of the constantly upcoming
questions/change requests considering all these functionalities. And,
besides, we maybe could come up with a new input format that would
really fulfill the needs of (at least almost) all pet users.

With the hope of a lively discussion,

best,
        Bernd

-- 
In a world without walls and fences, who needs Windows or Gates?

**********************************************************************
Bernd Kiefer                                            Am Blauberg 16
kiefer at dfki.de                                      66119 Saarbruecken
+49-681/302-5301 (office)                      +49-681/3904507  (home)
**********************************************************************