[developers] preprocessing functionality in pet
Bernd Kiefer
kiefer at dfki.de
Thu Feb 8 14:46:55 CET 2007
Hi all,
the last message by Tim Baldwin, together with different requests from
other sides, made me aware of a general problem i'm continuously facing
when doing implementation or maintenance for pet.
Besides the basic constraint-based grammar processing, there are now a
number of, so to say, preprocessing steps already included or planned
that also operate on input strings (or external information).
Those are:
1.) preprocessing / tokenization / (lightweight?) NE-recognition
at the moment there is a Lisp/ecl module based on fspp which seems
to be prohibitively slow and could in principle be replaced by a
C(++) version since there is a C library, which would only require
to reimplement the relatively small amount of custom code. But
that would require an agreement on a) is this formalism what we
really want and b) what exactly is the functionality, e.g., with
respect to the production of ambiguous output.
Connected is the question of a) encodings (for example conversion
into/from real to isomorphix umlauts) and b) character case
distinction, also in relation of what is in the binary grammar
file. What, when and how should case be normalized?
2.) spelling/morphological rule processing.
this is roughly implemented along the lines of LKB, although the
more elaborate rule filter there is still missing and seems to be
responsible for some of the inefficiency with the german grammar.
Here, the question is what and when this process is applied (see
the mail i'm referring to and one of my last mails on the pet list).
3.) lexical gap discovery / POS tagging / generic entry processing
the current implementation at least needs to be described properly
(there is not enough space for that here). This is also related
to 2). We decided to apply the spelling rules to the surface forms
associated with the generic entries and thus treat the generic
entry types like normal lexical types. Up to now, there was no
resistance against that.
4.) lexicon access, i.e., the mapping of strings to lexical types,
especially with respect to case normalization (or not!)
I think these are issues that we definitely should settle (and write
down our decisions), because they also have impact on the input format
if external modules should be used instead of internal functionality.
I don't want to make these decisions here because i'd like to do some
real work/improvement on pet and i'm tired of the constantly upcoming
questions/change requests considering all these functionalities. And,
besides, we maybe could come up with a new input format that would
really fulfill the needs of (at least almost) all pet users.
With the hope of a lively discussion,
best,
Bernd
--
In a world without walls and fences, who needs Windows or Gates?
**********************************************************************
Bernd Kiefer Am Blauberg 16
kiefer at dfki.de 66119 Saarbruecken
+49-681/302-5301 (office) +49-681/3904507 (home)
**********************************************************************
More information about the developers
mailing list