[developers] preprocessing functionality in pet

Thu Feb 8 17:57:30 CET 2007

maybe instead of spending the time having a lively discussion, we
should stat off by all just sitting down and writing documentation?
Our discussions tend to end with the decision that we need
documentation and then it doesn't happen.  I'm somewhat guilty of this
with the morphology stuff, I admit, although I have emailed a fairly
detailed account.  I'd welcome specific questions if people don't
understand it.  I didn't think there was an urgent need to document
the details of how the morphological rule filter is applied, though,
because the behaviour is declarative and gives the same results as if
the filter were not there (except much faster).  There is an unusually
high amount of comments in that part of the LKB code btw. 

My belief about case is that, in the long-term, the systems should not
be normalising case, except as defined by a grammar-specific
preprocessor.  Wasn't this a conclusion from Jerez?  I still intend to
take all the case conversion out of the LKB.

I would like to know where/why the ECL preprocessor is so slow - I
hadn't heard this.  Is it because it's writing out a full PET input
chart or something?  I would be surprised if we couldn't make the
speed acceptable in Lisp unless ECL itself is very inefficient, but
then the MRS stuff runs reasonably, doesn't it?

Ann

> the last message by Tim Baldwin, together with different requests from
> other sides, made me aware of a general problem i'm continuously facing
> when doing implementation or maintenance for pet.
> 
> Besides the basic constraint-based grammar processing, there are now a
> number of, so to say, preprocessing steps already included or planned
> that also operate on input strings (or external information).
> 
> Those are:
> 1.) preprocessing / tokenization / (lightweight?) NE-recognition 
>     at the moment there is a Lisp/ecl module based on fspp which seems
>     to be prohibitively slow and could in principle be replaced by a
>     C(++) version since there is a C library, which would only require
>     to reimplement the relatively small amount of custom code. But
>     that would require an agreement on a) is this formalism what we
>     really want and b) what exactly is the functionality, e.g., with
>     respect to the production of ambiguous output.
> 
>     Connected is the question of a) encodings (for example conversion
>     into/from real to isomorphix umlauts) and b) character case
>     distinction, also in relation of what is in the binary grammar
>     file. What, when and how should case be normalized?
> 
> 2.) spelling/morphological rule processing. 
>     this is roughly implemented along the lines of LKB, although the
>     more elaborate rule filter there is still missing and seems to be
>     responsible for some of the inefficiency with the german grammar.
>     Here, the question is what and when this process is applied (see
>     the mail i'm referring to and one of my last mails on the pet list).
> 
> 3.) lexical gap discovery / POS tagging / generic entry processing
>     the current implementation at least needs to be described properly
>     (there is not enough space for that here). This is also related 
>     to 2). We decided to apply the spelling rules to the surface forms
>     associated with the generic entries and thus treat the generic
>     entry types like normal lexical types. Up to now, there was no
>     resistance against that.
> 
> 4.) lexicon access, i.e., the mapping of strings to lexical types,
>     especially with respect to case normalization (or not!)
> 
> I think these are issues that we definitely should settle (and write
> down our decisions), because they also have impact on the input format
> if external modules should be used instead of internal functionality.
> 
> I don't want to make these decisions here because i'd like to do some
> real work/improvement on pet and i'm tired of the constantly upcoming
> questions/change requests considering all these functionalities. And,
> besides, we maybe could come up with a new input format that would
> really fulfill the needs of (at least almost) all pet users.
> 
> With the hope of a lively discussion,
> 
> best,
>         Bernd
> 
> -- 
> In a world without walls and fences, who needs Windows or Gates?
> 
> **********************************************************************
> Bernd Kiefer                                            Am Blauberg 16
> kiefer at dfki.de                                      66119 Saarbruecken
> +49-681/302-5301 (office)                      +49-681/3904507  (home)
> **********************************************************************
>