[developers] [erg] New ERG with improved tokenization/preprocessing for PET

Wed May 20 23:02:55 CEST 2009

It's a pity to have an approach that allows for predicates which are
not part of the lexicon and then not to use it.

The mapping rules from tags to letters are there for RASP-RMRS.  I
don't actually see why the extra cost, but it would be OK to
underspecify the POS letters if necessary.

The existing LKB-style analysis rules can be used without a base
lexicon - one doesn't need morpha.  (I've run them in the past like
this - I wouldn't recommend it for an arbitrary language, but for
English it should still work OK.)  This would be better because it
would avoid the discrepancies between base forms that I sometimes get
with ERG-RMRS and RASP-RMRS.  To experiment, try calling
one-step-morph-analysis on an upper-case string.  Multiple results are
only going to matter in the case where the stem is different, which
will mostly be things like `BANNED' -> `BANN'/`BAN', I imagine.  It
would be interesting to see how many of these we're getting with the
current corpora.  

Ann