[developers] [erg] New ERG with improved tokenization/preprocessing for PET

Stephan Oepen oe at ifi.uio.no
Thu May 21 08:08:56 CEST 2009


> The existing LKB-style analysis rules can be used without a base
> lexicon - one doesn't need morpha.  (I've run them in the past like
> this - I wouldn't recommend it for an arbitrary language, but for
> English it should still work OK.)  This would be better because it
> would avoid the discrepancies between base forms that I sometimes get
> with ERG-RMRS and RASP-RMRS.  To experiment, try calling
> one-step-morph-analysis on an upper-case string.  Multiple results are
> only going to matter in the case where the stem is different, which
> will mostly be things like `BANNED' -> `BANN'/`BAN', I imagine.

did you try yourself?  you underestimate the complexity of this issue:
the ERG orthographemic rules hypothesize three candidate stems: `ban',
`bann', and `banne'.  without a lexicon, i believe all are justified.

we do not want this (silly) ambiguity in parsing, nor would i be keen
on putting an approximate (procedural) solution (restricted to english)
into the parser (PET, in our case).  thus i still think post-processing
is the best we can do (as long as parser inputs are not lemmatized).

                                                        cheers  -  oe

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
+++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++       --- oe at ifi.uio.no; oe at csli.stanford.edu; stephan at oepen.net ---
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



More information about the developers mailing list