[developers] [erg] New ERG with improved tokenization/preprocessing for PET
oe at ifi.uio.no
Thu May 21 08:08:56 CEST 2009
> The existing LKB-style analysis rules can be used without a base
> lexicon - one doesn't need morpha. (I've run them in the past like
> this - I wouldn't recommend it for an arbitrary language, but for
> English it should still work OK.) This would be better because it
> would avoid the discrepancies between base forms that I sometimes get
> with ERG-RMRS and RASP-RMRS. To experiment, try calling
> one-step-morph-analysis on an upper-case string. Multiple results are
> only going to matter in the case where the stem is different, which
> will mostly be things like `BANNED' -> `BANN'/`BAN', I imagine.
did you try yourself? you underestimate the complexity of this issue:
the ERG orthographemic rules hypothesize three candidate stems: `ban',
`bann', and `banne'. without a lexicon, i believe all are justified.
we do not want this (silly) ambiguity in parsing, nor would i be keen
on putting an approximate (procedural) solution (restricted to english)
into the parser (PET, in our case). thus i still think post-processing
is the best we can do (as long as parser inputs are not lemmatized).
cheers - oe
+++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
+++ CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++ --- oe at ifi.uio.no; oe at csli.stanford.edu; stephan at oepen.net ---
More information about the developers