[developers] [erg] New ERG with improved tokenization/preprocessing for PET

Stephan Oepen oe at ifi.uio.no
Wed May 20 22:24:57 CEST 2009


thanks for the quick comments, ann,

> just briefly - it would be better from my point of view if the unknown
> word predicates could conform to the agreed conventions when they are
> constructed - i.e., _bazed_v_vdb_rel or better _baze_v_vbd_rel with
> the surface form recorded (and ending up in the appropriate slot in
> the MRS XML).  This would be in line with what I've been doing for
> RASP RMRS (actually I would use _baze_v_rel there).

well, in our setup input to parsing is PoS-tagged but /not/ lemmatized.
so `bazed' and `VBD' really are the only pieces of information that we
have available when synthesizing the initial PRED values for generics.
even classifying the various tags into _v_, _n_, and _a_ would require
multiplying out token mapping rules, i.e. come at a small extra cost.

hence our thinking was to preserve the available information but make
it clear that these are not yet regular predicates.  i now think they
should maybe look /more/ abnormal, actually: "_bazed__vbd_rel", say?

> If the problem is that the tagger isn't explicitly providing
> lemmatizations, then I guess we can do it via post-processing, though
> we could just as well do it when the predicates are constructed.

i think an actual solution would require a tool like morpha (which is
part of pre-processing in RASP, i believe), adapted for PTB tags and
american english.  one could argue this /should/ be part of our input
pre-processing prior to parsing, but that is not an option right now.

and in principle there can be lemmatization ambiguity, which (for the
cases discussed here) has no bearing on parsing; thus it is desirable
to defer such ambiguity until as late as possible (late commitment),
much like i would expect to do (the bulk of) WSD /after/ parsing.

                                                        best  -  oe

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
+++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++       --- oe at ifi.uio.no; oe at csli.stanford.edu; stephan at oepen.net ---
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++



More information about the developers mailing list