[erg] [developers] New ERG with improved tokenization/preprocessing for PET

Fri May 22 02:28:54 CEST 2009

I wrote the ERG orthographemic rules, in fact, so I don't think I'm
underestimating the complexity but I obviously didn't explain in
enough length in my previous message.  I've now talked to Dan and have
a better idea what the issues are with the current implementation -
although of course I could be mis-stating something - apologies in
advance if so.

A.  Format of the relation name.  I think there are very good reasons
to keep to the _pred_mrstag_anything format.  Dan points out that
existing transfer rules presuppose this.  The MRS-XML will not conform
to the DTD if we can't split into the components.  Dan and I would
also like to keep open the option of moving to a version of the
grammar where we actually split up those components in the grammar (I
can explain why it a separate message if necessary), so even if some
of the work is done at a post-processing stage, I think the predicate
names used in the grammar should conform structurally.

B. The mapping from POS-TAG -> MRS tag.  I assumed this was
deterministic (it is for RASP-RMRS) and very simple from the unknown
word perspective since we're only dealing with open class cases.  I
understand from what Dan says that there was a problem with mapping of
adjectives and adverbs in the reg-ex machinery.  If that's driving the
proposed approaches, it would be useful to know more about what the
issue actually is.  I think we need to keep the set of tags to a small
class, for much the same reasons as in A, but we could add some tags
to allow for further underspecification if needed.

C. Lemmatization.  The best option, I think, is to run the LKB-style
rules (since they are already part of the grammar), check results
against a word list to filter candidates and also filter according to
the POS tag, and use heuristics to select a single result in the case
where no entries are in the word list.  (I realise you need to use
PET, and I don't know what the PET implementation looks like, but this
part of the LKB morphology code is so trivial that I would assume it's
easy to use the rules in this way in PET too.)  It's possible to get
reasonable lemmatization results without a full word list, in fact,
just using heuristics and a partial word list.  That's what I've done
previously, but that was before large word lists were easily
available.

What I meant about testing was to try the rules on the actual unknown
words you're getting in the test sets.  I think it will work better
for unknown words than with all open-class words of English, because
there will be fewer verbs.

If necessary, we could do this after parsing, but I think it should
happen before the MRSs are output.  In any case, the actual string
should be available in the SURFACE attribute in the XML.  Using the
pred name as the place to store the unanalysed form is sub-optimal,
because of characters like underscore, and because case can be
important.  Repair to replace the postulated lemma could be necessary
for unknown words, but note that this may also be required even when
the ERG has recognised a word in its lexicon.  For instance, if
"route" was in as a verb and "rout" was not, it would be possible to
repair a misanalysis of "routed".  (I'm sorry I can't think of an
actual example offhand - usually in the cases of mismatches between
RASP-RMRS and ERG-RMRS the RASP-RMRS is wrong, but in some cases it's
the ERG.)

In any case, if we use a combination of morph-rules and a word list,
then it will be reasonably easy to fix cases which go wrong for a
particular domain, even without adding new full lexical entries. 

Getting this stuff right is important.  The lack of proper unknown
word handling has been a very big barrier to use for the ERG, and
since you've done so much work to get it right, it would be a shame if
this part wasn't thought through properly, since this will be what
people actually see.  

Ann

> > The existing LKB-style analysis rules can be used without a base
> > lexicon - one doesn't need morpha.  (I've run them in the past like
> > this - I wouldn't recommend it for an arbitrary language, but for
> > English it should still work OK.)  This would be better because it
> > would avoid the discrepancies between base forms that I sometimes get
> > with ERG-RMRS and RASP-RMRS.  To experiment, try calling
> > one-step-morph-analysis on an upper-case string.  Multiple results are
> > only going to matter in the case where the stem is different, which
> > will mostly be things like `BANNED' -> `BANN'/`BAN', I imagine.
> 
> did you try yourself?  you underestimate the complexity of this issue:
> the ERG orthographemic rules hypothesize three candidate stems: `ban',
> `bann', and `banne'.  without a lexicon, i believe all are justified.
> 
> we do not want this (silly) ambiguity in parsing, nor would i be keen
> on putting an approximate (procedural) solution (restricted to english)
> into the parser (PET, in our case).  thus i still think post-processing
> is the best we can do (as long as parser inputs are not lemmatized).
> 
>                                                         cheers  -  oe
> 
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
> +++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
> +++       --- oe at ifi.uio.no; oe at csli.stanford.edu; stephan at oepen.net ---
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++