[developers] New ERG with improved tokenization/preprocessing for PET

Fri May 22 16:50:10 CEST 2009

G'day,

jumping in late, sorry if there is some overlap.

[snip]

> in my view, it is an error to expect the (current) parsing outputs to
> always be valid generator inputs.  `named_unk_rel' should not be used
> in inputs to generation (a provider of an input semantics should need
> no knowledge of which names exist in the ERG lexicon).

In my view the goal should be that the parser should produce valid
semantics for as wide a  range of input as possible, and the generator
should generate from as wide a range of semantics as possible.    Of
course, I realize that it isn't always possible :-).

> with the revised treatment of NEs and unknown words in parsing, there
> are many more inputs that parse, but the semantics assigned to unknown
> words is often `incomplete' (or `internal', or `not quite right'), and
> hence i think one would have to add an MRS post-processing step before
> trying to feed these MRSs back into the generator.

The semantics assigned to unknown words is potentially input for MT,
Sciborg, QA, ontology extraction as well as just generation.

> names are probably not the most interesting example, as one might ask
> why `Frodo' should end up with a different predicate than `Abrams' in
> parsing.  at present, the `named_unk_rel' is an attempt at marking the
> fact that `Frodo' was parsed as an unknown name in the semantics.  for
> all i see, the underlying generic lexical entry could just as well use
> `named_rel' instead.

I do so ask:  does anyone have any special motivation for using a
different predicate in the semantics here?

> more interesting, however, are unknown nouns, verbs, etc.  looking at
> the example towards the bottom of
>
>  http://wiki.delph-in.net/moin/PetInput
>
> the current goal is for an unknown verb like `bazed' (as recognized by
> the parser by virtue of its PoS tag: VBD) to introduce a predicate like
> "_bazed_vbd_rel", i.e. just the concatenation of the token surface form
> and the PoS tag.  the reasoning is that the grammar has no knowledge of
> the actual stem (which could be `baz' or `baze' in this example; with a
> doubled consonant, there would be three alternatives), and therefore we
> `short-circuit' morphology: most PoS-based generic lexical entries are
> already inflected, i.e. words rather than lexemes.  this is really all
> the parser can do at this point (without introducing silly ambiguity).
>
> using the predicate "_bazed_vbd_rel" preserves all information provided
> by the tagger and grammar for downstream processing.  my expectation is
> that this `incomplete' MRS should be post-processed after parsing, such
> that the predicate can be rewritten to "_baze_v_?_rel", or whatever one
> deems appropriate in terms of the external interface.
>
> for the paraphrasing setup, such rewriting can be part of the transfer
> grammar that is invoked after parsing and prior to generation.  i will
> aim to at least provide a first shot at predicate normalization in the
> forthcoming 0907 ERG release.

Adding an extra transfer grammar is feasible for paraphrasing, but may
be a considerable burden for other tasks.  It also has the potential
to separate part of  the unknown word handling from the grammar
itself, which would then make it prone to boundary friction as one
component changes but the other doesn't.

>> Did unknown word generation not make into the mainstream?  If so, is
>> there a branch that has it?
>
> i believe your joint experimentation with dan on generating `unknown'
> nouns and verbs was after 0902 but is reflected in the current `trunk'.
> but before making that functionality part of the forthcoming release, i
> was hoping to have a little more discussion about what we actually want
> in terms of generation with generic lexical entries.

And your wish is being granted.

> i guess there is consensus on the various NE classes listed above, i.e.
> where there is a CARG corresponding to surface form we will continue to
> support that (names, numbers, dates, and such).

Good.  The new chart mapping stuff also makes some classes work that
had been problematic, e.g. some of the date ersatz (like 1967).

> as for generating nouns or verbs that are not in the lexicon, i see no
> point in trying to support "_bazed_vbd_rel" as a valid generator input.

I agree.  I think that that "_bazed_vbd_rel" should not be the
representation for unknown words.

> we might be able, however, to support "_baze_v_?_rel", but then i think
> we need to say a little more about how we can map unknown predicates to
> generic lexical entries; and on how to relate the predicate and surface
> form to be generated.  in generation, i think, inflection is determined
> by variable properties; hence, generic entries for generation will need
> to be different from those used in parsing.

I don't understand the last sentence.

>  further, if we assume that
> the grammar can provide generics (in generation) for a limited sub-set
> of argument frames for verbs (intransitive and simple transitive, say),
> nouns (mass or count, non-relational), and adjectives (intersective,
> non-relational), then the generator should check the complete EP giving
> rise to a generic for compatibility.  for example, an input MRS with an
> instantiated ARG1 on an unknown noun should be rejected, in my view.

I think it is reasonable to expect that the unknown word handling will
not correctly handle all cases, and may reject some. For Jacy, and I
think for the ERG as well, our policy is to add idiosyncratic words
(in inflection, syntax or semantics) to the grammar/lexicon as far as
we can.  I think we should do this even for relatively rare words.
The expectation can then be that the remaining unknown words are
fairly regular in their behavior.

> an extra layer of `generics' trigger rules would likely be adequate to
> capture the correspondence relation between `unknown' EPs and generics
> available to the generator.  it would still have to be combined with a
> convention of how to determine the surface form: /^_([^_]+).*_rel$ -->
> \1 would seem like a plausible start, i think.

Indeed.

> --- i would be grateful for comments on these thoughts, especially from
> ann, dan, and francis.  i believe i could implement a first shot at the
> setup sketched here for the 0907 ERG release.  however, to decide where
> to use my time, it would also be helpful to know who actually makes use
> of the ERG paraphrasing setup in current projects?

I am experimenting with it, aided by an intern from Georgia Tech
(Darren).  The MT group at NICT has asked me to make the paraphraser
available for them, as they think it will help in the SMT
competitions.  We have also had a request for the paraphraser from a
group in India.

Eric and I, and more recently Sanghoun, also want to use the unknown
word handling in our MT systems.  Better parsing and generating the
most common classes of unknown words will make an immediate difference
to the quality of our systems.

Now I will try and cover some of the later discussion:

I agree completely with Ann that we should be keeping to the agreed
upon RMRS standard for predicate names.   I can also think of no
potential use of the grammar/parser who would want the surface form
rather than the base form here.   Any link to some external resource
(dictionary, ontology, transfer, generator) will need the base form.

> well, in our setup input to parsing is PoS-tagged but /not/ lemmatized.
> so `bazed' and `VBD' really are the only pieces of information that we
> have available when synthesizing the initial PRED values for generics.
> even classifying the various tags into _v_, _n_, and _a_ would require
> multiplying out token mapping rules, i.e. come at a small extra cost.

I realize that determining the base form is non-trivial.

My preference would be to get it from the morphological analyzer.  In
the current setup tnt does not provide the lemmatized form, but other
setups are possible.   We could (a) add a wrapper with morpha or
equivalent or (b) switch to a tagger that does provide the base form.
 The taggers for Japanese and Spanish do provide the base form.   In
the actual parse, each entry is associated with a lexical type --- I
assumed that this gives enough information to decide _v_, _n_, or _a_.
 I would be very impressed to find that the ERG has super-types for
these classes used in parsing.  But perhaps I am missing something?
Could you (or Dan) give some examples where classifying the various
tags into _v_, _n_, and _a_ would require multiplying out token
mapping rules?  If it is just a few rules, then I think this cost
would be dwarfed by the cost of a separate post-processing transfer
grammar.

> i think an actual solution would require a tool like morpha (which is
> part of pre-processing in RASP, i believe), adapted for PTB tags and
> american english.  one could argue this /should/ be part of our input
> pre-processing prior to parsing, but that is not an option right now.

Why is it not an option?  I thought Tim has something like this
already.  If not  we could  make one, or maybe see if the RASP project
has one squirreled away somewhere.  If we can't find anything better,
I volunteer to make one: under the assumption that irregular cases
should go in the ERG proper, I believe it would be reasonably cheap to
build.

> and in principle there can be lemmatization ambiguity, which (for the
> cases discussed here) has no bearing on parsing; thus it is desirable
> to defer such ambiguity until as late as possible (late commitment),
> much like i would expect to do (the bulk of) WSD /after/ parsing.

There is a cost to adding lemmatization ambiguity.   Leaving it
unspecified is a possibility, but I think we should keep in mind that
almost any potential user of the grammar is going to have to ambiguate
at some stage, defering the ambiguity is just moving the cost
elsewhere, not eliminating it.

I think there are other reasonably principled alternatives: one is to
say that unknown words are relatively rare, and we can afford to add
in some ambiguity.  Another would be to say that unknown words are
likely to be regular and just chose a best guess.  If it turns out
that we are wrong and it is important then the words get added to the
lexicon.   Neither of these are perfect solutions, but I consider both
of them better than doing nothing.

> did you try yourself?  you underestimate the complexity of this issue:
> the ERG orthographemic rules hypothesize three candidate stems: `ban',
> `bann', and `banne'.  without a lexicon, i believe all are justified.

My solution would be to add as many verbs that end in double
consonants to the ERG as we can, and then only run the regular rule
:-).

> we do not want this (silly) ambiguity in parsing, nor would i be keen
> on putting an approximate (procedural) solution (restricted to english)
> into the parser (PET, in our case).  thus i still think post-processing
> is the best we can do (as long as parser inputs are not lemmatized).

I see the attraction of keeping a language-specific approximate
solution out of the parser.  I think it should therefore go into the
pre-processor: i.e. we should lemmatize the parser inputs.  If this
isn't really feasible, then an approach that reuses the grammar's
existing morphological rules (like Ann outlined) would go some way to
solving the problem of language-specificity.　The grammar writer then
becomes responsible for either (a) mapping the tags to inflectional
rules and providing a word list or (b) providing a tagger that
lemmatizes.    Whatever is decided for English, I would really like
(b) to be available for Japanese, as I would prefer not to have to
redo what ChaSen (or Juman) give me already.

Anyway, I am really glad that you brought this up so we could discuss
it before things gets too fixed in stone.  Can I encourage other
people with a potential interest to chip in? Berthold? Montse?
Antonio? Ulrich --- I would think unknown word predicate naming is
important for the HoG too :-)

Yours,

-- 
Francis Bond <http://www2.nict.go.jp/x/x161/en/member/bond/>
NICT Language Infrastructure Group