[developers] [erg] New ERG with improved tokenization/preprocessing for PET

Wed May 20 18:58:55 CEST 2009

just briefly - it would be better from my point of view if the unknown
word predicates could conform to the agreed conventions when they are
constructed - i.e., _bazed_v_vdb_rel or better _baze_v_vbd_rel with
the surface form recorded (and ending up in the appropriate slot in
the MRS XML).  This would be in line with what I've been doing for
RASP RMRS (actually I would use _baze_v_rel there).  The point of
introducing that convention was that it was possible to build
consistent names automatically, after all.  If the problem is that the
tagger isn't explicitly providing lemmatizations, then I guess we can
do it via post-processing, though we could just as well do it when the
predicates are constructed.

Ann

> 
> g'day,
> 
> > It appears that this grammar will not generate unknown words, although
> > I for one had been hoping it would.
> 
> the 0902 release of the ERG (for all i know) has preserved the support
> for generation of unknown names (`named_rel'), numbers (`card_rel' and
> `ord_rel'), and some dates (`dofw_rel' et al.).  for all these cases,
> the surface form is constructed from the CARG (plus inflection, where
> appropriate), much like it used to in recent years.
> 
> > [22:09:43] translate(): error `invalid predicates: |named_unk_rel("Frodo")|'.
> > 
> > This is with terg+tnt (with option -mrs) as the interactive cpu, and
> > terg running as the top level grammar.  Batch processing has the same
> > problem.
> 
> in my view, it is an error to expect the (current) parsing outputs to
> always be valid generator inputs.  `named_unk_rel' should not be used
> in inputs to generation (a provider of an input semantics should need
> no knowledge of which names exist in the ERG lexicon).
> 
> with the revised treatment of NEs and unknown words in parsing, there
> are many more inputs that parse, but the semantics assigned to unknown
> words is often `incomplete' (or `internal', or `not quite right'), and
> hence i think one would have to add an MRS post-processing step before
> trying to feed these MRSs back into the generator.
> 
> names are probably not the most interesting example, as one might ask
> why `Frodo' should end up with a different predicate than `Abrams' in
> parsing.  at present, the `named_unk_rel' is an attempt at marking the
> fact that `Frodo' was parsed as an unknown name in the semantics.  for
> all i see, the underlying generic lexical entry could just as well use
> `named_rel' instead.
> 
> more interesting, however, are unknown nouns, verbs, etc.  looking at
> the example towards the bottom of
> 
>   http://wiki.delph-in.net/moin/PetInput
> 
> the current goal is for an unknown verb like `bazed' (as recognized by
> the parser by virtue of its PoS tag: VBD) to introduce a predicate like
> "_bazed_vbd_rel", i.e. just the concatenation of the token surface form
> and the PoS tag.  the reasoning is that the grammar has no knowledge of
> the actual stem (which could be `baz' or `baze' in this example; with a
> doubled consonant, there would be three alternatives), and therefore we
> `short-circuit' morphology: most PoS-based generic lexical entries are 
> already inflected, i.e. words rather than lexemes.  this is really all
> the parser can do at this point (without introducing silly ambiguity).
> 
> using the predicate "_bazed_vbd_rel" preserves all information provided
> by the tagger and grammar for downstream processing.  my expectation is
> that this `incomplete' MRS should be post-processed after parsing, such
> that the predicate can be rewritten to "_baze_v_?_rel", or whatever one
> deems appropriate in terms of the external interface.
> 
> for the paraphrasing setup, such rewriting can be part of the transfer
> grammar that is invoked after parsing and prior to generation.  i will
> aim to at least provide a first shot at predicate normalization in the  
> forthcoming 0907 ERG release.
> 
> > Did unknown word generation not make into the mainstream?  If so, is
> > there a branch that has it?
> 
> i believe your joint experimentation with dan on generating `unknown'
> nouns and verbs was after 0902 but is reflected in the current `trunk'.
> but before making that functionality part of the forthcoming release, i
> was hoping to have a little more discussion about what we actually want
> in terms of generation with generic lexical entries.
> 
> i guess there is consensus on the various NE classes listed above, i.e.
> where there is a CARG corresponding to surface form we will continue to
> support that (names, numbers, dates, and such).
> 
> as for generating nouns or verbs that are not in the lexicon, i see no
> point in trying to support "_bazed_vbd_rel" as a valid generator input.
> 
> we might be able, however, to support "_baze_v_?_rel", but then i think
> we need to say a little more about how we can map unknown predicates to
> generic lexical entries; and on how to relate the predicate and surface
> form to be generated.  in generation, i think, inflection is determined
> by variable properties; hence, generic entries for generation will need
> to be different from those used in parsing.  further, if we assume that
> the grammar can provide generics (in generation) for a limited sub-set
> of argument frames for verbs (intransitive and simple transitive, say),
> nouns (mass or count, non-relational), and adjectives (intersective,
> non-relational), then the generator should check the complete EP giving
> rise to a generic for compatibility.  for example, an input MRS with an
> instantiated ARG1 on an unknown noun should be rejected, in my view.
> 
> an extra layer of `generics' trigger rules would likely be adequate to
> capture the correspondence relation between `unknown' EPs and generics
> available to the generator.  it would still have to be combined with a
> convention of how to determine the surface form: /^_([^_]+).*_rel$ -->
> \1 would seem like a plausible start, i think.
> 
> --- i would be grateful for comments on these thoughts, especially from
> ann, dan, and francis.  i believe i could implement a first shot at the
> setup sketched here for the 0907 ERG release.  however, to decide where
> to use my time, it would also be helpful to know who actually makes use
> of the ERG paraphrasing setup in current projects?
> 
>                                                       all best  -  oe
> 
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
> +++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
> +++       --- oe at ifi.uio.no; oe at csli.stanford.edu; stephan at oepen.net ---
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++