[developers] New ERG with improved tokenization/preprocessing for PET

Wed May 20 14:12:27 CEST 2009

g'day,

> It appears that this grammar will not generate unknown words, although
> I for one had been hoping it would.

the 0902 release of the ERG (for all i know) has preserved the support
for generation of unknown names (`named_rel'), numbers (`card_rel' and
`ord_rel'), and some dates (`dofw_rel' et al.).  for all these cases,
the surface form is constructed from the CARG (plus inflection, where
appropriate), much like it used to in recent years.

> [22:09:43] translate(): error `invalid predicates: |named_unk_rel("Frodo")|'.
> 
> This is with terg+tnt (with option -mrs) as the interactive cpu, and
> terg running as the top level grammar.  Batch processing has the same
> problem.

in my view, it is an error to expect the (current) parsing outputs to
always be valid generator inputs.  `named_unk_rel' should not be used
in inputs to generation (a provider of an input semantics should need
no knowledge of which names exist in the ERG lexicon).

with the revised treatment of NEs and unknown words in parsing, there
are many more inputs that parse, but the semantics assigned to unknown
words is often `incomplete' (or `internal', or `not quite right'), and
hence i think one would have to add an MRS post-processing step before
trying to feed these MRSs back into the generator.

names are probably not the most interesting example, as one might ask
why `Frodo' should end up with a different predicate than `Abrams' in
parsing.  at present, the `named_unk_rel' is an attempt at marking the
fact that `Frodo' was parsed as an unknown name in the semantics.  for
all i see, the underlying generic lexical entry could just as well use
`named_rel' instead.

more interesting, however, are unknown nouns, verbs, etc.  looking at
the example towards the bottom of

  http://wiki.delph-in.net/moin/PetInput

the current goal is for an unknown verb like `bazed' (as recognized by
the parser by virtue of its PoS tag: VBD) to introduce a predicate like
"_bazed_vbd_rel", i.e. just the concatenation of the token surface form
and the PoS tag.  the reasoning is that the grammar has no knowledge of
the actual stem (which could be `baz' or `baze' in this example; with a
doubled consonant, there would be three alternatives), and therefore we
`short-circuit' morphology: most PoS-based generic lexical entries are 
already inflected, i.e. words rather than lexemes.  this is really all
the parser can do at this point (without introducing silly ambiguity).

using the predicate "_bazed_vbd_rel" preserves all information provided
by the tagger and grammar for downstream processing.  my expectation is
that this `incomplete' MRS should be post-processed after parsing, such
that the predicate can be rewritten to "_baze_v_?_rel", or whatever one
deems appropriate in terms of the external interface.

for the paraphrasing setup, such rewriting can be part of the transfer
grammar that is invoked after parsing and prior to generation.  i will
aim to at least provide a first shot at predicate normalization in the  
forthcoming 0907 ERG release.

> Did unknown word generation not make into the mainstream?  If so, is
> there a branch that has it?

i believe your joint experimentation with dan on generating `unknown'
nouns and verbs was after 0902 but is reflected in the current `trunk'.
but before making that functionality part of the forthcoming release, i
was hoping to have a little more discussion about what we actually want
in terms of generation with generic lexical entries.

i guess there is consensus on the various NE classes listed above, i.e.
where there is a CARG corresponding to surface form we will continue to
support that (names, numbers, dates, and such).

as for generating nouns or verbs that are not in the lexicon, i see no
point in trying to support "_bazed_vbd_rel" as a valid generator input.

we might be able, however, to support "_baze_v_?_rel", but then i think
we need to say a little more about how we can map unknown predicates to
generic lexical entries; and on how to relate the predicate and surface
form to be generated.  in generation, i think, inflection is determined
by variable properties; hence, generic entries for generation will need
to be different from those used in parsing.  further, if we assume that
the grammar can provide generics (in generation) for a limited sub-set
of argument frames for verbs (intransitive and simple transitive, say),
nouns (mass or count, non-relational), and adjectives (intersective,
non-relational), then the generator should check the complete EP giving
rise to a generic for compatibility.  for example, an input MRS with an
instantiated ARG1 on an unknown noun should be rejected, in my view.

an extra layer of `generics' trigger rules would likely be adequate to
capture the correspondence relation between `unknown' EPs and generics
available to the generator.  it would still have to be combined with a
convention of how to determine the surface form: /^_([^_]+).*_rel$ -->
\1 would seem like a plausible start, i think.

--- i would be grateful for comments on these thoughts, especially from
ann, dan, and francis.  i believe i could implement a first shot at the
setup sketched here for the 0907 ERG release.  however, to decide where
to use my time, it would also be helpful to know who actually makes use
of the ERG paraphrasing setup in current projects?

                                                      all best  -  oe

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
+++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++       --- oe at ifi.uio.no; oe at csli.stanford.edu; stephan at oepen.net ---
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++