[developers] [erg] on generation failure with the Barcelona release and later

Wed May 5 09:58:57 CEST 2010

Hi Xuchen Yao -

We have made some progress on generation with unknown words since last summer, and even though we have not yet arrived at an ideal solution, I believe that the most recent (1004) version should work pretty well.  Here is what I just confirmed this morning:

1. Load current LKB (I'm running from the LOGON repository, which might matter)
2. Load ERG (1004)
3. Index for generation (LKB Top -- Generate -- Index) and start the generator server (LKB Top -- Generate -- Start server)
4. Using the `erg+tnt' CPU definition in $LOGONROOT/dot.tsdbrc, call PET to parse a sentence containing unknown words  (I parsed `The glimpy glump arrived.')
5. Identify the MRS for the intended analysis, and generate:
   - Since I'm using [incr tsdb()], I just clicked `Annotate' for this one-sentence profile, then (left-) clicked on the analysis I wanted (where `glimpy' is an unknown adjective), and clicked `Rephrase' which generated the same sentence successfully.

You should know that the generator is currently expecting a very specific format for the predicate names, following this template:
_surface-orthography/POS_u_unknown_rel
where POS is one of the tags you'll find in the generic lexical entries in erg/gle.tdl.  For example, the predicate for `glimpy' is _glimpy/jj_u_unknown_rel.  Likewise, for unknown nouns, the predicate name must be of the following form: _glump/nn_u_unknown_rel, where the first field in the predicate name again consists of the surface orthography followed by a slash followed by the POS tag.

Unknown proper names are simpler: PET simply creates an ordinary `named_rel' EP with the new proper name as the CARG value in that EP.  I confirmed that this works by parsing the following sentence with PET: `We hired Grundy.' and then generating from the MRS for the single analysis that PET returns. 

Similarly for years like "1884", we just use the ordinary predicate 'yofc_rel', and provide the year as the CARG value.  I confirmed this with the sentence `We arrived in 1884.', which generates fine.

The main flaw in what we are currently doing is that we don't have a good way of determining on the fly the lemma form of the unknown word we see, so the unknown noun `glumps' gives rise to the predicate name _glumps/nns_u_unknown_rel which is of course not ideal.  We'll work on this further, but I would in the meantime be glad to hear whether you can get the behavior I describe above with the 1004 version of the ERG.

Best,

 Dan

----- Original Message -----
From: "Xuchen Yao" <xuchen at coli.uni-saarland.de>
To: developers at delph-in.net, erg at delph-in.net
Sent: Tuesday, May 4, 2010 12:14:50 PM
Subject: [erg] on generation failure with the Barcelona release and later

Hi,

I noticed there was some intensive discussion of generation failure from
unknown words in the mailing list last year. Then people agreed to
continue the discussion at last year's meeting but I didn't find any
memo on the delph-in website. It looks like the Barcelona (0907) ERG
release was intended for this issue. So I switched from the current
stable version (0902) to 0907 or even the newest in the trunk (1004)
hoping to have a better handle of unknown words (or the "invalid
predicates" error). But unfortunately it didn't work out. Here's a
shortened observation from my experiment:

The basic idea is to follow what Stephan said:

"hence i think one would have to add an MRS post-processing step before
trying to feed these MRSs back into the generator." from
http://lists.delph-in.net/archive/developers/2009/001217.html

1. For unknown NNP, I changed `named_unk_rel' to `named_rel', it works
for 0902. (If i remembered correctly, this change doesn't work for 0907
and 1004).

2. For errors like invalid predicates: |basic_yofc_rel("1998"), from the
sentence "He left in 1998." I changed basic_yofc_rel to number_q_rel =q
card_rel as a shortcut to avoid a generation failure. This works under
0902.

3. For errors like invalid predicates: |"_iconic_jj_rel"|, from "This is
an iconic place." I tried change the *_jj_rel to generic_unk_adj_rel
with "iconic" as the CARG value. But this didn't work under both 0902
and 0907.

I didn't observe generation failure on unknown verbs, but did have some
cases of failure on nouns, such as: invalid predicates:
|"_wreckage_nn_rel"|, |"_oscillation_nn_rel"|, |"_axiom_nn_rel"|.

For the generation task, my naive thought is that if cheap can parse a
sentence, then LKB should generate from cheap's MRS output. For a
successful parsing, I used the chart-mapping branch of cheap to support
pre-processed (POS-tagged) sentences, but the problem of generation
failure due to invalid predicates still exists. Since there was a
discussion on this at last year's meeting and the ERG release is rolling
forward, it looks to me this issue has already been solved (since the
0907 release) but only I was using the wrong method. I'd appreciate it
very much if somebody can help me out. Thanks.

With kind regards,

Xuchen Yao