[developers] Generation with unknown words

Sun Feb 7 11:38:46 CET 2016

for realization at least, isn't it adequate to use a lemma list 
extracted from (say) WordNet to support predicate normalisation?

But, the application that Alex is interested in is a form of 
regeneration.  So I think that as long as the generator accepts what the 
parser outputs for unknown words, it really doesn't matter whether or 
not it's normalised.  I don't know whether or not anyone is using the 
realiser for applications which are broad-coverage (hence need unknown 
words) and where the *MRS is constructed from scratch (hence need to use 
lemmas for the predicates).  Excluding MT, of course.

All best,

Ann

On 07/02/2016 10:15, Stephan Oepen wrote:
> there actually are two separate mechanism to discuss: (a) lexical 
> instantiation for unknown predicates (in realization) and (b) 
> predicate normalization for unknown words (in parsing).
>
> as for (a), i find the current LKB mechanism about as generic as i can 
> imagine (and consider appropriate).  the grammar provides an inventory 
> of generic lexical entries for realization (these are in part distinct 
> from the parsing ones, in the ERG, because the strategies for dealing 
> with inflection are different).  for each such entry, the 
> grammar declares which MRS predicate activates it and how to determine 
> its orthography.  the former is accomplished via a regular expression, 
> e.g. something like /^named$/ or /^_([^_]+)/.  the latter either comes 
> from the (unique) parameter of the relation with the unknown predicate 
> (CARG in the ERG) or from the part of the predicate matched as the 
> above capture group (the lemma field).  there is no provision for 
> generic lexical entries with decomposed semantics (in realization).
>
> regarding (b), the ERG in parsing outputs predicates like the 
> ones alex had noticed.  these are not fully normalized because there 
> is no reliable lemmatization facility for unknown words inside the 
> parser (and, thus, generic entries for parsing predominantly are full 
> forms).  what is recorded in the ‘lemma’ field is the actual surface 
> form, concatenated with the PoS that activated the generic entry.  the 
> ERG provides a mechanism for post-parsing normalization, again in 
> mostly declarative and general form: triggered by regular expressions 
> looking for PTB PoS tags in predicate names, an orthographemic rule of 
> the grammar can (optionally) be invoked on the remainder of the 
> ‘lemma’ field.  if i recall correctly, we ‘disambiguate’ 
> lemmatization naïvely and take the first output from the set of 
> matches of that rule.  the resulting string is injected into a 
> predicate template, e.g. something like "_~a_n_unknown_rel".
>
> i believe, at the time, i did not want to enable predicate 
> normalization as part of the standard parsing set-up because of its 
> heuristic (naïve disambiguation) nature.  for an input of, say, ‘they 
> zanned’, our current parsers have no knowledge beyond the surface form 
> and its tag VBD; hence, we provide what we know as 
> ‘_zanned/VBD_u_unknown’.  the past tense orthographemic rule of the 
> ERG will hypothesize three candidate stems (‘zanne’, ‘zann’, 
> or ‘zan’).  it would require more information than is in the grammar 
> to do a better job of lemmatization than my current heuristic.
>
> —having refreshed my memory of the issues, i retract my suggestion to 
> enable predicate normalization (in its current form) in MRS 
> construction after parsing.  i wish someone would work on providing a 
> broader-coverage solution to this problem.  but we have added an input 
> fix-up transfer step to realization in the meantime, and that would 
> seem like a good place for heuristic predicate normalization, for the 
> time being.  it would enable round-trip parsing and generation, yet 
> preserve exact information in parser outputs for someone to put a 
> better normalization module there.
>
> best wishes, oe
>
>
> On Sunday, February 7, 2016, Woodley Packard <sweaglesw at sweaglesw.org> 
> wrote:
>
>     Hello Alex,
>
>     This is a corner of the generation game that is not yet
>     implemented in ACE.  It’s been on the ToDo list for years but
>     nobody has bugged me about it so it has been sitting at low
>     priority.  As Stephan mentioned, the mechanism to make it work in
>     the LKB is both somewhat fiddly and covered in a few cobwebs, so I
>     had somewhat aloofly hoped that over the years someone would have
>     straightened things out to where generation from unknown
>     predicates had a canonical approach (e.g. implemented for multiple
>     grammars or multiple platforms).  I would be interested to hear
>     whether Glenn Slayden (who is on this list) has implemented this
>     in the Agree generator?
>
>     I’m willing to put the hour or two it would take to make this
>     work, but wonder if other DELPH-IN developers/grammarians have
>     ideas about ways in which the current setup (as implemented in the
>     ERG’s custom lisp code that patches into the LKB, if memory
>     serves) could be improved upon in the process?
>
>     Regards,
>     -Woodley
>
>>     On Feb 6, 2016, at 2:48 AM, Alexander Kuhnle <aok25 at cam.ac.uk> wrote:
>>
>>     Dear all,
>>     We came across the problem of generating from MRS involving
>>     unknown words, for instance, in the sentence “I like porcelain.”
>>     (parsing gives "_porcelain/NN_u_unknown_rel"). Is there an option
>>     for ACE so that these cases can be handled?
>>     Moreover, we came across the example “The phosphorus
>>     self-combusts.” vs ?“The phosphorus is self-combusted.” Where the
>>     first doesn’t parse, the second does, but doesn’t generate (again
>>     presumably because of "_combusted/VBN_u_unknown_rel"). It seems
>>     to not recognise verbs with a “self-“ prefix, but does for past
>>     participles.
>>     Many thanks,
>>     Alex
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20160207/e8f037a9/attachment-0001.html>