[developers] Generation with unknown words

Sun Feb 7 11:15:16 CET 2016

there actually are two separate mechanism to discuss: (a) lexical
instantiation for unknown predicates (in realization) and (b) predicate
normalization for unknown words (in parsing).

as for (a), i find the current LKB mechanism about as generic as i can
imagine (and consider appropriate).  the grammar provides an inventory of
generic lexical entries for realization (these are in part distinct from
the parsing ones, in the ERG, because the strategies for dealing with
inflection are different).  for each such entry, the grammar declares which
MRS predicate activates it and how to determine its orthography.  the
former is accomplished via a regular expression, e.g. something
like /^named$/ or /^_([^_]+)/.  the latter either comes from the
(unique) parameter of the relation with the unknown predicate (CARG in the
ERG) or from the part of the predicate matched as the above capture group
(the lemma field).  there is no provision for generic lexical entries with
decomposed semantics (in realization).

regarding (b), the ERG in parsing outputs predicates like the ones alex had
noticed.  these are not fully normalized because there is no reliable
lemmatization facility for unknown words inside the parser (and, thus,
generic entries for parsing predominantly are full forms).  what is
recorded in the ‘lemma’ field is the actual surface form, concatenated with
the PoS that activated the generic entry.  the ERG provides a mechanism for
post-parsing normalization, again in mostly declarative and general form:
triggered by regular expressions looking for PTB PoS tags in predicate
names, an orthographemic rule of the grammar can (optionally) be invoked on
the remainder of the ‘lemma’ field.  if i recall correctly, we
‘disambiguate’ lemmatization naïvely and take the first output from the set
of matches of that rule.  the resulting string is injected into a
predicate template, e.g. something like "_~a_n_unknown_rel".

i believe, at the time, i did not want to enable predicate normalization as
part of the standard parsing set-up because of its heuristic (naïve
disambiguation) nature.  for an input of, say, ‘they zanned’, our current
parsers have no knowledge beyond the surface form and its tag VBD; hence,
we provide what we know as ‘_zanned/VBD_u_unknown’.  the past tense
orthographemic rule of the ERG will hypothesize three candidate stems
(‘zanne’, ‘zann’, or ‘zan’).  it would require more information than is in
the grammar to do a better job of lemmatization than my current heuristic.

—having refreshed my memory of the issues, i retract my suggestion to
enable predicate normalization (in its current form) in MRS construction
after parsing.  i wish someone would work on providing a
broader-coverage solution to this problem.  but we have added an input
fix-up transfer step to realization in the meantime, and that would seem
like a good place for heuristic predicate normalization, for the time
being.  it would enable round-trip parsing and generation, yet preserve
exact information in parser outputs for someone to put a better
normalization module there.

best wishes, oe

On Sunday, February 7, 2016, Woodley Packard <sweaglesw at sweaglesw.org>
wrote:

> Hello Alex,
>
> This is a corner of the generation game that is not yet implemented in
> ACE.  It’s been on the ToDo list for years but nobody has bugged me about
> it so it has been sitting at low priority.  As Stephan mentioned, the
> mechanism to make it work in the LKB is both somewhat fiddly and covered in
> a few cobwebs, so I had somewhat aloofly hoped that over the years someone
> would have straightened things out to where generation from unknown
> predicates had a canonical approach (e.g. implemented for multiple grammars
> or multiple platforms).  I would be interested to hear whether Glenn
> Slayden (who is on this list) has implemented this in the Agree generator?
>
> I’m willing to put the hour or two it would take to make this work, but
> wonder if other DELPH-IN developers/grammarians have ideas about ways in
> which the current setup (as implemented in the ERG’s custom lisp code that
> patches into the LKB, if memory serves) could be improved upon in the
> process?
>
> Regards,
> -Woodley
>
> On Feb 6, 2016, at 2:48 AM, Alexander Kuhnle <aok25 at cam.ac.uk> wrote:
>
> Dear all,
>
> We came across the problem of generating from MRS involving unknown words,
> for instance, in the sentence “I like porcelain.” (parsing gives
> "_porcelain/NN_u_unknown_rel"). Is there an option for ACE so that these
> cases can be handled?
> Moreover, we came across the example “The phosphorus self-combusts.” vs
> ?“The phosphorus is self-combusted.” Where the first doesn’t parse, the
> second does, but doesn’t generate (again presumably because of
> "_combusted/VBN_u_unknown_rel"). It seems to not recognise verbs with a
> “self-“ prefix, but does for past participles.
>
> Many thanks,
> Alex
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20160207/26d373b7/attachment.html>