[developers] Generation with unknown words
aac10 at cam.ac.uk
Mon Feb 8 19:28:39 CET 2016
That could be helpful - thanks!
I would like to see a decoupling of the predicate normalisation from the
question of what ACE or the LKB or whatever does. I think predicate
normalisation can perfectly well be treated as a *MRS <-> *MRS
conversion which could be provided by an external tool. The realiser
itself should work with whatever predicates come out of the unknown word
mechanism. In fact, I think that for what Alex is doing, not
normalising would be perfectly fine.
That said, it would be good to provide a method for predicate
normalisation using the rules the grammar writer defines already in
conjunction with a large stem list, defaulting to no stemming with words
which weren't on the list. My reasoning is:
- we want a solution which works for languages other than English
- as far as possible, we want this to be under grammar writer control.
e.g., in the case where the grammar writer finds a particular stem which
is treated incorrectly, they can add it to the list appropriately (or
the irregs file).
The main applications I can see for normalisation are cases where there
is some external resource of some description which needs stem-based
predicates - e.g., automatically created transfer rules. I think it's
only needed for regeneration in the situation where one needs to change
tense or plurality, etc and where the predicate normalisation is part of
the mechanism for telling the morphology generation what the stem is.
I may well be missing something, though. It's one of these cases where
I remember being involved in a discussion, possibly in Jerez, but not
the content of the discussion.
On 08/02/2016 16:14, John Carroll wrote:
> Would the morpha and morphg tools at
> <http://users.sussex.ac.uk/%7Ejohnca/morph.html> be appropriate for
> predicate normalisation for parsing and generation? They are inverses
> of each other, i.e.
> $ echo "zanned_VBD" | ./morpha.ix86_darwin -actf verbstem.list
> $ echo zann+ed_VBD | ./morphg.ix86_darwin -ctf verbstem.list
> On 7 Feb 2016, at 10:38, Ann Copestake wrote:
>> for realization at least, isn't it adequate to use a lemma list
>> extracted from (say) WordNet to support predicate normalisation?
>> But, the application that Alex is interested in is a form of
>> regeneration. So I think that as long as the generator accepts what
>> the parser outputs for unknown words, it really doesn't matter
>> whether or not it's normalised. I don't know whether or not anyone
>> is using the realiser for applications which are broad-coverage
>> (hence need unknown words) and where the *MRS is constructed from
>> scratch (hence need to use lemmas for the predicates). Excluding MT,
>> of course.
>> All best,
>> On 07/02/2016 10:15, Stephan Oepen wrote:
>>> there actually are two separate mechanism to discuss: (a) lexical
>>> instantiation for unknown predicates (in realization) and (b)
>>> predicate normalization for unknown words (in parsing).
>>> as for (a), i find the current LKB mechanism about as generic as i
>>> can imagine (and consider appropriate). the grammar provides an
>>> inventory of generic lexical entries for realization (these are in
>>> part distinct from the parsing ones, in the ERG, because the
>>> strategies for dealing with inflection are different). for each
>>> such entry, the grammar declares which MRS predicate activates it
>>> and how to determine its orthography. the former is accomplished
>>> via a regular expression, e.g. something like /^named$/ or
>>> /^_([^_]+)/. the latter either comes from the (unique) parameter of
>>> the relation with the unknown predicate (CARG in the ERG) or from
>>> the part of the predicate matched as the above capture group (the
>>> lemma field). there is no provision for generic lexical entries
>>> with decomposed semantics (in realization).
>>> regarding (b), the ERG in parsing outputs predicates like the
>>> ones alex had noticed. these are not fully normalized because there
>>> is no reliable lemmatization facility for unknown words inside the
>>> parser (and, thus, generic entries for parsing predominantly are
>>> full forms). what is recorded in the ‘lemma’ field is the actual
>>> surface form, concatenated with the PoS that activated the generic
>>> entry. the ERG provides a mechanism for post-parsing normalization,
>>> again in mostly declarative and general form: triggered by regular
>>> expressions looking for PTB PoS tags in predicate names, an
>>> orthographemic rule of the grammar can (optionally) be invoked on
>>> the remainder of the ‘lemma’ field. if i recall correctly, we
>>> ‘disambiguate’ lemmatization naïvely and take the first output from
>>> the set of matches of that rule. the resulting string is injected
>>> into a predicate template, e.g. something like "_~a_n_unknown_rel".
>>> i believe, at the time, i did not want to enable predicate
>>> normalization as part of the standard parsing set-up because of its
>>> heuristic (naïve disambiguation) nature. for an input of, say,
>>> ‘they zanned’, our current parsers have no knowledge beyond the
>>> surface form and its tag VBD; hence, we provide what we know as
>>> ‘_zanned/VBD_u_unknown’. the past tense orthographemic rule of the
>>> ERG will hypothesize three candidate stems (‘zanne’, ‘zann’,
>>> or ‘zan’). it would require more information than is in the grammar
>>> to do a better job of lemmatization than my current heuristic.
>>> —having refreshed my memory of the issues, i retract my suggestion
>>> to enable predicate normalization (in its current form) in MRS
>>> construction after parsing. i wish someone would work on providing
>>> a broader-coverage solution to this problem. but we have added an
>>> input fix-up transfer step to realization in the meantime, and that
>>> would seem like a good place for heuristic predicate normalization,
>>> for the time being. it would enable round-trip parsing and
>>> generation, yet preserve exact information in parser outputs for
>>> someone to put a better normalization module there.
>>> best wishes, oe
>>> On Sunday, February 7, 2016, Woodley Packard
>>> <sweaglesw at sweaglesw.org> wrote:
>>> Hello Alex,
>>> This is a corner of the generation game that is not yet
>>> implemented in ACE. It’s been on the ToDo list for years but
>>> nobody has bugged me about it so it has been sitting at low
>>> priority. As Stephan mentioned, the mechanism to make it work
>>> in the LKB is both somewhat fiddly and covered in a few cobwebs,
>>> so I had somewhat aloofly hoped that over the years someone
>>> would have straightened things out to where generation from
>>> unknown predicates had a canonical approach (e.g. implemented
>>> for multiple grammars or multiple platforms). I would be
>>> interested to hear whether Glenn Slayden (who is on this list)
>>> has implemented this in the Agree generator?
>>> I’m willing to put the hour or two it would take to make this
>>> work, but wonder if other DELPH-IN developers/grammarians have
>>> ideas about ways in which the current setup (as implemented in
>>> the ERG’s custom lisp code that patches into the LKB, if memory
>>> serves) could be improved upon in the process?
>>>> On Feb 6, 2016, at 2:48 AM, Alexander Kuhnle <aok25 at cam.ac.uk>
>>>> Dear all,
>>>> We came across the problem of generating from MRS involving
>>>> unknown words, for instance, in the sentence “I like
>>>> porcelain.” (parsing gives "_porcelain/NN_u_unknown_rel"). Is
>>>> there an option for ACE so that these cases can be handled?
>>>> Moreover, we came across the example “The phosphorus
>>>> self-combusts.” vs ?“The phosphorus is self-combusted.” Where
>>>> the first doesn’t parse, the second does, but doesn’t generate
>>>> (again presumably because of "_combusted/VBN_u_unknown_rel").
>>>> It seems to not recognise verbs with a “self-“ prefix, but does
>>>> for past participles.
>>>> Many thanks,
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the developers