[developers] potentially important `bug fix' in LKB generator

Thu Feb 14 17:52:22 CET 2008

dear all,

i just made a change in the LKB generator that, in my view, is just a
bug fix.  but it may cost some grammars generation coverage, hence let
me elaborate.

abstractly, the goal of the generator is to enumerate all derivations
(as licensed by the grammar) such that their semantics is subsumed by
the input semantics to the generator: within certain limits, we allow
the generator to return results with a more specific semantics.  this
is desirable for example where the input is underspecified, e.g. using
`temp_loc_rel' in the input, even though realizations use prepositions
whose actual predicates are subsumed by `temp_loc_rel'.  mostly, we do
not allow realizations whose semantics is less specific than the input,
i.e. failing to verbalize part of the input semantics.

in the traditional LKB generator, there is one exception to this rule.
less specific realizations are returned in case they only lack some of
input semantics expressed as variable properties.  for example, if the
input requires an event to be [ SF prop ], but the grammar constructs
a derivation whose semantics is [ SF prop-or-ques ] (and otherwise is
subsumed by the input semantics), then that derivation is included as
part of the generator results.

i suspect the above used to be the case for technical reasons: during
lexical lookup, the generator specializes lexical entries as they are 
activated, i.e. variable properties from input EPs are copied into the
AVMs of lexical entries (and rules), as activated by those EPs.  this
specialization prior to chart generation makes things more effient; a
related effect is that generator derivations look different from parse
results: their trees do not show applications of lexical rules.  in a
sense, these derivations have `hidden' daughters, as to record a full
recipe of rebuilding their tree, lexical rules obviously are required.

there used to be specialized code for generator edges in various parts
of the LKB and [incr tsdb()], recovering those hidden daugthers.  for
example, to compute the MaxEnt score of a generator derivation, these
daughters contribute to the total score, hence the code needs to treat
generator derivations different from parser derivations.

i have long felt irritated with this property of the code (which is all
my fault in the first place); i know LKB users are often confused about
the missing nodes in browsing generator trees (there is no special code
in the tree browser to show the additional daughters).

now getting to the point: i changed the generator internals to include
daughters corresponding to lexical rule applications in the usual way
in the `edge' structure (i.e. the `children' slot is a list of edges).
this makes the tree display look as expected, and the specialized code
for generator edges in various places becomes obsolete.

however, there is an additional benefit to this: when chart packing is
enabled, the final realization is constructed from re-unifying AVMs of
lexical entries and rules, as prescribed by the derivation.  in my new
setup, this includes lexical rules and the original lexical entry, thus
annuls the intermediate effects of specialization.  so, in our earlier
example: the semantics on the realization is just [ SF prop-or-ques ],
as that is (we assume) what the grammar makes it to be.  i know dan at
least tends to consider such cases bugs in the grammar.  e.g. looking
at `http://erg.emmtee.net/', try parsing `Kim ate also.'  the result is
marked `prop-or-ques' (in contrast to, say, `Kim ate.') and generating
from it yields a number of surprising paraphrases.  conversely, using a
fully specified input semantics (parse `Kim also ate.' instead), there
is no generator output `Kim ate also.'  this is the result of the code
change discussed above: in the new setup, `Kim ate also.' lacks a piece
of input semantics, viz. the more specific [ SF prop ].

in conclusion, i present this change as a bug fix because it brings the
generator in compliance with the abstract definition offered above.  it
also has the practical value of making the tree display more accessible
and may help diagnose unwanted underspecification in grammars.  i would
like to release this change to the LKB sources in the next few weeks.

the downside is that grammars (sloppily) built to take advantage of the
traditional LKB behavior could lose some paraphrases in generation.  it
would be great if those of you actively using the generator could give
all this some thought, and hopefully arrive at the conclusion that you
want this improved debugging tool.  currently at least, the full effect
of the above only kicks in when chart packing is on, but in principle i
would like to make the new behavior apply in non-packing mode too.

                                phew, a long message!  all best  -  oe

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
+++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++       --- oe at ifi.uio.no; oe at csli.stanford.edu; stephan at oepen.net ---
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++