[developers] boring though important: generalizing characterization

Tue Apr 3 01:38:13 CEST 2007

hi again,

in various applications we have had to provide a linking from elements
of the semantics (EPs) to surface elements of the original input to the
parser.  in Deep-Thought and ever since, we have used characterization
(CFROM and CTO character ranges, referring to an input sub-string) for
this purpose.  however, some applications lack the notion of character
ranges, e.g. parsing an ASR word graph or the token lattice provided by
an existing preprocessing pipeline.  hence, i think we need to be able
to accommodate more than one way of linking EPs to surface elements.

i just checked in parts of a more general solution; see the comments in
`lkb/src/mrs/lnk.lsp':

  an attempt at generalizing over various ways of linking to the underlying
  input to the parser, be it by character or vertex ranges (as used at times
  in HoG et al.) or token identifiers (originally at YY and now in LOGON).
  currently, there are four distinct value formats:

    <0:4>    character range (i.e. a sub-string of an assumed flat input);
    <0#2>    chart vertex range (traditional in PET to some degree);
    <0 1 3>  token identifiers, i.e. links to basic input units;
    <@42>    edge identifier (used internally in generation)

  of these, the first is maybe most widely supported across DELPH-IN tools,
  while the second (in my view) should be deprecated.  the third resembles
  what was used in VerbMobil, YY, and now LOGON; given that the input to a
  `deep' parser can always be viewed as a token lattice, this is probably the
  most general mode, and we should aim to establish it over time: first, the
  underlying input may not have been string-shaped (but come from the lattice
  of a speech recognizer), and second even with one underlying string there
  could be token-level ambiguity, so identifying the actual token used in an
  analysis preserves more information.  properties like the sub-string range,
  prosodic information (VerbMobil), or pointers to KB nodes (YY) can all be
  associated with the individual tokens sent into the parser.  finally, the
  fourth mode is used in generation, where surface linking actually is a two-
  stage process (see comments in `generate.lsp').              (4-dec-06; oe)

so far, i have implemented the following:

  - a new slot `lnk' on the MRS `rel' structure; values can be objects
    of either of the following kinds:

    (:characters 0 4)
    (:vertices 0 2)
    (:tokens 0 1 3)
    (:id 42)

  - functions output-lnk() and read-lnk() to map between the abstract
    (internal form) and concrete syntax (surface forms).

  - a new global mrs:*lnkp* to control which form, if any, is active;
    the current default is :characters.

  - modifications to quite a few (though not all) MRS input and output
    routines to (un-)serialize LNK information appropriately.

  - support for characterization on generator outputs (requires a few 
    changes in the grammar; see the ERG).

for grammars with functional characterization, the only visible effects
so far should be that the `simple' and `indexed' views of MRSs, as well
as the elementary dependencies, should show characterization.  in joint
work with dan, looking for a space-efficient display, we came up with:

  h7:named_rel<0:3>(x5, "Kim"),
  h8:_walk_v_1_rel<4:10>(e2, x5),

seeing that the `simple' MRS reader is the default in [incr tsdb()], it 
should now be possible to write and read MRSs with characterization (or
other forms of LNKs, for that matter).

i propose the following next steps:

  - developers who have worked on characterization (ann and ben) take a
    look at this proposal (and code changes so far) and give comments.

  - i would like to eliminate the `char-rel' structure; the distinction 
    at the structure level has been a nuisance in the past, and the new
    `lnk' on the `rel' structure subsumes the `char-rel' functionality.
    i have inspected the relatively few occurences of `char-rel' and it
    seems straightforward to make this change.  please object by 9-apr
    in case this suggestion makes you nervous.

  - someone better at XML than me should define adequate LNK renderings
    for MRSs and RMRs in XML.  ideally, at some point, the XML printers
    and readers (and DTDs, obviously) would support this.  at some even
    later point, it might be tempting to deprecate the old attributes.

so far, all of this only affects the MRS layer.  the way CFROM and CTO
are treated in the parser and AVM universe remains unchanged.  once LNK
has stabilized more, we can consider revisiting that part maybe.

                                                       all best  -  oe

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
+++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++       --- oe at csli.stanford.edu; oe at ifi.uio.no; oepen at idi.ntnu.no ---
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++