[developers] relating various layers of information in [incr tsdb()] profiles

Tue Jul 14 21:41:47 CEST 2020

hi jan,

yesterday (during the summit plenary), you inquired about a tighter
linking of MRS predications to the underlying syntactic analysis than
the default character ranges.  that is actually a fine example of
existing functionality, only nobody but me likely knows about it,
because there is no available documentation (it is buried in
'lingo/lkb/src/mrs/lnk.lisp' and the LOGON 'redwoods' script).

if one were to temporarily venture back to the LOGON environment, i
just tried the following:

$LOGONROOT/redwoods --erg \
  --export/id/blind input,derivation,mrs,eds \
  --condition "i-id == 21" --target /tmp mrs

here, the '/blind' modifier means to ignore any MRS (or labeled tree)
that may be recorded in the profile, which will trigger [incr tsdb()]
re-creating the complete feature structure (and then MRS) from the
recorded derivation tree; the '/id' modifier calls for MRS linking to
use identifiers into the derivation tree (rather than character
ranges), e.g. the export file contains:

[...]

(ROOT_STRICT
 (141 SB-HD_MC_C -0.207561 0 2
  (138 HDN_BNP-PN_C 0.0930572 0 1
   (137 N_SG_ILR 0.135806 0 1
    (31 abrams at n_-_pn_le 0 0 1

[...]

[ TOP: h1
   INDEX: e3 [ e SF: PROP TENSE: PAST MOOD: INDICATIVE PROG: - PERF: - ]
   RELS: <
          [ proper_q<@138>
            LBL: h4
            ARG0: x6 [ x PERS: 3 NUM: SG IND: + ]
            RSTR: h5
            BODY: h7 ]
          [ named<@31>
            LBL: h8
            ARG0: x6
            CARG: "Abrams" ]

[...]

in the above <@138> and <@31> refer to the corresponding node
identifiers in the derivation tree, i.e. the unary rule that adds the
quantifier and the lexical entry for Abrams, respectively.  from what
i recall, these links are injected into (the AVM description of) each
MRS predication during the bottom-up reconstruction of the derivation
tree, i.e. as tokens, lexical entries, and constructions are being put
back together deterministically by [incr tsdb()].

looking further into the export file, there are both the initial (REPP
output) and internal (after chart mapping) tokenizations (in YY token
serialization):

<
  (1, 0, 1, <0:6>, 1, "Abrams", 0, "null")
  (2, 1, 2, <7:13>, 1, "barked", 0, "null")
  (3, 2, 3, <13:14>, 1, ".", 0, "null")
>

<
  (26, 0, 1, <0:6>, 1, "abrams", 0, "null")
  (28, 0, 1, <0:6>, 1, "abrams", 0, "null")
  (25, 1, 2, <7:14>, 1, "barked.", 0, "null")
  (27, 1, 2, <7:14>, 1, "barked.", 0, "null")
>

toward the bottom of the derivation tree, each lexical entry is
related to a list of (internal) token identifiers and corresponding
token feature structures, e.g.

[...]

    (38 bark_v1 at v_-_le 0 1 2
     ("barked." 25

[...]

so far, so good (and quite straightforward).  at this point, the
relation between internal and initial tokens becomes a little more
complex, as one initial token can be split into multiple internal
tokens (as would be the case e.g. in 'New York-based', with initial
tokens 'New' and 'York-based' vs. internal tokens 'New", 'York-', and
'based'); likewise, multiple initial tokens are frequently glued
together (e.g. initial #2 and #3 to form internal #25 or #27).  hence,
one has to resort to character ranges (plus knowledge that the initial
tokens are a simple sequence), to sort out these correspondences.

i ended up going through this example because this kind of exact
accounting through all analysis layers has at times been important to
me, and i do believe there should be complete information in ERG
profiles to piece things back together.  but, as demonstrated in the
above, this process requires looking at both layers of tokenization,
the derivation tree, and identifier-linked MRSs in tandem.  this kind
of holistic interpretation, i suspect, remains out of scope for
pyDelphin for now, in part because it requires the ability to
reconstruct derivations, using the grammar.

i attach the complete export file, in case you wanted to look at this
example more closely.

best wishes, oe

ps: from the available 'documentation' on alternate ways anchoring MRS
predications in corresponding input elements:

;;;
;;; an attempt at generalizing over various ways of linking to the underlying
;;; input to the parser, be it by character or vertex ranges (as used at times
;;; in HoG et al.) or token identifiers (originally at YY and now in LOGON).
;;; currently, there are four distinct value formats:
;;;
;;;   <0:4>    character range (i.e. a sub-string of an assumed flat input);
;;;   <0#2>    chart vertex range (traditional in PET to some degree);
;;;   <0 1 3>  token identifiers, i.e. links to basic input units;
;;;   <@42>    edge identifier (used internally in generation)
;;;
;;; of these, the first is maybe most widely supported across DELPH-IN tools,
;;; while the second (in my view) should be deprecated.  the third resembles
;;; what was used in VerbMobil, YY, and now LOGON; given that the input to a
;;; `deep' parser can always be viewed as a token lattice, this is probably the
;;; most general mode, and we should aim to establish it over time: first, the
;;; underlying input may not have been string-shaped (but come from the lattice
;;; of a speech recognizer), and second even with one underlying string there
;;; could be token-level ambiguity, so identifying the actual token used in an
;;; analysis preserves more information.  properties like the sub-string range,
;;; prosodic information (VerbMobil), or pointers to KB nodes (YY) can all be
;;; associated with the individual tokens sent into the parser.  finally, the
;;; fourth mode is used in generation, where surface linking actually is a two-
;;; stage process (see comments in `generate.lsp').              (4-dec-06; oe)
;;;
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 21.gz
Type: application/gzip
Size: 1022 bytes
Desc: not available
URL: <http://lists.delph-in.net/archives/developers/attachments/20200714/5606a6db/attachment.bin>