[developers] MRSs in [incr tsdb()] and unknown words

Thu Dec 4 21:36:44 CET 2008

hi francisco,

the behavior you describe is what i would expect.  [incr tsdb()] only
records a `recipe' for building full analyses (feature structures and
associated MRSs), i.e. the derivation trees.  in treebanking, there is
a step of re-building each derivation, i.e. unifying together all the
pieces, before an FS or MRS can be displayed.

the fsmod facility (and a few other procedural mechanisms in PET which
augment analyses `behind the scene') goes beyond the information that
is recorded in the [incr tsdb()] derivation.  thus, there is no way to
re-build the complete analyses, as part of the information was lost.

short-term, i believe the only work-around would be to record MRSs in 
the profile while parsing.  giving PET the option `-mrs=simple' should
turn on MRS output, though i cannot say whether this is supported with
`-tsdbdump' mode.  storing MRSs in profiles will make the files quite
a bit bigger, and it might be that treebanking becomes sluggish as an
effect of that.  a lot will depend on how many readings you record in
total, and of course on how many items you have per profile.

sometime early next year, i hope, the situation will get better: in the
new chart mapping universe in PET, the design is to input a lattice of
feature structures to PET (which is not new), and instead of fsmod-like
facilities there is an explicit layer of token-level rewriting (as part
of the grammar).  further, lexical instantiation (both for generics and
native LEs) unifies the list of token FSs into a designated path inside
each lexical entry.  thus, the lexical entry has full control over what
information to pick up among the token properties.  additionally, it is
possible to put constraints on token properties into the LE, such that,
for example, generics can be made to `select' a specific PoS tag on the
input token(s).  our hope is that this facility obsoletes fsmod et al.,
given the grammar full control over the interface between token-level
and sign-level information (e.g. manufacturing CARG or PRED values for
generic entries or recording characterization or token id information).
so far, for the ERG at least, this seems to work out well.

once the chart mapping and extended lexical instantiation is stable in
PET, i plan to extend the [incr tsdb()] derivation format.  in addition
to the information recorded currently, it will then store the token FSs
at the yield of the derivation.  and because `original analyses' in PET
are now put togethether exclusively by unification, all the information
needed to exactly re-build analyses will be available to [incr tsdb()].

i hope we will be ready to release an extended PET and [incr tsdb()] in
early 2009.  it would then be great to see whether your use pattern can
be accomodated in the new universe, ideally without recourse to any of
the procedural `FS augmentation' machinery in use currently ...

                                                      all best  -  oe

nb: Adolphs et al. (2008) gives an overview of the chart mapping idea;
that was a paper in LREC earlier this year.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
+++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++       --- oe at ifi.uio.no; oe at csli.stanford.edu; stephan at oepen.net ---
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++