[developers] a late follow-up from lisboa

Wed Oct 26 19:21:52 CEST 2005

dear all,

several people expressed some concern about growing divergences between
DELPH-IN components during the lisboa meeting.  i believe i volunteered
to summarize some of the specific issues that were mentioned (i started
this message weeks ago and have been adding piecemeal :-):

  - syntax for orthographemic rules: the recent LKB changes resulted in
    cleaning up the syntax for %letter-set, %suffix, et al. annotations
    on lexical rules.  only #\!, #\?, #\*, and #\) need escaping in the
    new universe, and #\\ is the escape character.  i believe only the
    ERG makes use of funny characters in orthographemics currently, but
    still i think we should update PET to reflect the above.  i all but
    promised dan to do this, hence will hopefully look into it soon.

  - chart dependencies: re-working the processing of lexical rules in
    the LKB resulted in a change to the chart dependencies mechanism,
    where in the old set-up chart dependencies were checked after all
    lexical rules had been applied, and in the current universe they
    are tested among lexical entries only (i.e. prior to application of
    lexical rules).  PET (main), on the other hand, i believe has moved
    to a set-up where parsing is divided into two phases, viz. one with
    lexical rules only, the second with non-lexical rules only.  there
    is an underlying difference between the systems here, where the LKB
    notion of `lexical' rules only entails that a rule can be annotated
    with an orthographemic effect; otherwise, the LKB will happily feed
    the output of a non-lexical rule into a lexical rule.  grammarians
    tend to not be aware of this, as they maintain the word vs. phrase
    distinction in the type system already (in my view, the LKB use of
    the term `lexical' rule is potentially mis-leading; but documented).

    i (still) have yet to re-read the earlier discussion on this on the
    `developers' list, but currently at least berthold seems affected
    by the two systems not doing the same (and the LKB no longer doing
    what it used to).  so, it seems as if a written specification for
    the mechanism was needed, and then hopefully we can aim at getting
    all processors to implement it alike.

  - lexical rule processing: at some point, interleaving lexical rules
    without orthographemic effect prior to orthographemic rules (e.g.
    dative shift prior to passivization) had broken in the PET `main'
    branch; i remember fixing it in the `oe' branch, but are not quite
    sure about the state of play in recent `main' versions (i believe
    the patch propagated).  generally, it would seem tempting to have a
    few testing grammars (e.g. the `toy' and `polymorphan' grammars in
    the LKB source tree), so as to mechanically (and regularly) confirm
    all systems output the sameresults on these.

  - TDL syntax extensions: ann (i believe) added a comment facility on
    types some time ago, and emily last year added a `:+' operator for
    `overlay' type definitions, i.e. monotonic extensions to an earlier
    defined type.  the latter is already in use in the Matrix, i think,
    and the former we were thinking to use for adding documentation to
    grammars, specifically to annotate types that should be part of the
    SEM-I for a grammar (i.e. exported).  for all i know, PET does not
    yet have support for either operator, but probably should.

    conversely, bernd (i believe) extended PET at some point to allow a
    variant rule notation, viz.

      [ ] --> [ ], ..., [ ].

    which should be equivalent to embedding the RHS of the definition
    as a list below some designated path (e.g. `ARGS') in the LHS; to
    make this compatible with the (mostly unused) LKB option of having
    a separate feature embedding the LHS (which, in a sense, would be
    cleaner in terms of which parts of the FS corresponds to what), it
    could become necessary to also allow this in PET, but while no-one
    is using this option that seems hardly a priority.  the arrow, on
    the other hand, i quite like and would love to get into the LKB.

  - input chart description: from what i gather we have at least three
    and a half ways of describing partly processed input for parsing, 
    viz. (a) the original YY mode (`PetInput' on the wiki), (b) SPPP of
    the early Deep-Thought days (`LkbSppp'), (c) the XML input chart in
    PET (`PetInput)', and (d) emerging MAF support.  of these, (a) and
    (c) are available in PET ((c) only in the `main' branch), while (b)
    and (d) are in the LKB.  i personally believe in plurality, and at
    least (a) -- (c) currently have active users, but if nothing else
    we should try to document the various options better (and work out
    what their strong and weak points are for candidate users).

  - MaxEnt features: in recent work (jointly with erik velldal at UiO),
    we have extended the range of MaxEnt features in [incr tsdb()] and
    changed their textual representation in a non-backwards compatible
    way.  the new code generates `[1 2 imper hcomp vc_prd_be_le "be"]',
    where the second integer is new.  the immediate effect is that PET
    will still read `.mem' files created with the new code, but utterly 
    mis-interpret features, i.e. effectively ignore the model.  we hope
    to wrap up our re-redesign in the MaxEnt space fairly soon and then
    put out documentation on the feature templates.  at that point, PET
    will need an update to minimally recognize the new format properly, 
    and ideally make use of more of these new features (grandparenting,
    lexicalization, et al.).

  - ERG and HoG: the current use of the ERG in HoG is, say, sub-optimal
    in terms of interfaces and results.  two core issues, i think, are
    (a) divergence in tokenization assumptions (e.g. |we haven't slept|
    fails to parse due to the contracted auxiliary) and (b) the need to
    further fine-tune the interactions with pre-processing steps (e.g.
    |Kim arrived at 2:00am.|, email addresses, URLs, et al. all get not
    quite the analysis they could).  these problems would get far worse 
    with recent versions of the ERG, where most punctuation is treated
    as affixation now.  dan, ulli schaefer, and i have talked about the
    issue in some depth and have concluded that a general solution will
    be somewhat challenging to build (though interesting): in principle
    it would have to allow for multiple views on tokenization (taggers
    have good reasons to consider punctuation separate tokens, the ERG
    has its reasons to consider punctuation marks as affixes), and then
    annotations contributed (to tokens) by one component might refer to
    only part of a token from the point of view of another component.

    when we last talked, ulli and i resolved to further investigate the
    general solution but also look for a more pragmatic solution to the
    current issues with using the ERG in the HoG.  i believe the recent
    development in the ERG has actually simplified things, as we mostly
    now expect token boundaries at whitespace (plus for a small number
    of contracted forms, e.g. |'s|, |'ve|, et al.).  our proposal would
    be to make tokenization rules semi-explicit in the form of a token
    test suite, i.e. a collection of relevant examples plus associated
    tokenizer output).  given that, ulli believes he should be able to
    re-organize the input to PET, so as to collapse token boundaries in
    several cases.  we were hoping to look into this more when both dan
    and i are at DFKI in mid-november.

    differences in lexical rule processing might be another issue here,
    of course, as the ERG for the time being is developed against the
    `oe' branch of PET (see above and below). 

  - PET branches: dan and i still use what is effectively a version of
    PET as of sometime in 2003 (with moderate patches here and there).
    the main reason for this, i believe, is the flop(1) slow-down from
    moving to using Boost (over LEDA).  some remaining uncertainty of
    which of the various patches have migrated to the main branch, our 
    overall conservative natures, and difficulties compiling the main
    branch last i tried add to our inertia.  however, maintaining two
    distinct branches is not a good thing, and once flop(1) performance
    with Boost were resolved, some testing of the ERG on `main' should
    fairly quickly allow merging of the two branches, i hope.

i suspect there may well be additional issues that have been mentioned
over time.  some of the above, i believe, were reflected in the agenda
francis proposed for a developer meeting next year.  but those who see
more `candidate cross-platform issues' (or whatever these are), please
add them to this list!  even if we failed to resolve many of them, in
my view it is still useful to keep track of more of the loose ends.

                                                     all best  -  oe

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Universitetet i Oslo (ILN); Boks 1102 Blindern; 0317 Oslo; (+47) 2285 7989
+++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++       --- oe at csli.stanford.edu; oe at hf.uio.no; stephan at oepen.net ---
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++