[developers] a late follow-up from lisboa
Stephan Oepen
oe at csli.Stanford.EDU
Wed Oct 26 19:21:52 CEST 2005
dear all,
several people expressed some concern about growing divergences between
DELPH-IN components during the lisboa meeting. i believe i volunteered
to summarize some of the specific issues that were mentioned (i started
this message weeks ago and have been adding piecemeal :-):
- syntax for orthographemic rules: the recent LKB changes resulted in
cleaning up the syntax for %letter-set, %suffix, et al. annotations
on lexical rules. only #\!, #\?, #\*, and #\) need escaping in the
new universe, and #\\ is the escape character. i believe only the
ERG makes use of funny characters in orthographemics currently, but
still i think we should update PET to reflect the above. i all but
promised dan to do this, hence will hopefully look into it soon.
- chart dependencies: re-working the processing of lexical rules in
the LKB resulted in a change to the chart dependencies mechanism,
where in the old set-up chart dependencies were checked after all
lexical rules had been applied, and in the current universe they
are tested among lexical entries only (i.e. prior to application of
lexical rules). PET (main), on the other hand, i believe has moved
to a set-up where parsing is divided into two phases, viz. one with
lexical rules only, the second with non-lexical rules only. there
is an underlying difference between the systems here, where the LKB
notion of `lexical' rules only entails that a rule can be annotated
with an orthographemic effect; otherwise, the LKB will happily feed
the output of a non-lexical rule into a lexical rule. grammarians
tend to not be aware of this, as they maintain the word vs. phrase
distinction in the type system already (in my view, the LKB use of
the term `lexical' rule is potentially mis-leading; but documented).
i (still) have yet to re-read the earlier discussion on this on the
`developers' list, but currently at least berthold seems affected
by the two systems not doing the same (and the LKB no longer doing
what it used to). so, it seems as if a written specification for
the mechanism was needed, and then hopefully we can aim at getting
all processors to implement it alike.
- lexical rule processing: at some point, interleaving lexical rules
without orthographemic effect prior to orthographemic rules (e.g.
dative shift prior to passivization) had broken in the PET `main'
branch; i remember fixing it in the `oe' branch, but are not quite
sure about the state of play in recent `main' versions (i believe
the patch propagated). generally, it would seem tempting to have a
few testing grammars (e.g. the `toy' and `polymorphan' grammars in
the LKB source tree), so as to mechanically (and regularly) confirm
all systems output the sameresults on these.
- TDL syntax extensions: ann (i believe) added a comment facility on
types some time ago, and emily last year added a `:+' operator for
`overlay' type definitions, i.e. monotonic extensions to an earlier
defined type. the latter is already in use in the Matrix, i think,
and the former we were thinking to use for adding documentation to
grammars, specifically to annotate types that should be part of the
SEM-I for a grammar (i.e. exported). for all i know, PET does not
yet have support for either operator, but probably should.
conversely, bernd (i believe) extended PET at some point to allow a
variant rule notation, viz.
[ ] --> [ ], ..., [ ].
which should be equivalent to embedding the RHS of the definition
as a list below some designated path (e.g. `ARGS') in the LHS; to
make this compatible with the (mostly unused) LKB option of having
a separate feature embedding the LHS (which, in a sense, would be
cleaner in terms of which parts of the FS corresponds to what), it
could become necessary to also allow this in PET, but while no-one
is using this option that seems hardly a priority. the arrow, on
the other hand, i quite like and would love to get into the LKB.
- input chart description: from what i gather we have at least three
and a half ways of describing partly processed input for parsing,
viz. (a) the original YY mode (`PetInput' on the wiki), (b) SPPP of
the early Deep-Thought days (`LkbSppp'), (c) the XML input chart in
PET (`PetInput)', and (d) emerging MAF support. of these, (a) and
(c) are available in PET ((c) only in the `main' branch), while (b)
and (d) are in the LKB. i personally believe in plurality, and at
least (a) -- (c) currently have active users, but if nothing else
we should try to document the various options better (and work out
what their strong and weak points are for candidate users).
- MaxEnt features: in recent work (jointly with erik velldal at UiO),
we have extended the range of MaxEnt features in [incr tsdb()] and
changed their textual representation in a non-backwards compatible
way. the new code generates `[1 2 imper hcomp vc_prd_be_le "be"]',
where the second integer is new. the immediate effect is that PET
will still read `.mem' files created with the new code, but utterly
mis-interpret features, i.e. effectively ignore the model. we hope
to wrap up our re-redesign in the MaxEnt space fairly soon and then
put out documentation on the feature templates. at that point, PET
will need an update to minimally recognize the new format properly,
and ideally make use of more of these new features (grandparenting,
lexicalization, et al.).
- ERG and HoG: the current use of the ERG in HoG is, say, sub-optimal
in terms of interfaces and results. two core issues, i think, are
(a) divergence in tokenization assumptions (e.g. |we haven't slept|
fails to parse due to the contracted auxiliary) and (b) the need to
further fine-tune the interactions with pre-processing steps (e.g.
|Kim arrived at 2:00am.|, email addresses, URLs, et al. all get not
quite the analysis they could). these problems would get far worse
with recent versions of the ERG, where most punctuation is treated
as affixation now. dan, ulli schaefer, and i have talked about the
issue in some depth and have concluded that a general solution will
be somewhat challenging to build (though interesting): in principle
it would have to allow for multiple views on tokenization (taggers
have good reasons to consider punctuation separate tokens, the ERG
has its reasons to consider punctuation marks as affixes), and then
annotations contributed (to tokens) by one component might refer to
only part of a token from the point of view of another component.
when we last talked, ulli and i resolved to further investigate the
general solution but also look for a more pragmatic solution to the
current issues with using the ERG in the HoG. i believe the recent
development in the ERG has actually simplified things, as we mostly
now expect token boundaries at whitespace (plus for a small number
of contracted forms, e.g. |'s|, |'ve|, et al.). our proposal would
be to make tokenization rules semi-explicit in the form of a token
test suite, i.e. a collection of relevant examples plus associated
tokenizer output). given that, ulli believes he should be able to
re-organize the input to PET, so as to collapse token boundaries in
several cases. we were hoping to look into this more when both dan
and i are at DFKI in mid-november.
differences in lexical rule processing might be another issue here,
of course, as the ERG for the time being is developed against the
`oe' branch of PET (see above and below).
- PET branches: dan and i still use what is effectively a version of
PET as of sometime in 2003 (with moderate patches here and there).
the main reason for this, i believe, is the flop(1) slow-down from
moving to using Boost (over LEDA). some remaining uncertainty of
which of the various patches have migrated to the main branch, our
overall conservative natures, and difficulties compiling the main
branch last i tried add to our inertia. however, maintaining two
distinct branches is not a good thing, and once flop(1) performance
with Boost were resolved, some testing of the ERG on `main' should
fairly quickly allow merging of the two branches, i hope.
i suspect there may well be additional issues that have been mentioned
over time. some of the above, i believe, were reflected in the agenda
francis proposed for a developer meeting next year. but those who see
more `candidate cross-platform issues' (or whatever these are), please
add them to this list! even if we failed to resolve many of them, in
my view it is still useful to keep track of more of the loose ends.
all best - oe
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Universitetet i Oslo (ILN); Boks 1102 Blindern; 0317 Oslo; (+47) 2285 7989
+++ CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++ --- oe at csli.stanford.edu; oe at hf.uio.no; stephan at oepen.net ---
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
More information about the developers
mailing list