[developers] boring though important: generalizing characterization

Tue Apr 3 19:30:17 CEST 2007

hi again, thanks for the quick feedback,

> this proposal appears to be missing the generalised notion of
> character ranges that we require for XML input - i.e., the
> combination of xpath with character position.  

good, so actually more reason to generalize than i knew about.  how
could i learn more about that scheme?  more centralized LNK handling
should make it easy, in principle, to add another scheme.  however,
more is not always better, of course.  i would hope to avoid legacy
support for a wider range of schemes than is really needed.

but i take it there is no LKB code yet for the scheme you suggest?  as
far as i could see, existing code assumes CFROM and CTO are integers.

> The token lattice notion is _not_ adequate for our purposes, given
> that we require a general notion of standoff annotation that can work
> with multiple tokenisations (and indeed, for processing that is not
> token based).

i would have thought the token-based view was the most general.  if we
defined the deep parser as a process mapping from a token lattice into
analyses (trees, MRSs, et al.), then the creation of the input lattice
(tokenization, tagging, spell correction, NE recognition, you name it)
is external to the parser.  with multiple tokenizations, each will use
unique token identifiers, and someone prior to parsing has to decide on
which tokens are provided as input to the parser.  all parser analyses,
ultimately, are composed of those input elements, and as long as there
is a non-ambiguous reference from (parts of) analyses to such elements,
then all available information is preserved.  on this view, there could
either be indirect reference (from an EP to its input tokens, and from
each input token to its position in the preprocessing pipeline), or one
could re-map (or augment) analyses as a post-processing step outside of
the parser, i.e. as part of embedding it into a larger context.  

i imagine, if only for practical reasons, various pipelines (contexts)
will have varying indexing schemes (we have seen quite a few already).
ideally, the interface into and out of the parser should be unaffected
by such external variation.  well, maybe a topic for the Summit ...

                                                         best  -  oe

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
+++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++       --- oe at csli.stanford.edu; oe at ifi.uio.no; oepen at idi.ntnu.no ---
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++