[developers] preprocessor implementations

Thu Nov 27 21:50:57 CET 2008

hi richard,

> The reason I'm asking is because I'm preparing to run some
> ERG-compatible tokenization on a really large scale, and I'm
> wondering if there are implementations out there that are more
> efficient than the lisp implementation.

in addition to tokenizing your data, do you also plan on parsing it?
if so, efficiency of the tokenization step may not matter that much.

but i believe you are right, the current FSPP (aka `-tok=fsr') support
in PET is compiled in via ECL.  which, at least last i checked, means
that it lacks UniCode support.  hence its utility is further limited.

just another heads-up on ongoing developments in this area: we are in
the process of re-working tokenization assumptions in the ERG, coupled
with a full re-design of the token lattice interface into PET.  there
was a quite general discussion of the motivation for this work in LREC
2008 (by Peter Adolphs et al.), and in the meantime the chart mapping
implementation in PET and revisions in the ERG have matured to a point
where we anticipate a first release early next year.  in a nutshell,
most of the current FSPP rules are re-cast as token mapping rules, and
then (in this configuration) parsing starts from a token lattice that
follows `common' PTB-style tokenization conventions.  FSPP is replaced
by a simplified successor module (dubbed REPP), which reduces grammar-
external pre-processing to string-level substitution and tokenization,
in a sense a sub-set of the original FSPP design.  however, to be able
to properly tokenize PTB-style, we found it necessary to implement two
parts of the original FSPP design that were not implemented yet, viz.
optional, named `modules' (sets of rules), and an iteration operator.

in this sense, REPP is a simplification of FSPP that adds some extra
functionality.  i did a trial implementation on top of the LKB, and we
are now planning a native implementation (using the Boost RE library)
in PET.  i expect all of this will stabilize early next year, and then
i would encourage the transition to the new tokenization universe and
REPP.  if what you are preparing to do heavily depends on the specific
tokenization assumptions made in the current ERG, it may or may not be
advisable to see whether you could postpone that project for a couple
of months ...

                       more on all this in due time; all best  -  oe

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
+++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++       --- oe at ifi.uio.no; oe at csli.stanford.edu; stephan at oepen.net ---
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++