[developers] extension to the REPP sub-formalism

Mon Aug 3 18:58:47 CEST 2020

hi woodley,

> It looks from the file you referenced like the proposed new operation is '=' rather than '+'?

yes, sorry, my typo in the email!

> I guess you will be limited to using this facility in cases where the designation as named entity is sufficiently unambiguous based on the RE alone.  It is tempting to contemplate ways in which REPP could offer ambiguous tokenization output here, but so far my imagination is too limited to come up with the scenario where it would be useful.

indeed, the intended use for masking would be for (near-)certain
patterns; in principle, one could further split and ambiguate in token
mapping then.  in the REPP predecessor, there was some contemplation
of string-level rewriting over a token lattice, but with the
introduction of token mapping we more than happily purged that
complexity from the initial tokenizer.  i have grown fond of the
current division of labor, with a simple, sequence-to-sequence initial
step (which should be limited to straightforward string-level
processing), the ability to call out to external processors (like a
PoS tagger) with that simple sequence, and deferring lattice
processing to the second stage of preprocessing, where we can
manipulate structured token objects ...

glad to hear you expect REPP masking should not be hard to implement;
i have yet to find out whether i share that optimistic expectation on
the LKB side :-).

oe