[developers] extension to the REPP sub-formalism

Sun Aug 2 14:44:21 CEST 2020

dear bec, mike, and woodley:

during the summit you may have noticed dan mentioning a 'war zone'
around NE-related token mapping rules in the current ERG trunk.  with
our move to modern, OntoNotes-style tokenization, the initial REPP
segmentation now breaks at dashes (including hyphens) and slashes.
but these will, of course, occur frequently in named entities like
email and web addresses, where they should preferably not be
segmented.  the current unhappy state of affairs is that initial
tokenization over-segments, with dan then heroically seeking to
re-unite at least the most common patterns of 'multi-token' named
entities in token mapping, where any number of token boundaries may
have been introduced at hyphens and slashes.

to rationalize this state of affairs (and, thus, work toward a peace
treaty in token mapping), i believe we will need to extend the REPP
language with a new facility: masking sub-strings according to NE-like
patterns prior to core REPP processing, and exempting masked regions
from all subsequent rewriting (i.e. making sure they remain intact).
i have added an example of this new facility (introducing the '+'
operator) to the ERG trunk; please see:

http://svn.delph-in.net/erg/trunk/rpp/ne.rpp

at present, these rules are only loaded into the LKB (where i am in
the process of adding masking to the REPP implementation), hence they
should not cause trouble in the other engines (i hope).  i would like
to invite you (as the developers of REPP processors in PET, pyDelphin,
and ACE, respectively) to look over this proposal and share any
comments you might have.  assuming we can agree on the need for
extending the REPP language along the above lines, i am hoping you
might have a chance to add support for the masking operator in your
REPP implementations?

from my ongoing work in the LKB, masking support appears relatively
straightforward once an engine implements the step-wise accounting for
character position sketched by Dridan & Oepen (2012; ACL).  the
masking patterns merely set a boolean flag for the matched character
positions, and subsequent rewriting must block rule applications that
destructively change one or more masked character positions.  output
of capture groups (copying from the left-hand side verbatim), on the
other hand, must be allowed over masked regions.  because the LKB
implementation predates the 2012 paper, however, i will first have to
implement the precise accounting mechanism to validate the above
expectation regarding how to realize masking.

what do you make of the above proposal?  oe