[developers] extension to the REPP sub-formalism

Mon Aug 3 09:26:21 CEST 2020

Hi Stephan,

It looks from the file you referenced like the proposed new operation is '=' rather than '+'?

This seems like a plausible and modest addition to me, and should not be hard to implement.  I guess you will be limited to using this facility in cases where the designation as named entity is sufficiently unambiguous based on the RE alone.  It is tempting to contemplate ways in which REPP could offer ambiguous tokenization output here, but so far my imagination is too limited to come up with the scenario where it would be useful.

Woodley

> On Aug 2, 2020, at 5:44 AM, Stephan Oepen <oe at ifi.uio.no> wrote:
> 
> dear bec, mike, and woodley:
> 
> during the summit you may have noticed dan mentioning a 'war zone'
> around NE-related token mapping rules in the current ERG trunk.  with
> our move to modern, OntoNotes-style tokenization, the initial REPP
> segmentation now breaks at dashes (including hyphens) and slashes.
> but these will, of course, occur frequently in named entities like
> email and web addresses, where they should preferably not be
> segmented.  the current unhappy state of affairs is that initial
> tokenization over-segments, with dan then heroically seeking to
> re-unite at least the most common patterns of 'multi-token' named
> entities in token mapping, where any number of token boundaries may
> have been introduced at hyphens and slashes.
> 
> to rationalize this state of affairs (and, thus, work toward a peace
> treaty in token mapping), i believe we will need to extend the REPP
> language with a new facility: masking sub-strings according to NE-like
> patterns prior to core REPP processing, and exempting masked regions
> from all subsequent rewriting (i.e. making sure they remain intact).
> i have added an example of this new facility (introducing the '+'
> operator) to the ERG trunk; please see:
> 
> http://svn.delph-in.net/erg/trunk/rpp/ne.rpp
> 
> at present, these rules are only loaded into the LKB (where i am in
> the process of adding masking to the REPP implementation), hence they
> should not cause trouble in the other engines (i hope).  i would like
> to invite you (as the developers of REPP processors in PET, pyDelphin,
> and ACE, respectively) to look over this proposal and share any
> comments you might have.  assuming we can agree on the need for
> extending the REPP language along the above lines, i am hoping you
> might have a chance to add support for the masking operator in your
> REPP implementations?
> 
> from my ongoing work in the LKB, masking support appears relatively
> straightforward once an engine implements the step-wise accounting for
> character position sketched by Dridan & Oepen (2012; ACL).  the
> masking patterns merely set a boolean flag for the matched character
> positions, and subsequent rewriting must block rule applications that
> destructively change one or more masked character positions.  output
> of capture groups (copying from the left-hand side verbatim), on the
> other hand, must be allowed over masked regions.  because the LKB
> implementation predates the 2012 paper, however, i will first have to
> implement the precise accounting mechanism to validate the above
> expectation regarding how to realize masking.
> 
> what do you make of the above proposal?  oe