[developers] extension to the REPP sub-formalism

Wed Aug 5 13:24:39 CEST 2020

It's a _loong_ time since I looked at that code (or used svn...). I've been
refreshing my memory of the code, and I think I can see how that works. As
a mechanism, it sounds reasonable, but it's going to be a long time before
I'd have time to sit down and try to make the change. More than happy for
anyone else to take up the challenge :)

Bec

On Sun, Aug 2, 2020 at 10:44 PM Stephan Oepen <oe at ifi.uio.no> wrote:

> dear bec, mike, and woodley:
>
> during the summit you may have noticed dan mentioning a 'war zone'
> around NE-related token mapping rules in the current ERG trunk.  with
> our move to modern, OntoNotes-style tokenization, the initial REPP
> segmentation now breaks at dashes (including hyphens) and slashes.
> but these will, of course, occur frequently in named entities like
> email and web addresses, where they should preferably not be
> segmented.  the current unhappy state of affairs is that initial
> tokenization over-segments, with dan then heroically seeking to
> re-unite at least the most common patterns of 'multi-token' named
> entities in token mapping, where any number of token boundaries may
> have been introduced at hyphens and slashes.
>
> to rationalize this state of affairs (and, thus, work toward a peace
> treaty in token mapping), i believe we will need to extend the REPP
> language with a new facility: masking sub-strings according to NE-like
> patterns prior to core REPP processing, and exempting masked regions
> from all subsequent rewriting (i.e. making sure they remain intact).
> i have added an example of this new facility (introducing the '+'
> operator) to the ERG trunk; please see:
>
> http://svn.delph-in.net/erg/trunk/rpp/ne.rpp
>
> at present, these rules are only loaded into the LKB (where i am in
> the process of adding masking to the REPP implementation), hence they
> should not cause trouble in the other engines (i hope).  i would like
> to invite you (as the developers of REPP processors in PET, pyDelphin,
> and ACE, respectively) to look over this proposal and share any
> comments you might have.  assuming we can agree on the need for
> extending the REPP language along the above lines, i am hoping you
> might have a chance to add support for the masking operator in your
> REPP implementations?
>
> from my ongoing work in the LKB, masking support appears relatively
> straightforward once an engine implements the step-wise accounting for
> character position sketched by Dridan & Oepen (2012; ACL).  the
> masking patterns merely set a boolean flag for the matched character
> positions, and subsequent rewriting must block rule applications that
> destructively change one or more masked character positions.  output
> of capture groups (copying from the left-hand side verbatim), on the
> other hand, must be allowed over masked regions.  because the LKB
> implementation predates the 2012 paper, however, i will first have to
> implement the precise accounting mechanism to validate the above
> expectation regarding how to realize masking.
>
> what do you make of the above proposal?  oe
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20200805/b0f0eba1/attachment.html>