[developers] extension to the REPP sub-formalism

Woodley Packard sweaglesw at sweaglesw.org
Sun Aug 9 08:53:49 CEST 2020


Hi again,

> On Aug 3, 2020, at 9:58 AM, Stephan Oepen <oe at ifi.uio.no> wrote:
> 
> glad to hear you expect REPP masking should not be hard to implement;
> i have yet to find out whether i share that optimistic expectation on
> the LKB side :-).

I got to the point of being able to play around a bit with rules, anyway.  I can mask email addresses, but as far as I can tell, no subsequent rules are ever even trying to do anything inside of them.  Is this actually a good test case?  I get a single identical token for the email address in the below example, before and after implementing the masking idea:

$ ace -g erg.dat -E
I sent <oe at csli.stanford.edu> an e-mail.
EXECUTING MASK pattern...
MASKING <oe at csli.stanford.edu>
I<0:1> sent<2:6> <oe at csli.stanford.edu><7:29> an<30:32> e<33:34> -<34:35> mail<35:39> .<39:40>

> On Aug 3, 2020, at 12:35 AM, goodman.m.w at gmail.com wrote:
> 
> As an aside, that email regex is needlessly complicated. Since, in a unicode-aware regex engine, the word-character class \w is equivalent to the L and N unicode properties with the underscore ([\p{L}\p{N}_]), and since the TLD part of the domain must have only ascii characters, it can be simplified as follows:
> 
>     <?[\w.-]+@[\w-]+(?:\.[\w-]+)*\.[a-zA-Z0-9]+>?

Besides looking prettier, Mike's regex has the advantage of working in Boost's POSIX regex interface, whereas Stephan's does not.  I am not particularly eager to change to a different regex API.  Boost regex has multiple ways to call it, and for whatever reason, the POSIX way does not support the \p{} syntax.

I ended up using the BIO-encoded representation of what's masked that Mike proposed, so I can mask two adjacent spans and then still insert material between them, but block changing material inside of the masked regions.  In my implementation, material copied by capture group is OK but material rewritten literally on the RHS of a replace fails currently, because that material ends up being marked as unmasked, whereas the check requires identical content, characterization, and mask tags for everything in a masked area.  As you both noted, shifting the entire mask left or right is fine.

Regards,
-Woodley


More information about the developers mailing list