[developers] extension to the REPP sub-formalism

Sun Aug 9 23:47:48 CEST 2020

hi again, woodley:

> I got to the point of being able to play around a bit with rules, anyway.  I can mask email addresses, but as far as I can tell, no subsequent rules are ever even trying to do anything inside of them.  Is this actually a good test case?  I get a single identical token for the email address in the below example, before and after implementing the masking idea:

i am happy to hear you were able to confirm your optimistic
expectation that masking would not be too difficult to implement :-).

i shall add a few more masking rules to the ERG trunk this coming
week, but i would think the following could be a useful test case to
explore the interaction of masking and rewriting (i would expect
eleven tokens):

stephan, oe at yy.com, oe at ellingsen-oepen.net, or привет@радио-москва.рф, called.

> Besides looking prettier, Mike's regex has the advantage of working in Boost's POSIX regex interface, whereas Stephan's does not.  I am not particularly eager to change to a different regex API.  Boost regex has multiple ways to call it, and for whatever reason, the POSIX way does not support the \p{} syntax.

i would suggest we leave aesthetic judgments to the maintainers of the
REPP rules, but in this case i put in unicode properties for a reason:
i am eager to take into use the \p{} syntax because (unlike classic
character ranges or shorthands like \w) it is unambiguously defined
across engines, independent of locales.  more importantly, i expect
unicode properties will afford a cleaner and more general solution to
normalization of punctuation, e.g. different types of whitespace and
various conventions for opening and closing quote marks; unicode
properties may also help in dealing with interspersed foreign content.

it appears Boost regex offers full unicode support when combined with
ICU, which i would guess ACE is using from before?  so, i am hoping
that full unicode support in regular expressions (in REPP and chart
mapping) might become available with relatively minor adjustments of
how you call into the Boost regex engine?

https://www.boost.org/doc/libs/1_73_0/libs/regex/doc/html/boost_regex/unicode.html

> I ended up using the BIO-encoded representation of what's masked that Mike proposed, so I can mask two adjacent spans and then still insert material between them, but block changing material inside of the masked regions.  In my implementation, material copied by capture group is OK but material rewritten literally on the RHS of a replace fails currently, because that material ends up being marked as unmasked, whereas the check requires identical content, characterization, and mask tags for everything in a masked area.

that all sounds compatible with my intuitions about how i would like
the masking to behave.  in general, i am hoping to discourage literal
rewriting, as it has the potential to weaken characterization
accounting.

many thanks for working on this!  oe