[developers] extension to the REPP sub-formalism

goodman.m.w at gmail.com goodman.m.w at gmail.com
Sat Sep 11 05:58:41 UTC 2021


Hi all, I'm now returning to REPP masking support in PyDelphin and I have
some questions.

Firstly, Woodley, do you have your implementation of REPP somewhere
accessible? The latest packaged tarball on sweaglesw.org (0.2.2) was last
touched in 2011.

Secondly, I got PyDelphin to support masking as we discussed, but my
implementation turned up some strange edge cases where regex captures are
concerned. Part of this may be due to how, unlike Woodley's, Bec's, and
Glenn's REPP libraries, mine does not execute a regular expression anew for
each match (you may recall from another thread that this is a source of
differences when material is deleted at the start of the string). Instead,
it finds all matches in the string before rewriting. As a consequence, I
need to determine if masked material has been altered within a single
match, leading to the following conditions:

1. If the first mask value (in the BIO-tagging scheme) in the matched
substring is "I", then the match begins in the middle of some masked
substring. The result must not shift the position of this mask.
2. Similarly, if the first mask value *after* the match is "I", then the
match ends in the middle of a masked substring. This mask also should not
move.
3. All other masked substrings in a match are entirely covered and may
shift around.

As an example, imagine we have masked the substring "<oe at yy.com>" from the
full string "(<oe at yy.com>)" and we have rewrite rules that insert a space
between open/closes parentheses and some captured punctuation (here being <
and >):

    !\((\p{P})    ( \1
    !(\p{P})\)    \1 )

These rules partially capture the masked substring, so they must not move
from the right/left boundary of the match in the output of the rule.

One issue, however unlikely, is whether a masked substring can be
duplicated (e.g., the backreference is repeated in the RHS of a rewrite
rule). Forgetting about characterization constraints for a minute, (1) and
(2) above should not allow duplication as it would copy a partial mask and
therefore alter it. (3) could possibly allow for duplicating masked
substrings (but see below).

Another issue, also unlikely, and also requiring us to set aside
characterization constraints, is whether masked substrings can swap
positions. E.g., if we capture two masked email addresses and have "\2; \1"
on the RHS of the rewrite rule.

Now, if we consider changes in characterization (aside from simple shifts
to the left or right) within a mask to be something that can block a
rewrite rule, then we will disallow both duplication and swapping of masked
substrings. This is because accurate characterization stops when the
backreferences are not monotonically increasing from 1 without gaps, so
both "\1 \1" and "\2 \1" will break characterization, and thus block the
rewrite rule even if the masked content was not otherwise altered.

I guess my question is: does this fit with your expectations of how masks
should behave in REPP? That is, no duplication and no swapping?

On Sun, Aug 9, 2020 at 2:48 PM Stephan Oepen <oe at ifi.uio.no> wrote:

> hi again, woodley:
>
> > I got to the point of being able to play around a bit with rules,
> anyway.  I can mask email addresses, but as far as I can tell, no
> subsequent rules are ever even trying to do anything inside of them.  Is
> this actually a good test case?  I get a single identical token for the
> email address in the below example, before and after implementing the
> masking idea:
>
> i am happy to hear you were able to confirm your optimistic
> expectation that masking would not be too difficult to implement :-).
>
> i shall add a few more masking rules to the ERG trunk this coming
> week, but i would think the following could be a useful test case to
> explore the interaction of masking and rewriting (i would expect
> eleven tokens):
>
> stephan, oe at yy.com, oe at ellingsen-oepen.net, or привет@радио-москва.рф,
> called.
>
> > Besides looking prettier, Mike's regex has the advantage of working in
> Boost's POSIX regex interface, whereas Stephan's does not.  I am not
> particularly eager to change to a different regex API.  Boost regex has
> multiple ways to call it, and for whatever reason, the POSIX way does not
> support the \p{} syntax.
>
> i would suggest we leave aesthetic judgments to the maintainers of the
> REPP rules, but in this case i put in unicode properties for a reason:
> i am eager to take into use the \p{} syntax because (unlike classic
> character ranges or shorthands like \w) it is unambiguously defined
> across engines, independent of locales.  more importantly, i expect
> unicode properties will afford a cleaner and more general solution to
> normalization of punctuation, e.g. different types of whitespace and
> various conventions for opening and closing quote marks; unicode
> properties may also help in dealing with interspersed foreign content.
>
> it appears Boost regex offers full unicode support when combined with
> ICU, which i would guess ACE is using from before?  so, i am hoping
> that full unicode support in regular expressions (in REPP and chart
> mapping) might become available with relatively minor adjustments of
> how you call into the Boost regex engine?
>
>
> https://www.boost.org/doc/libs/1_73_0/libs/regex/doc/html/boost_regex/unicode.html
>
> > I ended up using the BIO-encoded representation of what's masked that
> Mike proposed, so I can mask two adjacent spans and then still insert
> material between them, but block changing material inside of the masked
> regions.  In my implementation, material copied by capture group is OK but
> material rewritten literally on the RHS of a replace fails currently,
> because that material ends up being marked as unmasked, whereas the check
> requires identical content, characterization, and mask tags for everything
> in a masked area.
>
> that all sounds compatible with my intuitions about how i would like
> the masking to behave.  in general, i am hoping to discourage literal
> rewriting, as it has the potential to weaken characterization
> accounting.
>
> many thanks for working on this!  oe
>


-- 
-Michael Wayne Goodman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20210910/f5d34d15/attachment.htm>


More information about the developers mailing list