[developers] extension to the REPP sub-formalism

Fri Sep 17 06:39:07 UTC 2021

Hi Mike,

It's been over a year since I've thought about this, so it is far from fresh in my mind.  My repp implementation is not in a public-facing version control system; however, I put a copy of the initial work I did on this last year here since you expressed interest:

http://sweaglesw.org/linguistics/repp-mask-trial.tgz <http://sweaglesw.org/linguistics/repp-mask-trial.tgz>

The regex unicode property syntax problem is unresolved.  I'm afraid I also can't offer any kind of expectation or intuition as to how masks should behave relative to your questions, though, other than agreeing with your assessment that a rule that intentionally duplicates masked content seems far-fetched.

-Woodley

> On Sep 10, 2021, at 10:58 PM, goodman.m.w at gmail.com wrote:
> 
> Hi all, I'm now returning to REPP masking support in PyDelphin and I have some questions.
> 
> Firstly, Woodley, do you have your implementation of REPP somewhere accessible? The latest packaged tarball on sweaglesw.org <http://sweaglesw.org/> (0.2.2) was last touched in 2011.
> 
> Secondly, I got PyDelphin to support masking as we discussed, but my implementation turned up some strange edge cases where regex captures are concerned. Part of this may be due to how, unlike Woodley's, Bec's, and Glenn's REPP libraries, mine does not execute a regular expression anew for each match (you may recall from another thread that this is a source of differences when material is deleted at the start of the string). Instead, it finds all matches in the string before rewriting. As a consequence, I need to determine if masked material has been altered within a single match, leading to the following conditions:
> 
> 1. If the first mask value (in the BIO-tagging scheme) in the matched substring is "I", then the match begins in the middle of some masked substring. The result must not shift the position of this mask.
> 2. Similarly, if the first mask value *after* the match is "I", then the match ends in the middle of a masked substring. This mask also should not move.
> 3. All other masked substrings in a match are entirely covered and may shift around.
> 
> As an example, imagine we have masked the substring "<oe at yy.com <mailto:oe at yy.com>>" from the full string "(<oe at yy.com <mailto:oe at yy.com>>)" and we have rewrite rules that insert a space between open/closes parentheses and some captured punctuation (here being < and >):
> 
>     !\((\p{P})    ( \1
>     !(\p{P})\)    \1 )
> 
> These rules partially capture the masked substring, so they must not move from the right/left boundary of the match in the output of the rule.
> 
> One issue, however unlikely, is whether a masked substring can be duplicated (e.g., the backreference is repeated in the RHS of a rewrite rule). Forgetting about characterization constraints for a minute, (1) and (2) above should not allow duplication as it would copy a partial mask and therefore alter it. (3) could possibly allow for duplicating masked substrings (but see below).
> 
> Another issue, also unlikely, and also requiring us to set aside characterization constraints, is whether masked substrings can swap positions. E.g., if we capture two masked email addresses and have "\2; \1" on the RHS of the rewrite rule.
> 
> Now, if we consider changes in characterization (aside from simple shifts to the left or right) within a mask to be something that can block a rewrite rule, then we will disallow both duplication and swapping of masked substrings. This is because accurate characterization stops when the backreferences are not monotonically increasing from 1 without gaps, so both "\1 \1" and "\2 \1" will break characterization, and thus block the rewrite rule even if the masked content was not otherwise altered.
> 
> I guess my question is: does this fit with your expectations of how masks should behave in REPP? That is, no duplication and no swapping?
> 
> On Sun, Aug 9, 2020 at 2:48 PM Stephan Oepen <oe at ifi.uio.no <mailto:oe at ifi.uio.no>> wrote:
> hi again, woodley:
> 
> > I got to the point of being able to play around a bit with rules, anyway.  I can mask email addresses, but as far as I can tell, no subsequent rules are ever even trying to do anything inside of them.  Is this actually a good test case?  I get a single identical token for the email address in the below example, before and after implementing the masking idea:
> 
> i am happy to hear you were able to confirm your optimistic
> expectation that masking would not be too difficult to implement :-).
> 
> i shall add a few more masking rules to the ERG trunk this coming
> week, but i would think the following could be a useful test case to
> explore the interaction of masking and rewriting (i would expect
> eleven tokens):
> 
> stephan, oe at yy.com <mailto:oe at yy.com>, oe at ellingsen-oepen.net <mailto:oe at ellingsen-oepen.net>, or привет@радио-москва.рф, called.
> 
> > Besides looking prettier, Mike's regex has the advantage of working in Boost's POSIX regex interface, whereas Stephan's does not.  I am not particularly eager to change to a different regex API.  Boost regex has multiple ways to call it, and for whatever reason, the POSIX way does not support the \p{} syntax.
> 
> i would suggest we leave aesthetic judgments to the maintainers of the
> REPP rules, but in this case i put in unicode properties for a reason:
> i am eager to take into use the \p{} syntax because (unlike classic
> character ranges or shorthands like \w) it is unambiguously defined
> across engines, independent of locales.  more importantly, i expect
> unicode properties will afford a cleaner and more general solution to
> normalization of punctuation, e.g. different types of whitespace and
> various conventions for opening and closing quote marks; unicode
> properties may also help in dealing with interspersed foreign content.
> 
> it appears Boost regex offers full unicode support when combined with
> ICU, which i would guess ACE is using from before?  so, i am hoping
> that full unicode support in regular expressions (in REPP and chart
> mapping) might become available with relatively minor adjustments of
> how you call into the Boost regex engine?
> 
> https://www.boost.org/doc/libs/1_73_0/libs/regex/doc/html/boost_regex/unicode.html <https://www.boost.org/doc/libs/1_73_0/libs/regex/doc/html/boost_regex/unicode.html>
> 
> > I ended up using the BIO-encoded representation of what's masked that Mike proposed, so I can mask two adjacent spans and then still insert material between them, but block changing material inside of the masked regions.  In my implementation, material copied by capture group is OK but material rewritten literally on the RHS of a replace fails currently, because that material ends up being marked as unmasked, whereas the check requires identical content, characterization, and mask tags for everything in a masked area.
> 
> that all sounds compatible with my intuitions about how i would like
> the masking to behave.  in general, i am hoping to discourage literal
> rewriting, as it has the potential to weaken characterization
> accounting.
> 
> many thanks for working on this!  oe
> 
> 
> -- 
> -Michael Wayne Goodman

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20210916/504ee2c9/attachment.htm>