[developers] extension to the REPP sub-formalism

goodman.m.w at gmail.com goodman.m.w at gmail.com
Mon Sep 20 07:10:26 UTC 2021


Thanks, Woodley,

I looked over the code you sent. It looks like, as I suspected, your
implementation can check the mask after the effects of a rewrite rule are
resolved (the full string and characterization), unlike mine which only has
the context of a single match as it's being rewritten and characterization
deltas. Given that, looping over the input and output strings to check for
masks is reasonable, but as it breaks early when a matching mask is found,
it doesn't block duplication of masked capture groups, tested as follows:

    $ cat tst/mask2.rpp
    =a
    !(.)    \1\1
    :[\t ]
    $ echo "ab " | repp/repp tst/mask2.rpp -y
    LOADED a MASK expression: a
    EXECUTING MASK pattern...
    MASKING a
    (0, 0, 1, <0:2>, 1, "aabb", 0, "null")

Here I'd expect "abb" as the output because the second occurrence of "a"
should have different characterization as the input string. It also doesn't
seem to block the swapping of capture groups, but I'm not yet sure why.


On Thu, Sep 16, 2021 at 11:39 PM Woodley Packard <sweaglesw at sweaglesw.org>
wrote:

> Hi Mike,
>
> It's been over a year since I've thought about this, so it is far from
> fresh in my mind.  My repp implementation is not in a public-facing version
> control system; however, I put a copy of the initial work I did on this
> last year here since you expressed interest:
>
> http://sweaglesw.org/linguistics/repp-mask-trial.tgz
>
> The regex unicode property syntax problem is unresolved.  I'm afraid I
> also can't offer any kind of expectation or intuition as to how masks
> should behave relative to your questions, though, other than agreeing with
> your assessment that a rule that intentionally duplicates masked content
> seems far-fetched.
>
> -Woodley
>
>
> On Sep 10, 2021, at 10:58 PM, goodman.m.w at gmail.com wrote:
>
> Hi all, I'm now returning to REPP masking support in PyDelphin and I have
> some questions.
>
> Firstly, Woodley, do you have your implementation of REPP somewhere
> accessible? The latest packaged tarball on sweaglesw.org (0.2.2) was last
> touched in 2011.
>
> Secondly, I got PyDelphin to support masking as we discussed, but my
> implementation turned up some strange edge cases where regex captures are
> concerned. Part of this may be due to how, unlike Woodley's, Bec's, and
> Glenn's REPP libraries, mine does not execute a regular expression anew for
> each match (you may recall from another thread that this is a source of
> differences when material is deleted at the start of the string). Instead,
> it finds all matches in the string before rewriting. As a consequence, I
> need to determine if masked material has been altered within a single
> match, leading to the following conditions:
>
> 1. If the first mask value (in the BIO-tagging scheme) in the matched
> substring is "I", then the match begins in the middle of some masked
> substring. The result must not shift the position of this mask.
> 2. Similarly, if the first mask value *after* the match is "I", then the
> match ends in the middle of a masked substring. This mask also should not
> move.
> 3. All other masked substrings in a match are entirely covered and may
> shift around.
>
> As an example, imagine we have masked the substring "<oe at yy.com>" from
> the full string "(<oe at yy.com>)" and we have rewrite rules that insert a
> space between open/closes parentheses and some captured punctuation (here
> being < and >):
>
>     !\((\p{P})    ( \1
>     !(\p{P})\)    \1 )
>
> These rules partially capture the masked substring, so they must not move
> from the right/left boundary of the match in the output of the rule.
>
> One issue, however unlikely, is whether a masked substring can be
> duplicated (e.g., the backreference is repeated in the RHS of a rewrite
> rule). Forgetting about characterization constraints for a minute, (1) and
> (2) above should not allow duplication as it would copy a partial mask and
> therefore alter it. (3) could possibly allow for duplicating masked
> substrings (but see below).
>
> Another issue, also unlikely, and also requiring us to set aside
> characterization constraints, is whether masked substrings can swap
> positions. E.g., if we capture two masked email addresses and have "\2; \1"
> on the RHS of the rewrite rule.
>
> Now, if we consider changes in characterization (aside from simple shifts
> to the left or right) within a mask to be something that can block a
> rewrite rule, then we will disallow both duplication and swapping of masked
> substrings. This is because accurate characterization stops when the
> backreferences are not monotonically increasing from 1 without gaps, so
> both "\1 \1" and "\2 \1" will break characterization, and thus block the
> rewrite rule even if the masked content was not otherwise altered.
>
> I guess my question is: does this fit with your expectations of how masks
> should behave in REPP? That is, no duplication and no swapping?
>
> On Sun, Aug 9, 2020 at 2:48 PM Stephan Oepen <oe at ifi.uio.no> wrote:
>
>> hi again, woodley:
>>
>> > I got to the point of being able to play around a bit with rules,
>> anyway.  I can mask email addresses, but as far as I can tell, no
>> subsequent rules are ever even trying to do anything inside of them.  Is
>> this actually a good test case?  I get a single identical token for the
>> email address in the below example, before and after implementing the
>> masking idea:
>>
>> i am happy to hear you were able to confirm your optimistic
>> expectation that masking would not be too difficult to implement :-).
>>
>> i shall add a few more masking rules to the ERG trunk this coming
>> week, but i would think the following could be a useful test case to
>> explore the interaction of masking and rewriting (i would expect
>> eleven tokens):
>>
>> stephan, oe at yy.com, oe at ellingsen-oepen.net, or привет@радио-москва.рф
>> <привет@xn----7sbbhiyppqcpt.xn--p1ai>, called.
>>
>> > Besides looking prettier, Mike's regex has the advantage of working in
>> Boost's POSIX regex interface, whereas Stephan's does not.  I am not
>> particularly eager to change to a different regex API.  Boost regex has
>> multiple ways to call it, and for whatever reason, the POSIX way does not
>> support the \p{} syntax.
>>
>> i would suggest we leave aesthetic judgments to the maintainers of the
>> REPP rules, but in this case i put in unicode properties for a reason:
>> i am eager to take into use the \p{} syntax because (unlike classic
>> character ranges or shorthands like \w) it is unambiguously defined
>> across engines, independent of locales.  more importantly, i expect
>> unicode properties will afford a cleaner and more general solution to
>> normalization of punctuation, e.g. different types of whitespace and
>> various conventions for opening and closing quote marks; unicode
>> properties may also help in dealing with interspersed foreign content.
>>
>> it appears Boost regex offers full unicode support when combined with
>> ICU, which i would guess ACE is using from before?  so, i am hoping
>> that full unicode support in regular expressions (in REPP and chart
>> mapping) might become available with relatively minor adjustments of
>> how you call into the Boost regex engine?
>>
>>
>> https://www.boost.org/doc/libs/1_73_0/libs/regex/doc/html/boost_regex/unicode.html
>>
>> > I ended up using the BIO-encoded representation of what's masked that
>> Mike proposed, so I can mask two adjacent spans and then still insert
>> material between them, but block changing material inside of the masked
>> regions.  In my implementation, material copied by capture group is OK but
>> material rewritten literally on the RHS of a replace fails currently,
>> because that material ends up being marked as unmasked, whereas the check
>> requires identical content, characterization, and mask tags for everything
>> in a masked area.
>>
>> that all sounds compatible with my intuitions about how i would like
>> the masking to behave.  in general, i am hoping to discourage literal
>> rewriting, as it has the potential to weaken characterization
>> accounting.
>>
>> many thanks for working on this!  oe
>>
>
>
> --
> -Michael Wayne Goodman
>
>
>

-- 
-Michael Wayne Goodman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20210920/a0d1b1dc/attachment.htm>


More information about the developers mailing list