[developers] extension to the REPP sub-formalism

Thu Sep 30 23:36:09 UTC 2021

I have now released PyDelphin 1.6.0 with REPP masking support as described
above. I invite you all to test it out and let me know what you think.
Here's an example (using the same mask2.rpp file as above):

    $ delphin repp -m mask2.rpp <<< "ab"
    (0, 0, 1, <0:2>, 1, "abb", 0, "null")

On Mon, Sep 20, 2021 at 12:10 AM goodman.m.w at gmail.com <
goodman.m.w at gmail.com> wrote:

> Thanks, Woodley,
>
> I looked over the code you sent. It looks like, as I suspected, your
> implementation can check the mask after the effects of a rewrite rule are
> resolved (the full string and characterization), unlike mine which only has
> the context of a single match as it's being rewritten and characterization
> deltas. Given that, looping over the input and output strings to check for
> masks is reasonable, but as it breaks early when a matching mask is found,
> it doesn't block duplication of masked capture groups, tested as follows:
>
>     $ cat tst/mask2.rpp
>     =a
>     !(.)    \1\1
>     :[\t ]
>     $ echo "ab " | repp/repp tst/mask2.rpp -y
>     LOADED a MASK expression: a
>     EXECUTING MASK pattern...
>     MASKING a
>     (0, 0, 1, <0:2>, 1, "aabb", 0, "null")
>
> Here I'd expect "abb" as the output because the second occurrence of "a"
> should have different characterization as the input string. It also doesn't
> seem to block the swapping of capture groups, but I'm not yet sure why.
>
>
> On Thu, Sep 16, 2021 at 11:39 PM Woodley Packard <sweaglesw at sweaglesw.org>
> wrote:
>
>> Hi Mike,
>>
>> It's been over a year since I've thought about this, so it is far from
>> fresh in my mind.  My repp implementation is not in a public-facing version
>> control system; however, I put a copy of the initial work I did on this
>> last year here since you expressed interest:
>>
>> http://sweaglesw.org/linguistics/repp-mask-trial.tgz
>>
>> The regex unicode property syntax problem is unresolved.  I'm afraid I
>> also can't offer any kind of expectation or intuition as to how masks
>> should behave relative to your questions, though, other than agreeing with
>> your assessment that a rule that intentionally duplicates masked content
>> seems far-fetched.
>>
>> -Woodley
>>
>>
>> On Sep 10, 2021, at 10:58 PM, goodman.m.w at gmail.com wrote:
>>
>> Hi all, I'm now returning to REPP masking support in PyDelphin and I have
>> some questions.
>>
>> Firstly, Woodley, do you have your implementation of REPP somewhere
>> accessible? The latest packaged tarball on sweaglesw.org (0.2.2) was
>> last touched in 2011.
>>
>> Secondly, I got PyDelphin to support masking as we discussed, but my
>> implementation turned up some strange edge cases where regex captures are
>> concerned. Part of this may be due to how, unlike Woodley's, Bec's, and
>> Glenn's REPP libraries, mine does not execute a regular expression anew for
>> each match (you may recall from another thread that this is a source of
>> differences when material is deleted at the start of the string). Instead,
>> it finds all matches in the string before rewriting. As a consequence, I
>> need to determine if masked material has been altered within a single
>> match, leading to the following conditions:
>>
>> 1. If the first mask value (in the BIO-tagging scheme) in the matched
>> substring is "I", then the match begins in the middle of some masked
>> substring. The result must not shift the position of this mask.
>> 2. Similarly, if the first mask value *after* the match is "I", then the
>> match ends in the middle of a masked substring. This mask also should not
>> move.
>> 3. All other masked substrings in a match are entirely covered and may
>> shift around.
>>
>> As an example, imagine we have masked the substring "<oe at yy.com>" from
>> the full string "(<oe at yy.com>)" and we have rewrite rules that insert a
>> space between open/closes parentheses and some captured punctuation (here
>> being < and >):
>>
>>     !\((\p{P})    ( \1
>>     !(\p{P})\)    \1 )
>>
>> These rules partially capture the masked substring, so they must not move
>> from the right/left boundary of the match in the output of the rule.
>>
>> One issue, however unlikely, is whether a masked substring can be
>> duplicated (e.g., the backreference is repeated in the RHS of a rewrite
>> rule). Forgetting about characterization constraints for a minute, (1) and
>> (2) above should not allow duplication as it would copy a partial mask and
>> therefore alter it. (3) could possibly allow for duplicating masked
>> substrings (but see below).
>>
>> Another issue, also unlikely, and also requiring us to set aside
>> characterization constraints, is whether masked substrings can swap
>> positions. E.g., if we capture two masked email addresses and have "\2; \1"
>> on the RHS of the rewrite rule.
>>
>> Now, if we consider changes in characterization (aside from simple shifts
>> to the left or right) within a mask to be something that can block a
>> rewrite rule, then we will disallow both duplication and swapping of masked
>> substrings. This is because accurate characterization stops when the
>> backreferences are not monotonically increasing from 1 without gaps, so
>> both "\1 \1" and "\2 \1" will break characterization, and thus block the
>> rewrite rule even if the masked content was not otherwise altered.
>>
>> I guess my question is: does this fit with your expectations of how masks
>> should behave in REPP? That is, no duplication and no swapping?
>>
>> On Sun, Aug 9, 2020 at 2:48 PM Stephan Oepen <oe at ifi.uio.no> wrote:
>>
>>> hi again, woodley:
>>>
>>> > I got to the point of being able to play around a bit with rules,
>>> anyway.  I can mask email addresses, but as far as I can tell, no
>>> subsequent rules are ever even trying to do anything inside of them.  Is
>>> this actually a good test case?  I get a single identical token for the
>>> email address in the below example, before and after implementing the
>>> masking idea:
>>>
>>> i am happy to hear you were able to confirm your optimistic
>>> expectation that masking would not be too difficult to implement :-).
>>>
>>> i shall add a few more masking rules to the ERG trunk this coming
>>> week, but i would think the following could be a useful test case to
>>> explore the interaction of masking and rewriting (i would expect
>>> eleven tokens):
>>>
>>> stephan, oe at yy.com, oe at ellingsen-oepen.net, or привет@радио-москва.рф
>>> <привет@xn----7sbbhiyppqcpt.xn--p1ai>, called.
>>>
>>> > Besides looking prettier, Mike's regex has the advantage of working in
>>> Boost's POSIX regex interface, whereas Stephan's does not.  I am not
>>> particularly eager to change to a different regex API.  Boost regex has
>>> multiple ways to call it, and for whatever reason, the POSIX way does not
>>> support the \p{} syntax.
>>>
>>> i would suggest we leave aesthetic judgments to the maintainers of the
>>> REPP rules, but in this case i put in unicode properties for a reason:
>>> i am eager to take into use the \p{} syntax because (unlike classic
>>> character ranges or shorthands like \w) it is unambiguously defined
>>> across engines, independent of locales.  more importantly, i expect
>>> unicode properties will afford a cleaner and more general solution to
>>> normalization of punctuation, e.g. different types of whitespace and
>>> various conventions for opening and closing quote marks; unicode
>>> properties may also help in dealing with interspersed foreign content.
>>>
>>> it appears Boost regex offers full unicode support when combined with
>>> ICU, which i would guess ACE is using from before?  so, i am hoping
>>> that full unicode support in regular expressions (in REPP and chart
>>> mapping) might become available with relatively minor adjustments of
>>> how you call into the Boost regex engine?
>>>
>>>
>>> https://www.boost.org/doc/libs/1_73_0/libs/regex/doc/html/boost_regex/unicode.html
>>>
>>> > I ended up using the BIO-encoded representation of what's masked that
>>> Mike proposed, so I can mask two adjacent spans and then still insert
>>> material between them, but block changing material inside of the masked
>>> regions.  In my implementation, material copied by capture group is OK but
>>> material rewritten literally on the RHS of a replace fails currently,
>>> because that material ends up being marked as unmasked, whereas the check
>>> requires identical content, characterization, and mask tags for everything
>>> in a masked area.
>>>
>>> that all sounds compatible with my intuitions about how i would like
>>> the masking to behave.  in general, i am hoping to discourage literal
>>> rewriting, as it has the potential to weaken characterization
>>> accounting.
>>>
>>> many thanks for working on this!  oe
>>>
>>
>>
>> --
>> -Michael Wayne Goodman
>>
>>
>>
>
> --
> -Michael Wayne Goodman
>

-- 
-Michael Wayne Goodman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20210930/a17880eb/attachment.htm>