[developers] extension to the REPP sub-formalism

goodman.m.w at gmail.com goodman.m.w at gmail.com
Sun Oct 31 17:37:26 UTC 2021


Hi Stephan, all,

Thanks for coming back to the thread.

On Sun, Oct 31, 2021 at 12:47 AM Stephan Oepen <oe at ifi.uio.no> wrote:

> [...]
> > Here I'd expect "abb" as the output because the second occurrence of "a"
> should have different characterization as the input string.
>
> why exactly do you think duplicating the masked capture group should be
> blocked?  if i read your rules right, a full masked region is matched by
> the replacement rule here.  so both copies in the output remain ‘intact’,
> in the sense of preserving the masked content and its characterization, i
> would think?
>
> similarly, i do not immediately see on which grounds we should block
> swapping around two complete masked regions?  i wonder whether your current
> code actually operationalizes sufficient notions of preserving masked
> sub-strings?
>

For both of these cases, it is because I understood masking as working
within the framework we have for REPP characterization and not preempting
or co-opting the backreferences in rewrite rules. From Footnote 9 in Dridan
and Oepen, 2012, in reference to characterization of capture groups:

    9 If capture group references are used out-of-order, however,
    the per-group linkage is no longer well-defined, and we resort
    to the maximum-span ‘union’ of boundary points (see below).

I recall that, during the development of PyDelphin's repp module, I
confirmed this statement with tests against the C++ REPP implementation
(unfortunately I'm having trouble compiling it at the moment so I'm going
from memory). So the following would have fully traceable characterization
for all backreferences on the RHS:

    !(ab)(cd)(ef)    \1 \2 \3
    !(ab)(cd)(ef)    \1 \2
    !(ab)(cd)(ef)    \1

But the following would not:

    !(ab)(cd)(ef)    \1 \3 \2  (out of order)
    !(ab)(cd)(ef)    \1 \1 \2  (not strictly increasing)
    !(ab)(cd)(ef)    \1 \3  (gap before \3)
    !(ab)(cd)(ef)    \2 \3  (gap before \2)

For instance:

    $ cat test.rpp
    !(ab)(cd)(ef) \1 \3 \2
    $ delphin repp -m test.rpp -f triple <<<"abcdef"
    (0, 2, ab)
    (2, 6, ef)
    (2, 6, cd)

Note that the CFROM:CTO values for ef and cd are *not* 4:6 and 2:4,
respectively. So if we attempt to duplicate a mask or swap two mask
positions, then the rewrite rule would not preserve characterization for at
least the second backreference and, thus, the rule would be blocked.

Side note: since, normally, the question of which backreferences in a
rule's RHS are fully traceable is not sensitive to the input of the rewrite
rule, my implementation analyzes the RHS at model-load time and not
rule-application time. If we declare that masks don't play by the normal
rules, then I could no longer perform this optimization as it's unknown a
priori whether a rewrite rule will be applied to masked or unmasked content.

Mainly for the logical reason above (concerning characterization and
rule-blocking) and secondly for the implementational issue and that it's
dubious that masks will ever be duplicated or swapped in practice, I'm
quite happy to simply state that such operations are blocked. Do you have a
strong case for exempting masked regions from the normal characterization
behavior?

Finally, I think there might be a regex hack that allows for duplicated
masked regions, at least. The following captures the same span twice:

    !((abc))    \1 \2

This doesn't work in my current implementation (possibly due to a bug), so
for now it's just theoretical.

-- 
-Michael Wayne Goodman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20211031/78a43882/attachment.htm>


More information about the developers mailing list