[developers] extension to the REPP sub-formalism

Tue Aug 4 04:11:25 CEST 2020

On Tue, Aug 4, 2020 at 12:52 AM Stephan Oepen <oe at ifi.uio.no> wrote:

> hi again, mike, and many thanks for the quick response!
>
> > Ok, so if I understood correctly, masking is not sequential like rewrite
> rules, and happens before the rewrite rules regardless of where the mask
> pattern appears in the file (just as the tokenization pattern is applied
> after the rewrite rules), and the order of application of the mask patterns
> doesn't matter.
>
> that is in fact not what i had intended.  i would like the masking
> rules to follow the standard sequential flow of control in REPP, i.e.
> they get invoked when the processor gets to that point in the rule
> sequence.  for full generality, i imagine one might want to allow some
> string-level normalization prior to mask invocation.  the effects of a
> successful mask matching will be valid from that point in the
> processing sequence onwards.
>

Sorry, I misinterpreted what you meant by "masking sub-strings [...] prior
to core REPP processing" in the original email.

on this view, i believe your clarification questions (1) and (2) do
> not apply, right?
>

Correct, although my related questions (in the GitHub issue) still stand.
We can deal with those later.

[...]
>
> > Finally, do we want to block rewrite rules where a capture group starts
> or ends within a mask? I can imagine multiple capture groups that
> collectively copy the entire masked region without alteration. I think this
> situation wouldn't be too bad if we just check that the before and after
> masked substrings have the same contents *and* the characterization is
> constant (the same offset for the whole mask).
>
> i am not quite sure what exactly you have in mind here regarding
> constant characterization (masked sub-strings can be shifted to the
> left or the right, but their length and content must not change)?

By "the same offset for the whole mask" I am referring to the start and end
positions that are tracked for each character. The offset itself may change
(indicating the masked region shifting left or right), but all start and
end offsets within a masked region must be the same offset, otherwise it
indicates that the length has changed or that content has been replaced.

>   my
> original assumption was to just disallow rewriting without capture
> groups inside (or overlapping with) a masked region.  this feels like
> a simple and clear constraint to me.  on this view, two adjacent
> capture groups that cover (at least) the complete masked region would
> be fine, but even single-character identity rewriting (as in your '@'
> example) should be blocked.  i fail to see a compelling need for that
> kind of rewriting in the first place, and i would like to not
> complicate masking support too much.  i imagine it might be relatively
> straightforward to evaluate rewriting conditions while synthesizing
> the output (i.e. while processing the right-hand side of a rule),
> interleaved with the character-level accounting.
>

I agree that these cases are extremely unlikely. I think that being too
permissive with these seemingly trivial decisions can lead to unexpected
bugs later. For instance, if we allow multiple capture groups to piece
together the original masked string and we oversee the rewriting to ensure
it hasn't changed, these might cause problems, depending on implementation:

    ; mask "abc"
    =abc
    ; full mask is captured and rewritten contiguously, but string and
offsets change
    !(a)(b)(c)    \2\1\3
    ; full mask is captured, only part is written
    !(a(b)(c))    \2\3
    ; full mask is captured and rewritten contiguously, but 'b' is
duplicated
    !(a(b))(c)    \1\2\3

I feel that the analysis of the regex on the left and the template on the
right to ensure that the full masked substring is recreated contiguously,
completely, and in order is an overly-complicated solution. Perhaps when I
write this code I'll see something that makes it easy to compute. But
barring that, I proposed using post-rule-application checks on having
uniform start/end offsets in each mask and that the contents of those
substrings is identical to the corresponding pre-rule-application
substrings. These checks alone would not block the '@' example only as a
side effect, because replacing a single non-captured character does not
break the uniformity of the offsets (and in this case the string didn't
change, either). When 2 or more non-captured characters are replaced, the
offsets become non-uniform, even if the replaced characters are identical
to the input. The '@' example could probably be blocked with a third check
that no non-captured material is inserted in a mask; at least, this sounds
much simpler than tracking the captured groups.

The alternative where a rewrite rule is blocked if capture groups begin or
end within a mask sounds like a special case that would be confusing for a
grammar developer not familiar with the full REPP specification.

> [...]
>

-- 
-Michael Wayne Goodman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20200804/a96a8a7b/attachment.html>