<div dir="ltr"><div dir="ltr"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif"><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Aug 4, 2020 at 12:52 AM Stephan Oepen <<a href="mailto:oe@ifi.uio.no" target="_blank">oe@ifi.uio.no</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">hi again, mike, and many thanks for the quick response!<br> <br> > Ok, so if I understood correctly, masking is not sequential like rewrite rules, and happens before the rewrite rules regardless of where the mask pattern appears in the file (just as the tokenization pattern is applied after the rewrite rules), and the order of application of the mask patterns doesn't matter.<br> <br> that is in fact not what i had intended. i would like the masking<br> rules to follow the standard sequential flow of control in REPP, i.e.<br> they get invoked when the processor gets to that point in the rule<br> sequence. for full generality, i imagine one might want to allow some<br> string-level normalization prior to mask invocation. the effects of a<br> successful mask matching will be valid from that point in the<br> processing sequence onwards.<br></blockquote><div><br></div><div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default">Sorry, I misinterpreted what you meant by "masking sub-strings [...] prior to core REPP processing" in the original email.</div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default"><br></div></div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default"></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> on this view, i believe your clarification questions (1) and (2) do<br> not apply, right?<br></blockquote><div><br></div><div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default">Correct, although my related questions (in the GitHub issue) still stand. We can deal with those later.</div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default"><br></div></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <span class="gmail_default" style="font-family:arial,helvetica,sans-serif">[...]</span><br> <br> > Finally, do we want to block rewrite rules where a capture group starts or ends within a mask? I can imagine multiple capture groups that collectively copy the entire masked region without alteration. I think this situation wouldn't be too bad if we just check that the before and after masked substrings have the same contents *and* the characterization is constant (the same offset for the whole mask).<br> <br> i am not quite sure what exactly you have in mind here regarding<br> constant characterization (masked sub-strings can be shifted to the<br> left or the right, but their length and content must not change)?</blockquote><div><br></div><div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default">By "the same offset for the whole mask" I am referring to the start and end positions that are tracked for each character. The offset itself may change (indicating the masked region shifting left or right), but all start and end offsets within a masked region must be the same offset, otherwise it indicates that the length has changed or that content has been replaced.<br></div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> my<br> original assumption was to just disallow rewriting without capture<br> groups inside (or overlapping with) a masked region. this feels like<br> a simple and clear constraint to me. on this view, two adjacent<br> capture groups that cover (at least) the complete masked region would<br> be fine, but even single-character identity rewriting (as in your '@'<br> example) should be blocked. i fail to see a compelling need for that<br> kind of rewriting in the first place, and i would like to not<br> complicate masking support too much. i imagine it might be relatively<br> straightforward to evaluate rewriting conditions while synthesizing<br> the output (i.e. while processing the right-hand side of a rule),<br> interleaved with the character-level accounting.<br></blockquote><div><br></div><div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default">I agree that these cases are extremely unlikely. I think that being too permissive with these seemingly trivial decisions can lead to unexpected bugs later. For instance, if we allow multiple capture groups to piece together the original masked string and we oversee the rewriting to ensure it hasn't changed, these might cause problems, depending on implementation:<br></div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default"><br></div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default"> ; mask "abc"<br></div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default"> =abc</div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default"> ; full mask is captured and rewritten contiguously, but string and offsets change<br></div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default"> !(a)(b)(c) \2\1\3</div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default"> ; full mask is captured, only part is written<br></div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default"> !(a(b)(c)) \2\3</div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default"> ; full mask is captured and rewritten contiguously, but 'b' is duplicated<br></div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default"> !(a(b))(c) \1\2\3<br></div></div><div><br></div><div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default">I feel that the analysis of the regex on the left and the template on the right to ensure that the full masked substring is recreated contiguously, completely, and in order is an overly-complicated solution. Perhaps when I write this code I'll see something that makes it easy to compute. But barring that, I proposed using post-rule-application checks on having uniform start/end offsets in each mask and that the contents of those substrings is identical to the corresponding pre-rule-application substrings. These checks alone would not block the '@' example only as a side effect, because replacing a single non-captured character does not break the uniformity of the offsets (and in this case the string didn't change, either). When 2 or more non-captured characters are replaced, the offsets become non-uniform, even if the replaced characters are identical to the input. The '@' example could probably be blocked with a third check that no non-captured material is inserted in a mask; at least, this sounds much simpler than tracking the captured groups.<br></div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default"><br></div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default">The alternative where a rewrite rule is blocked if capture groups begin or end within a mask sounds like a special case that would be confusing for a grammar developer not familiar with the full REPP specification.<br></div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <span class="gmail_default" style="font-family:arial,helvetica,sans-serif">[...]</span><br></blockquote></div><br>-- <br><div dir="ltr">-Michael Wayne Goodman</div></div>