[developers] extension to the REPP sub-formalism

Mon Aug 3 18:52:04 CEST 2020

hi again, mike, and many thanks for the quick response!

> Ok, so if I understood correctly, masking is not sequential like rewrite rules, and happens before the rewrite rules regardless of where the mask pattern appears in the file (just as the tokenization pattern is applied after the rewrite rules), and the order of application of the mask patterns doesn't matter.

that is in fact not what i had intended.  i would like the masking
rules to follow the standard sequential flow of control in REPP, i.e.
they get invoked when the processor gets to that point in the rule
sequence.  for full generality, i imagine one might want to allow some
string-level normalization prior to mask invocation.  the effects of a
successful mask matching will be valid from that point in the
processing sequence onwards.

on this view, i believe your clarification questions (1) and (2) do
not apply, right?

> That makes sense, but we may need a different mechanism than just boolean flags because of the possibility of immediately adjacent masked regions looking like one solid region when we should allow material to be inserted between them. Instead, an IOB scheme (like in chunking) or similar would be better.

indeed, that is a good point (that i had not yet considered).  yes,
destructive rewriting inbetween two adjacent masking regions must be
allowed.

> There's also the question of overlapping masks (viz., when a mask pattern matches a sequence that is already part of another mask). The IOB vector would not accommodate these as separate, overlapping masks, so we could (1) ignore overlapping matches, (2) union them (and update the IOB values accordingly), or (3) use a different data structure such as a list of mask start-positions and run-lengths. Currently I like option (2).

yes, your option (2) sounds like the most straightforward solution,
both in terms of specifying the expected behavior and implementing it.
the alternative would be not to allow overlapping mask matching, but
to me too it seems conceptually simplest (for REPP users and
implementers alike) to not restrict mask matching and union
overlapping matches.

> Finally, do we want to block rewrite rules where a capture group starts or ends within a mask? I can imagine multiple capture groups that collectively copy the entire masked region without alteration. I think this situation wouldn't be too bad if we just check that the before and after masked substrings have the same contents *and* the characterization is constant (the same offset for the whole mask).

i am not quite sure what exactly you have in mind here regarding
constant characterization (masked sub-strings can be shifted to the
left or the right, but their length and content must not change)?  my
original assumption was to just disallow rewriting without capture
groups inside (or overlapping with) a masked region.  this feels like
a simple and clear constraint to me.  on this view, two adjacent
capture groups that cover (at least) the complete masked region would
be fine, but even single-character identity rewriting (as in your '@'
example) should be blocked.  i fail to see a compelling need for that
kind of rewriting in the first place, and i would like to not
complicate masking support too much.  i imagine it might be relatively
straightforward to evaluate rewriting conditions while synthesizing
the output (i.e. while processing the right-hand side of a rule),
interleaved with the character-level accounting.

i have started to extend ReppTop on the wiki with a section on
masking, though some of the fine points of this thread have yet to be
(decided and) written down.  thanks, once more, for pushing towards
more specificity!

> Also, generally speaking, I can see this functionality having potential to reduce the need for special casing of things beyond named entities. Currently the ERG has 12 lexical entries for "email" ("e-mail", "e - mail", "e mail", nouns and verbs) and some of the orthographic variation seems to account for tokenization effects. Is there any reason it should not be used in these cases?

well, yes, i too wonder at times whether accommodation of typographic
variation could be reduced in the ERG lexicon :-).  this is a tricky
game, i fear.  in part because what is in the lexicon (in some cases)
seeks to cover both common conventions and common deviations, in part
because there have been some usage scenarios for the ERG without going
through the REPP layer (i.e. when parsing pre-tokenized or otherwise
externally tokenized inputs).  for the above example, i imagine (at
least if assuming REPP tokenization) one could hope to make do without
the three-token |e - mail| lexical entry (by masking |e-mail|),
whereas the other variants likely are required.  but such masking
could be said to duplicate specific lexical information in the REPP
rules, so maybe one would rather want to not require the |e-mail|
entry?

best wishes, oe