[developers] extension to the REPP sub-formalism

Stephan Oepen oe at ifi.uio.no
Sun Oct 31 07:47:00 UTC 2021


hi mike, all:

with apologies for coming back to this thread late!  and, as so often, many thanks for pushing us further toward a definition of the REPP sub-formalism, including corner cases that i did not anticipate from my work with the ERG tokenization rules :-).

> Here I'd expect "abb" as the output because the second occurrence of "a" should have different characterization as the input string.

why exactly do you think duplicating the masked capture group should be blocked?  if i read your rules right, a full masked region is matched by the replacement rule here.  so both copies in the output remain ‘intact’, in the sense of preserving the masked content and its characterization, i would think?

similarly, i do not immediately see on which grounds we should block swapping around two complete masked regions?  i wonder whether your current code actually operationalizes sufficient notions of preserving masked sub-strings?

all best, oe




More information about the developers mailing list