[developers] extension to the REPP sub-formalism

Mon Aug 3 09:35:06 CEST 2020

Hi Stephan,

This sounds like a good solution. I have some questions/comments below.

On Sun, Aug 2, 2020 at 8:44 PM Stephan Oepen <oe at ifi.uio.no> wrote:

> [...]
> to rationalize this state of affairs (and, thus, work toward a peace
> treaty in token mapping), i believe we will need to extend the REPP
> language with a new facility: masking sub-strings according to NE-like
> patterns prior to core REPP processing, and exempting masked regions
> from all subsequent rewriting (i.e. making sure they remain intact).
>

Ok, so if I understood correctly, masking is not sequential like rewrite
rules, and happens before the rewrite rules regardless of where the mask
pattern appears in the file (just as the tokenization pattern is applied
after the rewrite rules), and the order of application of the mask patterns
doesn't matter.

I first wish to discuss mask pattern discovery, and this cross-cuts with
some other unclear areas of the REPP specification. To recap, REPP has
sequential operators ('!' rewrite rule, '<' file include, and '>' group
call) which apply in order during processing, and non-sequential operators
('#' iterative group definition, ':' tokenizer pattern, '@' meta-info
declaration) which do not apply except in certain circumstances (iterative
groups when they are called, tokenization after all rewrite rules have
applied). Non-sequential operators also have these two properties:

1. They may only be defined once in a REPP (once per identifier for
iterative groups)
2. They are local to a REPP instance (an iterative group or tokenizer
pattern in an external module is not available to other modules)

(These are partially guesses; I've raised an issue for PyDelphin to resolve
related questions so they don't distract from the current topic:
https://github.com/delph-in/pydelphin/issues/308)

The masking rules are non-sequential, but (1) clearly doesn't apply, and
(2) doesn't seem to apply in your proposal since ne.rpp is a submodule. At
first my reaction was to vote for starting simple and using masks defined
in the top-level module only (like the tokenizer), but I can see the value
in having them spread across submodules: a submodule may define rewrite
rules that require additional masks that are only needed when the module is
active.

So if we allow submodules to define these global masks, I guess we need to
collect any mask pattern found by crawling active submodules. The
non-sequential but global nature raises an issue: what if a submodule
containing a mask is active (e.g., set in *repp-calls* in the LKB) but is
not actually called with a group-call (i.e., if `>ne` did not appear in
tokenizer.rpp)?

> i have added an example of this new facility (introducing the '+'
> operator) to the ERG trunk; please see:
>
> http://svn.delph-in.net/erg/trunk/rpp/ne.rpp
>

As an aside, that email regex is needlessly complicated. Since, in a
unicode-aware regex engine, the word-character class \w is equivalent to
the L and N unicode properties with the underscore ([\p{L}\p{N}_]), and
since the TLD part of the domain must have only ascii characters, it can be
simplified as follows:

    <?[\w.-]+@[\w-]+(?:\.[\w-]+)*\.[a-zA-Z0-9]+>?

Either way it's not RFC5322 compatible but I imagine in running text you
want to match addresses that may be displayed with unicode codepoints.

> [...] the masking patterns merely set a boolean flag for the matched
> character
> positions, and subsequent rewriting must block rule applications that
> destructively change one or more masked character positions.  output
> of capture groups (copying from the left-hand side verbatim), on the
> other hand, must be allowed over masked regions.

That makes sense, but we may need a different mechanism than just boolean
flags because of the possibility of immediately adjacent masked regions
looking like one solid region when we should allow material to be inserted
between them. Instead, an IOB scheme (like in chunking) or similar would be
better.

There's also the question of overlapping masks (viz., when a mask pattern
matches a sequence that is already part of another mask). The IOB vector
would not accommodate these as separate, overlapping masks, so we could (1)
ignore overlapping matches, (2) union them (and update the IOB values
accordingly), or (3) use a different data structure such as a list of mask
start-positions and run-lengths. Currently I like option (2).

Finally, do we want to block rewrite rules where a capture group starts or
ends within a mask? I can imagine multiple capture groups that collectively
copy the entire masked region without alteration. I think this situation
wouldn't be too bad if we just check that the before and after masked
substrings have the same contents *and* the characterization is constant
(the same offset for the whole mask). This means the following would pass
because reinserting a single non-captured character doesn't change the
characterization:

    !(<?[\w.-]+)@([\w-]+(?:\.[\w-]+)*\.[a-zA-Z0-9]+>?)        \1@\2

But the following would change the characterization at the end and would
thus be blocked:

    !(<?[\w.-]+@[\w-]+(?:\.[\w-]+)*)\.com(>?)        \1.com\2

Also, generally speaking, I can see this functionality having potential to
reduce the need for special casing of things beyond named entities.
Currently the ERG has 12 lexical entries for "email" ("e-mail", "e - mail",
"e mail", nouns and verbs) and some of the orthographic variation seems to
account for tokenization effects. Is there any reason it should not be used
in these cases?

-- 
-Michael Wayne Goodman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20200803/76d4df3b/attachment.html>