[developers] extension to the REPP sub-formalism

Woodley Packard sweaglesw at sweaglesw.org
Mon Aug 3 18:51:26 CEST 2020


Mike,  what makes you think the masking operator should float to the top of the execution order instead of applying in the position it is written?  I expect it may be useful to apply some degree of normalization before masking applies, and if the desire is for masking first the author can always put it first.

Woodley

> On Aug 3, 2020, at 12:35 AM, "goodman.m.w at gmail.com" <goodman.m.w at gmail.com> wrote:
> 
> 
> Hi Stephan,
> 
> This sounds like a good solution. I have some questions/comments below.
> 
>> On Sun, Aug 2, 2020 at 8:44 PM Stephan Oepen <oe at ifi.uio.no> wrote:
>> [...]
>> to rationalize this state of affairs (and, thus, work toward a peace
>> treaty in token mapping), i believe we will need to extend the REPP
>> language with a new facility: masking sub-strings according to NE-like
>> patterns prior to core REPP processing, and exempting masked regions
>> from all subsequent rewriting (i.e. making sure they remain intact).
> 
> Ok, so if I understood correctly, masking is not sequential like rewrite rules, and happens before the rewrite rules regardless of where the mask pattern appears in the file (just as the tokenization pattern is applied after the rewrite rules), and the order of application of the mask patterns doesn't matter.
> 
> I first wish to discuss mask pattern discovery, and this cross-cuts with some other unclear areas of the REPP specification. To recap, REPP has sequential operators ('!' rewrite rule, '<' file include, and '>' group call) which apply in order during processing, and non-sequential operators ('#' iterative group definition, ':' tokenizer pattern, '@' meta-info declaration) which do not apply except in certain circumstances (iterative groups when they are called, tokenization after all rewrite rules have applied). Non-sequential operators also have these two properties:
> 
> 1. They may only be defined once in a REPP (once per identifier for iterative groups)
> 2. They are local to a REPP instance (an iterative group or tokenizer pattern in an external module is not available to other modules)
> 
> (These are partially guesses; I've raised an issue for PyDelphin to resolve related questions so they don't distract from the current topic: https://github.com/delph-in/pydelphin/issues/308)
> 
> The masking rules are non-sequential, but (1) clearly doesn't apply, and (2) doesn't seem to apply in your proposal since ne.rpp is a submodule. At first my reaction was to vote for starting simple and using masks defined in the top-level module only (like the tokenizer), but I can see the value in having them spread across submodules: a submodule may define rewrite rules that require additional masks that are only needed when the module is active.
> 
> So if we allow submodules to define these global masks, I guess we need to collect any mask pattern found by crawling active submodules. The non-sequential but global nature raises an issue: what if a submodule containing a mask is active (e.g., set in *repp-calls* in the LKB) but is not actually called with a group-call (i.e., if `>ne` did not appear in tokenizer.rpp)?
>  
>> i have added an example of this new facility (introducing the '+'
>> operator) to the ERG trunk; please see:
>> 
>> http://svn.delph-in.net/erg/trunk/rpp/ne.rpp
> 
> As an aside, that email regex is needlessly complicated. Since, in a unicode-aware regex engine, the word-character class \w is equivalent to the L and N unicode properties with the underscore ([\p{L}\p{N}_]), and since the TLD part of the domain must have only ascii characters, it can be simplified as follows:
> 
>     <?[\w.-]+@[\w-]+(?:\.[\w-]+)*\.[a-zA-Z0-9]+>?
> 
> Either way it's not RFC5322 compatible but I imagine in running text you want to match addresses that may be displayed with unicode codepoints.
>  
>> [...] the masking patterns merely set a boolean flag for the matched character
>> positions, and subsequent rewriting must block rule applications that
>> destructively change one or more masked character positions.  output
>> of capture groups (copying from the left-hand side verbatim), on the
>> other hand, must be allowed over masked regions. 
> 
> That makes sense, but we may need a different mechanism than just boolean flags because of the possibility of immediately adjacent masked regions looking like one solid region when we should allow material to be inserted between them. Instead, an IOB scheme (like in chunking) or similar would be better.
> 
> There's also the question of overlapping masks (viz., when a mask pattern matches a sequence that is already part of another mask). The IOB vector would not accommodate these as separate, overlapping masks, so we could (1) ignore overlapping matches, (2) union them (and update the IOB values accordingly), or (3) use a different data structure such as a list of mask start-positions and run-lengths. Currently I like option (2).
> 
> Finally, do we want to block rewrite rules where a capture group starts or ends within a mask? I can imagine multiple capture groups that collectively copy the entire masked region without alteration. I think this situation wouldn't be too bad if we just check that the before and after masked substrings have the same contents *and* the characterization is constant (the same offset for the whole mask). This means the following would pass because reinserting a single non-captured character doesn't change the characterization:
> 
>     !(<?[\w.-]+)@([\w-]+(?:\.[\w-]+)*\.[a-zA-Z0-9]+>?)        \1@\2
> 
> But the following would change the characterization at the end and would thus be blocked:
> 
>     !(<?[\w.-]+@[\w-]+(?:\.[\w-]+)*)\.com(>?)        \1.com\2
> 
> Also, generally speaking, I can see this functionality having potential to reduce the need for special casing of things beyond named entities. Currently the ERG has 12 lexical entries for "email" ("e-mail", "e - mail", "e mail", nouns and verbs) and some of the orthographic variation seems to account for tokenization effects. Is there any reason it should not be used in these cases?
> 
> -- 
> -Michael Wayne Goodman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20200803/94c5edf6/attachment.html>


More information about the developers mailing list