<div dir="ltr">It's a _loong_ time since I looked at that code (or used svn...). I've been refreshing my memory of the code, and I think I can see how that works. As a mechanism, it sounds reasonable, but it's going to be a long time before I'd have time to sit down and try to make the change. More than happy for anyone else to take up the challenge :)<div> </div><div>Bec</div></div> <div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Aug 2, 2020 at 10:44 PM Stephan Oepen <<a href="mailto:oe@ifi.uio.no">oe@ifi.uio.no</a>> wrote: </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">dear bec, mike, and woodley: during the summit you may have noticed dan mentioning a 'war zone' around NE-related token mapping rules in the current ERG trunk. with our move to modern, OntoNotes-style tokenization, the initial REPP segmentation now breaks at dashes (including hyphens) and slashes. but these will, of course, occur frequently in named entities like email and web addresses, where they should preferably not be segmented. the current unhappy state of affairs is that initial tokenization over-segments, with dan then heroically seeking to re-unite at least the most common patterns of 'multi-token' named entities in token mapping, where any number of token boundaries may have been introduced at hyphens and slashes. to rationalize this state of affairs (and, thus, work toward a peace treaty in token mapping), i believe we will need to extend the REPP language with a new facility: masking sub-strings according to NE-like patterns prior to core REPP processing, and exempting masked regions from all subsequent rewriting (i.e. making sure they remain intact). i have added an example of this new facility (introducing the '+' operator) to the ERG trunk; please see: <a href="http://svn.delph-in.net/erg/trunk/rpp/ne.rpp" rel="noreferrer" target="_blank">http://svn.delph-in.net/erg/trunk/rpp/ne.rpp</a> at present, these rules are only loaded into the LKB (where i am in the process of adding masking to the REPP implementation), hence they should not cause trouble in the other engines (i hope). i would like to invite you (as the developers of REPP processors in PET, pyDelphin, and ACE, respectively) to look over this proposal and share any comments you might have. assuming we can agree on the need for extending the REPP language along the above lines, i am hoping you might have a chance to add support for the masking operator in your REPP implementations? from my ongoing work in the LKB, masking support appears relatively straightforward once an engine implements the step-wise accounting for character position sketched by Dridan & Oepen (2012; ACL). the masking patterns merely set a boolean flag for the matched character positions, and subsequent rewriting must block rule applications that destructively change one or more masked character positions. output of capture groups (copying from the left-hand side verbatim), on the other hand, must be allowed over masked regions. because the LKB implementation predates the 2012 paper, however, i will first have to implement the precise accounting mechanism to validate the above expectation regarding how to realize masking. what do you make of the above proposal? oe </blockquote></div>