[developers] Comparison of REPP implementations

Fri Nov 29 03:37:21 CET 2019

Returning to the "troubling" issue, I looked at ACE's repp implementation
and saw that my second guess of the behavior's cause was correct: it is not
calling groups iteratively (except internal, iterative groups, as it
should), but it is doing global substitution with multiple regex calls. If
it finds a match, it performs the substitution and then continues matching
from the next position on the transformed string. If the substitution
replaces the match with nothing, it matches again from the same position.
Normally this is fine but because it is a separate regex call it will even
match anchors like ^ on the successive calls. ACE's implementation uses
Boost.Regex, as does Bec's C++ implementation, and I suspect they are doing
the same thing in these situations.

A fix would be to continue matching on the original string while
constructing a separate transformed string. After the rule has been fully
applied, the original string can be replaced with the fully transformed one.

Even though the fix would alter our underlying tokenization, this seems to
only affect one sentence in the ERG's gold profiles.

On Tue, Nov 26, 2019 at 12:24 PM goodman.m.w at gmail.com <
goodman.m.w at gmail.com> wrote:

> It turns out my second issue regarding group-local inline flags (e.g.,
> (?i)) was not as trivial as I expected. By using a 3rd party regex library
> with better support for these advanced features I was able to easily
> resolve this issue, but it introduced another one: unescaped brackets in
> character classes are treated as nested sets. In fact I'd already
> encountered this situation and created an issue for the ERG a few months
> ago: https://github.com/delph-in/erg/issues/17.
>
> For now I've sidestepped the issue by backing off to the old behavior if
> there's an error that is plausibly caused by unescaped brackets, and now I
> only get a diff for 1 item compared to the REPP standalone tool. The
> remaining error is the one described in the previous message. But, as you
> said, PyDelphin might be doing the right thing for that case.
>
> On Mon, Nov 25, 2019 at 11:59 PM goodman.m.w at gmail.com <
> goodman.m.w at gmail.com> wrote:
>
>>
>> Hi Stephan and all,
>>
>> On Mon, Nov 25, 2019 at 9:13 PM Stephan Oepen <oe at ifi.uio.no> wrote:
>>
>>> also, i am happy to confirm that leaving out the LKB from this
>>> comparison is justified, as its implementation of recovering character
>>> ranges after all rule applications are complete predates the work by bec
>>> (and myself) and is known to give invalid results in some cases.
>>>
>>
>> Thanks for confirming. I have compared the LKB's implementation in the
>> past, but this time I was short on time and/or being lazy.
>>
>>
>>>
>>> in sum, i would look to the REPP standalone tool as the closest we
>>> currently have to a reference implementation.  it uses the approach of
>>> explicitly keeping track of sub-string correspondences for each rule
>>> application, as described here:
>>>
>>> https://www.aclweb.org/anthology/P12-2074/
>>>
>>
>> Yes, I referred heavily to this paper as well as ReppTop and to the
>> outputs of the various systems when I wrote my version. I took some notes
>> along the way which I've been meaning to migrate to the wiki, but for now
>> I've just put them up here:
>> https://gist.github.com/goodmami/16d907bc2a0e4408456ff596e1e263e6. These
>> notes include details of characterization as well as comparisons of 4
>> systems (LKB, ACE, PET/standalone, and PyDelphin; I could not easily test
>> agree but I welcome any information to fill in my comparison).
>>
>>
>>
>>>
>>>
>>> > The third one is more troubling, because it appears that ACE and REPP
>>> both apply external group calls iteratively even though the ReppTop wiki is
>>> clear that they are should not be iterative. If someone can confirm that
>>> the wiki is incorrect, [...]
>>>
>>> this observation is surprising and potentially troubling to me!  the
>>> wiki page is in fact correct: unlike internal group calls, external calls
>>> should not in and of themselves be iterative; that would unnecessarily lump
>>> together two aspects of the REPP specification and potentially constrain
>>> modularity, viz. in a scenario where one would like to split out a rule
>>> group into an external module (e.g. to be able to parametrically turn it on
>>> or off) but does not want iteration over these rules.  if one in fact wants
>>> both, it is trivial to wrap an internal iterative group around the external
>>> module.
>>>
>>
>> Ok, we agree on the ideal behavior here. I just tried creating a MWE
>> (minimal working example) with nothing but the external with a single rule
>> and the tokenization pattern (see comments here:
>> https://github.com/delph-in/pydelphin/issues/254).  Based on my tests
>> with this setup, it now appears that the difference is *not* iterative
>> application but rather a difference in how PyDelphin and the others do
>> global substitutions. The pattern /^ *[:*#]+/ has the ^ anchor, so it
>> should only match the first instance. Perl (which I assume uses PCRE) shows
>> this is the case as well:
>>
>> With ^ anchor:
>> $ echo ' # # foo' | perl -e 'while (<>) { s/^ *[:*#]+//g; print }'
>>  # foo
>>
>> Without ^ anchor:
>> $ echo ' # # foo' | perl -e 'while (<>) { s/ *[:*#]+//g; print }'
>>  foo
>>
>> So now I'm wondering how REPP and ACE do global substitutions. PyDelphin
>> gets all matches on a line with a single call to the regex engine, so ^ is
>> only matched once, but if it were to instead get a single match at a time
>> (matching successively starting at the end-position of the previous
>> substitution), then it could match ^ on those successive calls. Perhaps
>> this is what the other tools are doing?
>>
>> --
>> -Michael Wayne Goodman
>>
>
>
> --
> -Michael Wayne Goodman
>

-- 
-Michael Wayne Goodman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20191129/108af26a/attachment.html>