[developers] Comparison of REPP implementations

Tue Nov 26 05:24:45 CET 2019

It turns out my second issue regarding group-local inline flags (e.g.,
(?i)) was not as trivial as I expected. By using a 3rd party regex library
with better support for these advanced features I was able to easily
resolve this issue, but it introduced another one: unescaped brackets in
character classes are treated as nested sets. In fact I'd already
encountered this situation and created an issue for the ERG a few months
ago: https://github.com/delph-in/erg/issues/17.

For now I've sidestepped the issue by backing off to the old behavior if
there's an error that is plausibly caused by unescaped brackets, and now I
only get a diff for 1 item compared to the REPP standalone tool. The
remaining error is the one described in the previous message. But, as you
said, PyDelphin might be doing the right thing for that case.

On Mon, Nov 25, 2019 at 11:59 PM goodman.m.w at gmail.com <
goodman.m.w at gmail.com> wrote:

>
> Hi Stephan and all,
>
> On Mon, Nov 25, 2019 at 9:13 PM Stephan Oepen <oe at ifi.uio.no> wrote:
>
>> also, i am happy to confirm that leaving out the LKB from this comparison
>> is justified, as its implementation of recovering character ranges after
>> all rule applications are complete predates the work by bec (and myself)
>> and is known to give invalid results in some cases.
>>
>
> Thanks for confirming. I have compared the LKB's implementation in the
> past, but this time I was short on time and/or being lazy.
>
>
>>
>> in sum, i would look to the REPP standalone tool as the closest we
>> currently have to a reference implementation.  it uses the approach of
>> explicitly keeping track of sub-string correspondences for each rule
>> application, as described here:
>>
>> https://www.aclweb.org/anthology/P12-2074/
>>
>
> Yes, I referred heavily to this paper as well as ReppTop and to the
> outputs of the various systems when I wrote my version. I took some notes
> along the way which I've been meaning to migrate to the wiki, but for now
> I've just put them up here:
> https://gist.github.com/goodmami/16d907bc2a0e4408456ff596e1e263e6. These
> notes include details of characterization as well as comparisons of 4
> systems (LKB, ACE, PET/standalone, and PyDelphin; I could not easily test
> agree but I welcome any information to fill in my comparison).
>
>
>
>>
>>
>> > The third one is more troubling, because it appears that ACE and REPP
>> both apply external group calls iteratively even though the ReppTop wiki is
>> clear that they are should not be iterative. If someone can confirm that
>> the wiki is incorrect, [...]
>>
>> this observation is surprising and potentially troubling to me!  the wiki
>> page is in fact correct: unlike internal group calls, external calls should
>> not in and of themselves be iterative; that would unnecessarily lump
>> together two aspects of the REPP specification and potentially constrain
>> modularity, viz. in a scenario where one would like to split out a rule
>> group into an external module (e.g. to be able to parametrically turn it on
>> or off) but does not want iteration over these rules.  if one in fact wants
>> both, it is trivial to wrap an internal iterative group around the external
>> module.
>>
>
> Ok, we agree on the ideal behavior here. I just tried creating a MWE
> (minimal working example) with nothing but the external with a single rule
> and the tokenization pattern (see comments here:
> https://github.com/delph-in/pydelphin/issues/254).  Based on my tests
> with this setup, it now appears that the difference is *not* iterative
> application but rather a difference in how PyDelphin and the others do
> global substitutions. The pattern /^ *[:*#]+/ has the ^ anchor, so it
> should only match the first instance. Perl (which I assume uses PCRE) shows
> this is the case as well:
>
> With ^ anchor:
> $ echo ' # # foo' | perl -e 'while (<>) { s/^ *[:*#]+//g; print }'
>  # foo
>
> Without ^ anchor:
> $ echo ' # # foo' | perl -e 'while (<>) { s/ *[:*#]+//g; print }'
>  foo
>
> So now I'm wondering how REPP and ACE do global substitutions. PyDelphin
> gets all matches on a line with a single call to the regex engine, so ^ is
> only matched once, but if it were to instead get a single match at a time
> (matching successively starting at the end-position of the previous
> substitution), then it could match ^ on those successive calls. Perhaps
> this is what the other tools are doing?
>
> --
> -Michael Wayne Goodman
>

-- 
-Michael Wayne Goodman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20191126/90946060/attachment-0001.html>