[developers] Comparison of REPP implementations

Mon Nov 25 16:59:32 CET 2019

Hi Stephan and all,

On Mon, Nov 25, 2019 at 9:13 PM Stephan Oepen <oe at ifi.uio.no> wrote:

> also, i am happy to confirm that leaving out the LKB from this comparison
> is justified, as its implementation of recovering character ranges after
> all rule applications are complete predates the work by bec (and myself)
> and is known to give invalid results in some cases.
>

Thanks for confirming. I have compared the LKB's implementation in the
past, but this time I was short on time and/or being lazy.

>
> in sum, i would look to the REPP standalone tool as the closest we
> currently have to a reference implementation.  it uses the approach of
> explicitly keeping track of sub-string correspondences for each rule
> application, as described here:
>
> https://www.aclweb.org/anthology/P12-2074/
>

Yes, I referred heavily to this paper as well as ReppTop and to the outputs
of the various systems when I wrote my version. I took some notes along the
way which I've been meaning to migrate to the wiki, but for now I've just
put them up here:
https://gist.github.com/goodmami/16d907bc2a0e4408456ff596e1e263e6. These
notes include details of characterization as well as comparisons of 4
systems (LKB, ACE, PET/standalone, and PyDelphin; I could not easily test
agree but I welcome any information to fill in my comparison).

>
>
> > The third one is more troubling, because it appears that ACE and REPP
> both apply external group calls iteratively even though the ReppTop wiki is
> clear that they are should not be iterative. If someone can confirm that
> the wiki is incorrect, [...]
>
> this observation is surprising and potentially troubling to me!  the wiki
> page is in fact correct: unlike internal group calls, external calls should
> not in and of themselves be iterative; that would unnecessarily lump
> together two aspects of the REPP specification and potentially constrain
> modularity, viz. in a scenario where one would like to split out a rule
> group into an external module (e.g. to be able to parametrically turn it on
> or off) but does not want iteration over these rules.  if one in fact wants
> both, it is trivial to wrap an internal iterative group around the external
> module.
>

Ok, we agree on the ideal behavior here. I just tried creating a MWE
(minimal working example) with nothing but the external with a single rule
and the tokenization pattern (see comments here:
https://github.com/delph-in/pydelphin/issues/254).  Based on my tests with
this setup, it now appears that the difference is *not* iterative
application but rather a difference in how PyDelphin and the others do
global substitutions. The pattern /^ *[:*#]+/ has the ^ anchor, so it
should only match the first instance. Perl (which I assume uses PCRE) shows
this is the case as well:

With ^ anchor:
$ echo ' # # foo' | perl -e 'while (<>) { s/^ *[:*#]+//g; print }'
 # foo

Without ^ anchor:
$ echo ' # # foo' | perl -e 'while (<>) { s/ *[:*#]+//g; print }'
 foo

So now I'm wondering how REPP and ACE do global substitutions. PyDelphin
gets all matches on a line with a single call to the regex engine, so ^ is
only matched once, but if it were to instead get a single match at a time
(matching successively starting at the end-position of the previous
substitution), then it could match ^ on those successive calls. Perhaps
this is what the other tools are doing?

-- 
-Michael Wayne Goodman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20191125/1696daba/attachment.html>