<div dir="ltr"><div style="font-family:arial,helvetica,sans-serif" class="gmail_default"><br></div><div dir="ltr"><div class="gmail_default" style="font-family:arial,helvetica,sans-serif">Hi Stephan and all,</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Nov 25, 2019 at 9:13 PM Stephan Oepen <<a href="mailto:oe@ifi.uio.no" target="_blank">oe@ifi.uio.no</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div> also, i am happy to confirm that leaving out the LKB from this comparison is justified, as its implementation of recovering character ranges after all rule applications are complete predates the work by bec (and myself) and is known to give invalid results in some cases.<br></div></blockquote><div><br></div><div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default">Thanks for confirming. I have compared the LKB's implementation in the past, but this time I was short on time and/or being lazy.<br></div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div> <br> in sum, i would look to the REPP standalone tool as the closest we currently have to a reference implementation. it uses the approach of explicitly keeping track of sub-string correspondences for each rule application, as described here:<br> <br> <a href="https://www.aclweb.org/anthology/P12-2074/" rel="noreferrer" target="_blank">https://www.aclweb.org/anthology/P12-2074/</a></div></blockquote><div><br></div><div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default">Yes, I referred heavily to this paper as well as ReppTop and to the outputs of the various systems when I wrote my version. I took some notes along the way which I've been meaning to migrate to the wiki, but for now I've just put them up here: <a href="https://gist.github.com/goodmami/16d907bc2a0e4408456ff596e1e263e6">https://gist.github.com/goodmami/16d907bc2a0e4408456ff596e1e263e6</a>. These notes include details of characterization as well as comparisons of 4 systems (LKB, ACE, PET/standalone, and PyDelphin; I could not easily test agree but I welcome any information to fill in my comparison).<br></div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><br> <br> > The third one is more troubling, because it appears that ACE and REPP both apply external group calls iteratively even though the ReppTop wiki is clear that they are should not be iterative. If someone can confirm that the wiki is incorrect, [...]<br> <br> this observation is surprising and potentially troubling to me! the wiki page is in fact correct: unlike internal group calls, external calls should not in and of themselves be iterative; that would unnecessarily lump together two aspects of the REPP specification and potentially constrain modularity, viz. in a scenario where one would like to split out a rule group into an external module (e.g. to be able to parametrically turn it on or off) but does not want iteration over these rules. if one in fact wants both, it is trivial to wrap an internal iterative group around the external module.<br></div></blockquote><div><br></div><div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default">Ok, we agree on the ideal behavior here. I just tried creating a MWE (minimal working example) with nothing but the external with a single rule and the tokenization pattern (see comments here: <a href="https://github.com/delph-in/pydelphin/issues/254">https://github.com/delph-in/pydelphin/issues/254</a>). Based on my tests with this setup, it now appears that the difference is *not* iterative application but rather a difference in how PyDelphin and the others do global substitutions. The pattern /^ *[:*#]+/ has the ^ anchor, so it should only match the first instance. Perl (which I assume uses PCRE) shows this is the case as well:</div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default"><br></div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default">With ^ anchor:</div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default">$ echo ' # # foo' | perl -e 'while (<>) { s/^ *[:*#]+//g; print }'<br></div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default"> # foo</div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default"><br></div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default">Without ^ anchor:<br></div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default">$ echo ' # # foo' | perl -e 'while (<>) { s/ *[:*#]+//g; print }'<br></div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default"> foo</div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default"><br></div><div style="font-family:arial,helvetica,sans-serif" class="gmail_default">So now I'm wondering how REPP and ACE do global substitutions. PyDelphin gets all matches on a line with a single call to the regex engine, so ^ is only matched once, but if it were to instead get a single match at a time (matching successively starting at the end-position of the previous substitution), then it could match ^ on those successive calls. Perhaps this is what the other tools are doing?<br></div></div></div><br>-- <br><div dir="ltr">-Michael Wayne Goodman</div></div>