[developers] EDM implementations

Mon Jan 27 18:42:15 CET 2020

hi mike,

belatedly, thanks (once again) for pushing forward standardization!
and also my apologies for returning to this thread a little late!

regarding EDM, i used to think of the Common-Lisp implementation
(which it appears i produced in early 2012, i.e. more recently than
the Perl version by bec) as the reference until recently.  last year,
when comparing its scores to my re-implementation in Python as part of
mtool, that comparison also turned up the two questions you raised,
viz. the treatment of the TOP property and how to score parameterized
predicates.

regarding the first, this appears to be one of the better-kept secrets
in meaning representation comparison: in my view, it is a semantically
highly relevant property (marking the contrast between e.g. 'all
fierce dogs bark' vs. 'all barking dogs are fierce'), but neither the
original EDM paper nor its derivative in the AMR world (Cai & Knight,
2013) discuss it.  yet, both the Lisp implementation of EDM and SMATCH
seem to always have scored the TOP node as an additional tuple
(counted among the 'argument' tuples for EDM, while considered among
the 'attribute' tuples in SMATCH).  the Perl implementation of EDM, on
the other hand, worked off my 'ltriples' export format for EDS, which
appears to not include a separate TOP tuple.

i confirmed the nature of those triples by reminding myself of what
became of the 'export' script mentioned in the original EDM wiki notes
you had found.  it was folded into the LOGON 'redwoods' script, so
something like the following actually works today to prepare the input
for the Perl implementation of EDM:

  $LOGONROOT/redwoods --erg --export ltriples --target /tmp mrs

i attach the output for item #21 from the MRS test suite, for
reference.  so, i agree with the conclusion bec and you have already
reached: the original Perl implementation of EDM did not consider TOP
tuples.  the Lisp implementation, on the other hand, appears to have
had TOP tuples from its very beginning.

regarding the second design choice you raise, parameterized relations
(involving one or more constant arguments), it appears that both the
Lisp and Perl implementations of EDM do the same thing, viz. assume
that there can be at most one constant argument in a relation and
'inline' its value (if present) with the predicate itself, e.g.
internally using node label shorthands like 'named(Abrams)'.  in this
regard, i suspect bec and you actually may have arrived at the wrong
conclusion about historic behavior; thus, personally, i see no reason
for pyDelphin to provide a special-cased version of EDM that wholly
ignores constant arguments.

looking at this particular design choice today, however, it seems too
limiting an assumption and meshing together two things that arguably
should be considered separate.  even though ERG versions for the past
15 or more years have not used predicates with multiple (constant)
parameters, there would be nothing wrong with representing, say, the
fraction '2/3' as involving two constant arguments, e.g. something
like fraction [ CARG1 "2", CARG2 "3" ].  this is, for example, what
AMR does for complex proper names.

thus, even though our two historic EDM implementations appear to agree
on the 'inlining' treatment of constant arguments, i would be prepared
to argue that CARG et al. values should rather be treated as separate
node properties, i.e. for the above example the 'named' predicate and
the 'CARG' == 'Abrams' value should be treated as two distinct tuples.
in part for cross-framework compatibility, this is what we ended up
doing in mtool, including in its re-implementation of EDM, see:

  http://mrp.nlpl.eu/index.php?page=5

in summary, it sounds as if your EDM re-implementation, mike, had
arrived at the same conclusions: TOP tuples should be scored, and
constant arguments considered as separate properties.  i would expect
your implementation and mtool should then come to the exact same
results (on EDSs stripped of MRS variable properties, which the
current mtool EDS reader deliberately discards; see below)?  seeing as
we have identified two ways in which this way of computing EDM differs
from the original publication and the two earlier implementations (in
Perl and Lisp), i would like to suggest we formally coin this
refinement of the metric EDM 2.0.

regarding how to deal with missing graphs on either the gold or system
side of the comparison: it appears the Lisp implementation of EDM
provides a toggle *redwooods-score-all-p*, which selects between two
modes of computing EDM over two sets of corresponding items, either on
the intersection of items only; or on their union, treating gaps on
either side of the comparison as empty graphs (thus, incurring recall
or precision penalties).  in practice, i believe we used to
near-exclusively compute EDM over sets of items for which there was
both a gold and a system graph.  but that can of course only give
comparable results when fixing that very set of items.  thus, the
setup of scoring 'all' items seems more general, robust to attempts at
gaming, and in my view should be considered the default.

finally, regarding variable properties in mtool: for the 2019 CoNLL
shared task on meaning representation parsing (MRP 2019), we had
agreed with other framework developers to keep morpho-semantic
decorations out of the comparison.  hence, the MRP 2019 graphs did not
include tense, aspect, or number information from the full ERSs.  but
technically, i would consider that a property of the EDS used in MRP
2019, not a design decision in mtool.  for the re-run of the MRP task
at CoNLL 2020, we are currently preparing to throw these properties
back into the mix (also in other frameworks, where annotations are
available), which means the EDS reader in mtool in the near future
will no longer discard (underlying) variable properties by default.

best wishes, oe

On Mon, Jan 20, 2020 at 2:15 AM goodman.m.w at gmail.com
<goodman.m.w at gmail.com> wrote:
>
> Thanks again, Bec.
>
> I just want to make sure my implementation gets the same scores for the same inputs under the same assumptions as the original implementation. For this to work, its behavior concerning the points I've sought clarification for should be intentional. In light of your responses, I've separated the CARG triples from other properties and have given it its own weight. Thus I should be able to get the same scores as your code by setting the weights of CARGs (but not properties) and graph-tops to zero. Similarly, I'll add an option to ignore missing test items and otherwise treat them as mismatches.
>
> On Fri, Jan 17, 2020 at 6:14 PM Bec Dridan <bec.dridan at gmail.com> wrote:
>>
>>
>>
>> On Fri, Jan 17, 2020 at 5:39 PM goodman.m.w at gmail.com <goodman.m.w at gmail.com> wrote:
>>>
>>>
>>> One more detail is what to do when the two sides (gold and test) have different numbers of items. Currently my code stops as soon as either a gold or test item is missing, which is what smatch (the similar metric made for AMR) does, but I think that may be wrong because parsing profiles are likely to have missing or extra (overgeneration) items in the middle. So the question is whether we ignore it or count it as a full mismatch.
>>
>>
>> If you are asking what is 'correct', I guess that depends on why you are evaluating. The perl implementation wouldn't have noticed missing gold parses, because it used the gold set as the definition of the set. A missing test item, on the other hand, by default counts as a full mismatch, but there is a command line option to ignore any gold parse with no corresponding test parse. The ignore option is useful when the purpose of the evaluation is assessing the system you are working on (and you consider coverage separately). For comparing across systems, I imagine you probably want to count parse failure as a full mismatch. It was useful for me to have both options.
>>
>> Bec
>>
>>>
>>>
>>> On Thu, Jan 16, 2020 at 6:33 PM Bec Dridan <bec.dridan at gmail.com> wrote:
>>>>
>>>> Wow, that is some old code... From memory, export was a wrapper around `parse --export`, where I could add :ltriples to the tsdb::*redwoods-export-values* set.
>>>>
>>>> I don't know the mtool code at all, but re-reading the paper and looking at the perl code, I don't think the original implementation evaluated CARG at all. We only checked that the correct character span had a pred name of`named`.
>>>>
>>>> I think you are right that the triple export at the time did not produce a triple for TOP and it hence would not have been counted.
>>>>
>>>> That match your memory Stephan?
>>>>
>>>> Bec
>>>>
>>>>
>>>> On Thu, Jan 16, 2020 at 8:34 PM goodman.m.w at gmail.com <goodman.m.w at gmail.com> wrote:
>>>>>
>>>>> Hello developers,
>>>>>
>>>>> Recently I wanted to try out Elementary Dependency Match (EDM) but I did not find an easy way to do it. I saw lisp code in the LKB's repository and Bec's Perl code, but I'm not sure how to call the former from the command line and the latter seems outdated (I don't see the "export" command required by its instructions).
>>>>>
>>>>> The Dridan & Oepen, 2011 algorithm was simple enough so I though I'd implement it on top of PyDelphin. The result is here: https://github.com/delph-in/delphin.edm. It requires the latest version of PyDelphin (v1.2.0). It works with MRS, EDS, and DMRS, and it reads text files or [incr tsdb()] profiles.
>>>>>
>>>>> When I nearly had my version working I found that Stephan et al.'s mtool (https://github.com/cfmrpThe paper example
>>>>> /mtool) also had an implementation of EDM, so I used that to compare with my outputs (as I couldn't get the previous implementations to work). In this process I think I found some differences from Dridan & Oepen, 2011's description, and this email is to confirm those findings. Namely, that mtool's (and now my) implementation do the following:
>>>>>
>>>>> * CARGs are treated as property triples ("class 3 information"). Previously they were combined with the predicate name. This change means that predicates like 'named' will match even if their CARGs don't and the CARGs are a separate thing that needs to be matched.
>>>>>
>>>>> * The identification of the graph's TOP counts as a triple.
>>>>>
>>>>> One difference between mtool and delphin.edm is that mtool does not count "variable" properties from EDS, but that's just because its EDS parser does not yet handle them while PyDelphin's does.
>>>>>
>>>>> Can anyone familiar with EDM confirm the above? Or can anyone explain how to call the Perl or LKB code so I can compare?
>>>>>
>>>>> --
>>>>> -Michael Wayne Goodman
>>>
>>>
>>>
>>> --
>>> -Michael Wayne Goodman
>
>
>
> --
> -Michael Wayne Goodman
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 21.gz
Type: application/gzip
Size: 271 bytes
Desc: not available
URL: <http://lists.delph-in.net/archives/developers/attachments/20200127/41fb96a1/attachment.bin>