[developers] Inter-annotator agreement for syntax

Arne Skjærholt arnskj at ifi.uio.no
Wed Oct 23 20:49:23 CEST 2013

Apologies for starting a new thread on this. I was told about the
original thread and subscribed to the list after it started, so I
don't have the original mails to reply to.

Anyways, I've been working on chance-corrected measures of
inter-annotator agreement for syntax recently, and the Tanaka et al.
paper is one of the best papers on IAA for syntactic annotation I've
read so far. But I agree with Emily that their "kappa" (which is more
like the related metric S, which assigns all outcomes equal
probabilities, than kappa) likely underestimates expected agreement.

Measuring agreement on the discriminants is one option, and in
addition to the approach in de Castro's thesis which Antonio Branco
mentioned, I'd also consider using Rebecca Passoneau's MASI[1], using
the set of discriminants entailed by the selected parse as the set for
each annotator for example. Of course, that still leaves the question
of how to deal with the "reject all trees" option.

One option is to treat the reject option as a missing data point when
computing agreement on the syntactic structures, and compute rejection
agreement separately. If you cast reject as a binary choice (either
you select a tree or you reject all of them), you can use a standard
metric such as kappa or pi (it seems that pi is better than kappa,
from what I've read).

In my own work, which has mostly been focused on dependency grammar,
I've made a metric that works directly on the syntactic
representations (by necessity, since discriminants aren't an option in
non-grammar-driven treebanking =). My metric works by using a distance
function between trees to compute agreement. It can probably be used
for Redwoods-style treebanks as well, with one caveat: The metric
can't be used on the AVMs themselves for reasons of computational
tractability; the distance function I'm using is O(n^2) for trees
where there's an ordering of the nodes, if the nodes are unordered the
problem is NP-hard. Thus, I expect the DAG case to be at least as

Therefore there are two options for Redwoods-style: Either we use a
tree representation of the full structure, or use an approximate
distance function (which might not exist for DAGs; I haven't looked).
For the former option, I've thought of either using the derivation
trees, or one of the dependency conversions Angelina Ivanova and
others studied last year[2]. There are likely other options as well,
but I'm not familiar enough with the representations (especially the
MRSes) to know what's possible.

In a context based on distance functions we also have a second option
for handling "reject all trees": simply define distance between a
selected tree and a rejection. Either consider a rejection an empty
tree, or say that the distance between a rejection and an actual tree
is some fixed distance. Both choices have potential drawbacks, I


1: Rebecca Passoneau (2006): "Measuring Agreement on Set-valued Items
(MASI) for Semantic and Pragmatic Annotation" LREC '06.

2: Angelina Ivanova, Stephan Oepen, Lilja Øvrelid, Dan Flickinger
(2012): "Who did what to whom? A contrastive study of
syntacto-semantic dependencies" LAW '12.

More information about the developers mailing list