[developers] Inter-annotator agreement for Redwoods treebanks

Mon Oct 21 18:51:33 CEST 2013

A comment on productivity and reliability instead of inter-annotator 
agreement per se...

Some of the references imply that user selected discriminants are not 
distinguished from implied discriminants, which sacrifices too much 
information.  We want to know every user selection and whether or not it 
is productive and reliable.

We find that the semantics is most productive and reliable in 
disambiguation and keep track of the specific choices made per annotator 
(we call them disambiguators, annotation seeming to be a misnomer in the 
discriminant approach).  In our experience, disambiguation of nominal 
vs. verbal or adjectival and prepositional attachments are most 
significant, for example.

Indeed, we suppress some if not most discriminants (such as with regard 
to quantification and generic entities) as they are unreliable (i.e., 
error prone).   By default, we also suppress lexical and syntactic 
discriminants and, in the case of MRS, features of arguments of 
predications.  This suppression has insignificant impact on the number 
of selections required for disambiguation while also improving ease of 
use by reducing perplexity and linguistic expertise requirements.

When syntactic discriminants are used, bottom up disambiguation of noun 
phrases is more reliable than presenting arbitrarily deep syntactic 
structures.  Syntactic disambiguation is less reliable than semantic, 
however. This shows up per annotator (rather than across them) in 
"resetting", so we recommend that all choices during disambiguation, not 
just final results of disambiguation, should be considered (i.e., tracked).

We are currently working on metrics related to reliability of 
disambiguation that are a function of discriminant "risk" and utterance 
complexity. Our goal using these metrics in addition to agreement across 
redundant disambiguation is to focus crowd disambiguation where it adds 
the most information to automated parsing, of course.  The application 
area is obtaining appropriately precise logical semantics from textbooks 
and other authoritative publications for so-called "adaptive learning" 
and deep QA in educational applications.

On 10/15/2013 12:56 PM, Emily M. Bender wrote:
> Hi all,
>
> Is there an accepted, chance-corrected measure for inter-annotator
> agreement with Redwoods treebanks?  It seems to me that measuring
> chance agreement over discriminants would make more sense than
> measuring over trees, and I'm not quite sure how to conceptualize the
> chance agreement for the "reject all trees" option...
>
> Emily
>