[developers] Inter-annotator agreement for Redwoods treebanks
Paul Haley
paul at haleyai.com
Mon Oct 21 18:51:33 CEST 2013
A comment on productivity and reliability instead of inter-annotator
agreement per se...
Some of the references imply that user selected discriminants are not
distinguished from implied discriminants, which sacrifices too much
information. We want to know every user selection and whether or not it
is productive and reliable.
We find that the semantics is most productive and reliable in
disambiguation and keep track of the specific choices made per annotator
(we call them disambiguators, annotation seeming to be a misnomer in the
discriminant approach). In our experience, disambiguation of nominal
vs. verbal or adjectival and prepositional attachments are most
significant, for example.
Indeed, we suppress some if not most discriminants (such as with regard
to quantification and generic entities) as they are unreliable (i.e.,
error prone). By default, we also suppress lexical and syntactic
discriminants and, in the case of MRS, features of arguments of
predications. This suppression has insignificant impact on the number
of selections required for disambiguation while also improving ease of
use by reducing perplexity and linguistic expertise requirements.
When syntactic discriminants are used, bottom up disambiguation of noun
phrases is more reliable than presenting arbitrarily deep syntactic
structures. Syntactic disambiguation is less reliable than semantic,
however. This shows up per annotator (rather than across them) in
"resetting", so we recommend that all choices during disambiguation, not
just final results of disambiguation, should be considered (i.e., tracked).
We are currently working on metrics related to reliability of
disambiguation that are a function of discriminant "risk" and utterance
complexity. Our goal using these metrics in addition to agreement across
redundant disambiguation is to focus crowd disambiguation where it adds
the most information to automated parsing, of course. The application
area is obtaining appropriately precise logical semantics from textbooks
and other authoritative publications for so-called "adaptive learning"
and deep QA in educational applications.
On 10/15/2013 12:56 PM, Emily M. Bender wrote:
> Hi all,
>
> Is there an accepted, chance-corrected measure for inter-annotator
> agreement with Redwoods treebanks? It seems to me that measuring
> chance agreement over discriminants would make more sense than
> measuring over trees, and I'm not quite sure how to conceptualize the
> chance agreement for the "reject all trees" option...
>
> Emily
>
More information about the developers
mailing list