[matrix] Lexicon building
Dan Flickinger
danf at csli.Stanford.EDU
Tue Jan 23 19:41:31 CET 2007
Hi Petter -
My recent experience with enlarging the lexicon for the ERG indicates
that manual addition of open-class lexical entries (nouns, verbs, adjectives,
and adverbs in English) averages out to about 50 entries per hour, under the
following conditions:
(1) The target lexical types have been defined and documented
(2) The lexicographer has been trained to understand the distinctions drawn
by these lexical types in the grammar
(3) The goal is to add all of the lexical entries for each word (including
subcategorization variants), avoiding lexical gaps where a word which
can be used, say, as a noun and a verb is only represented by the noun
lexical entry in the lexicon. (These gaps are the most expensive ones
to repair later, since it is often difficult to identify this source
of a failed analysis by the grammar.)
For the ERG lexicon I had the help of two graduate students and two
undergraduates, and found that over time this rate of 50 entries an hour
was sustainable and relatively consistent. At this rate, a lexicon of
30,000 would take 600 person-hours, or 15 person-weeks. Of course,
some portion of that lexicon will consist of proper names, and if the
list of names already exists or can be easily extracted from e.g. a
gazetteer, these entries can safely be constructed automatically, after
checking as always for lexical ambiguity to avoid lexical gaps.
Regards,
Dan
More information about the matrix
mailing list