[matrix] Lexicon building

Dan Flickinger danf at csli.Stanford.EDU
Tue Jan 23 19:41:31 CET 2007


Hi Petter -

My recent experience with enlarging the lexicon for the ERG indicates
that manual addition of open-class lexical entries (nouns, verbs, adjectives,
and adverbs in English) averages out to about 50 entries per hour, under the 
following conditions:

(1) The target lexical types have been defined and documented
(2) The lexicographer has been trained to understand the distinctions drawn
    by these lexical types in the grammar
(3) The goal is to add all of the lexical entries for each word (including
    subcategorization variants), avoiding lexical gaps where a word which 
    can be used, say, as a noun and a verb is only represented by the noun 
    lexical entry in the lexicon.  (These gaps are the most expensive ones 
    to repair later, since it is often difficult to identify this source 
    of a failed analysis by the grammar.)

For the ERG lexicon I had the help of two graduate students and two 
undergraduates, and found that over time this rate of 50 entries an hour
was sustainable and relatively consistent.  At this rate, a lexicon of
30,000 would take 600 person-hours, or 15 person-weeks.  Of course, 
some portion of that lexicon will consist of proper names, and if the
list of names already exists or can be easily extracted from e.g. a
gazetteer, these entries can safely be constructed automatically, after
checking as always for lexical ambiguity to avoid lexical gaps.

Regards,

 Dan




More information about the matrix mailing list