[developers] Increasing parse coverage

Woodley Packard sweaglesw at sweaglesw.org
Sun Apr 5 01:41:08 CEST 2015


Hi Guy,

There are probably some tricks you can employ still, but I think 86.4% is in the ball park of what you should expect to parse, given that lots of the inputs are not intended to be sentences in the traditional sense.  The default parsing configuration does not permit the "informal" root symbols, leading to the lower (64.4%) coverage.

You could try using TNT instead of ACE's built-in tagger.  Something like --tnt-model=${DELPHINROOT}/coli/tnt/models/wsj should do that for you; I'd be surprised if it helps much though.  I think Ann's suggestions are more likely to bear fruit -- e.g. helping the grammar recognize that Take Care of My Cat is a proper name.  I notice "cinema" is not available as a mass noun in the ERG, also.  If I were you I would grab a sample of 20 or so items that don't parse and figure out exactly why each of them don't (this is fairly easy to do by substituting bits of the sentence with something simpler until it does parse).  You may well discover some frequent problems that are easy to correct.

-Woodley

On Apr 4, 2015, at 2:38 PM, Ann Copestake <aac10 at cam.ac.uk> wrote:

> I hope people more knowledgeable than I am will comment, but:
> 
> - titles without quotes won't help - I don't know if there's any mileage in trying to insert these via a list of movie names.
> 
> - there are many very bizarre uses of ... - any idea where they came from?  It doesn't look like normal punctuation use to me.  I would be tempted to just try removing all of them ...
> 
> Ann
> 
> On 2015-04-04 17:01, Guy Emerson wrote:
>> I'm trying to use the ERG to produce DMRSs for a sentiment analysis
>> task.  However, I'm getting relatively low coverage at the moment.
>> I have run ACE with a freshly downloaded pre-compiled ERG as follows:
>> ace -g erg.dat -1Tq filename
>> ace -g erg.dat -1Tq -r "root_strict root_frag root_informal
>> root_inffrag" filename
>> In the first case, I got 64.4% coverage, and in the second, 86.4%.
>> Are there are any further tricks I could use to improve coverage?  I'm
>> using the Stanford Sentiment Treebank, and I've put a
>> 'sentence'-segmented version of the text here:
>> https://raw.githubusercontent.com/guyemerson/Sentimantics/master/data/sentibank.txt
>> [1]
>> Many lines are noun phrases or adjective phrases.  There are also a
>> lot of make-it-up-as-you-go hyphenated tokens.
>> Best,
>> Guy.
>> Links:
>> ------
>> [1]
>> https://raw.githubusercontent.com/guyemerson/Sentimantics/master/data/sentibank.txt




More information about the developers mailing list