[developers] Increasing parse coverage

Sun Apr 5 10:03:35 CEST 2015

Hi Guy -

In addition to using TnT, I'd suggest increasing the resource limits, which are by default relatively conservative.  (You can also simplify the list of roots that you specify, since the stricter ones are subsumed by the "informal" ones, though this won't give you more coverage, just less clutter in the invocation of ACE.)  I experimented this afternoon with also adding `root_robust' but it seems to produce about equal amounts of noise and benefit (both modest), so it's not worth the trouble.  Along the way, I found and fixed some erroneous lexical entries that showed up in nonparsed Sentibank items, which should help improve coverage some.  So I'd recommend updating your `trunk' ERG, then recompiling erg.dat as follows:
cd erg; ace -G erg.dat -g ace/config.tdl
and then calling ACE with something like this:
ace -g erg.dat -1Tq --max-chart-megabytes 7000 --max-unpack-megabytes 8000 --timeout 60 -r "root_informal root_inffrag" --tnt-model=${DELPHINROOT}/coli/tnt/models/wsj <filename>

I saw a little over 92% coverage on the Sentibank items this evening with this configuration, so if you try it and get a lot less, I'd be glad to know.  But I agree with Woodley that even in the best case, you're going to lose some noticeable portion of the corpus when parsing with the standard ERG configuration.

I'll experiment tomorrow with also adding in the bridging machinery that we talked about in Tomar, to see if it can be useful for your task.  It can in principle fill in almost all of the missing items, and it got pretty close recently with the items in the Sherlock Holmes story The Speckled Band, but it's still highly experimental, and may not do so well in parse selection on the sentiment text.

Cheers,

 Dan

----- Original Message -----
From: "Francis Bond" <bond at ieee.org>
To: "Woodley Packard" <sweaglesw at sweaglesw.org>
Cc: "Ann Copestake" <aac10 at cam.ac.uk>, developers at delph-in.net, "Guy Emerson" <gete2 at cam.ac.uk>
Sent: Saturday, April 4, 2015 5:02:37 PM
Subject: Re: [developers] Increasing parse coverage

We got a pretty big boost from using TNT --- without it we didn't seem
to be getting any unknown word handling.  Maybe there's a setting we
missed?

On Sun, Apr 5, 2015 at 7:41 AM, Woodley Packard <sweaglesw at sweaglesw.org> wrote:
> Hi Guy,
>
> There are probably some tricks you can employ still, but I think 86.4% is in the ball park of what you should expect to parse, given that lots of the inputs are not intended to be sentences in the traditional sense.  The default parsing configuration does not permit the "informal" root symbols, leading to the lower (64.4%) coverage.
>
> You could try using TNT instead of ACE's built-in tagger.  Something like --tnt-model=${DELPHINROOT}/coli/tnt/models/wsj should do that for you; I'd be surprised if it helps much though.  I think Ann's suggestions are more likely to bear fruit -- e.g. helping the grammar recognize that Take Care of My Cat is a proper name.  I notice "cinema" is not available as a mass noun in the ERG, also.  If I were you I would grab a sample of 20 or so items that don't parse and figure out exactly why each of them don't (this is fairly easy to do by substituting bits of the sentence with something simpler until it does parse).  You may well discover some frequent problems that are easy to correct.
>
> -Woodley
>
> On Apr 4, 2015, at 2:38 PM, Ann Copestake <aac10 at cam.ac.uk> wrote:
>
>> I hope people more knowledgeable than I am will comment, but:
>>
>> - titles without quotes won't help - I don't know if there's any mileage in trying to insert these via a list of movie names.
>>
>> - there are many very bizarre uses of ... - any idea where they came from?  It doesn't look like normal punctuation use to me.  I would be tempted to just try removing all of them ...
>>
>> Ann
>>
>> On 2015-04-04 17:01, Guy Emerson wrote:
>>> I'm trying to use the ERG to produce DMRSs for a sentiment analysis
>>> task.  However, I'm getting relatively low coverage at the moment.
>>> I have run ACE with a freshly downloaded pre-compiled ERG as follows:
>>> ace -g erg.dat -1Tq filename
>>> ace -g erg.dat -1Tq -r "root_strict root_frag root_informal
>>> root_inffrag" filename
>>> In the first case, I got 64.4% coverage, and in the second, 86.4%.
>>> Are there are any further tricks I could use to improve coverage?  I'm
>>> using the Stanford Sentiment Treebank, and I've put a
>>> 'sentence'-segmented version of the text here:
>>> https://raw.githubusercontent.com/guyemerson/Sentimantics/master/data/sentibank.txt
>>> [1]
>>> Many lines are noun phrases or adjective phrases.  There are also a
>>> lot of make-it-up-as-you-go hyphenated tokens.
>>> Best,
>>> Guy.
>>> Links:
>>> ------
>>> [1]
>>> https://raw.githubusercontent.com/guyemerson/Sentimantics/master/data/sentibank.txt
>
>

-- 
Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University