[developers] Fwd: a couple of DeepBank questions

Fri Feb 1 10:10:42 CET 2013

Begin forwarded message:

> From: Yi Zhang <yizhang at dfki.de>
> Subject: Re: [developers] a couple of DeepBank questions
> Date: February 1, 2013 10:09:04 AM GMT+01:00
> To: Megan Schneider <caelum at gmail.com>
> Cc: developers at delph-in.net
> 
> hi Megan,
> 
>> 1) How do the DeepBank sentence identifiers map to the Penn Treebank? (20020005 appears to map to the 5th sentence in RAW/parsed/mrg/wsj/00/wsj_0020.mrg and 20001001 appears to map to the first sentence in 00/wsj_0001.mrg from looking at where the sentences in question exist)
>> 
> yes, your observation is right. the sentence identifiers in DeepBank are 8-digit integers, always starting (from left) with "2", followed by 4 digits corresponding to the file name in the PTB (e.g. 0234 is from the file 02/wsj_0234.mrg), and ends with 3 digits corresponding to the sentence number in that file (starting from 1). 
> 
> 
>> 2) Does anyone have a version of the Penn Treebank which is limited to only those parses/sentences also contained in DeepBank?
>> 
>> 
> you can find a simple perl script from the following link, which will select and print the subpart of PTB (in original .mrg format) according to a list of sentence ids. 
> http://www.coli.uni-saarland.de/~yzhang/files/select-ptb-with-iid.pl
> 
> a simple way of getting the id list is the following command line (assuming you are doing it on the DeepBank release 0.9, which contains thinned tsdb profiles):
>  
>  $ for i in deepbank-0.9/tsdb/*.1; do zcat $i/result.gz | cut -f 1 -d@ >> id-list.txt; done
> 
> afterwards, run the perl script:
>  $ perl select-ptb-with-iid.pl  id-list.txt penntreebank3/parsed/mrg/wsj/ > ptb-deepbank-0.9.mrg
> 
> best,
> yi

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20130201/bd4b4749/attachment.html>