[itsdb] parse big corpus with itsdb

David Martinez davidm at csse.unimelb.edu.au
Sat Nov 11 08:44:20 CET 2006


 	Hi Stephan and all,

 	thanks a lot. I downloaded the pre-compiled version for 64 bits 
and it's working fine with 4 CPUs.
 	However, I have a problem using the "parse" wrapper directly, as 
it looks for cheap in the linux.x86.32 directory and I couldn't change 
that. I went around the problem by using a perl script that does basically 
the same and works fine with the new versions (thanks Francis). The 
"parse" wrapper gives me these errors, I'll try to investigate further:

/home/them/davidm/tmp/build-on/logon/bin/cheap: line 47: 
/home/them/davidm/tmp/build-on/logon/uio/bin/linux.x86.32/cheap: No such 
file or directory

 	Regarding Yi's question about the corpus, I'm parsing a sample of 
the BNC now. Our plan is to parse other corpora as well: Mainichi for 
Japanese, Tiger for German, and also Spanish and Norwegian corpora.
 	Thanks,
 	david

On Fri, 10 Nov 2006, Stephan Oepen wrote:

> hi david,
>
>> thanks to all for the information on the scoring model. My problem
>> now is that I can't find the model "vm6p.mem" in erg. I looked in the
>> versions Jan-2006 and Jun-2006, and I found "vm.mem", but not "vm6p".
>
>  [...]
>
>> I have been playing with the parameters "nsolutions", "results", and
>> "packing" and I can't get the output I need: a single export tree for
>> each parsed sentence.
>
> given a flurry of recent activity, i think you should get latest and
> greatest versions of everything: the ERG, LKB, and [incr tsdb()] (in
> the LinGO CVS repository); plus PET (from SVN).  the ERG that dan put
> out yesterday includes a new MEM file `jhpstg.mem', which is active by
> default.  it was trainined on an up-to-date treebank of 7000 sentences
> of hiking text (the LOGON corpus) and uses three-level grandparenting;
> thanks to zhang yi (of CoLi), the latest PET can take advantage of the
> enriched feature set in this model, though only in selective unpacking
> mode.
>
> if you want everything quickly, you can try a pre-compiled LOGON build:
>
>  http://www.emmtee.net/ftp/builds/2006-11-10/
>
> there is some rudimentary documentation under LogonTop in the wiki.
>
> using that tree, you will also find the batch scripts that francis had
> suggested in a previous email.  i believe the following should about do
> the right thing for you:
>
>  LOGONROOT=`pwd` \
>    ./parse --binary --erg --best 1 --count 4 \
>      --export --ascii /tmp/rondane.txt
>
> the above is a wrapper around [incr tsdb()] and PET.  it will cause the
> input file (one sentence per line) to be imported into a profile, hence
> it may still be wise to restrict files to a few thousand items (though
> if you are actually only computing one tree, 100,000 inputs might still
> be okay).  my value to --count assumes you have that many cpus to parse
> in parallel; --best enables selective unpacking in PET (and you should
> never see more than one tree); and --export requests an [incr tsdb()]
> export of everything upon completion (into a directory named after the
> profile, right below your home directory).  obviously, you will need to
> adapt the options (and maybe parameters in the `parse' script) to your
> needs, but i would like to think the above could work our for you :-).
>
>                    please let me know how things go.  all best  -  oe
>
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
> +++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
> +++       --- oe at csli.stanford.edu; oe at ifi.uio.no; stephan at oepen.net ---
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>



More information about the itsdb mailing list