[itsdb] parse big corpus with itsdb
David Martinez
davidm at csse.unimelb.edu.au
Sat Nov 11 08:44:20 CET 2006
Hi Stephan and all,
thanks a lot. I downloaded the pre-compiled version for 64 bits
and it's working fine with 4 CPUs.
However, I have a problem using the "parse" wrapper directly, as
it looks for cheap in the linux.x86.32 directory and I couldn't change
that. I went around the problem by using a perl script that does basically
the same and works fine with the new versions (thanks Francis). The
"parse" wrapper gives me these errors, I'll try to investigate further:
/home/them/davidm/tmp/build-on/logon/bin/cheap: line 47:
/home/them/davidm/tmp/build-on/logon/uio/bin/linux.x86.32/cheap: No such
file or directory
Regarding Yi's question about the corpus, I'm parsing a sample of
the BNC now. Our plan is to parse other corpora as well: Mainichi for
Japanese, Tiger for German, and also Spanish and Norwegian corpora.
Thanks,
david
On Fri, 10 Nov 2006, Stephan Oepen wrote:
> hi david,
>
>> thanks to all for the information on the scoring model. My problem
>> now is that I can't find the model "vm6p.mem" in erg. I looked in the
>> versions Jan-2006 and Jun-2006, and I found "vm.mem", but not "vm6p".
>
> [...]
>
>> I have been playing with the parameters "nsolutions", "results", and
>> "packing" and I can't get the output I need: a single export tree for
>> each parsed sentence.
>
> given a flurry of recent activity, i think you should get latest and
> greatest versions of everything: the ERG, LKB, and [incr tsdb()] (in
> the LinGO CVS repository); plus PET (from SVN). the ERG that dan put
> out yesterday includes a new MEM file `jhpstg.mem', which is active by
> default. it was trainined on an up-to-date treebank of 7000 sentences
> of hiking text (the LOGON corpus) and uses three-level grandparenting;
> thanks to zhang yi (of CoLi), the latest PET can take advantage of the
> enriched feature set in this model, though only in selective unpacking
> mode.
>
> if you want everything quickly, you can try a pre-compiled LOGON build:
>
> http://www.emmtee.net/ftp/builds/2006-11-10/
>
> there is some rudimentary documentation under LogonTop in the wiki.
>
> using that tree, you will also find the batch scripts that francis had
> suggested in a previous email. i believe the following should about do
> the right thing for you:
>
> LOGONROOT=`pwd` \
> ./parse --binary --erg --best 1 --count 4 \
> --export --ascii /tmp/rondane.txt
>
> the above is a wrapper around [incr tsdb()] and PET. it will cause the
> input file (one sentence per line) to be imported into a profile, hence
> it may still be wise to restrict files to a few thousand items (though
> if you are actually only computing one tree, 100,000 inputs might still
> be okay). my value to --count assumes you have that many cpus to parse
> in parallel; --best enables selective unpacking in PET (and you should
> never see more than one tree); and --export requests an [incr tsdb()]
> export of everything upon completion (into a directory named after the
> profile, right below your home directory). obviously, you will need to
> adapt the options (and maybe parameters in the `parse' script) to your
> needs, but i would like to think the above could work our for you :-).
>
> please let me know how things go. all best - oe
>
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
> +++ CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
> +++ --- oe at csli.stanford.edu; oe at ifi.uio.no; stephan at oepen.net ---
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
More information about the itsdb
mailing list