[itsdb] parse big corpus with itsdb

Fri Nov 10 16:19:51 CET 2006

hi david,

> thanks to all for the information on the scoring model. My problem
> now is that I can't find the model "vm6p.mem" in erg. I looked in the
> versions Jan-2006 and Jun-2006, and I found "vm.mem", but not "vm6p".

  [...]

> I have been playing with the parameters "nsolutions", "results", and
> "packing" and I can't get the output I need: a single export tree for
> each parsed sentence.

given a flurry of recent activity, i think you should get latest and
greatest versions of everything: the ERG, LKB, and [incr tsdb()] (in
the LinGO CVS repository); plus PET (from SVN).  the ERG that dan put
out yesterday includes a new MEM file `jhpstg.mem', which is active by
default.  it was trainined on an up-to-date treebank of 7000 sentences
of hiking text (the LOGON corpus) and uses three-level grandparenting;
thanks to zhang yi (of CoLi), the latest PET can take advantage of the
enriched feature set in this model, though only in selective unpacking
mode.

if you want everything quickly, you can try a pre-compiled LOGON build:

  http://www.emmtee.net/ftp/builds/2006-11-10/

there is some rudimentary documentation under LogonTop in the wiki.

using that tree, you will also find the batch scripts that francis had
suggested in a previous email.  i believe the following should about do
the right thing for you:

  LOGONROOT=`pwd` \
    ./parse --binary --erg --best 1 --count 4 \
      --export --ascii /tmp/rondane.txt

the above is a wrapper around [incr tsdb()] and PET.  it will cause the
input file (one sentence per line) to be imported into a profile, hence
it may still be wise to restrict files to a few thousand items (though
if you are actually only computing one tree, 100,000 inputs might still
be okay).  my value to --count assumes you have that many cpus to parse
in parallel; --best enables selective unpacking in PET (and you should
never see more than one tree); and --export requests an [incr tsdb()]
export of everything upon completion (into a directory named after the
profile, right below your home directory).  obviously, you will need to
adapt the options (and maybe parameters in the `parse' script) to your
needs, but i would like to think the above could work our for you :-).

                    please let me know how things go.  all best  -  oe

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
+++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++       --- oe at csli.stanford.edu; oe at ifi.uio.no; stephan at oepen.net ---
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++