[itsdb] parse big corpus with itsdb

Mon Oct 23 08:47:35 CEST 2006

G'day,

>         we have recently started to use the itsdb interface to process
> corpora with different grammars in different languages. We didn't have
> any problem to process small files, but now we want to parse a corpus of
> 5M sentences (10k examples per file), and we didn't find a way to select
> all the target files, process all items, and extract trees in batch mode
> using the tsdb interface.
>         We have been looking at ways to interact with the command-line
> interface with tsdb-do-process, but my lisp is almost non-existant, and I
> didn't know which parameters to use in the function calls.
>         Could you give me pointers on how to do this? I would like to
> create a function that parses and exports trees for each of the files in
> turn. Any help will be appreciated.

I put some notes up at http://wiki.delph-in.net/moin/ItsdbBatch.  I
think that there should now be enough information there to do what you
want.   If you write an elegant script to parse and export (or to call
the existing scripts) to do this, please put it up in the wiki (^_^).
The tricky thing I found in batch parsing was actually creating the
profiles from the skeletons, as you have to get the name and directory
that itsdb expects, and it wasn't always what I was expecting.

Good luck,

-- 
Francis Bond  <www.kecl.ntt.co.jp/icl/mtg/members/bond/>
NTT Communication Science Laboratories | Natural Language Research Group