[itsdb] parse big corpus with itsdb

David Martinez davidm at csse.unimelb.edu.au
Wed Nov 8 00:49:38 CET 2006


 	Hello all,

 	thank you very much Francis, your answer was really helpful for 
our task.
 	What I did to parse the corpus was to look at the format of the 
profiles and create them directly from my corpus using a perl script.
 	Then I listed all the commands that I needed: set the parameters, 
load lkb, load tsdb, load erg (for export), load the pet CPUs, process the 
profiles, and export the trees. I make a single system call with the list 
of commands for every 10,000 sentences and it works fine.
 	However, I have another question, maybe someone can help me. I 
would like to use only one analysis per sentence, and I limited the number 
of answers to 1. But I don't know if this is the way to get the best 
possible analysis, is there some other switch that I could use?
 	Thanks in advance,

 	David

On Mon, 23 Oct 2006, David Martinez wrote:

>
> 	Dear list members,
>
> 	we have recently started to use the itsdb interface to process 
> corpora with different grammars in different languages. We didn't have any 
> problem to process small files, but now we want to parse a corpus of 5M 
> sentences (10k examples per file), and we didn't find a way to select all the 
> target files, process all items, and extract trees in batch mode using the 
> tsdb interface.
> 	We have been looking at ways to interact with the command-line 
> interface with tsdb-do-process, but my lisp is almost non-existant, and I 
> didn't know which parameters to use in the function calls.
> 	Could you give me pointers on how to do this? I would like to create 
> a function that parses and exports trees for each of the files in turn. Any 
> help will be appreciated.
>
> 	Best,
> 	David
>



More information about the itsdb mailing list