[developers] Question about using PET and the Pet Input Chart with tsdb++

Thu Jul 17 16:53:26 CEST 2008

Hello all,

At the moment I'm trying to parse a corpus with tsdb. For that I'd like 
to use PET as the parsing engine, and to use the Pet Input Chart format 
for input (I'm following http://wiki.delph-in.net/moin/PetInput).

I'm using the following cpu definitions in my ~/.tsdbrc:

(setf *pvm-cpus*
   (list
    (make-cpu
     :host "localhost.localdomain"
     :class '(:cheap)
     :spawn "/home/fcosta/logon/bin/cheap"
     :options '("-tsdb" "/home/fcosta/portuguese/portuguese.grm")
    )
    (make-cpu
     :host "localhost.localdomain"
     :class '(:cheappreprocessed)
     :spawn "/home/fcosta/logon/bin/cheap"
     :options '("-tsdb" "-tok=xml_counts" "-default-les" 
"/home/fcosta/portuguese/portuguese.grm")
    )
   )
)

I'm also setting tsdb::*tsdb-preprocessing-hook* with the value 
"lkb::preprocess-for-pet". This is a function returning a string with 
PIC XML. This string ends with two new lines (as  pointed out on the PIC 
page of the wiki). This function is calling an external executable that 
actually computes the PIC XML ("preprocess-for-pet.sh").

Calling
(lkb::preprocess-for-pet "sentence")
from the lisp interpreter in emacs produces what I think is the desired 
effect (I get a string with PIC XML).

Feeding PIC XML to cheap from a shell also works correctly. I'm using:
preprocess-for-pet.sh "sentence" | cheap -mrs -tok=xml_counts 
-default-les portuguese.grm
and I get MRS representations for "sentence".

However, I'm having problems when I try parsing a corpus from within 
tsdb. If I use the first cpu, by doing

(tsdb::tsdb :cpu :cheap :file t)

in the lisp prompt, tsdb indeed calls the preprocessor, but it also 
tokenizes the PIC XML as if it were natural language (e.g. I see `error: 
no lexicon entries for: "encoding=utf"'). This is expected, since I'm 
not using the `-tok=xml_counts' option.
But when I activate the second cpu, where that option is used, tsdb 
freezes before any sentence is parsed.

Any ideas on what I'm doing wrong?
At some point of playing with different configurations, I would get a 
SAX error message about missing files. The names of the missing files 
were the sentences that I was trying to parse, being looked for in my 
home directory. This only happened with the `-tok=xml_counts' option. I 
can no longer reproduce the configuration causing this. (When I created 
files named after the sentences to parse and containing XML PIC, I could 
parse them with tsdb. However this is not a viable solution because the 
corpus that I want to parse contains long sentences that would generate 
files with names longer than the size limit for filenames imposed my OS).

Any help is really welcome.

Thank you in advance,

Francisco Costa