[developers] Question about using PET and the Pet Input Chart with tsdb++

Fri Jul 18 17:01:57 CEST 2008

Thank you for the tip.
I'm running cheap outside tsdb++ with the -tsdbdump option now, and I 
like the results.

Just one more question. At the moment I'm invoking cheap for each 
sentence, which is time consuming even if automatized. Is is possible to 
give a batch of sentences to cheap? What would be the sentence 
delimiter? Or can I use the -server option? What signals the end of an 
item in that case? I've tried running cheap with the -server and 
-tok=xml_counts options and netcatting PIC XML to it, but it doesn't 
seem to do anything. Does it wait for an EOF also in this case?

Thanks again in advance,

Francisco

Rebecca Dridan wrote:
> Hi Francisco,
> 
> I eventually gave up trying to parse PIC input from within [incr 
> tsdb()].  From memory, I think the problem was cheap expected EOF at the 
> end of each item, and there didn't seem to be a way of sending that. 
> It's possible there have been changes to PET since (I was trying this 
> March 07), but I ended up running cheap outside [incr tsdb()]  with the 
> -tsdbdump option and copying the results in place to view with [incr 
> tsdb()].  It's not quite as convenient, but it works.
> 
> Rebecca
> 
> Francisco Costa wrote:
> 
>> Hello all,
>>
>> At the moment I'm trying to parse a corpus with tsdb. For that I'd 
>> like to use PET as the parsing engine, and to use the Pet Input Chart 
>> format for input (I'm following http://wiki.delph-in.net/moin/PetInput).
>>
>> I'm using the following cpu definitions in my ~/.tsdbrc:
>>
>>
>> (setf *pvm-cpus*
>>   (list
>>    (make-cpu
>>     :host "localhost.localdomain"
>>     :class '(:cheap)
>>     :spawn "/home/fcosta/logon/bin/cheap"
>>     :options '("-tsdb" "/home/fcosta/portuguese/portuguese.grm")
>>    )
>>    (make-cpu
>>     :host "localhost.localdomain"
>>     :class '(:cheappreprocessed)
>>     :spawn "/home/fcosta/logon/bin/cheap"
>>     :options '("-tsdb" "-tok=xml_counts" "-default-les" 
>> "/home/fcosta/portuguese/portuguese.grm")
>>    )
>>   )
>> )
>>
>> I'm also setting tsdb::*tsdb-preprocessing-hook* with the value 
>> "lkb::preprocess-for-pet". This is a function returning a string with 
>> PIC XML. This string ends with two new lines (as  pointed out on the 
>> PIC page of the wiki). This function is calling an external executable 
>> that actually computes the PIC XML ("preprocess-for-pet.sh").
>>
>> Calling
>> (lkb::preprocess-for-pet "sentence")
>> from the lisp interpreter in emacs produces what I think is the 
>> desired effect (I get a string with PIC XML).
>>
>> Feeding PIC XML to cheap from a shell also works correctly. I'm using:
>> preprocess-for-pet.sh "sentence" | cheap -mrs -tok=xml_counts 
>> -default-les portuguese.grm
>> and I get MRS representations for "sentence".
>>
>> However, I'm having problems when I try parsing a corpus from within 
>> tsdb. If I use the first cpu, by doing
>>
>> (tsdb::tsdb :cpu :cheap :file t)
>>
>> in the lisp prompt, tsdb indeed calls the preprocessor, but it also 
>> tokenizes the PIC XML as if it were natural language (e.g. I see 
>> `error: no lexicon entries for: "encoding=utf"'). This is expected, 
>> since I'm not using the `-tok=xml_counts' option.
>> But when I activate the second cpu, where that option is used, tsdb 
>> freezes before any sentence is parsed.
>>
>> Any ideas on what I'm doing wrong?
>> At some point of playing with different configurations, I would get a 
>> SAX error message about missing files. The names of the missing files 
>> were the sentences that I was trying to parse, being looked for in my 
>> home directory. This only happened with the `-tok=xml_counts' option. 
>> I can no longer reproduce the configuration causing this. (When I 
>> created files named after the sentences to parse and containing XML 
>> PIC, I could parse them with tsdb. However this is not a viable 
>> solution because the corpus that I want to parse contains long 
>> sentences that would generate files with names longer than the size 
>> limit for filenames imposed my OS).
>>
>> Any help is really welcome.
>>
>> Thank you in advance,
>>
>> Francisco Costa
>>
>>
>>
>>
> 
>