[developers] Question about using PET and the Pet Input Chart with tsdb++

Thu Jul 17 18:54:51 CEST 2008

Hi Francisco,

I eventually gave up trying to parse PIC input from within [incr 
tsdb()].  From memory, I think the problem was cheap expected EOF at the 
end of each item, and there didn't seem to be a way of sending that. 
It's possible there have been changes to PET since (I was trying this 
March 07), but I ended up running cheap outside [incr tsdb()]  with the 
-tsdbdump option and copying the results in place to view with [incr 
tsdb()].  It's not quite as convenient, but it works.

Rebecca

Francisco Costa wrote:
> Hello all,
>
> At the moment I'm trying to parse a corpus with tsdb. For that I'd 
> like to use PET as the parsing engine, and to use the Pet Input Chart 
> format for input (I'm following http://wiki.delph-in.net/moin/PetInput).
>
> I'm using the following cpu definitions in my ~/.tsdbrc:
>
>
> (setf *pvm-cpus*
>   (list
>    (make-cpu
>     :host "localhost.localdomain"
>     :class '(:cheap)
>     :spawn "/home/fcosta/logon/bin/cheap"
>     :options '("-tsdb" "/home/fcosta/portuguese/portuguese.grm")
>    )
>    (make-cpu
>     :host "localhost.localdomain"
>     :class '(:cheappreprocessed)
>     :spawn "/home/fcosta/logon/bin/cheap"
>     :options '("-tsdb" "-tok=xml_counts" "-default-les" 
> "/home/fcosta/portuguese/portuguese.grm")
>    )
>   )
> )
>
> I'm also setting tsdb::*tsdb-preprocessing-hook* with the value 
> "lkb::preprocess-for-pet". This is a function returning a string with 
> PIC XML. This string ends with two new lines (as  pointed out on the 
> PIC page of the wiki). This function is calling an external executable 
> that actually computes the PIC XML ("preprocess-for-pet.sh").
>
> Calling
> (lkb::preprocess-for-pet "sentence")
> from the lisp interpreter in emacs produces what I think is the 
> desired effect (I get a string with PIC XML).
>
> Feeding PIC XML to cheap from a shell also works correctly. I'm using:
> preprocess-for-pet.sh "sentence" | cheap -mrs -tok=xml_counts 
> -default-les portuguese.grm
> and I get MRS representations for "sentence".
>
> However, I'm having problems when I try parsing a corpus from within 
> tsdb. If I use the first cpu, by doing
>
> (tsdb::tsdb :cpu :cheap :file t)
>
> in the lisp prompt, tsdb indeed calls the preprocessor, but it also 
> tokenizes the PIC XML as if it were natural language (e.g. I see 
> `error: no lexicon entries for: "encoding=utf"'). This is expected, 
> since I'm not using the `-tok=xml_counts' option.
> But when I activate the second cpu, where that option is used, tsdb 
> freezes before any sentence is parsed.
>
> Any ideas on what I'm doing wrong?
> At some point of playing with different configurations, I would get a 
> SAX error message about missing files. The names of the missing files 
> were the sentences that I was trying to parse, being looked for in my 
> home directory. This only happened with the `-tok=xml_counts' option. 
> I can no longer reproduce the configuration causing this. (When I 
> created files named after the sentences to parse and containing XML 
> PIC, I could parse them with tsdb. However this is not a viable 
> solution because the corpus that I want to parse contains long 
> sentences that would generate files with names longer than the size 
> limit for filenames imposed my OS).
>
> Any help is really welcome.
>
> Thank you in advance,
>
> Francisco Costa
>
>
>
>