[developers] Question about using PET and the Pet Input Chart with tsdb++
Rebecca Dridan
bec.dridan at gmail.com
Thu Jul 17 18:54:51 CEST 2008
Hi Francisco,
I eventually gave up trying to parse PIC input from within [incr
tsdb()]. From memory, I think the problem was cheap expected EOF at the
end of each item, and there didn't seem to be a way of sending that.
It's possible there have been changes to PET since (I was trying this
March 07), but I ended up running cheap outside [incr tsdb()] with the
-tsdbdump option and copying the results in place to view with [incr
tsdb()]. It's not quite as convenient, but it works.
Rebecca
Francisco Costa wrote:
> Hello all,
>
> At the moment I'm trying to parse a corpus with tsdb. For that I'd
> like to use PET as the parsing engine, and to use the Pet Input Chart
> format for input (I'm following http://wiki.delph-in.net/moin/PetInput).
>
> I'm using the following cpu definitions in my ~/.tsdbrc:
>
>
> (setf *pvm-cpus*
> (list
> (make-cpu
> :host "localhost.localdomain"
> :class '(:cheap)
> :spawn "/home/fcosta/logon/bin/cheap"
> :options '("-tsdb" "/home/fcosta/portuguese/portuguese.grm")
> )
> (make-cpu
> :host "localhost.localdomain"
> :class '(:cheappreprocessed)
> :spawn "/home/fcosta/logon/bin/cheap"
> :options '("-tsdb" "-tok=xml_counts" "-default-les"
> "/home/fcosta/portuguese/portuguese.grm")
> )
> )
> )
>
> I'm also setting tsdb::*tsdb-preprocessing-hook* with the value
> "lkb::preprocess-for-pet". This is a function returning a string with
> PIC XML. This string ends with two new lines (as pointed out on the
> PIC page of the wiki). This function is calling an external executable
> that actually computes the PIC XML ("preprocess-for-pet.sh").
>
> Calling
> (lkb::preprocess-for-pet "sentence")
> from the lisp interpreter in emacs produces what I think is the
> desired effect (I get a string with PIC XML).
>
> Feeding PIC XML to cheap from a shell also works correctly. I'm using:
> preprocess-for-pet.sh "sentence" | cheap -mrs -tok=xml_counts
> -default-les portuguese.grm
> and I get MRS representations for "sentence".
>
> However, I'm having problems when I try parsing a corpus from within
> tsdb. If I use the first cpu, by doing
>
> (tsdb::tsdb :cpu :cheap :file t)
>
> in the lisp prompt, tsdb indeed calls the preprocessor, but it also
> tokenizes the PIC XML as if it were natural language (e.g. I see
> `error: no lexicon entries for: "encoding=utf"'). This is expected,
> since I'm not using the `-tok=xml_counts' option.
> But when I activate the second cpu, where that option is used, tsdb
> freezes before any sentence is parsed.
>
> Any ideas on what I'm doing wrong?
> At some point of playing with different configurations, I would get a
> SAX error message about missing files. The names of the missing files
> were the sentences that I was trying to parse, being looked for in my
> home directory. This only happened with the `-tok=xml_counts' option.
> I can no longer reproduce the configuration causing this. (When I
> created files named after the sentences to parse and containing XML
> PIC, I could parse them with tsdb. However this is not a viable
> solution because the corpus that I want to parse contains long
> sentences that would generate files with names longer than the size
> limit for filenames imposed my OS).
>
> Any help is really welcome.
>
> Thank you in advance,
>
> Francisco Costa
>
>
>
>
More information about the developers
mailing list