[developers] Question about using PET and the Pet Input Chart with tsdb++
Francisco Costa
fcosta at di.fc.ul.pt
Thu Jul 17 16:53:26 CEST 2008
Hello all,
At the moment I'm trying to parse a corpus with tsdb. For that I'd like
to use PET as the parsing engine, and to use the Pet Input Chart format
for input (I'm following http://wiki.delph-in.net/moin/PetInput).
I'm using the following cpu definitions in my ~/.tsdbrc:
(setf *pvm-cpus*
(list
(make-cpu
:host "localhost.localdomain"
:class '(:cheap)
:spawn "/home/fcosta/logon/bin/cheap"
:options '("-tsdb" "/home/fcosta/portuguese/portuguese.grm")
)
(make-cpu
:host "localhost.localdomain"
:class '(:cheappreprocessed)
:spawn "/home/fcosta/logon/bin/cheap"
:options '("-tsdb" "-tok=xml_counts" "-default-les"
"/home/fcosta/portuguese/portuguese.grm")
)
)
)
I'm also setting tsdb::*tsdb-preprocessing-hook* with the value
"lkb::preprocess-for-pet". This is a function returning a string with
PIC XML. This string ends with two new lines (as pointed out on the PIC
page of the wiki). This function is calling an external executable that
actually computes the PIC XML ("preprocess-for-pet.sh").
Calling
(lkb::preprocess-for-pet "sentence")
from the lisp interpreter in emacs produces what I think is the desired
effect (I get a string with PIC XML).
Feeding PIC XML to cheap from a shell also works correctly. I'm using:
preprocess-for-pet.sh "sentence" | cheap -mrs -tok=xml_counts
-default-les portuguese.grm
and I get MRS representations for "sentence".
However, I'm having problems when I try parsing a corpus from within
tsdb. If I use the first cpu, by doing
(tsdb::tsdb :cpu :cheap :file t)
in the lisp prompt, tsdb indeed calls the preprocessor, but it also
tokenizes the PIC XML as if it were natural language (e.g. I see `error:
no lexicon entries for: "encoding=utf"'). This is expected, since I'm
not using the `-tok=xml_counts' option.
But when I activate the second cpu, where that option is used, tsdb
freezes before any sentence is parsed.
Any ideas on what I'm doing wrong?
At some point of playing with different configurations, I would get a
SAX error message about missing files. The names of the missing files
were the sentences that I was trying to parse, being looked for in my
home directory. This only happened with the `-tok=xml_counts' option. I
can no longer reproduce the configuration causing this. (When I created
files named after the sentences to parse and containing XML PIC, I could
parse them with tsdb. However this is not a viable solution because the
corpus that I want to parse contains long sentences that would generate
files with names longer than the size limit for filenames imposed my OS).
Any help is really welcome.
Thank you in advance,
Francisco Costa
More information about the developers
mailing list