[developers] batch processing with and without itsdb

Francis Bond bond at ieee.org
Thu Jul 7 11:19:34 CEST 2011


I'm trying to follow the DELPH-IN Summit recommendation of asking
questions that may be of general interest to the list.

ecently we have been trying to do some batch processing using the
parse script supplied with logon (logon/parse) and have found problems
with it randomly crashing.  We were parsing with two grammars
(--erg+tnt for English and --jacy for Japanese). I attach the error
messages after my signature (for the ERG it seemed to be a pvm issue,
for jacy it would just stop, we have no idea why).  Please consider
this a bug report (and we are happy to give more detail on request).
It is hard to reproduce, as it happens after successfully parsing
anywhere between 8-86 profiles, with no apparent pattern.

It has, however, led us to a renewed interest in using cheap with the
-tsdbdump option.   Although this gives us slightly out-of-date
profiles, at least for Japanese, we can get everything we need
(basically the MRS, with characterisation), just by passing in plain
(segmented) text.  Unfortunately, we don't get  characterisation for
the ERG, and almost certainly are losing some cover due to the lack of
pre-processing and unknown word handling.

Does anyone know
(i) if there is a way of getting characterisation working for the ERG
called for from cheap, with text input?
(ii) if not, will there some way of doing this with the new REPP code?

Thanks in advance,

Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University

-batch command--------------------------------------------------
for i in {1..200}
   ./logon/parse --binary --jacy --limit 5 --suffix .train${i}
   ./logon/parse --binary --text --limit 5 --erg+tnt --mrs --suffix
.train${i} ./logon/train/train${i}/bitext/object

-ERG log --------------------------------------------------

retrieve(): found 1500 items (0 output specifications).
create-cache(): write-through mode for `erg/1010/object/11-06-18/pet.train18'.
largest-run-id(): largest `run-id' is 0.
largest-parse-id(): largest `parse-id' (for `run' 1) is 0.
[t40002] libpvm [t40002] pvm_tc_conreq() bind: Address already in use
libpvm [t40001] pvm_tc_conreq() CONREQ from t40002 but state=4 ?

- Jacy log---------------------------------------------------

Building rule filter
[22:10:05] gc-after-hook(): {L#39 N=48m O=23m E=52%} [S=845m R=306m].
24293264 bytes have been tenured, next gc will be global.
See the documentation for variable EXCL:*GLOBAL-GC-BEHAVIOR* for more

Building lr connections table
Constructing lr table for non-morphological rules
Grammar input complete
TSNLP(12): 100000

More information about the developers mailing list