[developers] batch processing with and without itsdb

Thu Jul 7 12:23:50 CEST 2011

Hi Francis,

On 07/07/2011 07:19 PM, Francis Bond wrote:
> G'day,
>
> I'm trying to follow the DELPH-IN Summit recommendation of asking
> questions that may be of general interest to the list.
>
> ecently we have been trying to do some batch processing using the
> parse script supplied with logon (logon/parse) and have found problems
> with it randomly crashing.  We were parsing with two grammars
> (--erg+tnt for English and --jacy for Japanese). I attach the error
> messages after my signature (for the ERG it seemed to be a pvm issue,
> for jacy it would just stop, we have no idea why).  Please consider
> this a bug report (and we are happy to give more detail on request).
> It is hard to reproduce, as it happens after successfully parsing
> anywhere between 8-86 profiles, with no apparent pattern.
Were you running more than one instance of the parse script at the same 
time? (Say one with ERG and one with Jacy?)  I've seen the same messages 
before when running parallel instances caused, I believe, by the 
(tsdb:tsdb :cpu :kill) command which kills all pvms by the same user, 
not just those coming from the calling script. The work around suggested 
to me was to run separate instances on different machines, or under 
different users...

> It has, however, led us to a renewed interest in using cheap with the
> -tsdbdump option.   Although this gives us slightly out-of-date
> profiles, at least for Japanese, we can get everything we need
> (basically the MRS, with characterisation), just by passing in plain
> (segmented) text.  Unfortunately, we don't get  characterisation for
> the ERG, and almost certainly are losing some cover due to the lack of
> pre-processing and unknown word handling.
>
> Does anyone know
> (i) if there is a way of getting characterisation working for the ERG
> called for from cheap, with text input?
> (ii) if not, will there some way of doing this with the new REPP code?
To the best of my knowledge, parsing with the ERG now requires REPP 
tokenisation. Any method of getting the correct tokenisation into cheap 
should also be able to give you the characterisation. My standalone REPP 
tokenisation preprocessor certainly can (in YY or FSC format), and the 
PET integration is just waiting review before we push it to the main PET 
branch. If you want to beta test, let me know and I can give you access 
to PET source that has REPP (and TnT tagging) in-built.

Rebecca