[developers] batch processing with and without itsdb
rdrid at dridan.com
Thu Jul 7 12:23:50 CEST 2011
On 07/07/2011 07:19 PM, Francis Bond wrote:
> I'm trying to follow the DELPH-IN Summit recommendation of asking
> questions that may be of general interest to the list.
> ecently we have been trying to do some batch processing using the
> parse script supplied with logon (logon/parse) and have found problems
> with it randomly crashing. We were parsing with two grammars
> (--erg+tnt for English and --jacy for Japanese). I attach the error
> messages after my signature (for the ERG it seemed to be a pvm issue,
> for jacy it would just stop, we have no idea why). Please consider
> this a bug report (and we are happy to give more detail on request).
> It is hard to reproduce, as it happens after successfully parsing
> anywhere between 8-86 profiles, with no apparent pattern.
Were you running more than one instance of the parse script at the same
time? (Say one with ERG and one with Jacy?) I've seen the same messages
before when running parallel instances caused, I believe, by the
(tsdb:tsdb :cpu :kill) command which kills all pvms by the same user,
not just those coming from the calling script. The work around suggested
to me was to run separate instances on different machines, or under
> It has, however, led us to a renewed interest in using cheap with the
> -tsdbdump option. Although this gives us slightly out-of-date
> profiles, at least for Japanese, we can get everything we need
> (basically the MRS, with characterisation), just by passing in plain
> (segmented) text. Unfortunately, we don't get characterisation for
> the ERG, and almost certainly are losing some cover due to the lack of
> pre-processing and unknown word handling.
> Does anyone know
> (i) if there is a way of getting characterisation working for the ERG
> called for from cheap, with text input?
> (ii) if not, will there some way of doing this with the new REPP code?
To the best of my knowledge, parsing with the ERG now requires REPP
tokenisation. Any method of getting the correct tokenisation into cheap
should also be able to give you the characterisation. My standalone REPP
tokenisation preprocessor certainly can (in YY or FSC format), and the
PET integration is just waiting review before we push it to the main PET
branch. If you want to beta test, let me know and I can give you access
to PET source that has REPP (and TnT tagging) in-built.
More information about the developers