[developers] What errors can cause a grid parse reranking process to return an empty scores file?

Fri Feb 27 17:57:13 CET 2009

hi again, bill,

> I am running many different grid* files in parallel using the Condor
> distributed computing system.  On my latest round of jobs, all of my
> Condor jobs completed, but when I look at the directories created
> under ~/logon/lingo/redwoods/tsdb/home, I find that most of them have
> empty score files.  I take the empty score files to be a sign that
> something didn't work.

yes, i would say they indicate a failed experiment.  there are quite a
few ways in which individual experiments can fail, without the complete
job necessarily failing.  Lisp may run out of memory at some point, but
`recover' from that and carry on; for example, when reading the profile
data to start an experiment, a fresh Lisp process will almost certainly
need to grow substantially.  with limited RAM and swap space, that may
fail, but an `out of memory' error may be caught by the caller, where i
can say for sure that [incr tsdb()] frequently catches errors, but i am
less confident (off the top of my head) about how these will be handled
in the context of feature caching and ME grid searches.  our `approach'
has typically been lazy: avoid errors of this kind, hence it is quite
likely that they are not handled in a very meaningful way.  experiments
might end up being skipped, or even executed with incomplete data ...

in a similar spirit, the parameter searches call tadm and evaluate many
times, and either one could crash (insufficient memory or disk space in
`/tmp'), and again i cannot really say how that would be handled.  i am
afraid, my best recommendation is to (a) inspect the log files created
by the `load' script and (b) try to create an environment for such jobs
where you are pretty confident you have some remaining headroom.  from
my experience, i would think that means a minimum of 16 gbytes in RAM,
generous swap space (on top of RAM), and at least several gigabytes of
disk space in `/tmp'.

as regards your earlier (related) question about resource usage:

>  What was your memory high water mark during test (as opposed to
>  training)?

memory consumption will depend on two parameters: the total number of
results (i.e. distinct trees: `zcat result.gz | wc -l'), and how many
feature templates are active (e.g. levels of grandparenting, n-grams,
active edges, constituent weight).  i have started to run experiments
again myself, and i notice that we have become sloppy with memory use
(the process holds on to data longer than it should need to; and the
specifics of Lisp-internal memory management may be sub-optimal too).
i am currently making changes liberally to the LOGON `trunk', where i
would suggest you stick to the HandOn release version until everything
has stabilized again (hopefully sometime next week, or so).

                                                      all best  -  oe

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
+++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++       --- oe at ifi.uio.no; oe at csli.stanford.edu; stephan at oepen.net ---
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++