[developers] What errors can cause a grid parse reranking process to return an empty scores file?
Stephan Oepen
oe at ifi.uio.no
Fri Feb 27 17:57:13 CET 2009
hi again, bill,
> I am running many different grid* files in parallel using the Condor
> distributed computing system. On my latest round of jobs, all of my
> Condor jobs completed, but when I look at the directories created
> under ~/logon/lingo/redwoods/tsdb/home, I find that most of them have
> empty score files. I take the empty score files to be a sign that
> something didn't work.
yes, i would say they indicate a failed experiment. there are quite a
few ways in which individual experiments can fail, without the complete
job necessarily failing. Lisp may run out of memory at some point, but
`recover' from that and carry on; for example, when reading the profile
data to start an experiment, a fresh Lisp process will almost certainly
need to grow substantially. with limited RAM and swap space, that may
fail, but an `out of memory' error may be caught by the caller, where i
can say for sure that [incr tsdb()] frequently catches errors, but i am
less confident (off the top of my head) about how these will be handled
in the context of feature caching and ME grid searches. our `approach'
has typically been lazy: avoid errors of this kind, hence it is quite
likely that they are not handled in a very meaningful way. experiments
might end up being skipped, or even executed with incomplete data ...
in a similar spirit, the parameter searches call tadm and evaluate many
times, and either one could crash (insufficient memory or disk space in
`/tmp'), and again i cannot really say how that would be handled. i am
afraid, my best recommendation is to (a) inspect the log files created
by the `load' script and (b) try to create an environment for such jobs
where you are pretty confident you have some remaining headroom. from
my experience, i would think that means a minimum of 16 gbytes in RAM,
generous swap space (on top of RAM), and at least several gigabytes of
disk space in `/tmp'.
as regards your earlier (related) question about resource usage:
> What was your memory high water mark during test (as opposed to
> training)?
memory consumption will depend on two parameters: the total number of
results (i.e. distinct trees: `zcat result.gz | wc -l'), and how many
feature templates are active (e.g. levels of grandparenting, n-grams,
active edges, constituent weight). i have started to run experiments
again myself, and i notice that we have become sloppy with memory use
(the process holds on to data longer than it should need to; and the
specifics of Lisp-internal memory management may be sub-optimal too).
i am currently making changes liberally to the LOGON `trunk', where i
would suggest you stick to the HandOn release version until everything
has stabilized again (hopefully sometime next week, or so).
all best - oe
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
+++ CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++ --- oe at ifi.uio.no; oe at csli.stanford.edu; stephan at oepen.net ---
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
More information about the developers
mailing list