[pet] PET: managing resource limits

Mon Aug 29 14:32:55 CEST 2011

fellow PET developers (and users),

i have a suggestion for house cleaning in the code: checking, enforcing, and
reporting resource limits.  as background, last week i noticed that with very
long sentences, token mapping can take several minutes (i wonder whether
there is low-hanging fruit in terms of making that phase more efficient, but
that is not my current concern).  thus i added a check for resource limits in
token mapping, see:

  http://pet.opendfki.de/changeset/819

now i wonder how to further generalize my new test_resource_limits(), so as
to encapsulate all testing for resource exhaustion in this function.  this would
seem to call for a review of current assumptions and policies (of which some
today are implicit or somewhat undefined, i believe).

for all i can see, one could abstractly break processing into eight phases:

  (1) preprocessing: tokenization, tagging, NER;
  (2) token mapping: rewriting the initial token lattice;
  (3) morphological hypothesis generation
  (4) lexical instantiation and parsing
  (5) lexical filtering: ditching unwanted entries from the lexical chart
  (6) syntactic forest construction
  (6') robust syntactic forest augmentation (to be defined)
  (7) (selective) unpacking
  (8) result post-processing, e.g. MRS read-out

some of these are optional, and of course one could make more fine-grained or
coarser distinctions between phases.  with the addition of (e.g. PCFG-based)
robustness measures, for example, the interface between phases (6) and (6')
may need revisions.

in my view, running into a resource limit in phases (1) -- (5) should
fail immediately,
i.e. there is no way of recovering gracefully and parsing should
deliver no results.

for the purpose of the current discussion, i am inclined to treat
phases (6) and (6')
as one phase.  when robust processing is enabled, there will be some triggering
condition (say there were no results in forest construction), and
likewise there may
be some mechanism of sub-allotting available resources between these two
phases (e.g. as part of the definition of (6'), one might say that it
can borrow up to
n% of the resources originally available to (6)).  if you agreed to
that point of view,
let us ignore (6') for now.

typically, we expect resource exhaustion in phase (6).  when no
results have been
found at that point, parsing fails.  however, in case the forest
contains one or more
candidate complete analyses at the point of resource exhaustion, it would seem
plausible to use a tiny amount of extra resources on unpacking.  with
a bit of luck,
that way the parser can at least deliver some result(s), although we
must then be
careful to flag clearly that these may be sub-optimal, seeing that the
search was
ended prematurely.

to conceptualize the division of resources between (6) and (7), right
now, i would
again be tempted to apply a notion of sub-allotment, i.e. let whatever resources
are allowed to go into (7) be a fraction of what was originally
available to (6).

more concretely, let us assume a global timeout of 60 seconds.  by the time we
enter phase (6), let us further assume that 50 seconds remain
available.  i would
like to introduce a new parameter, say 'unpacking allotment', and assume a value
of, for example, 10%.  under these conditions, after 45 seconds in
phase (6), the
main parser loop should check whether there are candidate results already.  if
so, it should abort phase (6) and move into phase (6), leaving up to 5
seconds for
unpacking.  conversely, if no results had been found in forest
construction yet, it
would seem appropriate to continue phase (6) until the first result is
found (or the
timer expired), i.e. essentially introduce a notion of 'overtime'
forest construction,
'borrowing' resources allotted to unpacking while there is nothing to unpack.

finally, for the time being, i doubt we need to worry about resource
overrun in the
final phase (8).

before we think about making actual changes to the code, i would like to ask for
your thoughts on the above scheme, specifically maybe from bart and yi (seeing
you have recently experimented with increased robustness).  does the general
scheme make sense to you?  is it sufficiently fine-grained and general?

in terms of next steps, maybe we could try a top-down development model, for
once :-).  i.e. write up what we would like to have (somewhere on the wiki), and
then see whether we can adapt the code (on top of the cleaned-up NG branch,
i would suggest).

best, oe

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
+++    --- oe at ifi.uio.no; stephan at oepen.net; http://www.emmtee.net/oe/ ---
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++