[developers] results after edge limit reached?

Wed Aug 19 19:56:59 CEST 2009

g'day, with apologies for a late follow-up,

> I had hoped so. I use -tsdbdump because I want fsc input - i believe 
> that is currently the only way to input fsc?

the `i-input' field in [incr tsdb()] can be an arbitrary string, hence
it should in principle be possible to create profiles where the input
to the parser is in FSC format.  i believe we have failed to get this
to work with PiC in the past because the PiC XML reader requires two
linebreaks following the XML, and by default [incr tsdb()] normalizes
input strings (replacing sequences of whitespace with a single space)
during import (i.e. when creating profiles from text files).  it should
however be possible to patch the `i-input' field in the `item' relation
with two linefeeds (escaped as `\n\n' in tsdb(1) data files) after the
fact, and then i imagine even PiC might work (i have not tested this).

whether or not FSC would require similar magic will depend on how its
XML reader is configured, specifically how it detects the end of input
when reading from a string (rather than from a stream).  but i am sure
this could be made to work.

  [...]

> Actually, I think
> 
> (time spent on all items, including failed and errored)/
> (number of items that get an analysis)
> 
> is the most meaningful figure in batch processing for an application, 
> but over all input items is also reasonable. The current [incr tsdb()] 
> uses "number of input items - items with lexical gaps" as denominator (I 
> think), which is neither here nor there. For the moment, I'm going to 
> pick a definition and stick with it, but I think if people are going to 
> be reporting times and memory use from [incr tsdb()], we should decide 
> what those figures measure.

the above seems like a weird `average' to me?  because i would expect
the denominator to always be the cardinality of the set over which one
has summed (whatever property) for the numerator.  your proposal seems
to mesh aspects of coverage and efficiency, where i believe we need to
keep three basic measures separate: (a) coverage, as in what percentage
of inputs receive one or more analyses; (b) efficiency, average time or
other resource usage per input; and (c) accuracy, as in some measure of
how `good' analyses actually are.  as a PET consumer, i believe i would
ask (a) how often can i expect to get a result?  (b) when i parse 1,000
sentences, how long will it take?  and (c) (to which degree) can i rely
on the results i get?

in [incr tsdb()], readings = -1 was originally used to flag inputs for
which the processor encountered an internal error (Lisp running out of
memory, say), in which case no reliable statistics could be reported.
therefore, PET (and other processing clients) should presumably report
items with lexical gaps as readings = 0 and return a non-empty `error'
value.  so, in other words, [incr tsdb()] is of course doing the right
thing (potentially excluding some items from some statistics), only we
need to change PET to reserve readings = -1 for errors that prevent it
from reporting accurate resource consumption statistics.  this may just
be an issue of re-ordering a bit of the statistics code in the parser,
but it is too intricate a change for me to apply in a hurry.  it would
make our average parse times look smaller, of course, so maybe there is
enough motivation to make this change relatively soon :-).  anyone who
would oppose such a move?

                                                      all best  -  oe

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
+++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++       --- oe at ifi.uio.no; oe at csli.stanford.edu; stephan at oepen.net ---
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++