[developers] [itsdb] parse big corpus with itsdb

Stephan Oepen oe at csli.Stanford.EDU
Thu Nov 9 20:26:21 CET 2006

hi again,

> I also think the use of `-nsolutions' is particularly vague at the
> moment. I believe this is partly due to the split of the parsing
> phases. To PET developers, should the option be splitted for
> particular phases of parsing?

i had to check the code to convince me the above was true :-).  i think
in packing mode, `-nsolutions' should only affect the second phase, and
we should always compute the full forest.  i was so sure of this point 
of view that i just checked in the code changes to make it so.  here is
what i put into the ChangeLog:

  - ignore nsolutions limit in forest construction phase when packing
    is on; the rationale here is that (a) forest construction is cheap
    and (b) we need to have the full forest available for selective
    unpacking to compute the correct sequence of n-best results.

in fact, what i say about selective unpacking here is equally true for
the exhaustive unpacking mode (which should soon be deprecated, as it
remains restricted to local features).  while i write this, i realize
that forest construction may be more expensive in GG, hence my change
might cause berthold a loss in efficiency?  a small price for greater
precision, i would hope!  berthold, if not, i volunteer to add another
switch, just as zhang yi had suggested.

while making this change, i checked in a few more minor updates, viz:

  - allow selective unpacking by default when `-packing' is on, i.e. it
    is no longer required to say `-packing=15' (but still `-nsolutions'
    greater than 0 is needed to actually get selective unpacking);
  - fix an error in the YY tokenizer to make it robust to tokens coming
    in out of surface order;
  - complete spring cleaning of identity2() along the lines of my email
    of 31-oct (bernd i could not test jxchg output, but i am optimistic
    i did the right thing);
  - make the MEM reader robust to various value formats in the global
    parameter section;
  - ditch the (deprecated) *maxent-grandparenting* parameter; its name
    is canonically *feature-grandparenting*, and [incr tsdb()] will use
    that name in generating MEM files.

zhang yi and bernd, i hope you will all of the above agreeable!

                                                           best  -  oe

+++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
+++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++       --- oe at csli.stanford.edu; oe at ifi.uio.no; stephan at oepen.net ---

More information about the developers mailing list