[developers] [itsdb] parse big corpus with itsdb
Stephan Oepen
oe at csli.Stanford.EDU
Thu Nov 9 20:26:21 CET 2006
hi again,
> I also think the use of `-nsolutions' is particularly vague at the
> moment. I believe this is partly due to the split of the parsing
> phases. To PET developers, should the option be splitted for
> particular phases of parsing?
i had to check the code to convince me the above was true :-). i think
in packing mode, `-nsolutions' should only affect the second phase, and
we should always compute the full forest. i was so sure of this point
of view that i just checked in the code changes to make it so. here is
what i put into the ChangeLog:
- ignore nsolutions limit in forest construction phase when packing
is on; the rationale here is that (a) forest construction is cheap
and (b) we need to have the full forest available for selective
unpacking to compute the correct sequence of n-best results.
in fact, what i say about selective unpacking here is equally true for
the exhaustive unpacking mode (which should soon be deprecated, as it
remains restricted to local features). while i write this, i realize
that forest construction may be more expensive in GG, hence my change
might cause berthold a loss in efficiency? a small price for greater
precision, i would hope! berthold, if not, i volunteer to add another
switch, just as zhang yi had suggested.
while making this change, i checked in a few more minor updates, viz:
- allow selective unpacking by default when `-packing' is on, i.e. it
is no longer required to say `-packing=15' (but still `-nsolutions'
greater than 0 is needed to actually get selective unpacking);
- fix an error in the YY tokenizer to make it robust to tokens coming
in out of surface order;
- complete spring cleaning of identity2() along the lines of my email
of 31-oct (bernd i could not test jxchg output, but i am optimistic
i did the right thing);
- make the MEM reader robust to various value formats in the global
parameter section;
- ditch the (deprecated) *maxent-grandparenting* parameter; its name
is canonically *feature-grandparenting*, and [incr tsdb()] will use
that name in generating MEM files.
zhang yi and bernd, i hope you will all of the above agreeable!
best - oe
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
+++ CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++ --- oe at csli.stanford.edu; oe at ifi.uio.no; stephan at oepen.net ---
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
More information about the developers
mailing list