[developers] [itsdb] parse big corpus with itsdb
crysmann at dfki.de
Fri Nov 10 13:22:12 CET 2006
On Fri, 2006-11-10 at 12:24 +0100, Yi Zhang wrote:
> Hi Stephan and all,
> I do find the changes appropriate :-) thanks for the work. It is true
> that the forest creation is relatively inexpensive. However, Valia and
> I are still a little concerned about the potential efficiency loss on
> the German Grammar. Berthold, could you estimate how large the
> efficiency loss will be? Is an extra option necessary?
I can do a test run next week. But if restricting the number of
solutions during forest creation may result in losing the optimal parse,
I do not think that sounds too attractive as a performance measure.
Probably, this can only be decided on experimentally.
As I said in another mail to Stephan, attacking the discontinuity issue
is probably a more appropriate place to solve most of the remaining
efficiency problems with German.
> Theoretically, this might lead to the discussion of necessity of
> selective (k-best) forest creation. But an extra option for the forest
> creation will be an easy (though no-optimal) solution.
> Another use of such an option I can think of is in the coverage test,
> where only the parsability of the sentence is interested.
> In such cases, the creation of the entire parse forest does not seem
> Stephan, Berthold and Bernd, what do you think?
> On 11/9/06, Stephan Oepen <oe at csli.stanford.edu> wrote:
> hi again,
> > I also think the use of `-nsolutions' is particularly vague
> at the
> > moment. I believe this is partly due to the split of the
> > phases. To PET developers, should the option be splitted for
> > particular phases of parsing?
> i had to check the code to convince me the above was
> true :-). i think
> in packing mode, `-nsolutions' should only affect the second
> phase, and
> we should always compute the full forest. i was so sure of
> this point
> of view that i just checked in the code changes to make it
> so. here is
> what i put into the ChangeLog:
> - ignore nsolutions limit in forest construction phase when
> is on; the rationale here is that (a) forest construction
> is cheap
> and (b) we need to have the full forest available for
> unpacking to compute the correct sequence of n-best
> in fact, what i say about selective unpacking here is equally
> true for
> the exhaustive unpacking mode (which should soon be
> deprecated, as it
> remains restricted to local features). while i write this, i
> that forest construction may be more expensive in GG, hence my
> might cause berthold a loss in efficiency? a small price for
> precision, i would hope! berthold, if not, i volunteer to add
> switch, just as zhang yi had suggested.
> while making this change, i checked in a few more minor
> updates, viz:
> - allow selective unpacking by default when `-packing' is
> on, i.e. it
> is no longer required to say `-packing=15' (but still
> greater than 0 is needed to actually get selective
> - fix an error in the YY tokenizer to make it robust to
> tokens coming
> in out of surface order;
> - complete spring cleaning of identity2() along the lines of
> my email
> of 31-oct (bernd i could not test jxchg output, but i am
> i did the right thing);
> - make the MEM reader robust to various value formats in the
> parameter section;
> - ditch the (deprecated) *maxent-grandparenting* parameter;
> its name
> is canonically *feature-grandparenting*, and [incr tsdb()]
> will use
> that name in generating MEM files.
> zhang yi and bernd, i hope you will all of the above
> best - oe
> +++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo;
> (+47) 2284 0125
> +++ CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1
> 650) 723 0515
> +++ --- oe at csli.stanford.edu; oe at ifi.uio.no;
> stephan at oepen.net ---
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the developers