[developers] [itsdb] parse big corpus with itsdb

Fri Nov 10 12:24:26 CET 2006

Hi Stephan and all,

I do find the changes appropriate :-) thanks for the work. It is true that
the forest creation is relatively inexpensive. However, Valia and I are
still a little concerned about the potential efficiency loss on the German
Grammar. Berthold, could you estimate how large the efficiency loss will be?
Is an extra option necessary?

Theoretically, this might lead to the discussion of necessity of selective
(k-best) forest creation. But an extra option for the forest creation will
be an easy (though no-optimal) solution.

Another use of such an option I can think of is in the coverage test, where
only the parsability of the sentence is interested. In such cases, the
creation of the entire parse forest does not seem necessary.

Stephan, Berthold and Bernd, what do you think?

Best,
yi

On 11/9/06, Stephan Oepen <oe at csli.stanford.edu> wrote:
>
> hi again,
>
> > I also think the use of `-nsolutions' is particularly vague at the
> > moment. I believe this is partly due to the split of the parsing
> > phases. To PET developers, should the option be splitted for
> > particular phases of parsing?
>
> i had to check the code to convince me the above was true :-).  i think
> in packing mode, `-nsolutions' should only affect the second phase, and
> we should always compute the full forest.  i was so sure of this point
> of view that i just checked in the code changes to make it so.  here is
> what i put into the ChangeLog:
>
>   - ignore nsolutions limit in forest construction phase when packing
>     is on; the rationale here is that (a) forest construction is cheap
>     and (b) we need to have the full forest available for selective
>     unpacking to compute the correct sequence of n-best results.
>
> in fact, what i say about selective unpacking here is equally true for
> the exhaustive unpacking mode (which should soon be deprecated, as it
> remains restricted to local features).  while i write this, i realize
> that forest construction may be more expensive in GG, hence my change
> might cause berthold a loss in efficiency?  a small price for greater
> precision, i would hope!  berthold, if not, i volunteer to add another
> switch, just as zhang yi had suggested.
>
> while making this change, i checked in a few more minor updates, viz:
>
>   - allow selective unpacking by default when `-packing' is on, i.e. it
>     is no longer required to say `-packing=15' (but still `-nsolutions'
>     greater than 0 is needed to actually get selective unpacking);
>   - fix an error in the YY tokenizer to make it robust to tokens coming
>     in out of surface order;
>   - complete spring cleaning of identity2() along the lines of my email
>     of 31-oct (bernd i could not test jxchg output, but i am optimistic
>     i did the right thing);
>   - make the MEM reader robust to various value formats in the global
>     parameter section;
>   - ditch the (deprecated) *maxent-grandparenting* parameter; its name
>     is canonically *feature-grandparenting*, and [incr tsdb()] will use
>     that name in generating MEM files.
>
> zhang yi and bernd, i hope you will all of the above agreeable!
>
>                                                            best  -  oe
>
>
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284
> 0125
> +++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
> +++       --- oe at csli.stanford.edu; oe at ifi.uio.no; stephan at oepen.net ---
>
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20061110/adc8984c/attachment.html>