[developers] [itsdb] parse big corpus with itsdb

Berthold Crysmann crysmann at dfki.de
Fri Nov 10 13:22:12 CET 2006

On Fri, 2006-11-10 at 12:24 +0100, Yi Zhang wrote:

> Hi Stephan and all,
> I do find the changes appropriate :-) thanks for the work. It is true
> that the forest creation is relatively inexpensive. However, Valia and
> I are still a little concerned about the potential efficiency loss on
> the German Grammar. Berthold, could you estimate how large the
> efficiency loss will be? Is an extra option necessary? 

I can do a test run next week. But if restricting the number of
solutions during forest creation may result in losing the optimal parse,
I do not think that sounds too attractive as a performance measure.
Probably, this can only be decided on experimentally. 
As I said in another mail to Stephan, attacking the discontinuity issue
is probably a more appropriate place to solve most of the remaining
efficiency problems with German. 


> Theoretically, this might lead to the discussion of necessity of
> selective (k-best) forest creation. But an extra option for the forest
> creation will be an easy (though no-optimal) solution.
> Another use of such an option I can think of is in the coverage test,
> where only the parsability of the sentence is interested. 
> In such cases, the creation of the entire parse forest does not seem
> necessary. 
> Stephan, Berthold and Bernd, what do you think?
> Best,
> yi
> On 11/9/06, Stephan Oepen <oe at csli.stanford.edu> wrote:
>         hi again,
>         > I also think the use of `-nsolutions' is particularly vague
>         at the 
>         > moment. I believe this is partly due to the split of the
>         parsing
>         > phases. To PET developers, should the option be splitted for
>         > particular phases of parsing?
>         i had to check the code to convince me the above was
>         true :-).  i think 
>         in packing mode, `-nsolutions' should only affect the second
>         phase, and
>         we should always compute the full forest.  i was so sure of
>         this point
>         of view that i just checked in the code changes to make it
>         so.  here is 
>         what i put into the ChangeLog:
>           - ignore nsolutions limit in forest construction phase when
>         packing
>             is on; the rationale here is that (a) forest construction
>         is cheap
>             and (b) we need to have the full forest available for
>         selective 
>             unpacking to compute the correct sequence of n-best
>         results.
>         in fact, what i say about selective unpacking here is equally
>         true for
>         the exhaustive unpacking mode (which should soon be
>         deprecated, as it
>         remains restricted to local features).  while i write this, i
>         realize
>         that forest construction may be more expensive in GG, hence my
>         change
>         might cause berthold a loss in efficiency?  a small price for
>         greater
>         precision, i would hope!  berthold, if not, i volunteer to add
>         another 
>         switch, just as zhang yi had suggested.
>         while making this change, i checked in a few more minor
>         updates, viz:
>           - allow selective unpacking by default when `-packing' is
>         on, i.e. it
>             is no longer required to say `-packing=15' (but still
>         `-nsolutions' 
>             greater than 0 is needed to actually get selective
>         unpacking);
>           - fix an error in the YY tokenizer to make it robust to
>         tokens coming
>             in out of surface order;
>           - complete spring cleaning of identity2() along the lines of
>         my email 
>             of 31-oct (bernd i could not test jxchg output, but i am
>         optimistic
>             i did the right thing);
>           - make the MEM reader robust to various value formats in the
>         global
>             parameter section;
>           - ditch the (deprecated) *maxent-grandparenting* parameter;
>         its name 
>             is canonically *feature-grandparenting*, and [incr tsdb()]
>         will use
>             that name in generating MEM files.
>         zhang yi and bernd, i hope you will all of the above
>         agreeable!
>         best  -  oe 
>         +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>         +++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo;
>         (+47) 2284 0125
>         +++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1
>         650) 723 0515 
>         +++       --- oe at csli.stanford.edu; oe at ifi.uio.no;
>         stephan at oepen.net ---
>         +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20061110/b33947bc/attachment.html>

More information about the developers mailing list