[itsdb] parse big corpus with itsdb

Wed Nov 8 10:57:47 CET 2006

Hi,

 Here are my understanding of the options:

-results sits in the output routine and stops it printing all the
> results.  They are still all calculated.

I think that's right.

-nsolutions asks cheap to only produce the top "n" parses.
>
Due to the use of ambiguity packing, the parsing is splitted into two
phases: i) packed parse forest creation; ii) unpacking the readings.
`-nsolutions' can have effect in both phases.

In the first phase, if `-nsolutions' is set to be non-zero, the forest
creation phase will stop when the `first' n (with kind of beam search i
think) packed trees are found. If `-nsolutions' is not set or set to be
zero, the entire packed parse forest will be created.

In the unpacking phase, the effect depends on the unpacking mechanism used:
- if `packing=7' (which is the default exhaustive unpacking) is used, all
the readings will be unpacked (with lots of unification operations
replayed), and sorted according to the scoring model. `-nsolutions' won't
have any effect on this phase. So you might finally get more readings than
`-nsolutions'.

- if `packing=15' (selective unpacking) is used, only the best n readings
will be unpacked from the parse forest. But note that `-nsolutions' must be
set to >0, otherwise the parser will fall back into exhaustive unpacking
like `-packing=7'. Current implementation supports the basic branching and
grand-parenting (with arbitrary number of levels) features in the scoring
model.

I also think the use of `-nsolutions' is particularly vague at the moment. I
believe this is partly due to the split of the parsing phases. To PET
developers, should the option be splitted for particular phases of parsing?

Stephan and Bernd, please correct me if I am wrong :-)

Best,
yi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/itsdb/attachments/20061108/e10f7a7a/attachment.html>