[developers] [incr() tsdb]/LKB memory allocation error
aac10 at cl.cam.ac.uk
Sat Apr 9 16:31:43 CEST 2016
Sorry for not replying - did you manage to find a work around?
On 24/03/2016 21:55, Olga Zamaraeva wrote:
> Looking at the token chart, I see that I in fact have many lexical
> rules for the same orthography, and that results in too many parses (a
> snippet of the output is attached). For example, for this "a-tis-e"
> input that I am trying, there are dozens lexical rules for a- and
> dozens for -a, and so the combinations are too many. The reason for
> this is because the rules were inferred automatically by a clustering
> algorithm, and I asked for many clusters (there is a reason for asking
> for many clusters also: I am trying to compare these results with
> another algorithm which happened to infer many position classes, so I
> want my clustering to come up with the same number and then to compare
> two grammars).
> Is there a way to have the LKB stop after it found a parse, and not
> try other possibilities? I tried doing that in itsdb (by turning off
> exhaustive search and limiting the maximum number of analyses), but it
> still cannot handle this large grammar for some reason...
> Thank you!
> On Thu, Mar 17, 2016 at 1:34 AM Ann Copestake <aac10 at cl.cam.ac.uk
> <mailto:aac10 at cl.cam.ac.uk>> wrote:
> So the process is running out of memory before hitting the limit
> on the number of chart edges, which stops processing a little more
> gracefully. The LKB batch parse process catches some errors in a
> way that allows the rest of the batch to continue. It may be
> that all that's happening is that the chart edge limit was set too
> high relative to the available memory, although it is possible
> that memory is being used in a way that isn't reflected by the
> edge limit, which is why I suggested also looking at the token
> chart. You could increase the amount of memory available to the
> process and see whether you can get your test set through, but
> unless that's the final test set and you don't intend to work on
> any more complex examples than the ones you have, that's only
> going to be a temporary measure.
> I don't think it will matter whether you look at examples that can
> eventually be parsed, something that fails after a huge number of
> edges or something that causes the memory crash - your task is to
> find out whether there is something you can do to cut down the
> number of rule applications. The good news is that you won't need
> to find many cases of over-application to make a dramatic
> improvement. I think you will see the issues with the grammar when
> you look at a chart, even with a small edge limit.
> On 17/03/2016 01:55, Olga Zamaraeva wrote:
>> Thank you Ann!
>> I suppose I should try to pin down an input that can be
>> successfully parsed, but does produce a huge chart. Of course my
>> most pressing problem is not that some inputs are parsed with
>> huge charts but that some inputs can never be parsed and break
>> the system. But perhaps this is caused by the same problem (or
>> feature) in the grammar.
>> The LKB does give an error message, the same memory allocation
>> error that comes through itsdb when that breaks (attached in the
>> original email).
>> On Tue, Mar 15, 2016 at 2:19 PM Ann Copestake <aac10 at cl.cam.ac.uk
>> <mailto:aac10 at cl.cam.ac.uk>> wrote:
>> I would say that you should attempt to debug in the LKB. I
>> don't know
>> exactly why [incr() tsdb] crashes while the LKB batch fails more
>> gracefully (does the LKB give an error message?) but you
>> should try and
>> understand what's going on to give you such a huge chart.
>> That's not to
>> say that it wouldn't be a good idea to know what the [incr()
>> tsdb] issue
>> is, but it probably won't help you much ...
>> If you're using the LKB's morphophonology, you might want to
>> look at the
>> token chart as well as the parse chart. This is more recent
>> than the
>> book, so isn't documented, but if you have an expanded menu,
>> I think it
>> shows up under Debug. You want the `print token chart' item,
>> which will
>> output to the emacs window. Similarly, if you're trying to
>> debug what's
>> going on and have an enormous parse chart, don't try and look
>> at the
>> chart in a window, but use the `print chart' option. You
>> would want to
>> reduce the maximum number of items to something a lot smaller
>> than 20k
>> before you try that, though.
>> We should have a FAQ that says `ignore all the GC messages'.
>> really just a symptom of the underlying system running out of
>> space -
>> nothing to do with the LKB or [incr() tsdb] as such. So
>> there's not a
>> lot of enlightenment to be gained by understanding terms like
>> tenuring ...
>> On 15/03/2016 19:55, Olga Zamaraeva wrote:
>> > Dear developers!
>> > I am trying to use the LKB and [incr() tsdb] to parse a
>> list of verbs
>> > by a grammar of Chintang [ctn]. The language is
>> polysynthetic, plus
>> > the grammar was created automatically using k-means
>> clustering for the
>> > morphology section, so some of the position classes have
>> lots and lots
>> > of inputs and lots and lots of lexical rule types and
>> > I am running into a problem when [incr() tsdb] crashes
>> because of a
>> > memory allocation error. If I don't use itsdb and just go
>> with LKB
>> > batch parsing, it is more robust as it can catch the error and
>> > continue parsing, having reported a failure on the
>> problematic item,
>> > but the problem is still there and the parses still fail.
>> > I am a fairly inexperienced user of both systems, so right
>> now I am
>> > trying to understand what is the best way for me to:
>> > 1) debug the grammar with respect to the problem, i.e.
>> what is it
>> > about the grammar exactly that causes the issues;
>> > 2) do something with itsdb so that perhaps this does not
>> happen? Limit
>> > it somehow so that it doesn't try as much?
>> > Currently I am mostly just trying to filter out the problematic
>> > items... I also tried limiting the chart size to 30K, and
>> that seems
>> > to have helped a little, but the crashes still happen on
>> some items.
>> > If I limit the chart size to 20K, then it seems like maybe
>> I can go
>> > through the test suite, but then my coverage suffers when I
>> think it
>> > shouldn't: I think there are items which I can parse with
>> 30K limit
>> > but not 20K... Is this the route I should be going in any
>> case? Just
>> > optimizing for the chart size?.. Maybe 25K is my number :).
>> The chart
>> > is the parse chart, is that correct? I need to understand
>> what exactly
>> > makes the chart so huge in my case; how should I approach
>> > that?..
>> > One specific question: what does "tenuring" mean with
>> respect to
>> > garbage collection? Google doesn't know (nor does the
>> manual, I think).
>> > Does anyone have any comment on any of these issues? The (very
>> > helpful) chapter on errors and debugging in Copestake
>> (2002) book
>> > mostly talks about other types of issues such as type
>> loading problems
>> > etc.. I also looked at what I found in ItsdbTop
>> > (http://moin.delph-in.net/ItsdbTop), and it does mention
>> that on
>> > 32-bit systems memory problems are possible, but I think
>> that note has
>> > to do with treebanking, and it doesn't really tell me much
>> about what
>> > I should try in my case... I also looked thorough the itsdb
>> > (http://www.delph-in.net/itsdb/publications/manual.pdf) --
>> but it
>> > looks like some of the sections, specifically about
>> debugging and
>> > options and parameters, are empty?
>> > Anyway, I would greatly appreciate any advice! I attach a
>> picture of a
>> > running testsuite processing, to give an idea about the
>> memory usage
>> > and the chart size, and of the error. It is possible that
>> the grammar
>> > that I have is just not a usage scenario as far as itsdb is
>> > but I don't yet have a clear understanding of whether
>> that's the case.
>> > Thanks!
>> > Olga
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the developers