[developers] [incr() tsdb]/LKB memory allocation error

Olga Zamaraeva olzama at uw.edu
Thu Mar 24 22:59:37 CET 2016


(forgot the attachment; not that it is very useful)[image: Screen Shot
2016-03-24 at 2.46.38 PM.png]

On Thu, Mar 24, 2016 at 2:55 PM Olga Zamaraeva <olzama at uw.edu> wrote:

> Looking at the token chart, I see that I in fact have many lexical rules
> for the same orthography, and that results in too many parses (a snippet of
> the output is attached). For example, for this "a-tis-e" input that I am
> trying, there are dozens lexical rules for a- and dozens for -a, and so the
> combinations are too many. The reason for this is because the rules were
> inferred automatically by a clustering algorithm, and I asked for many
> clusters (there is a reason for asking for many clusters also: I am trying
> to compare these results with another algorithm which happened to infer
> many position classes, so I want my clustering to come up with the same
> number and then to compare two grammars).
>
> Is there a way to have the LKB stop after it found a parse, and not try
> other possibilities? I tried doing that in itsdb (by turning off exhaustive
> search and limiting the maximum number of analyses), but it still cannot
> handle this large grammar for some reason...
>
> Thank you!
> Olga
>
> On Thu, Mar 17, 2016 at 1:34 AM Ann Copestake <aac10 at cl.cam.ac.uk> wrote:
>
>> So the process is running out of memory before hitting the limit on the
>> number of chart edges, which stops processing a little more gracefully.
>> The LKB batch parse process catches some errors in a way that allows the
>> rest of the batch to continue.   It may be that all that's happening is
>> that the chart edge limit was set too high relative to the available
>> memory, although it is possible that memory is being used in a way that
>> isn't reflected by the edge limit, which is why I suggested also looking at
>> the token chart.   You could increase the amount of memory available to the
>> process and see whether you can get your test set through, but unless
>> that's the final test set and you don't intend to work on any more complex
>> examples than the ones you have, that's only going to be a temporary
>> measure.
>>
>> I don't think it will matter whether you look at examples that can
>> eventually be parsed, something that fails after a huge number of edges or
>> something that causes the memory crash - your task is to find out whether
>> there is something you can do to cut down the number of rule applications.
>> The good news is that you won't need to find many cases of over-application
>> to make a dramatic improvement. I think you will see the issues with the
>> grammar when you look at a chart, even with a small edge limit.
>>
>>
>> Ann
>>
>>
>> On 17/03/2016 01:55, Olga Zamaraeva wrote:
>>
>> Thank you Ann!
>>
>> I suppose I should try to pin down an input that can be successfully
>> parsed, but does produce a huge chart. Of course my most pressing problem
>> is not that some inputs are parsed with huge charts but that some inputs
>> can never be parsed and break the system. But perhaps this is caused by the
>> same problem (or feature) in the grammar.
>>
>> The LKB does give an error message, the same memory allocation error that
>> comes through itsdb when that breaks (attached in the original email).
>>
>> Olga
>> On Tue, Mar 15, 2016 at 2:19 PM Ann Copestake <aac10 at cl.cam.ac.uk> wrote:
>>
>>> I would say that you should attempt to debug in the LKB.  I don't know
>>> exactly why [incr() tsdb] crashes while the LKB batch fails more
>>> gracefully (does the LKB give an error message?) but you should try and
>>> understand what's going on to give you such a huge chart. That's not to
>>> say that it wouldn't be a good idea to know what the [incr() tsdb] issue
>>> is, but it probably won't help you much ...
>>>
>>> If you're using the LKB's morphophonology, you might want to look at the
>>> token chart as well as the parse chart.  This is more recent than the
>>> book, so isn't documented, but if you have an expanded menu, I think it
>>> shows up under Debug.  You want the `print token chart' item, which will
>>> output to the emacs window.  Similarly, if you're trying to debug what's
>>> going on and have an enormous parse chart, don't try and look at the
>>> chart in a window, but use the `print chart' option.  You would want to
>>> reduce the maximum number of items to something a lot smaller than 20k
>>> before you try that, though.
>>>
>>> We should have a FAQ that says `ignore all the GC messages'.  It's
>>> really just a symptom of the underlying system running out of space -
>>> nothing to do with the LKB or [incr() tsdb] as such.  So there's not a
>>> lot of enlightenment to be gained by understanding terms like tenuring
>>> ...
>>>
>>> Best,
>>>
>>> Ann
>>>
>>> On 15/03/2016 19:55, Olga Zamaraeva wrote:
>>> > Dear developers!
>>> >
>>> > I am trying to use the LKB and [incr() tsdb] to parse a list of verbs
>>> > by a grammar of Chintang [ctn]. The language is polysynthetic, plus
>>> > the grammar was created automatically using k-means clustering for the
>>> > morphology section, so some of the position classes have lots and lots
>>> > of inputs and lots and lots of lexical rule types and instances.
>>> >
>>> > I am running into a problem when  [incr() tsdb] crashes because of a
>>> > memory allocation error. If I don't use itsdb and just go with LKB
>>> > batch parsing, it is more robust as it can catch the error and
>>> > continue parsing, having reported a failure on the problematic item,
>>> > but the problem is still there and the parses still fail.
>>> >
>>> > I am a fairly inexperienced user of both systems, so right now I am
>>> > trying to understand what is the best way for me to:
>>> >
>>> >  1) debug the grammar with respect to the problem, i.e. what is it
>>> > about the grammar exactly that causes the issues;
>>> > 2) do something with itsdb so that perhaps this does not happen? Limit
>>> > it somehow so that it doesn't try as much?
>>> >
>>> > Currently I am mostly just trying to filter out the problematic
>>> > items... I also tried limiting the chart size to 30K, and that seems
>>> > to have helped a little, but the crashes still happen on some items.
>>> > If I limit the chart size to 20K, then it seems like maybe I can go
>>> > through the test suite, but then my coverage suffers when I think it
>>> > shouldn't: I think there are items which I can parse with 30K limit
>>> > but not 20K... Is this the route I should be going in any case? Just
>>> > optimizing for the chart size?.. Maybe 25K is my number :). The chart
>>> > is the parse chart, is that correct? I need to understand what exactly
>>> > makes the chart so huge in my case; how should I approach debugging
>>> > that?..
>>> >
>>> > One specific question: what does "tenuring" mean with respect to
>>> > garbage collection? Google doesn't know (nor does the manual, I think).
>>> >
>>> > Does anyone have any comment on any of these issues? The (very
>>> > helpful) chapter on errors and debugging in Copestake (2002) book
>>> > mostly talks about other types of issues such as type loading problems
>>> > etc.. I also looked at what I found in ItsdbTop
>>> > (http://moin.delph-in.net/ItsdbTop), and it does mention that on
>>> > 32-bit systems memory problems are possible, but I think that note has
>>> > to do with treebanking, and it doesn't really tell me much about what
>>> > I should try in my case... I also looked thorough the itsdb manual
>>> > (http://www.delph-in.net/itsdb/publications/manual.pdf) -- but it
>>> > looks like some of the sections, specifically about debugging and
>>> > options and parameters, are empty?
>>> >
>>> > Anyway, I would greatly appreciate any advice! I attach a picture of a
>>> > running testsuite processing, to give an idea about the memory usage
>>> > and the chart size, and of the error. It is possible that the grammar
>>> > that I have is just not a usage scenario as far as itsdb is concerned,
>>> > but I don't yet have a clear understanding of whether that's the case.
>>> >
>>> > Thanks!
>>> > Olga
>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20160324/014a81ab/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2016-03-24 at 2.46.38 PM.png
Type: image/png
Size: 363688 bytes
Desc: not available
URL: <http://lists.delph-in.net/archives/developers/attachments/20160324/014a81ab/attachment-0001.png>


More information about the developers mailing list