[developers] Tranining a parse ranking model for Jacy

Fri Jan 13 09:07:51 CET 2012

Hi,
yes, I used thinned profiles for training, indeed.
Thanks a lot for the quick help!
Lea.

On Friday 13,January,2012 01:46 AM, Dan Flickinger wrote:
> Hi Lea -
>
> Could it be that you are accidentally using a thinned version of the gold profile for Japanese, where only the one selected parse for each item is stored in `result'?  Just a thought.
>
>   Dan
>
> ----- Original Message -----
> From: "Lea"<frermann at coli.uni-saarland.de>
> To: developers at delph-in.net
> Sent: Wednesday, January 11, 2012 9:12:32 PM
> Subject: [developers] Tranining a parse ranking model for Jacy
>
> Hello,
>
> I am currently training two kinds of parse ranking models for Jacy and
> the ERG:
> (a) on gold annotated profiles of the (Japanese-English, parallel)
> Tanaka corpus
> (b) on Tanaka profiles which were treebanked automatically (using MRS
> alignment)
>
> In both cases the results for Japanese are worse than expected. I train
> the same models on the same data in English for the ERG, and here
> everything seems to work fine. I use the 'load' script and the
> 'train.lisp' script, which do both feature-caching and context-caching.
>
> setting (a)
> Training on gold annotated Tanaka profile 006, for Japanese only 1
> feature per sentence is extracted during feature caching, while for
> English the number looks reasonable. The model returned for Japanese is
> tiny compared to the English one, and performs very poorly, as expected.
>
> Japanese:
> [11:44:48] operate-on-profiles(): running `pet' [30009000 - 30009200|.
> [11:44:48] open-fc(): new BDB `fc.bdb'.
> [11:44:48] cache-features(): item # 30009000: 1 event;
> [11:44:48] cache-features(): item # 30009004: 1 event;
> [11:44:48] cache-features(): item # 30009006: 1 event;
> [11:44:48] cache-features(): item # 30009007: 1 event;
> [11:44:48] cache-features(): item # 30009008: 1 event;
> ...
> Events in  = /tmp/.model.lfrermann.19628.events
> Params out = /tmp/.model.lfrermann.19628.weights
> Marginal   = pseudo-likelihood
> Smoothing  = none
> Procs      = 1
> Classes    = 749
> Contexts   = 715
> Features   = 7 / 7
> Non-zeros  = 2197
>
>
> English:
> [11:45:59] operate-on-profiles(): running `pet' [30009000 - 30009200|.
> [11:46:00] open-fc(): new BDB `fc.bdb'.
> [11:46:00] cache-features(): item # 30009000: 11 events;
> [11:46:00] cache-features(): item # 30009001: 6 events;
> [11:46:01] cache-features(): item # 30009002: 11 events;
> [11:46:01] cache-features(): item # 30009003: 11 events;
> [11:46:01] cache-features(): item # 30009004: 2 events;
> ...
> Events in  = /tmp/.model.lfrermann.19803.events
> Params out = /tmp/.model.lfrermann.19803.weights
> Marginal   = pseudo-likelihood
> Smoothing  = none
> Procs      = 1
> Classes    = 8739
> Contexts   = 1147
> Features   = 6595 / 8650
> Non-zeros  = 667364
>
>
> setting (b)
> When I train ranking models one automatically treebanked profile, for
> both languages a reasonable number of parses is extracted (looking
> similar to the English output above), and the model sizes are comparable:
>
> Japanese:
> Events in  = /tmp/.model.lfrermann.20201.events
> Params out = /tmp/.model.lfrermann.20201.weights
> Marginal   = pseudo-likelihood
> Smoothing  = none
> Procs      = 1
> Classes    = 4620
> Contexts   = 495
> Features   = 3432 / 4314
> Non-zeros  = 421357
>
> English:
> Events in  = /tmp/.model.lfrermann.20140.events
> Params out = /tmp/.model.lfrermann.20140.weights
> Marginal   = pseudo-likelihood
> Smoothing  = none
> Procs      = 1
> Classes    = 5067
> Contexts   = 617
> Features   = 4177 / 5234
> Non-zeros  = 342117
>
> When I parse a test profile for English and Japanese using the
> respective model, and compare the resulting ranks to the gold
> annotations, I get 50% accuracy for English, but only 39% accuracy for
> Japanese. The difference might be influenced by language specific
> differences in training and evaluation, but still it seems too big to me.
>
> I'd be very grateful for any suggested solutions (especially considering
> the approaching ACL deadline (15.1.)).
> Thank you very much for your help in advance!
> Lea.
>
>
>