[developers] [itsdb] Howto train a model on more than one profile

Thu Nov 9 15:02:51 CET 2006

On Thu, 2006-10-19 at 17:53 +0200, Stephan Oepen wrote:

> hei!
> 
> > I would like to train and evaluate our current German treebank. The
> > treebank is, however, distributed across different profiles. Is there
> > a way to select more than one profile as the source for parameter
> > estimation? Or do I have to combine the profiles into one large
> > profile?
> 
> well, due to popular request, here is my secret about virtual profiles:
> as of late, it is possible to create `virtual profiles', which can then
> serve as the target profile for _some_ [incr tsdb()] operations.
> 
> a virtual profile, like any other profile, is a directory somewhere in
> the [incr tsdb()] profile database `home' directory.  the only file one
> needs to put into a virtual profile directory is one called `virtual'.
> the virtual file, in turn, contains the profile names of sub-profiles,
> e.g.
> 
>   "jh0"
>   "jh1"
>   "jh2"
>   "jh3"
>   "jh4"
>   "jh5"
>   "ps"
>   "tg"
> 
> here, `jh0' et al. must be valid profile names (visible in the podium),
> and the double quotes are mandatory.
> 
> a few restrictions: virtual profiles are read-only and currently do not
> show in the [incr tsdb()] podium.  yet, they can be useful in training
> and evaluating parse selection models.
> 
> berthold, if you got the latest LOGON build (or CVS), that now includes
> a sub-directory `lingo/redwoods/', which provides the current versions
> of the LOGON ERG treebanks (called JHPSTG).  also, there is a script by
> the name `load' (essentially setting up the environment for a variety
> of experimental tasks) and input files `fc.lisp' (creating the feature
> cache, a one-time operation); `grid.lisp' (executing a large number of
> experiments, with varying feature sets and estimation parameters); 

Hi Stephan, 

it is really a large number of experiments. I started my first
experiment about a week ago, and I am not even into grandparenting yet.
Is there a way to speed things up, e.g. by dropping some less
interesting variation in parameters? Or is there any support for
multiprocessing?

Some parameters are not really self-explanatory. Can you provide some
comments on grid.lisp? Which parameters are now supported in Pet? 

Thanks, 

Berthold

> and
> finally `train.lisp' (training and serializing a model, using a default
> set of parameters).  you should be able to adapt all of this for your
> Eiche treebank data.  note that, since virtual profiles are read-only,
> you will still need a skeleton for the full data set, as each iteration
> in `grid.lisp' needs to write scores et al.  generally, i would suggest
> to always use the LOGON tree for parse selection experiments.  it also
> includes suitable TADM (and SVM) binaries.
> 
> emily and francis, i hope you might find this useful too :-).
> 
>                                                            best  -  oe
> 
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> +++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
> +++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
> +++       --- oe at csli.stanford.edu; oe at ifi.uio.no; stephan at oepen.net ---
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20061109/52257d7a/attachment.html>