[developers] [itsdb] Howto train a model on more than one profile

Thu Oct 19 17:53:47 CEST 2006

hei!

> I would like to train and evaluate our current German treebank. The
> treebank is, however, distributed across different profiles. Is there
> a way to select more than one profile as the source for parameter
> estimation? Or do I have to combine the profiles into one large
> profile?

well, due to popular request, here is my secret about virtual profiles:
as of late, it is possible to create `virtual profiles', which can then
serve as the target profile for _some_ [incr tsdb()] operations.

a virtual profile, like any other profile, is a directory somewhere in
the [incr tsdb()] profile database `home' directory.  the only file one
needs to put into a virtual profile directory is one called `virtual'.
the virtual file, in turn, contains the profile names of sub-profiles,
e.g.

  "jh0"
  "jh1"
  "jh2"
  "jh3"
  "jh4"
  "jh5"
  "ps"
  "tg"

here, `jh0' et al. must be valid profile names (visible in the podium),
and the double quotes are mandatory.

a few restrictions: virtual profiles are read-only and currently do not
show in the [incr tsdb()] podium.  yet, they can be useful in training
and evaluating parse selection models.

berthold, if you got the latest LOGON build (or CVS), that now includes
a sub-directory `lingo/redwoods/', which provides the current versions
of the LOGON ERG treebanks (called JHPSTG).  also, there is a script by
the name `load' (essentially setting up the environment for a variety
of experimental tasks) and input files `fc.lisp' (creating the feature
cache, a one-time operation); `grid.lisp' (executing a large number of
experiments, with varying feature sets and estimation parameters); and
finally `train.lisp' (training and serializing a model, using a default
set of parameters).  you should be able to adapt all of this for your
Eiche treebank data.  note that, since virtual profiles are read-only,
you will still need a skeleton for the full data set, as each iteration
in `grid.lisp' needs to write scores et al.  generally, i would suggest
to always use the LOGON tree for parse selection experiments.  it also
includes suitable TADM (and SVM) binaries.

emily and francis, i hope you might find this useful too :-).

                                                           best  -  oe

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
+++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++       --- oe at csli.stanford.edu; oe at ifi.uio.no; stephan at oepen.net ---
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++