[developers] [itsdb] Howto train a model on more than one profile
Stephan Oepen
oe at csli.Stanford.EDU
Thu Oct 19 17:53:47 CEST 2006
hei!
> I would like to train and evaluate our current German treebank. The
> treebank is, however, distributed across different profiles. Is there
> a way to select more than one profile as the source for parameter
> estimation? Or do I have to combine the profiles into one large
> profile?
well, due to popular request, here is my secret about virtual profiles:
as of late, it is possible to create `virtual profiles', which can then
serve as the target profile for _some_ [incr tsdb()] operations.
a virtual profile, like any other profile, is a directory somewhere in
the [incr tsdb()] profile database `home' directory. the only file one
needs to put into a virtual profile directory is one called `virtual'.
the virtual file, in turn, contains the profile names of sub-profiles,
e.g.
"jh0"
"jh1"
"jh2"
"jh3"
"jh4"
"jh5"
"ps"
"tg"
here, `jh0' et al. must be valid profile names (visible in the podium),
and the double quotes are mandatory.
a few restrictions: virtual profiles are read-only and currently do not
show in the [incr tsdb()] podium. yet, they can be useful in training
and evaluating parse selection models.
berthold, if you got the latest LOGON build (or CVS), that now includes
a sub-directory `lingo/redwoods/', which provides the current versions
of the LOGON ERG treebanks (called JHPSTG). also, there is a script by
the name `load' (essentially setting up the environment for a variety
of experimental tasks) and input files `fc.lisp' (creating the feature
cache, a one-time operation); `grid.lisp' (executing a large number of
experiments, with varying feature sets and estimation parameters); and
finally `train.lisp' (training and serializing a model, using a default
set of parameters). you should be able to adapt all of this for your
Eiche treebank data. note that, since virtual profiles are read-only,
you will still need a skeleton for the full data set, as each iteration
in `grid.lisp' needs to write scores et al. generally, i would suggest
to always use the LOGON tree for parse selection experiments. it also
includes suitable TADM (and SVM) binaries.
emily and francis, i hope you might find this useful too :-).
best - oe
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
+++ CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++ --- oe at csli.stanford.edu; oe at ifi.uio.no; stephan at oepen.net ---
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
More information about the developers
mailing list