[developers] What is the deal with the skeleton virtual profiles?

Mon Mar 16 14:19:24 CET 2009

hi bill (and francis),

> When you create a new virtual profile in the logon system, you also
> have to create a corresponding "skeleton" in an LKB directory.  What
> is the deal with this skeleton?  What does it do, why is is
> necessary?  What happens if I leave it out?  How do you create one?

i suspect you are asking in the context of parse selection?  a virtual
profile is just a way of putting a read-only `skin' around a collection
of real profiles.  we keep profiles to between 500 and 2000 items each,
just to limit the size of each individual database (especially once the
tsdb(1) database engine uncompresses and reads the relations).  virtual
profiles, currently, are not used beyond parse selectione experiments,
they will always be read-only, and to date are only supported in parts
of the [incr tsdb()] code base.

once you install the Redwoods add-on, `lingo/redwoods/tsdb/home/jhpstg'
is a virtual profile.  its `virtual' file contains:

  "jh0"
  "jh1"
  "jh2"
  "jh3"
  "jh4"
  "jh5"
  "ps"
  "tg"

this is what (in the default setup) provides the treebank input to the
feature caching, grid parameter search, and model training.  of these,
the (optional) grid search step is the only process that will generate
new profiles, each recording the output of one n-fold cross validation
experiment.  these output profiles (with the long, funny-looking names)
are minimal in size, in that they do not copy the original treebank but
instead only contain the item information, plus the `fold' and `score'
relations (the results of n-fold  cross validation).  to create these
output profiles, [incr tsdb()] requirs a skeleton.  for JHPSTG, this is
the directory `lingo/lkb/src/tsdb/skeletons/english/logon/jhpstg', and
this is where the items used for each experiment originate.  note that
it is legitimate to have more items in the skeleton than are present in
the actual treebank.  e.g. one could trim down the virtual profile to a
sub-set of the JHPSTG segments and still use the full JHPSTG skeleton.
as long as the feature cache reflects the actual treebank data, then in
practice only items attested in the treebank will be considered for the
grid search (or training) steps.

as to creating the skeleton corresponding to a virtual profile, JHPSTG
for example: assuming the individual profiles already use disjoint item
identifiers (which, incidentally, is an assumption made throughout the
MaxEnt experimentation environment), it will suffice to concatenate the
individual item files.  in other words, i believe i did the following:

  cd $LOGONROOT/lingo/lkb/src/tsdb/skeletons/english/logon
  mkdir jhpstg
  cat jh{0,1,2,3,4,5}/item {ps,tg}/item > jhpstg/item
  cat jh{0,1,2,3,4,5}/item {ps,tg}/item-set > jhpstg/item-set
  cp ../Relations jhpstg/relations

the database schema (`relations') and the `item' relation are mandatory
in all skeletons.  some skeletons provide additional input information,
e.g. grouping items into item sets, phenomenon annotations, or various
types of information on expected outputs.  but, for your purposes, only
the `item' relation will probably be sufficient.  note that to make the
skeleton visible to [incr tsdb()], you also need to edit `Index.lisp'.

> I remember when I was working with Francis on the Japanese setup, the
> skeleton and the virtual profile directories were softlinked
> together.  Is this sufficient.

i do not quite see how that could work.  and generally there should not
be a need for a virtual profile directory, i guess; see comments above.

> PS.  Is there a logon-specific mailing list.  Stephan suggested I use
> that instead of devleopers, but I don't have a contact email for it.

i was thinking of the list `logon at emmtee.net', see:

  http://lists.emmtee.net/mailman/listinfo/logon

my inclination is to only support MaxEnt experimentation with the LOGON
tree (where various third-party components are available), thus i would
suggest we keep detailed discussions of the experimentation envionment
on that list.  but we have long struggled with defining the interfaces
between various lists, hence for each new thread the author should feel
free to decide which DELPH-IN list (or set of lists even) works best.

                                                    best wishes  -  oe

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
+++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++       --- oe at ifi.uio.no; oe at csli.stanford.edu; stephan at oepen.net ---
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++