[developers] Characterisation in cheap

Stephan Oepen oe at ifi.uio.no
Wed Aug 19 23:37:29 CEST 2009

> Ok. With that information in hand I managed to hunt down the bug in
> HaG as well. Still I am quite surprised that it worked as expected in
> the LKB.

the LKB still has the procedural magic in place, so no real surprise.
before too long, i hope to extend the LKB to also create token FSs and
unify those into lexical items, and then the magic can go, and one can
debug characterization on equal terms in both platforms.

> BTW: why does the LKB count characters, but Pet tokens? ANd how do I
> changed the behaviour in any of these platforms?

in the PET example you emailed, you are using untokenized input.  the
built-in tokenizer in PET does not count characters, i.e. in principle
no characterization information is available.  it appears the machine
is trying to be `smart' and using lattice positions instead (which is
not really helpful, in my view).

to get actual characterization in PET, you need to give it tokenized
input, i.e. the output of running REPP, which takes you back to using
an input format that has room for the extra information, e.g. YY 2.0.

btw, a few months ago, i wrote up some basic documentation on all this,


                                                    good night  -  oe

+++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
+++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++       --- oe at ifi.uio.no; oe at csli.stanford.edu; stephan at oepen.net ---

More information about the developers mailing list