[pet] xml_counts mode

Wed Feb 7 05:47:28 CET 2007

Hi all,

I have been playing around with -xml_counts mode and lexical type predictions,
and run into a slight problem with lexical rules (esp. inflectional
rules). What I want to be able to do is stipulate a set of lexical type(s) per
token and have the full lexical rule machinery kick in, conditioned on those
lexical types, e.g. something like the following:

<?xml version="1.0" encoding="utf-8" standalone="no" ?>
<!DOCTYPE pet-input-chart
 SYSTEM "/usr/share/lkb/src/preprocess/maf/pic.dtd">
<pet-input-chart>
<!-- The chance to laugh comes about -->
  <w id="W1" cstart="1" cend="4">
    <surface>The</surface>
  </w>
  <w id="W2" cstart="6" cend="11">
    <surface>chance</surface>
    <typeinfo id="W2S1" baseform="no" prio="1.0">
    <stem>$generic_n_vp_c_le</stem>
    </typeinfo>
  </w>
  <w id="W3" cstart="13" cend="14">
    <surface>to</surface>
  </w>
  <w id="W4" cstart="16" cend="20">
    <surface>laugh</surface>
  </w>
  <w id="W5" cstart="22" cend="25">
    <surface>comes</surface>
    <typeinfo id="W5S1" baseform="no" prio="1.0">
    <stem>$generic_v_p_le</stem>
    </typeinfo>
  </w>
  <w id="W6" cstart="27" cend="31">
    <surface>about</surface>
    <typeinfo id="W6S1" baseform="no" prio="1.0">
    <stem>$generic_p_np_ptcl_le</stem>
    </typeinfo>
  </w>
</pet-input-chart>

What I find I need to do in practice is:

<?xml version="1.0" encoding="utf-8" standalone="no" ?>
<!DOCTYPE pet-input-chart
 SYSTEM "/usr/share/lkb/src/preprocess/maf/pic.dtd">
<pet-input-chart>
<!-- The chance to laugh comes about -->
  <w id="W1" cstart="1" cend="4">
    <surface>The</surface>
  </w>
  <w id="W2" cstart="6" cend="11">
    <surface>chance</surface>
    <typeinfo id="W2S1" baseform="no" prio="1.0">
    <stem>$generic_n_vp_c_le</stem>
    </typeinfo>
  </w>
  <w id="W3" cstart="13" cend="14">
    <surface>to</surface>
  </w>
  <w id="W4" cstart="16" cend="20">
    <surface>laugh</surface>
  </w>
  <w id="W5" cstart="22" cend="25">
    <surface>comes</surface>
    <typeinfo id="W5S1" baseform="no" prio="1.0">
    <stem>$generic_v_p_le</stem>
    </typeinfo>
    <typeinfo id="W5S2" baseform="no" prio="1.0">
    <stem>$generic_v_p_le</stem>
    <infl name="$third_sg_fin_verb_orule"/>
    </typeinfo>
  </w>
  <w id="W6" cstart="27" cend="31">
    <surface>about</surface>
    <typeinfo id="W6S1" baseform="no" prio="1.0">
    <stem>$generic_p_np_ptcl_le</stem>
    </typeinfo>
  </w>
</pet-input-chart>

i.e. add in all the possible lexical rules that can apply to that lexical
type, or alternatively try to disambiguate which lexical rules apply to each
token (which I want to rely on the grammar to do for me). Have I perhaps
misunderstood the XML input formalism, or is there some magic trick on the PET
side of things that I need? For the record, this is the invocation of PET I am
using:

# cat input.xml | cheap -tok=xml_counts -packing /nlptools/erg/20060905-supertagging/english.grm

The version of PET I am using is 0.99.13, and the version of the ERG is
Jul-06, with some naive playing around with gen-lex.tdl to be able to specify
terminal lexical types.

Tim