[developers] xml_counts mode

Thu Feb 8 02:00:55 CET 2007

Hi all,

I originally sent this to the PET developers' list but it doesn't seem to be
particularly active, so have reposted it here. I hope this isn't too much of a
protocol gaffe.

> I have been playing around with -xml_counts mode and lexical type predictions,
> and run into a slight problem with lexical rules (esp. inflectional
> rules). What I want to be able to do is stipulate a set of lexical type(s) per
> token and have the full lexical rule machinery kick in, conditioned on those
> lexical types, e.g. something like the following:
> 
> <?xml version="1.0" encoding="utf-8" standalone="no" ?>
> <!DOCTYPE pet-input-chart
>  SYSTEM "/usr/share/lkb/src/preprocess/maf/pic.dtd">
> <pet-input-chart>
> <!-- The chance to laugh comes about -->
>   <w id="W1" cstart="1" cend="4">
>     <surface>The</surface>
>   </w>
>   <w id="W2" cstart="6" cend="11">
>     <surface>chance</surface>
>     <typeinfo id="W2S1" baseform="no" prio="1.0">
>     <stem>$generic_n_vp_c_le</stem>
>     </typeinfo>
>   </w>
>   <w id="W3" cstart="13" cend="14">
>     <surface>to</surface>
>   </w>
>   <w id="W4" cstart="16" cend="20">
>     <surface>laugh</surface>
>   </w>
>   <w id="W5" cstart="22" cend="25">
>     <surface>comes</surface>
>     <typeinfo id="W5S1" baseform="no" prio="1.0">
>     <stem>$generic_v_p_le</stem>
>     </typeinfo>
>   </w>
>   <w id="W6" cstart="27" cend="31">
>     <surface>about</surface>
>     <typeinfo id="W6S1" baseform="no" prio="1.0">
>     <stem>$generic_p_np_ptcl_le</stem>
>     </typeinfo>
>   </w>
> </pet-input-chart>
> 
> 
> 
> 
> What I find I need to do in practice is:
> 
> 
> <?xml version="1.0" encoding="utf-8" standalone="no" ?>
> <!DOCTYPE pet-input-chart
>  SYSTEM "/usr/share/lkb/src/preprocess/maf/pic.dtd">
> <pet-input-chart>
> <!-- The chance to laugh comes about -->
>   <w id="W1" cstart="1" cend="4">
>     <surface>The</surface>
>   </w>
>   <w id="W2" cstart="6" cend="11">
>     <surface>chance</surface>
>     <typeinfo id="W2S1" baseform="no" prio="1.0">
>     <stem>$generic_n_vp_c_le</stem>
>     </typeinfo>
>   </w>
>   <w id="W3" cstart="13" cend="14">
>     <surface>to</surface>
>   </w>
>   <w id="W4" cstart="16" cend="20">
>     <surface>laugh</surface>
>   </w>
>   <w id="W5" cstart="22" cend="25">
>     <surface>comes</surface>
>     <typeinfo id="W5S1" baseform="no" prio="1.0">
>     <stem>$generic_v_p_le</stem>
>     </typeinfo>
>     <typeinfo id="W5S2" baseform="no" prio="1.0">
>     <stem>$generic_v_p_le</stem>
>     <infl name="$third_sg_fin_verb_orule"/>
>     </typeinfo>
>   </w>
>   <w id="W6" cstart="27" cend="31">
>     <surface>about</surface>
>     <typeinfo id="W6S1" baseform="no" prio="1.0">
>     <stem>$generic_p_np_ptcl_le</stem>
>     </typeinfo>
>   </w>
> </pet-input-chart>
> 
> 
> i.e. add in all the possible lexical rules that can apply to that lexical
> type, or alternatively try to disambiguate which lexical rules apply to each
> token (which I want to rely on the grammar to do for me). Have I perhaps
> misunderstood the XML input formalism, or is there some magic trick on the PET
> side of things that I need? For the record, this is the invocation of PET I am
> using:
> 
> # cat input.xml | cheap -tok=xml_counts -packing /nlptools/erg/20060905-supertagging/english.grm
> 
> The version of PET I am using is 0.99.13, and the version of the ERG is
> Jul-06, with some naive playing around with gen-lex.tdl to be able to specify
> terminal lexical types.

Adding to this, I can see how I can get around the problem by extracting the
lexical rule information out of the original gold-standard parses to be able
to insert gold-standard lexical rule info, but am not sure how meaningful this
evaluation would be.

Tim