[developers] [pet] xml_counts mode

Wed Feb 14 07:21:13 CET 2007

Hi Bernd,

Apologies for the slow response. There's been a lot going on the last couple
of weeks, and not a lot of it research.

> what do you mean by "lexical rules"? 

I was referring to the irules and orules in inflr.tdl, i.e. inflectional rules.

> Those that do spelling changes will not be applied because you
> specified a type and the surface form is not considered any more.
> 
> The assumption was that if some module is be in able to generate the
> correct lexical type, it can also compute the applicable spelling
> rules. But that surely is something that we should discuss.

I can see the logic behind this. This all relates closely to a conversation I
had recently with the U of Tokyo mob about their supertagger, as part of which
we came to the realisation that we have been doing supertagging independent of
morphology (i.e. predicting only the lexical type and not the irule and orule
applications), whereas they compile out a lexicon by running all of their
lexical rules (presumably inflectional and derivational) over the base
lexicon, and supertag over the compiled lexicon. This raises the question of
which is the harder task (less lexical types but more lexical and syntactic
variation in my case, and vice versa in the Tokyo case), to which I don't
claim to have an answer at this point. Either way, for my immediate purposes,
I needed results ASAP and to get results I needed the spelling rules, so I
added in the gold-standard rule applications from the preferred parses. Not
ideal, but it at least allowed me to control for spelling rules and focus more
directly on the lexical type prediction.

> I just included a change that was requested by Berthold, namely
> including the processing of spelling changes for generic entries, which
> was based on the surface form given in the input. I could do something
> similar here, provided a non-empty surface form is given and no lexical
> rules are specified. But there's another problem: does the absence of
> those rules in the input signal that NO such rules should be applied or
> does it simply mean the input processor has no idea about that?

Interesting. At present, it would be the latter, but I can certainly see that
you need to be able to differentiate between the two cases.

A related question, hopefully with a simpler answer: is there a batch input
mode in xml_counts? The wiki suggests there is a way of specifying a file
containing a list of file names, each of which in turn contains a single
PIC. I couldn't find any documentation of such a facility or find any obvious
sign of such a thing in the source code, but would dearly like to get away
from my current mode of operation, of firing up PET each time I want to parse
a single PIC.

Tim