[Fwd: Re: [developers] processing of lexical rules]

Sun Feb 13 20:28:14 CET 2005

hi again,

> [...] the rule filter approach that Bernd has implemented in cheap
> does a good job eliminating loads of useless segmentation hypotheses.
> As it is done now, it requires the spelling rules to be organised in
> one contiguous block, though. (For the German grammar this is not
> much of a problem, since this had been the case before anyway.)

somehow efficiency of orthographemic rules appears to be the topic of
the week.  francis bond is visiting CSLI and we sat down to performance
tune JaCY a little.  it turned out that half the processing time in PET
was in the orthographemic component (and that PET failed to report this
time as part of overall processing time).  so we brainstormed a little
and came up with additional facilities that could be used to stipulate
boundaries for the string segmentation component:

  (a) a limit on the maximum depth for each chain of orthographemic
      rules (like it was available in the LKB already), 
  (b) a minimum length for the hypothesized stem at the `bottom' of the
      chain, such that no chains will be hypothesized that assume stems
      shorter than that threshold,
  (c) a boolean flag as to whether to allow multiple occurences of the
      same rule within a chain (there was a duplicate filter in place
      for multiple usages of the same rule on the _same_ intermediate
      form, but JaCY has lots of suffix substitution rules and manages
      to go through distinct forms even),
  (d) the ability to declare certain rules as `terminal', i.e. blocking
      further application of orthographemic rules.

we implemented these in PET and, for JaCY, (a) and (c) bring processing
back into the two-digit millisecond range: thus, for grammars where the
rule filter approach would be problematic (due to a need to intersperse
some non-orthographemic rules in a chain), there may be other options.

i plan to submit these changes to the PET source repository within the
next few days, hoping bernd would then merge them into the main branch?
i guess i should at least try to implement the same arsenal in the LKB,
even though i can only agree with ann: the current orthographemic code
should really be rewritten from scratch.
      .  
> I can send some details on the reduction  in search space, if anyone is 
> interested.

can i get a copy of the grammar, so i can test these additional filters
on examples that i can actually read (my japanese is a bit rusty :-)?

> To be honest, with the exception of compounding, I find the current
> functionality sufficient to deal with German (even with Umlaut).

i was cheered up by that finding ... i am fond of the built-in approach
myself.  i think we should add %infix() and branching lexical rules to
the wish list then, in order to stand a chance of doing compounding in
the same spirit.

                                                  all the best  -  oe

nb: oh, i also added a counter :mtcpu to PET (orthographemic processing
time), tracked separately in [incr tsdb()] but included in overall time
(i.e. :total).

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Universitetet i Oslo (ILF); Boks 1102 Blindern; 0317 Oslo; (+47) 2285 7989
+++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++       --- oe at csli.stanford.edu; oe at hf.uio.no; stephan at oepen.net ---
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++