[Fwd: Re: [developers] processing of lexical rules]

Berthold Crysmann crysmann at dfki.de
Mon Feb 14 00:12:56 CET 2005

Stephan Oepen wrote:

>hi again,
>>[...] the rule filter approach that Bernd has implemented in cheap
>>does a good job eliminating loads of useless segmentation hypotheses.
>>As it is done now, it requires the spelling rules to be organised in
>>one contiguous block, though. (For the German grammar this is not
>>much of a problem, since this had been the case before anyway.)
>somehow efficiency of orthographemic rules appears to be the topic of
>the week.  francis bond is visiting CSLI and we sat down to performance
>tune JaCY a little.  it turned out that half the processing time in PET
>was in the orthographemic component (and that PET failed to report this
>time as part of overall processing time).  so we brainstormed a little
>and came up with additional facilities that could be used to stipulate
>boundaries for the string segmentation component:
>  (a) a limit on the maximum depth for each chain of orthographemic
>      rules (like it was available in the LKB already), 
>  (b) a minimum length for the hypothesized stem at the `bottom' of the
>      chain, such that no chains will be hypothesized that assume stems
>      shorter than that threshold,
>  (c) a boolean flag as to whether to allow multiple occurences of the
>      same rule within a chain (there was a duplicate filter in place
>      for multiple usages of the same rule on the _same_ intermediate
>      form, but JaCY has lots of suffix substitution rules and manages
>      to go through distinct forms even),
>  (d) the ability to declare certain rules as `terminal', i.e. blocking
>      further application of orthographemic rules.
>we implemented these in PET and, for JaCY, (a) and (c) bring processing
>back into the two-digit millisecond range: thus, for grammars where the
>rule filter approach would be problematic (due to a need to intersperse
>some non-orthographemic rules in a chain), there may be other options.
>i plan to submit these changes to the PET source repository within the
>next few days, hoping bernd would then merge them into the main branch?
>i guess i should at least try to implement the same arsenal in the LKB,
>even though i can only agree with ann: the current orthographemic code
>should really be rewritten from scratch.
>      .  
Good. Have a look at the rule filter stuff though. I should really 
produce figures, but it does eleiminate lots of hypotheses !
Just think of German: there are plenty of affixes (verbal, adjectival, 
nominal) adding "e" or "(e)n"
Given a hypothetical word like "benenenenenenenenenenenenenenenen" you 
can imagine what the number of decompositions might be with the original 
code. With the rule filter extensions by Bernd, decomposition stops 
pretty early, since most of the "e" or "(e)n" affixation rules cannot 
possibly feed each other, which is statically determined...

>>I can send some details on the reduction  in search space, if anyone is 
>can i get a copy of the grammar, so i can test these additional filters
>on examples that i can actually read (my japanese is a bit rusty :-)?
Sure. What would you need? A .grm file, or sources?

>>To be honest, with the exception of compounding, I find the current
>>functionality sufficient to deal with German (even with Umlaut).
>i was cheered up by that finding ... 
Do not expect to find any higher-order stuff, though: I did have to 
enumerate all possible intervening consonant clusters.  Still, the list 
of skipped codas to be listed is quite short.
Using disjoint consonantal character classes was not workable: the LKB 
appears to expand them all, yielding some 160,000 odd combinations, 
which made the memory use wind up from 20% to 80% on just one rule!!!!

>i am fond of the built-in approach
>myself.  i think we should add %infix() and branching lexical rules to
>the wish list then, in order to stand a chance of doing compounding in
>the same spirit.
Regarding compounding: since we're currently doing stem access via  db 
lookup, it might be worthwhile to add that I managed to get   some  good 
speedup in a similar scenario by exploiting  graphotactics (reflecting 
phonotactics really) to predict split positions: since compounding 
always involves word constituents, and since words must be composed of 
well-formed syllables, one can easily extract possible onset and coda 
consonant clusters, together with a nucleus requirement,  to reduce 
hypotheses about the position of the word boundary.

What's actually good about this is that, underlyingly, the graphotactic 
restrictions do conform to universal phonological constraints (sonority 
hierarchy), tightened of course, by language-specific ones. As a result, 
one would expect this to carry over to a good deal of other languages...

Independently of  compounding, it would be good to have some 
concatenation operator to deal with affixation: at least for the 
separable prefix verbs, it would be nice to have a single rule to 
prepend the lexicalised prefix to the stem, not 50 different ones. I 
mean, I could sort this out with perl(1), but why not give TDL++ a chance...



PS: Did you know that we have got a perl TDL parser here???

>                                                   all the best  -  oe
>nb: oh, i also added a counter :mtcpu to PET (orthographemic processing
>time), tracked separately in [incr tsdb()] but included in overall time
>(i.e. :total).
>+++ Universitetet i Oslo (ILF); Boks 1102 Blindern; 0317 Oslo; (+47) 2285 7989
>+++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
>+++       --- oe at csli.stanford.edu; oe at hf.uio.no; stephan at oepen.net ---

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20050214/8b6f3635/attachment.html>

More information about the developers mailing list