<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type"> <title></title> </head> <body bgcolor="#ffffff" text="#000000"> Stephan Oepen wrote: <blockquote cite="mid200502131928.j1DJSE2T022609@mv.uio.no" type="cite"> <pre wrap="">hi again, </pre> <blockquote type="cite"> <pre wrap="">[...] the rule filter approach that Bernd has implemented in cheap does a good job eliminating loads of useless segmentation hypotheses. As it is done now, it requires the spelling rules to be organised in one contiguous block, though. (For the German grammar this is not much of a problem, since this had been the case before anyway.) </pre> </blockquote> <pre wrap=""> somehow efficiency of orthographemic rules appears to be the topic of the week. francis bond is visiting CSLI and we sat down to performance tune JaCY a little. it turned out that half the processing time in PET was in the orthographemic component (and that PET failed to report this time as part of overall processing time). so we brainstormed a little and came up with additional facilities that could be used to stipulate boundaries for the string segmentation component: (a) a limit on the maximum depth for each chain of orthographemic rules (like it was available in the LKB already), (b) a minimum length for the hypothesized stem at the `bottom' of the chain, such that no chains will be hypothesized that assume stems shorter than that threshold, (c) a boolean flag as to whether to allow multiple occurences of the same rule within a chain (there was a duplicate filter in place for multiple usages of the same rule on the _same_ intermediate form, but JaCY has lots of suffix substitution rules and manages to go through distinct forms even), (d) the ability to declare certain rules as `terminal', i.e. blocking further application of orthographemic rules. we implemented these in PET and, for JaCY, (a) and (c) bring processing back into the two-digit millisecond range: thus, for grammars where the rule filter approach would be problematic (due to a need to intersperse some non-orthographemic rules in a chain), there may be other options. i plan to submit these changes to the PET source repository within the next few days, hoping bernd would then merge them into the main branch? i guess i should at least try to implement the same arsenal in the LKB, even though i can only agree with ann: the current orthographemic code should really be rewritten from scratch. . </pre> </blockquote> Good. Have a look at the rule filter stuff though. I should really produce figures, but it does eleiminate lots of hypotheses ! Just think of German: there are plenty of affixes (verbal, adjectival, nominal) adding "e" or "(e)n" Given a hypothetical word like "benenenenenenenenenenenenenenenen" you can imagine what the number of decompositions might be with the original code. With the rule filter extensions by Bernd, decomposition stops pretty early, since most of the "e" or "(e)n" affixation rules cannot possibly feed each other, which is statically determined... <blockquote cite="mid200502131928.j1DJSE2T022609@mv.uio.no" type="cite"> <blockquote type="cite"> <pre wrap="">I can send some details on the reduction in search space, if anyone is interested. </pre> </blockquote> <pre wrap=""> can i get a copy of the grammar, so i can test these additional filters on examples that i can actually read (my japanese is a bit rusty :-)? </pre> </blockquote> Sure. What would you need? A .grm file, or sources? <blockquote cite="mid200502131928.j1DJSE2T022609@mv.uio.no" type="cite"> <pre wrap=""></pre> <blockquote type="cite"> <pre wrap="">To be honest, with the exception of compounding, I find the current functionality sufficient to deal with German (even with Umlaut). </pre> </blockquote> <pre wrap=""> i was cheered up by that finding ... </pre> </blockquote> Do not expect to find any higher-order stuff, though: I did have to enumerate all possible intervening consonant clusters.  Still, the list of skipped codas to be listed is quite short. Using disjoint consonantal character classes was not workable: the LKB appears to expand them all, yielding some 160,000 odd combinations, which made the memory use wind up from 20% to 80% on just one rule!!!!   <blockquote cite="mid200502131928.j1DJSE2T022609@mv.uio.no" type="cite"> <pre wrap="">i am fond of the built-in approach myself. i think we should add %infix() and branching lexical rules to the wish list then, in order to stand a chance of doing compounding in the same spirit. </pre> </blockquote> Regarding compounding: since we're currently doing stem access via  db lookup, it might be worthwhile to add that I managed to get   some  good speedup in a similar scenario by exploiting  graphotactics (reflecting phonotactics really) to predict split positions: since compounding always involves word constituents, and since words must be composed of well-formed syllables, one can easily extract possible onset and coda consonant clusters, together with a nucleus requirement,  to reduce hypotheses about the position of the word boundary. What's actually good about this is that, underlyingly, the graphotactic restrictions do conform to universal phonological constraints (sonority hierarchy), tightened of course, by language-specific ones. As a result, one would expect this to carry over to a good deal of other languages... Independently of  compounding, it would be good to have some concatenation operator to deal with affixation: at least for the separable prefix verbs, it would be nice to have a single rule to prepend the lexicalised prefix to the stem, not 50 different ones. I mean, I could sort this out with perl(1), but why not give TDL++ a chance... Cheers, Berthold PS: Did you know that we have got a perl TDL parser here??? <blockquote cite="mid200502131928.j1DJSE2T022609@mv.uio.no" type="cite"> <pre wrap=""> all the best - oe nb: oh, i also added a counter :mtcpu to PET (orthographemic processing time), tracked separately in [incr tsdb()] but included in overall time (i.e. :total). </pre> </blockquote> <blockquote cite="mid200502131928.j1DJSE2T022609@mv.uio.no" type="cite"> <pre wrap="">+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++ Universitetet i Oslo (ILF); Boks 1102 Blindern; 0317 Oslo; (+47) 2285 7989 +++ CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515 +++ --- <a class="moz-txt-link-abbreviated" href="mailto:oe@csli.stanford.edu">oe@csli.stanford.edu</a>; <a class="moz-txt-link-abbreviated" href="mailto:oe@hf.uio.no">oe@hf.uio.no</a>; <a class="moz-txt-link-abbreviated" href="mailto:stephan@oepen.net">stephan@oepen.net</a> --- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ </pre> </blockquote> </body> </html>