[From nobody Wed Feb 29 19:48:18 2012 Message-ID: <420B5E85.6090604@dfki.de> Date: Thu, 10 Feb 2005 14:15:49 +0100 From: Berthold Crysmann <crysmann@dfki.de> User-Agent: Mozilla Thunderbird 1.0 (X11/20041206) X-Accept-Language: en-us, en MIME-Version: 1.0 To: "Emily M. Bender" <ebender@u.washington.edu> Subject: Re: [developers] processing of lexical rules References: <200502091916.j19JGPUA005294@mv.uio.no> <20050209195511.GA25764@u.washington.edu> In-Reply-To: <20050209195511.GA25764@u.washington.edu> Content-Type: multipart/alternative; boundary="------------060404020502070800000201" This is a multi-part message in MIME format. --------------060404020502070800000201 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Emily M. Bender wrote: >Hello everyone, > >This is somewhat orthogonal to the discussion I believe Stephan >meant to start off, but I wanted to chime in because I've been >thinking about the best way to handle a morphophonology-morphosyntax >interface in languages with interesting morphophonology (i.e., >where the current LKB %suffix() etc. aren't up to the task). > >What's currently looking like the best solution is a dual string-based >and database-based interface between independent "morphology" >(morphophonology) and "syntax" (morphosyntax, syntax, semantics, i.e., >the LKB components). At run time, the morphological analyzer takes a >surface string and returns a string of strings of abstract morphemes. >These will probably look like "eat+1per+sg+past" (parsing direction >used for ease of exposition). This is then the input to the existing >LKB, which uses ordinary %suffix() (or %prefix(), as appropriate) >rules to handle the +1per etc. suffixes. > I am not sure that will necessarily give you efficient processing, unless you can control the morphological component: e.g. for an adjective like anderen, I get 7 distinct analyses from the Xerox German demo, owing to syncretism in German: anderen ander+PAdj+Indef+Fem+Sg+DatGen+Wk anderen ander+PAdj+Indef+Neut+Sg+Gen anderen ander+PAdj+Indef+Neut+Sg+Dat+Wk anderen ander+PAdj+Indef+Masc+Sg+AccGen anderen ander+PAdj+Indef+Masc+Sg+Dat+Wk anderen ander+PAdj+Indef+MFN+Pl+Dat anderen ander+PAdj+Indef+MFN+Pl+NomAccGen+Wk In the LKB, I can compactly represent all these readings as one type. Either it will be necessary to map all these readings to an underspecified representation, as we used to do with Morphix output, or one would have to modify the implementation of the finite-state grammars to better reflect the possibilities offered by type abstraction. The second possibility may also depend on licensing issues, so it might not be possible at all. Berthold > In order to avoid >duplicating entries for every stem in the morphological analyzer and >in the LKB lexicon, we'll want to extend the lexical database to >include morphophonological information. LKB lexical entries will >point to stem entries in the database, as well as to lexical types. >The stem entries will bear information about morphotactics and >lexically-specific morphophonological rules. We'd then want a tool >to compile from this database the source files for a morphological >analyzer (for present purposes, built with XFST). > >The idea/hope is that by segregating morphophonological analysis from >morphosyntactic analysis (the unification part of the lexical rules), >we'll gain efficiencies both at run time and in development. Perhaps >one of these is that, since the abstract affixes will presumably have >something funny in their spelling ('+' or otherwise), fewer inappropriate >stems will be hypothesized. > >Emily > >On Wed, Feb 09, 2005 at 11:16:25AM -0800, Stephan Oepen wrote: > > >>dear all, >> >>bernd emailed with some issues regarding interactions of lexical rules >>and the orthographemic component (the %suffix() and similar annotations >>on some lexical rules). i thought i would take this opportunity to get >>some traffic going on this new list. in my view the issue is recurring >>and a general solution not quite obvious. >> >>as i understand it, berthold and bernd at DFKI are experimenting with a >>new set of orthgraphemic rules and soon enough faced efficiency issues. >>i suspect this is another instance of what we saw in NorSource earlier, >>viz. combinatoric explosion in string segmentation hypotheses produced >>by the application of %suffix() et al. rules, particulary when combined >>with a large lexicon (such that hypothesized one- and two-letter stems >>are actually available). for completeness, i attach two analyses i did >>for JaCY and NorSource, respectively, (in 2003) to this message below. >> >>bernd and berthold, did you try *maximal-morphological-rule-depth*? as >>long as you are willing to impose an upper bound on the number of steps >>in string decomposition, it might make a real difference. >> >>to summarize my current understanding of the process: >> >> - phase 1: string segmentation, exclusively using %suffix() rules and >> not interleaving actual unification; the only requirement for each >> chain of hypothesized rules to be evaluated is the existence of the >> stem at the `bottom' of the chain in the lexicon. morph-analyse() >> is the LKB function corresponding to this phase. >> >> - phase 2: instantiating hypothesized chains, additionally attempting >> to intersperse other lexical rules at each point. this step calls >> the unifier for each step and (in the LKB) uses the rule filter and >> quick check. however, the LKB runs this phase outside of the chart >> (in the function apply-all-lexical-and-morph-rules(), mostly), such >> that i suspect it forgoes dynamic programming potential. PET does >> this phase as part of regular chart processing (annotating edges as >> to remaining orthographemic rules to go through, before such edges >> can undergo syntactic rules). i would expect it to be dramatically >> faster on inputs with large numbers of hypothesized chains. >> >>bernd and berthold, which of these two phases go bad for you? is there >>an observable difference between the LKB and PET? >> >>i believe bernd has a proposal for improvement already, though i am not >>sure i understand it fully yet. bernd was planning to email this list >>in response to my posting. >> >>while we are at it, maybe just a recap why the interspersing of lexical >>rules without orthographemic effects is necessary at each point. even >>in a language as simple as english, we might want to analyze >> >> give V: <NP, NP, NP> >> [dative shift] --> give V: <NP, NP, PP[to]> >> [agentive nominalization] --> giver N: <NP, PP[of], PP[to]> >> [plural] --> givers >> >>or even >> >> skip V: <NP, NP> >> [agentive nominalization] --> skipper N: <PP[of]> >> [verbing] --> skipper V: <NP, NP> >> [past] --> skippered >> >>i admit the latter may be restricted in productivity, but at least the >>above example conforms to MW :-). >> >>i bring this up, because a few months ago dan and i discovered that the >>ability to intersperse orthographemic and non-orthographemic rules had >>gone broken in versions of PET from the stable branch as of sometime in >>2003. i have a patch that dan has been testing, which i plan to submit >>to the PET source repository really soon now. >> >> all the best - oe >> >>+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>+++ Universitetet i Oslo (ILF); Boks 1102 Blindern; 0317 Oslo; (+47) 2285 7989 >>+++ CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515 >>+++ --- oe@csli.stanford.edu; oe@hf.uio.no; stephan@oepen.net --- >>+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> > > > --------------060404020502070800000201 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type"> <title></title> </head> <body bgcolor="#ffffff" text="#000000"> Emily M. Bender wrote: <blockquote cite="mid20050209195511.GA25764@u.washington.edu" type="cite"> <pre wrap="">Hello everyone, This is somewhat orthogonal to the discussion I believe Stephan meant to start off, but I wanted to chime in because I've been thinking about the best way to handle a morphophonology-morphosyntax interface in languages with interesting morphophonology (i.e., where the current LKB %suffix() etc. aren't up to the task). What's currently looking like the best solution is a dual string-based and database-based interface between independent "morphology" (morphophonology) and "syntax" (morphosyntax, syntax, semantics, i.e., the LKB components). At run time, the morphological analyzer takes a surface string and returns a string of strings of abstract morphemes. These will probably look like "eat+1per+sg+past" (parsing direction used for ease of exposition). This is then the input to the existing LKB, which uses ordinary %suffix() (or %prefix(), as appropriate) rules to handle the +1per etc. suffixes. </pre> </blockquote> I am not sure that will necessarily give you efficient processing, unless you can control the morphological component: <br> e.g. for an adjective like anderen, I get 7 distinct analyses from the Xerox German demo, owing to syncretism in German:<br> <br> <font size="+1">anderen ander+PAdj+Indef+Fem+Sg+DatGen+Wk<br> anderen ander+PAdj+Indef+Neut+Sg+Gen<br> anderen ander+PAdj+Indef+Neut+Sg+Dat+Wk<br> anderen ander+PAdj+Indef+Masc+Sg+AccGen<br> anderen ander+PAdj+Indef+Masc+Sg+Dat+Wk<br> anderen ander+PAdj+Indef+MFN+Pl+Dat<br> anderen ander+PAdj+Indef+MFN+Pl+NomAccGen+Wk </font><br> <br> In the LKB, I can compactly represent all these readings as one type. Either it will be necessary to map all these readings to an underspecified representation, as we used to do with Morphix output, or one would have to modify&nbsp; the implementation of the finite-state grammars&nbsp; to better reflect the possibilities&nbsp; offered&nbsp; by type abstraction. The second possibility may also depend on licensing issues, so it might not be possible at all. <br> <br> Berthold<br> <br> <blockquote cite="mid20050209195511.GA25764@u.washington.edu" type="cite"> <pre wrap=""> In order to avoid duplicating entries for every stem in the morphological analyzer and in the LKB lexicon, we'll want to extend the lexical database to include morphophonological information. LKB lexical entries will point to stem entries in the database, as well as to lexical types. The stem entries will bear information about morphotactics and lexically-specific morphophonological rules. We'd then want a tool to compile from this database the source files for a morphological analyzer (for present purposes, built with XFST). The idea/hope is that by segregating morphophonological analysis from morphosyntactic analysis (the unification part of the lexical rules), we'll gain efficiencies both at run time and in development. Perhaps one of these is that, since the abstract affixes will presumably have something funny in their spelling ('+' or otherwise), fewer inappropriate stems will be hypothesized. Emily On Wed, Feb 09, 2005 at 11:16:25AM -0800, Stephan Oepen wrote: </pre> <blockquote type="cite"> <pre wrap="">dear all, bernd emailed with some issues regarding interactions of lexical rules and the orthographemic component (the %suffix() and similar annotations on some lexical rules). i thought i would take this opportunity to get some traffic going on this new list. in my view the issue is recurring and a general solution not quite obvious. as i understand it, berthold and bernd at DFKI are experimenting with a new set of orthgraphemic rules and soon enough faced efficiency issues. i suspect this is another instance of what we saw in NorSource earlier, viz. combinatoric explosion in string segmentation hypotheses produced by the application of %suffix() et al. rules, particulary when combined with a large lexicon (such that hypothesized one- and two-letter stems are actually available). for completeness, i attach two analyses i did for JaCY and NorSource, respectively, (in 2003) to this message below. bernd and berthold, did you try *maximal-morphological-rule-depth*? as long as you are willing to impose an upper bound on the number of steps in string decomposition, it might make a real difference. to summarize my current understanding of the process: - phase 1: string segmentation, exclusively using %suffix() rules and not interleaving actual unification; the only requirement for each chain of hypothesized rules to be evaluated is the existence of the stem at the `bottom' of the chain in the lexicon. morph-analyse() is the LKB function corresponding to this phase. - phase 2: instantiating hypothesized chains, additionally attempting to intersperse other lexical rules at each point. this step calls the unifier for each step and (in the LKB) uses the rule filter and quick check. however, the LKB runs this phase outside of the chart (in the function apply-all-lexical-and-morph-rules(), mostly), such that i suspect it forgoes dynamic programming potential. PET does this phase as part of regular chart processing (annotating edges as to remaining orthographemic rules to go through, before such edges can undergo syntactic rules). i would expect it to be dramatically faster on inputs with large numbers of hypothesized chains. bernd and berthold, which of these two phases go bad for you? is there an observable difference between the LKB and PET? i believe bernd has a proposal for improvement already, though i am not sure i understand it fully yet. bernd was planning to email this list in response to my posting. while we are at it, maybe just a recap why the interspersing of lexical rules without orthographemic effects is necessary at each point. even in a language as simple as english, we might want to analyze give V: &lt;NP, NP, NP&gt; [dative shift] --&gt; give V: &lt;NP, NP, PP[to]&gt; [agentive nominalization] --&gt; giver N: &lt;NP, PP[of], PP[to]&gt; [plural] --&gt; givers or even skip V: &lt;NP, NP&gt; [agentive nominalization] --&gt; skipper N: &lt;PP[of]&gt; [verbing] --&gt; skipper V: &lt;NP, NP&gt; [past] --&gt; skippered i admit the latter may be restricted in productivity, but at least the above example conforms to MW :-). i bring this up, because a few months ago dan and i discovered that the ability to intersperse orthographemic and non-orthographemic rules had gone broken in versions of PET from the stable branch as of sometime in 2003. i have a patch that dan has been testing, which i plan to submit to the PET source repository really soon now. all the best - oe +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +++ Universitetet i Oslo (ILF); Boks 1102 Blindern; 0317 Oslo; (+47) 2285 7989 +++ CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515 +++ --- <a class="moz-txt-link-abbreviated" href="mailto:oe@csli.stanford.edu">oe@csli.stanford.edu</a>; <a class="moz-txt-link-abbreviated" href="mailto:oe@hf.uio.no">oe@hf.uio.no</a>; <a class="moz-txt-link-abbreviated" href="mailto:stephan@oepen.net">stephan@oepen.net</a> --- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ </pre> </blockquote> <pre wrap=""><!----> </pre> </blockquote> <br> </body> </html> --------------060404020502070800000201-- ]