[From nobody Wed Feb 29 19:48:18 2012
Message-ID: &lt;420B5E85.6090604@dfki.de&gt;
Date: Thu, 10 Feb 2005 14:15:49 +0100
From: Berthold Crysmann &lt;crysmann@dfki.de&gt;
User-Agent: Mozilla Thunderbird 1.0 (X11/20041206)
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: &quot;Emily M. Bender&quot; &lt;ebender@u.washington.edu&gt;
Subject: Re: [developers] processing of lexical rules
References: &lt;200502091916.j19JGPUA005294@mv.uio.no&gt;
	&lt;20050209195511.GA25764@u.washington.edu&gt;
In-Reply-To: &lt;20050209195511.GA25764@u.washington.edu&gt;
Content-Type: multipart/alternative;
	boundary=&quot;------------060404020502070800000201&quot;

This is a multi-part message in MIME format.
--------------060404020502070800000201
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Emily M. Bender wrote:

&gt;Hello everyone,
&gt;
&gt;This is somewhat orthogonal to the discussion I believe Stephan
&gt;meant to start off, but I wanted to chime in because I've been
&gt;thinking about the best way to handle a morphophonology-morphosyntax
&gt;interface in languages with interesting morphophonology (i.e.,
&gt;where the current LKB %suffix() etc. aren't up to the task).
&gt;
&gt;What's currently looking like the best solution is a dual string-based
&gt;and database-based interface between independent &quot;morphology&quot;
&gt;(morphophonology) and &quot;syntax&quot; (morphosyntax, syntax, semantics, i.e.,
&gt;the LKB components).  At run time, the morphological analyzer takes a
&gt;surface string and returns a string of strings of abstract morphemes.
&gt;These will probably look like &quot;eat+1per+sg+past&quot; (parsing direction
&gt;used for ease of exposition).  This is then the input to the existing
&gt;LKB, which uses ordinary %suffix() (or %prefix(), as appropriate)
&gt;rules to handle the +1per etc. suffixes. 
&gt;
I am not sure that will necessarily give you efficient processing, 
unless you can control the morphological component:
e.g. for an adjective like anderen, I get 7 distinct analyses from the 
Xerox German demo, owing to syncretism in German:

anderen ander+PAdj+Indef+Fem+Sg+DatGen+Wk
anderen ander+PAdj+Indef+Neut+Sg+Gen
anderen ander+PAdj+Indef+Neut+Sg+Dat+Wk
anderen ander+PAdj+Indef+Masc+Sg+AccGen
anderen ander+PAdj+Indef+Masc+Sg+Dat+Wk
anderen ander+PAdj+Indef+MFN+Pl+Dat
anderen ander+PAdj+Indef+MFN+Pl+NomAccGen+Wk

In the LKB, I can compactly represent all these readings as one type. 
Either it will be necessary to map all these readings to an 
underspecified representation, as we used to do with Morphix output, or 
one would have to modify  the implementation of the finite-state 
grammars  to better reflect the possibilities  offered  by type 
abstraction. The second possibility may also depend on licensing issues, 
so it might not be possible at all.

Berthold

&gt; In order to avoid
&gt;duplicating entries for every stem in the morphological analyzer and
&gt;in the LKB lexicon, we'll want to extend the lexical database to
&gt;include morphophonological information.  LKB lexical entries will
&gt;point to stem entries in the database, as well as to lexical types.
&gt;The stem entries will bear information about morphotactics and
&gt;lexically-specific morphophonological rules.  We'd then want a tool
&gt;to compile from this database the source files for a morphological
&gt;analyzer (for present purposes, built with XFST).
&gt;
&gt;The idea/hope is that by segregating morphophonological analysis from
&gt;morphosyntactic analysis (the unification part of the lexical rules),
&gt;we'll gain efficiencies both at run time and in development.  Perhaps
&gt;one of these is that, since the abstract affixes will presumably have
&gt;something funny in their spelling ('+' or otherwise), fewer inappropriate
&gt;stems will be hypothesized.
&gt;
&gt;Emily
&gt;
&gt;On Wed, Feb 09, 2005 at 11:16:25AM -0800, Stephan Oepen wrote:
&gt;  
&gt;
&gt;&gt;dear all,
&gt;&gt;
&gt;&gt;bernd emailed with some issues regarding interactions of lexical rules
&gt;&gt;and the orthographemic component (the %suffix() and similar annotations
&gt;&gt;on some lexical rules).  i thought i would take this opportunity to get
&gt;&gt;some traffic going on this new list.  in my view the issue is recurring
&gt;&gt;and a general solution not quite obvious. 
&gt;&gt;
&gt;&gt;as i understand it, berthold and bernd at DFKI are experimenting with a
&gt;&gt;new set of orthgraphemic rules and soon enough faced efficiency issues.
&gt;&gt;i suspect this is another instance of what we saw in NorSource earlier,
&gt;&gt;viz. combinatoric explosion in string segmentation hypotheses produced
&gt;&gt;by the application of %suffix() et al. rules, particulary when combined
&gt;&gt;with a large lexicon (such that hypothesized one- and two-letter stems
&gt;&gt;are actually available).  for completeness, i attach two analyses i did
&gt;&gt;for JaCY and NorSource, respectively, (in 2003) to this message below.
&gt;&gt;
&gt;&gt;bernd and berthold, did you try *maximal-morphological-rule-depth*?  as
&gt;&gt;long as you are willing to impose an upper bound on the number of steps
&gt;&gt;in string decomposition, it might make a real difference.
&gt;&gt;
&gt;&gt;to summarize my current understanding of the process:
&gt;&gt;
&gt;&gt;  - phase 1: string segmentation, exclusively using %suffix() rules and
&gt;&gt;    not interleaving actual unification; the only requirement for each
&gt;&gt;    chain of hypothesized rules to be evaluated is the existence of the
&gt;&gt;    stem at the `bottom' of the chain in the lexicon.  morph-analyse()
&gt;&gt;    is the LKB function corresponding to this phase.
&gt;&gt;
&gt;&gt;  - phase 2: instantiating hypothesized chains, additionally attempting
&gt;&gt;    to intersperse other lexical rules at each point.  this step calls
&gt;&gt;    the unifier for each step and (in the LKB) uses the rule filter and
&gt;&gt;    quick check.  however, the LKB runs this phase outside of the chart
&gt;&gt;    (in the function apply-all-lexical-and-morph-rules(), mostly), such
&gt;&gt;    that i suspect it forgoes dynamic programming potential.  PET does
&gt;&gt;    this phase as part of regular chart processing (annotating edges as
&gt;&gt;    to remaining orthographemic rules to go through, before such edges
&gt;&gt;    can undergo syntactic rules).  i would expect it to be dramatically
&gt;&gt;    faster on inputs with large numbers of hypothesized chains.
&gt;&gt;
&gt;&gt;bernd and berthold, which of these two phases go bad for you?  is there
&gt;&gt;an observable difference between the LKB and PET?
&gt;&gt;
&gt;&gt;i believe bernd has a proposal for improvement already, though i am not
&gt;&gt;sure i understand it fully yet.  bernd was planning to email this list
&gt;&gt;in response to my posting.  
&gt;&gt;
&gt;&gt;while we are at it, maybe just a recap why the interspersing of lexical
&gt;&gt;rules without orthographemic effects is necessary at each point.  even
&gt;&gt;in a language as simple as english, we might want to analyze
&gt;&gt;
&gt;&gt;  give V: &lt;NP, NP, NP&gt; 
&gt;&gt;    [dative shift] --&gt; give V: &lt;NP, NP, PP[to]&gt; 
&gt;&gt;    [agentive nominalization] --&gt; giver N: &lt;NP, PP[of], PP[to]&gt;
&gt;&gt;    [plural] --&gt; givers
&gt;&gt;
&gt;&gt;or even
&gt;&gt;
&gt;&gt;  skip V: &lt;NP, NP&gt;
&gt;&gt;    [agentive nominalization] --&gt; skipper N: &lt;PP[of]&gt;
&gt;&gt;    [verbing] --&gt; skipper V: &lt;NP, NP&gt;
&gt;&gt;    [past] --&gt; skippered
&gt;&gt;
&gt;&gt;i admit the latter may be restricted in productivity, but at least the
&gt;&gt;above example conforms to MW :-).
&gt;&gt;
&gt;&gt;i bring this up, because a few months ago dan and i discovered that the
&gt;&gt;ability to intersperse orthographemic and non-orthographemic rules had
&gt;&gt;gone broken in versions of PET from the stable branch as of sometime in
&gt;&gt;2003.  i have a patch that dan has been testing, which i plan to submit
&gt;&gt;to the PET source repository really soon now.
&gt;&gt;
&gt;&gt;                                                   all the best  -  oe
&gt;&gt;
&gt;&gt;+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
&gt;&gt;+++ Universitetet i Oslo (ILF); Boks 1102 Blindern; 0317 Oslo; (+47) 2285 7989
&gt;&gt;+++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
&gt;&gt;+++       --- oe@csli.stanford.edu; oe@hf.uio.no; stephan@oepen.net ---
&gt;&gt;+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
&gt;&gt;    
&gt;&gt;
&gt;
&gt;  
&gt;


--------------060404020502070800000201
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

&lt;!DOCTYPE html PUBLIC &quot;-//W3C//DTD HTML 4.01 Transitional//EN&quot;&gt;
&lt;html&gt;
&lt;head&gt;
  &lt;meta content=&quot;text/html;charset=ISO-8859-1&quot; http-equiv=&quot;Content-Type&quot;&gt;
  &lt;title&gt;&lt;/title&gt;
&lt;/head&gt;
&lt;body bgcolor=&quot;#ffffff&quot; text=&quot;#000000&quot;&gt;
Emily M. Bender wrote:
&lt;blockquote cite=&quot;mid20050209195511.GA25764@u.washington.edu&quot;
 type=&quot;cite&quot;&gt;
  &lt;pre wrap=&quot;&quot;&gt;Hello everyone,

This is somewhat orthogonal to the discussion I believe Stephan
meant to start off, but I wanted to chime in because I've been
thinking about the best way to handle a morphophonology-morphosyntax
interface in languages with interesting morphophonology (i.e.,
where the current LKB %suffix() etc. aren't up to the task).

What's currently looking like the best solution is a dual string-based
and database-based interface between independent &quot;morphology&quot;
(morphophonology) and &quot;syntax&quot; (morphosyntax, syntax, semantics, i.e.,
the LKB components).  At run time, the morphological analyzer takes a
surface string and returns a string of strings of abstract morphemes.
These will probably look like &quot;eat+1per+sg+past&quot; (parsing direction
used for ease of exposition).  This is then the input to the existing
LKB, which uses ordinary %suffix() (or %prefix(), as appropriate)
rules to handle the +1per etc. suffixes. &lt;/pre&gt;
&lt;/blockquote&gt;
I am not sure that will necessarily give you efficient processing,
unless you can control the morphological component: &lt;br&gt;
e.g. for an adjective like anderen, I get 7 distinct analyses from the
Xerox German demo, owing to syncretism in German:&lt;br&gt;
&lt;br&gt;
&lt;font size=&quot;+1&quot;&gt;anderen ander+PAdj+Indef+Fem+Sg+DatGen+Wk&lt;br&gt;
anderen ander+PAdj+Indef+Neut+Sg+Gen&lt;br&gt;
anderen ander+PAdj+Indef+Neut+Sg+Dat+Wk&lt;br&gt;
anderen ander+PAdj+Indef+Masc+Sg+AccGen&lt;br&gt;
anderen ander+PAdj+Indef+Masc+Sg+Dat+Wk&lt;br&gt;
anderen ander+PAdj+Indef+MFN+Pl+Dat&lt;br&gt;
anderen ander+PAdj+Indef+MFN+Pl+NomAccGen+Wk
&lt;/font&gt;&lt;br&gt;
&lt;br&gt;
In the LKB, I can compactly represent all these readings as one type.
Either it will be necessary to map all these readings to an
underspecified representation, as we used to do with Morphix output, or
one would have to modify&amp;nbsp; the implementation of the finite-state
grammars&amp;nbsp; to better reflect the possibilities&amp;nbsp; offered&amp;nbsp; by type
abstraction. The second possibility may also depend on licensing
issues, so it might not be possible at all. &lt;br&gt;
&lt;br&gt;
Berthold&lt;br&gt;
&lt;br&gt;
&lt;blockquote cite=&quot;mid20050209195511.GA25764@u.washington.edu&quot;
 type=&quot;cite&quot;&gt;
  &lt;pre wrap=&quot;&quot;&gt; In order to avoid
duplicating entries for every stem in the morphological analyzer and
in the LKB lexicon, we'll want to extend the lexical database to
include morphophonological information.  LKB lexical entries will
point to stem entries in the database, as well as to lexical types.
The stem entries will bear information about morphotactics and
lexically-specific morphophonological rules.  We'd then want a tool
to compile from this database the source files for a morphological
analyzer (for present purposes, built with XFST).

The idea/hope is that by segregating morphophonological analysis from
morphosyntactic analysis (the unification part of the lexical rules),
we'll gain efficiencies both at run time and in development.  Perhaps
one of these is that, since the abstract affixes will presumably have
something funny in their spelling ('+' or otherwise), fewer inappropriate
stems will be hypothesized.

Emily

On Wed, Feb 09, 2005 at 11:16:25AM -0800, Stephan Oepen wrote:
  &lt;/pre&gt;
  &lt;blockquote type=&quot;cite&quot;&gt;
    &lt;pre wrap=&quot;&quot;&gt;dear all,

bernd emailed with some issues regarding interactions of lexical rules
and the orthographemic component (the %suffix() and similar annotations
on some lexical rules).  i thought i would take this opportunity to get
some traffic going on this new list.  in my view the issue is recurring
and a general solution not quite obvious. 

as i understand it, berthold and bernd at DFKI are experimenting with a
new set of orthgraphemic rules and soon enough faced efficiency issues.
i suspect this is another instance of what we saw in NorSource earlier,
viz. combinatoric explosion in string segmentation hypotheses produced
by the application of %suffix() et al. rules, particulary when combined
with a large lexicon (such that hypothesized one- and two-letter stems
are actually available).  for completeness, i attach two analyses i did
for JaCY and NorSource, respectively, (in 2003) to this message below.

bernd and berthold, did you try *maximal-morphological-rule-depth*?  as
long as you are willing to impose an upper bound on the number of steps
in string decomposition, it might make a real difference.

to summarize my current understanding of the process:

  - phase 1: string segmentation, exclusively using %suffix() rules and
    not interleaving actual unification; the only requirement for each
    chain of hypothesized rules to be evaluated is the existence of the
    stem at the `bottom' of the chain in the lexicon.  morph-analyse()
    is the LKB function corresponding to this phase.

  - phase 2: instantiating hypothesized chains, additionally attempting
    to intersperse other lexical rules at each point.  this step calls
    the unifier for each step and (in the LKB) uses the rule filter and
    quick check.  however, the LKB runs this phase outside of the chart
    (in the function apply-all-lexical-and-morph-rules(), mostly), such
    that i suspect it forgoes dynamic programming potential.  PET does
    this phase as part of regular chart processing (annotating edges as
    to remaining orthographemic rules to go through, before such edges
    can undergo syntactic rules).  i would expect it to be dramatically
    faster on inputs with large numbers of hypothesized chains.

bernd and berthold, which of these two phases go bad for you?  is there
an observable difference between the LKB and PET?

i believe bernd has a proposal for improvement already, though i am not
sure i understand it fully yet.  bernd was planning to email this list
in response to my posting.  

while we are at it, maybe just a recap why the interspersing of lexical
rules without orthographemic effects is necessary at each point.  even
in a language as simple as english, we might want to analyze

  give V: &amp;lt;NP, NP, NP&amp;gt; 
    [dative shift] --&amp;gt; give V: &amp;lt;NP, NP, PP[to]&amp;gt; 
    [agentive nominalization] --&amp;gt; giver N: &amp;lt;NP, PP[of], PP[to]&amp;gt;
    [plural] --&amp;gt; givers

or even

  skip V: &amp;lt;NP, NP&amp;gt;
    [agentive nominalization] --&amp;gt; skipper N: &amp;lt;PP[of]&amp;gt;
    [verbing] --&amp;gt; skipper V: &amp;lt;NP, NP&amp;gt;
    [past] --&amp;gt; skippered

i admit the latter may be restricted in productivity, but at least the
above example conforms to MW :-).

i bring this up, because a few months ago dan and i discovered that the
ability to intersperse orthographemic and non-orthographemic rules had
gone broken in versions of PET from the stable branch as of sometime in
2003.  i have a patch that dan has been testing, which i plan to submit
to the PET source repository really soon now.

                                                   all the best  -  oe

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++ Universitetet i Oslo (ILF); Boks 1102 Blindern; 0317 Oslo; (+47) 2285 7989
+++     CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
+++       --- &lt;a class=&quot;moz-txt-link-abbreviated&quot; href=&quot;mailto:oe@csli.stanford.edu&quot;&gt;oe@csli.stanford.edu&lt;/a&gt;; &lt;a class=&quot;moz-txt-link-abbreviated&quot; href=&quot;mailto:oe@hf.uio.no&quot;&gt;oe@hf.uio.no&lt;/a&gt;; &lt;a class=&quot;moz-txt-link-abbreviated&quot; href=&quot;mailto:stephan@oepen.net&quot;&gt;stephan@oepen.net&lt;/a&gt; ---
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    &lt;/pre&gt;
  &lt;/blockquote&gt;
  &lt;pre wrap=&quot;&quot;&gt;&lt;!----&gt;
  &lt;/pre&gt;
&lt;/blockquote&gt;
&lt;br&gt;
&lt;/body&gt;
&lt;/html&gt;

--------------060404020502070800000201--

]