[developers] processing of lexical rules
Emily M. Bender
ebender at u.washington.edu
Wed Feb 9 20:55:11 CET 2005
This is somewhat orthogonal to the discussion I believe Stephan
meant to start off, but I wanted to chime in because I've been
thinking about the best way to handle a morphophonology-morphosyntax
interface in languages with interesting morphophonology (i.e.,
where the current LKB %suffix() etc. aren't up to the task).
What's currently looking like the best solution is a dual string-based
and database-based interface between independent "morphology"
(morphophonology) and "syntax" (morphosyntax, syntax, semantics, i.e.,
the LKB components). At run time, the morphological analyzer takes a
surface string and returns a string of strings of abstract morphemes.
These will probably look like "eat+1per+sg+past" (parsing direction
used for ease of exposition). This is then the input to the existing
LKB, which uses ordinary %suffix() (or %prefix(), as appropriate)
rules to handle the +1per etc. suffixes. In order to avoid
duplicating entries for every stem in the morphological analyzer and
in the LKB lexicon, we'll want to extend the lexical database to
include morphophonological information. LKB lexical entries will
point to stem entries in the database, as well as to lexical types.
The stem entries will bear information about morphotactics and
lexically-specific morphophonological rules. We'd then want a tool
to compile from this database the source files for a morphological
analyzer (for present purposes, built with XFST).
The idea/hope is that by segregating morphophonological analysis from
morphosyntactic analysis (the unification part of the lexical rules),
we'll gain efficiencies both at run time and in development. Perhaps
one of these is that, since the abstract affixes will presumably have
something funny in their spelling ('+' or otherwise), fewer inappropriate
stems will be hypothesized.
On Wed, Feb 09, 2005 at 11:16:25AM -0800, Stephan Oepen wrote:
> dear all,
> bernd emailed with some issues regarding interactions of lexical rules
> and the orthographemic component (the %suffix() and similar annotations
> on some lexical rules). i thought i would take this opportunity to get
> some traffic going on this new list. in my view the issue is recurring
> and a general solution not quite obvious.
> as i understand it, berthold and bernd at DFKI are experimenting with a
> new set of orthgraphemic rules and soon enough faced efficiency issues.
> i suspect this is another instance of what we saw in NorSource earlier,
> viz. combinatoric explosion in string segmentation hypotheses produced
> by the application of %suffix() et al. rules, particulary when combined
> with a large lexicon (such that hypothesized one- and two-letter stems
> are actually available). for completeness, i attach two analyses i did
> for JaCY and NorSource, respectively, (in 2003) to this message below.
> bernd and berthold, did you try *maximal-morphological-rule-depth*? as
> long as you are willing to impose an upper bound on the number of steps
> in string decomposition, it might make a real difference.
> to summarize my current understanding of the process:
> - phase 1: string segmentation, exclusively using %suffix() rules and
> not interleaving actual unification; the only requirement for each
> chain of hypothesized rules to be evaluated is the existence of the
> stem at the `bottom' of the chain in the lexicon. morph-analyse()
> is the LKB function corresponding to this phase.
> - phase 2: instantiating hypothesized chains, additionally attempting
> to intersperse other lexical rules at each point. this step calls
> the unifier for each step and (in the LKB) uses the rule filter and
> quick check. however, the LKB runs this phase outside of the chart
> (in the function apply-all-lexical-and-morph-rules(), mostly), such
> that i suspect it forgoes dynamic programming potential. PET does
> this phase as part of regular chart processing (annotating edges as
> to remaining orthographemic rules to go through, before such edges
> can undergo syntactic rules). i would expect it to be dramatically
> faster on inputs with large numbers of hypothesized chains.
> bernd and berthold, which of these two phases go bad for you? is there
> an observable difference between the LKB and PET?
> i believe bernd has a proposal for improvement already, though i am not
> sure i understand it fully yet. bernd was planning to email this list
> in response to my posting.
> while we are at it, maybe just a recap why the interspersing of lexical
> rules without orthographemic effects is necessary at each point. even
> in a language as simple as english, we might want to analyze
> give V: <NP, NP, NP>
> [dative shift] --> give V: <NP, NP, PP[to]>
> [agentive nominalization] --> giver N: <NP, PP[of], PP[to]>
> [plural] --> givers
> or even
> skip V: <NP, NP>
> [agentive nominalization] --> skipper N: <PP[of]>
> [verbing] --> skipper V: <NP, NP>
> [past] --> skippered
> i admit the latter may be restricted in productivity, but at least the
> above example conforms to MW :-).
> i bring this up, because a few months ago dan and i discovered that the
> ability to intersperse orthographemic and non-orthographemic rules had
> gone broken in versions of PET from the stable branch as of sometime in
> 2003. i have a patch that dan has been testing, which i plan to submit
> to the PET source repository really soon now.
> all the best - oe
> +++ Universitetet i Oslo (ILF); Boks 1102 Blindern; 0317 Oslo; (+47) 2285 7989
> +++ CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515
> +++ --- oe at csli.stanford.edu; oe at hf.uio.no; stephan at oepen.net ---
More information about the developers