[Fwd: Re: [developers] processing of lexical rules]
crysmann at dfki.de
Thu Feb 17 16:03:06 CET 2005
here are just some comments on the issues you've raised.
Ann Copestake wrote:
>This email is about the redesign of morphology. I'm afraid it may seem very
>basic, but I've realised that I didn't understand this in a suitably abstract
>way and I haven't seen a good description of what's going on that I can simply
>take over. I think I now have a better handle on the abstractions - what I'm
>after at the moment is comments about this and comments about principled
>restrictions that could/should be made in the way that the machinery handles
>things. See Questions below, though any sort of comments would be welcome.
>For any given input, we assume that there's some tokenisation T (t1 ... tn)
>which is done by a relatively linguistically uninformed preprocessor.
>Morphological processing is necessary because the tokens t1 etc don't all
>correspond directly to items in the lexicon (assuming one doesn't go for a
>full form lexicon). A morphological processor can take a token t, and produce
>from it a series of morphemes m1 ... mn, possibly with some associated
>bracketing. These are either stems (in which case they correspond to items in
>a lexicon) or affixes (in which case, depending on the approach, they could
>correspond to items in a affix lexicon or (as in the current LKB) to rules.
>So, Choice 1: a) affixes as rules or b) affixes as lexical items?
Well, given that there are languages with subtractive morphology (see
e.g. Anderson 1992), choice 1b is probably out. That choice would also
be problematic for fused elements in cluster morphology (e.g. clusters
of pronominal affixes), where you cannot easily assign a well-behaved
sign-like feature structure to the fused morphs.
>There are two main strategies for dealing with the morphemes. One is to use
>the morphemes as a new tokenisation and then to combine morphemes with rules.
>To make this effective, the morphemes have to be associated with FSs that
>specify some constraints on how they are combined. These FSs could encode any
>partial bracketing constraints (indirectly). This is not what the LKB does at
>the moment, but I think that if we had a morphology system which could be seen
>as providing such structures, we could support the recombination of morphemes
>without much problem since it'd work the same way as syntax. This most
>naturally goes with choice 1b above. I've been referring to this as
>word-syntax. So, e.g., the chart gets instantiated with:
>`I' `walk' `+ed'
>except that these things are actually FSs not just strings.
>The other strategy is to see the morphological processor (i.e., the thing
>which identifies morphemes) as providing a derivation tree or a
>partially-specified derivation tree with the leaves being the stems. What the
>LKB does at the moment is to treat the spelling rules as producing a partially
>specified derivation chain (i.e. no branching), where the partiality arises
>because, in principle, lexical rules with no spelling effects can operate
>in-between rules with morphological effect. At the moment, this partially
>specified chain does not involve full FSs and is constructed outside the main
>LKB system - what I've been suggesting in prior emails is to rewrite this code
>so that the construction of this chain uses the chart and so that full FSs are
>available (although for efficiency we might restrict the FSs). My previous
>email suggested an approach where we had to construct full derivation chains
>as part of deriving the morphemes, which involves postulating lexical rules
>before finding a stem, but now I think we might be able to support partially
>specified chains, as before and still utilise the chart. But, if we're going
>to support compounding (or other processes which require multiple stems in a
>word), then this needs to be a partially-specified derivation tree, rather
>than just a chain. So e.g. we get
>`I' (past `walk')
>`I' (past ... `walk')
>where ... corresponds to the partial specification e.g. for
>`I' (past (noun-to-verb `tango'))
>In principle, this isn't incompatible with 1b, if we allow partially specified
>trees rather than just chains, because we could have:
>(affixation `walk' `ed')
>And I think we could allow compounding similarly, in principle, but am worried
>about the practicality.
>So, Choice 2: a) morphemes as a new tokenisation or b) morphemes as partial
>specification of a derivation tree?
I had a bit of a problem here understanding as to how 2b differs from
1a, or 2a from 1b. Can you clarify this?
>As I currently see it, Choice 2a allows for some options that can't be done
>with 2b. For instance, we could instantiate the chart with
>`derivational' `grammar' `ian'
>and bracket as
>(`derivational' `grammar') `ian'
>Question 1a: Are there phenomena like this that people really want to deal
Can you foresee a semantic solution to bracketing paradoxa? Having
talked to Markus Egg, I remember that he thinks that most of these
issues can be dealt with on that level.
>Question 1b: If not, should we claim it as a principled restriction that we
>The downside of 2a is that there's no linkage between the analysis into
>morphemes and the (re)construction of the structure, and this seems wrong in
>principle to me. The morpheme FSs could be set up so that they guided the
>bracketing in a way which was similar to what is done with 2b but I don't see
>how to take advantage of the rule structure during analysis into morphemes.
>(It would also be possible, in principle, to have a mixed 2a/2b strategy. In
>a sense this is one way of thinking about what happens with English compounds
>One further thing - if we made the specification of the rules in the 2b
>strategy correspond to types rather than instances of rules, the specification
>of the derivation tree could be underspecified in another way - we could allow
>for several rules to be associated with one affix.
>Question 2 (probably mostly to Emily): what about incorporation? Could we
>handle this on the full 2b strategy?
>Question 3: could we restrict the 2b strategy? As far as compounding goes,
>could it be restricted to the bottom of the tree or does it need to be fully
>interleaved with affixation?
Most certainly not, at least as far as derivation is concerned. Here is
an example from German:
[[[halt]bar]keit]+s+[datum] `expiry date'
As regards inflectional morphology the situation is a bit better: the
received wisdom has it, e.g. Scalise 1984, that *projective* inflection
is only found peripherally. However, traces of inflection can still be
found internally, e.g. Italian pomidoro(pl) `tomatoes = apples of gold'.
Given that compounds are referential islands, one would not expect
inflection of non-head parts of a compund to be productive, so
lexicalisation may always be a good solution.
What one does have to take care of are the linking segments
(Fugenelemente), like the +s+ in the German example above. In Japanese,
there's also a voicing of initial consonants at the juncture (see Ito
and Mester 199?).
As I understand it, compounding can only be addressed realistically, if
we can put enough constraints in
to significantly reduce lexical lookup. For humans, these constraints
are in part phonological in nature: only prosodic words (or metrical
feet) can form the base of a compound, and all prosodic words must be
composed of well-formed syllables. Syllable structure is, however, quite
well-studied, and the constraints appear to have a universal basis, the
sonority hierarchy (less sonorous consonants before more sonorant ones
in the onset, the reverse order in the coda, more or less).
Language-specific variation is mostly limited to the complexity of
permissible onsets (1-n) or codas (0-n) and the type of segments allowed
in each position. Many languages put a strong restriction on coda
elements, e.g. Hausa, where basically only l,n, s, r and glides are
permissible. Japanese is probably similar. Most languages also require a
vocalic nucleus, although some do permit syllabic nasals (e.g. Bantu),
or liquids, as found with the island Krk. But whatever the range of
possibilities, it is certainly universally and language-specifically
With written input we cannot use stress patterns to detect phonological
wordhood, but what we can do is exploit the underlying constraints on
syllable structure by using their graphemic correlates for the
prediction of split points. Actually, these patterns may even be
specified as an orthographic component to the lexical compounding rules...
>Are there any reasons to allow more than binary
I can't think of one.
>Choice 3 - how do we formalise the spelling part of this? This is the bit I'm
>really not interested in - I think we should support alternatives to the
>current string unification but I don't want to implement them ...
>On top of all this is multiword expression handling ... I don't think I want
>to go into that right now!
>My current thoughts about redesign are that I want to do 2b properly, but I am
>not sure about allowing compounding, at least not if it can't be restricted.
At least for German, a treatment of compounding appears indispensible,
so we had better try hard to find ways of restricting it...
>As part of this, I'd redo the string unification code so it just handled a
>single affixation with the recursion handled in the chart as I mentioned
>before and I'd put in suitable hooks for alternative ways of handling the
>spelling stuff. I would allow input from external morphological analysers
>that adopted the 2a strategy (this really comes for free once we allow a chart
>with FSs as input). General question - does this make sense?
More information about the developers