[Fwd: Re: [developers] processing of lexical rules]
Ann.Copestake at cl.cam.ac.uk
Tue Feb 15 16:14:58 CET 2005
This email is about the redesign of morphology. I'm afraid it may seem very
basic, but I've realised that I didn't understand this in a suitably abstract
way and I haven't seen a good description of what's going on that I can simply
take over. I think I now have a better handle on the abstractions - what I'm
after at the moment is comments about this and comments about principled
restrictions that could/should be made in the way that the machinery handles
things. See Questions below, though any sort of comments would be welcome.
For any given input, we assume that there's some tokenisation T (t1 ... tn)
which is done by a relatively linguistically uninformed preprocessor.
Morphological processing is necessary because the tokens t1 etc don't all
correspond directly to items in the lexicon (assuming one doesn't go for a
full form lexicon). A morphological processor can take a token t, and produce
from it a series of morphemes m1 ... mn, possibly with some associated
bracketing. These are either stems (in which case they correspond to items in
a lexicon) or affixes (in which case, depending on the approach, they could
correspond to items in a affix lexicon or (as in the current LKB) to rules.
So, Choice 1: a) affixes as rules or b) affixes as lexical items?
There are two main strategies for dealing with the morphemes. One is to use
the morphemes as a new tokenisation and then to combine morphemes with rules.
To make this effective, the morphemes have to be associated with FSs that
specify some constraints on how they are combined. These FSs could encode any
partial bracketing constraints (indirectly). This is not what the LKB does at
the moment, but I think that if we had a morphology system which could be seen
as providing such structures, we could support the recombination of morphemes
without much problem since it'd work the same way as syntax. This most
naturally goes with choice 1b above. I've been referring to this as
word-syntax. So, e.g., the chart gets instantiated with:
`I' `walk' `+ed'
except that these things are actually FSs not just strings.
The other strategy is to see the morphological processor (i.e., the thing
which identifies morphemes) as providing a derivation tree or a
partially-specified derivation tree with the leaves being the stems. What the
LKB does at the moment is to treat the spelling rules as producing a partially
specified derivation chain (i.e. no branching), where the partiality arises
because, in principle, lexical rules with no spelling effects can operate
in-between rules with morphological effect. At the moment, this partially
specified chain does not involve full FSs and is constructed outside the main
LKB system - what I've been suggesting in prior emails is to rewrite this code
so that the construction of this chain uses the chart and so that full FSs are
available (although for efficiency we might restrict the FSs). My previous
email suggested an approach where we had to construct full derivation chains
as part of deriving the morphemes, which involves postulating lexical rules
before finding a stem, but now I think we might be able to support partially
specified chains, as before and still utilise the chart. But, if we're going
to support compounding (or other processes which require multiple stems in a
word), then this needs to be a partially-specified derivation tree, rather
than just a chain. So e.g. we get
`I' (past `walk')
`I' (past ... `walk')
where ... corresponds to the partial specification e.g. for
`I' (past (noun-to-verb `tango'))
In principle, this isn't incompatible with 1b, if we allow partially specified
trees rather than just chains, because we could have:
(affixation `walk' `ed')
And I think we could allow compounding similarly, in principle, but am worried
about the practicality.
So, Choice 2: a) morphemes as a new tokenisation or b) morphemes as partial
specification of a derivation tree?
As I currently see it, Choice 2a allows for some options that can't be done
with 2b. For instance, we could instantiate the chart with
`derivational' `grammar' `ian'
and bracket as
(`derivational' `grammar') `ian'
Question 1a: Are there phenomena like this that people really want to deal
Question 1b: If not, should we claim it as a principled restriction that we
The downside of 2a is that there's no linkage between the analysis into
morphemes and the (re)construction of the structure, and this seems wrong in
principle to me. The morpheme FSs could be set up so that they guided the
bracketing in a way which was similar to what is done with 2b but I don't see
how to take advantage of the rule structure during analysis into morphemes.
(It would also be possible, in principle, to have a mixed 2a/2b strategy. In
a sense this is one way of thinking about what happens with English compounds
One further thing - if we made the specification of the rules in the 2b
strategy correspond to types rather than instances of rules, the specification
of the derivation tree could be underspecified in another way - we could allow
for several rules to be associated with one affix.
Question 2 (probably mostly to Emily): what about incorporation? Could we
handle this on the full 2b strategy?
Question 3: could we restrict the 2b strategy? As far as compounding goes,
could it be restricted to the bottom of the tree or does it need to be fully
interleaved with affixation? Are there any reasons to allow more than binary
Choice 3 - how do we formalise the spelling part of this? This is the bit I'm
really not interested in - I think we should support alternatives to the
current string unification but I don't want to implement them ...
On top of all this is multiword expression handling ... I don't think I want
to go into that right now!
My current thoughts about redesign are that I want to do 2b properly, but I am
not sure about allowing compounding, at least not if it can't be restricted.
As part of this, I'd redo the string unification code so it just handled a
single affixation with the recursion handled in the chart as I mentioned
before and I'd put in suitable hooks for alternative ways of handling the
spelling stuff. I would allow input from external morphological analysers
that adopted the 2a strategy (this really comes for free once we allow a chart
with FSs as input). General question - does this make sense?
More information about the developers