[Fwd: Re: [developers] processing of lexical rules]

Tue Feb 15 16:14:58 CET 2005

Dear all,

This email is about the redesign of morphology.  I'm afraid it may seem very 
basic, but I've realised that I didn't understand this in a suitably abstract 
way and I haven't seen a good description of what's going on that I can simply 
take over.  I think I now have a better handle on the abstractions - what I'm 
after at the moment is comments about this and comments about principled 
restrictions that could/should be made in the way that the machinery handles 
things. See Questions below, though any sort of comments would be welcome.

For any given input, we assume that there's some tokenisation T (t1 ... tn) 
which is done by a relatively linguistically uninformed preprocessor.  
Morphological processing is necessary because the tokens t1 etc don't all 
correspond directly to items in the lexicon (assuming one doesn't go for a 
full form lexicon).  A morphological processor can take a token t, and produce 
from it a series of morphemes m1 ... mn, possibly with some associated 
bracketing.  These are either stems (in which case they correspond to items in 
a lexicon) or affixes (in which case, depending on the approach, they could 
correspond to items in a affix lexicon or (as in the current LKB) to rules.

So, Choice 1: a) affixes as rules or b) affixes as lexical items?

There are two main strategies for dealing with the morphemes.  One is to use 
the morphemes as a new tokenisation and then to combine morphemes with rules.
To make this effective, the morphemes have to be associated with FSs that 
specify some constraints on how they are combined.  These FSs could encode any 
partial bracketing constraints (indirectly).  This is not what the LKB does at 
the moment, but I think that if we had a morphology system which could be seen 
as providing such structures, we could support the recombination of morphemes 
without much problem since it'd work the same way as syntax.  This most 
naturally goes with choice 1b above.  I've been referring to this as 
word-syntax.  So, e.g., the chart gets instantiated with:

`I' `walk' `+ed'

except that these things are actually FSs not just strings.

The other strategy is to see the morphological processor (i.e., the thing 
which identifies morphemes) as providing a derivation tree or a 
partially-specified derivation tree with the leaves being the stems.  What the 
LKB does at the moment is to treat the spelling rules as producing a partially 
specified derivation chain (i.e. no branching), where the partiality arises 
because, in principle, lexical rules with no spelling effects can operate 
in-between rules with morphological effect.  At the moment, this partially 
specified chain does not involve full FSs and is constructed outside the main 
LKB system - what I've been suggesting in prior emails is to rewrite this code 
so that the construction of this chain uses the chart and so that full FSs are 
available (although for efficiency we might restrict the FSs).  My previous 
email suggested an approach where we had to construct full derivation chains 
as part of deriving the morphemes, which involves postulating lexical rules 
before finding a stem, but now I think we might be able to support partially 
specified chains, as before and still utilise the chart.  But, if we're going 
to support compounding (or other processes which require multiple stems in a 
word), then this needs to be a partially-specified derivation tree, rather 
than just a chain.  So e.g. we get

`I' (past `walk')

or actually:

`I' (past ... `walk')

where ... corresponds to the partial specification e.g. for

`I' (past (noun-to-verb `tango'))

In principle, this isn't incompatible with 1b, if we allow partially specified 
trees rather than just chains, because we could have:

(affixation `walk' `ed')

And I think we could allow compounding similarly, in principle, but am worried 
about the practicality.

So, Choice 2: a) morphemes as a new tokenisation or b) morphemes as partial 
specification of a derivation tree?

As I currently see it, Choice 2a allows for some options that can't be done 
with 2b.  For instance, we could instantiate the chart with

`derivational' `grammar' `ian'

and bracket as 

(`derivational' `grammar') `ian'

Question 1a: Are there phenomena like this that people really want to deal 
with?
Question 1b: If not, should we claim it as a principled restriction that we 
disallow this?!

The downside of 2a is that there's no linkage between the analysis into 
morphemes and the (re)construction of the structure, and this seems wrong in 
principle to me. The morpheme FSs could be set up so that they guided the 
bracketing in a way which was similar to what is done with 2b but I don't see 
how to take advantage of the rule structure during analysis into morphemes.

(It would also be possible, in principle, to have a mixed 2a/2b strategy.  In 
a sense this is one way of thinking about what happens with English compounds
currently ...)  

One further thing - if we made the specification of the rules in the 2b 
strategy correspond to types rather than instances of rules, the specification 
of the derivation tree could be underspecified in another way - we could allow 
for several rules to be associated with one affix.

Question 2 (probably mostly to Emily): what about incorporation?  Could we 
handle this on the full 2b strategy?

Question 3: could we restrict the 2b strategy?  As far as compounding goes, 
could it be restricted to the bottom of the tree or does it need to be fully 
interleaved with affixation?  Are there any reasons to allow more than binary 
branching?

Choice 3 - how do we formalise the spelling part of this?  This is the bit I'm 
really not interested in - I think we should support alternatives to the 
current string unification but I don't want to implement them ...

On top of all this is multiword expression handling ... I don't think I want 
to go into that right now!

My current thoughts about redesign are that I want to do 2b properly, but I am 
not sure about allowing compounding, at least not if it can't be restricted.  
As part of this, I'd redo the string unification code so it just handled a 
single affixation with the recursion handled in the chart as I mentioned 
before and I'd put in suitable hooks for alternative ways of handling the 
spelling stuff.  I would allow input from external morphological analysers 
that adopted the 2a strategy (this really comes for free once we allow a chart 
with FSs as input).  General question - does this make sense?

Ann