[Fwd: Re: [developers] processing of lexical rules]

Berthold Crysmann crysmann at dfki.de
Thu Feb 17 16:03:06 CET 2005


Dear Ann,

here are just some comments on the issues you've raised.

Ann Copestake wrote:

>Dear all,
>
>This email is about the redesign of morphology.  I'm afraid it may seem very 
>basic, but I've realised that I didn't understand this in a suitably abstract 
>way and I haven't seen a good description of what's going on that I can simply 
>take over.  I think I now have a better handle on the abstractions - what I'm 
>after at the moment is comments about this and comments about principled 
>restrictions that could/should be made in the way that the machinery handles 
>things. See Questions below, though any sort of comments would be welcome.
>
>For any given input, we assume that there's some tokenisation T (t1 ... tn) 
>which is done by a relatively linguistically uninformed preprocessor.  
>Morphological processing is necessary because the tokens t1 etc don't all 
>correspond directly to items in the lexicon (assuming one doesn't go for a 
>full form lexicon).  A morphological processor can take a token t, and produce 
>from it a series of morphemes m1 ... mn, possibly with some associated 
>bracketing.  These are either stems (in which case they correspond to items in 
>a lexicon) or affixes (in which case, depending on the approach, they could 
>correspond to items in a affix lexicon or (as in the current LKB) to rules.
>
>So, Choice 1: a) affixes as rules or b) affixes as lexical items?
>
>  
>
Well, given that there are languages with subtractive morphology (see 
e.g. Anderson 1992), choice 1b is probably out. That choice would also 
be problematic for fused elements in cluster morphology (e.g. clusters 
of pronominal affixes), where you cannot easily assign a well-behaved 
sign-like feature structure to the fused morphs.

>There are two main strategies for dealing with the morphemes.  One is to use 
>the morphemes as a new tokenisation and then to combine morphemes with rules.
>To make this effective, the morphemes have to be associated with FSs that 
>specify some constraints on how they are combined.  These FSs could encode any 
>partial bracketing constraints (indirectly).  This is not what the LKB does at 
>the moment, but I think that if we had a morphology system which could be seen 
>as providing such structures, we could support the recombination of morphemes 
>without much problem since it'd work the same way as syntax.  This most 
>naturally goes with choice 1b above.  I've been referring to this as 
>word-syntax.  So, e.g., the chart gets instantiated with:
>
>`I' `walk' `+ed'
>
>except that these things are actually FSs not just strings.
>
>The other strategy is to see the morphological processor (i.e., the thing 
>which identifies morphemes) as providing a derivation tree or a 
>partially-specified derivation tree with the leaves being the stems.  What the 
>LKB does at the moment is to treat the spelling rules as producing a partially 
>specified derivation chain (i.e. no branching), where the partiality arises 
>because, in principle, lexical rules with no spelling effects can operate 
>in-between rules with morphological effect.  At the moment, this partially 
>specified chain does not involve full FSs and is constructed outside the main 
>LKB system - what I've been suggesting in prior emails is to rewrite this code 
>so that the construction of this chain uses the chart and so that full FSs are 
>available (although for efficiency we might restrict the FSs).  My previous 
>email suggested an approach where we had to construct full derivation chains 
>as part of deriving the morphemes, which involves postulating lexical rules 
>before finding a stem, but now I think we might be able to support partially 
>specified chains, as before and still utilise the chart.  But, if we're going 
>to support compounding (or other processes which require multiple stems in a 
>word), then this needs to be a partially-specified derivation tree, rather 
>than just a chain.  So e.g. we get
>
>`I' (past `walk')
>
>or actually:
>
>`I' (past ... `walk')
>
>where ... corresponds to the partial specification e.g. for
>
>`I' (past (noun-to-verb `tango'))
>
>In principle, this isn't incompatible with 1b, if we allow partially specified 
>trees rather than just chains, because we could have:
>
>(affixation `walk' `ed')
>
>And I think we could allow compounding similarly, in principle, but am worried 
>about the practicality.
>
>So, Choice 2: a) morphemes as a new tokenisation or b) morphemes as partial 
>specification of a derivation tree?
>
>  
>
I had a bit of a problem here understanding as to how 2b differs from 
1a,  or 2a from 1b. Can you clarify this?

>As I currently see it, Choice 2a allows for some options that can't be done 
>with 2b.  For instance, we could instantiate the chart with
>
>`derivational' `grammar' `ian'
>
>and bracket as 
>
>(`derivational' `grammar') `ian'
>
>Question 1a: Are there phenomena like this that people really want to deal 
>with?
>  
>
Can you foresee a semantic solution to bracketing paradoxa? Having 
talked to Markus Egg, I remember that he thinks that most of these 
issues can be dealt with on that level.

>Question 1b: If not, should we claim it as a principled restriction that we 
>disallow this?!
>
>  
>
>The downside of 2a is that there's no linkage between the analysis into 
>morphemes and the (re)construction of the structure, and this seems wrong in 
>principle to me. The morpheme FSs could be set up so that they guided the 
>bracketing in a way which was similar to what is done with 2b but I don't see 
>how to take advantage of the rule structure during analysis into morphemes.
>
>(It would also be possible, in principle, to have a mixed 2a/2b strategy.  In 
>a sense this is one way of thinking about what happens with English compounds
>currently ...)  
>
>One further thing - if we made the specification of the rules in the 2b 
>strategy correspond to types rather than instances of rules, the specification 
>of the derivation tree could be underspecified in another way - we could allow 
>for several rules to be associated with one affix.
>
>Question 2 (probably mostly to Emily): what about incorporation?  Could we 
>handle this on the full 2b strategy?
>
>Question 3: could we restrict the 2b strategy?  As far as compounding goes, 
>could it be restricted to the bottom of the tree or does it need to be fully 
>interleaved with affixation?  
>
Most certainly not, at least as far as derivation is concerned. Here is 
an example from German:

[[[halt]bar]keit]+s+[datum] `expiry date'

As regards inflectional morphology the situation is a bit better: the 
received wisdom has it, e.g. Scalise 1984, that  *projective* inflection 
is only found peripherally. However, traces of inflection can still be 
found internally, e.g. Italian pomidoro(pl) `tomatoes = apples of gold'. 
Given that compounds are referential islands, one would not expect 
inflection of non-head parts of a compund to be productive, so 
lexicalisation may always be a good solution.

What one does have to take care of are the linking segments 
(Fugenelemente), like the +s+ in the German example above. In Japanese, 
there's also a voicing of initial consonants at the juncture (see Ito 
and Mester 199?).

As I understand it, compounding can only be addressed realistically, if 
we can put enough constraints in
to significantly reduce lexical lookup. For humans, these constraints 
are in part phonological in nature: only prosodic words (or metrical 
feet) can form the base of a compound, and all prosodic words must be 
composed of well-formed syllables. Syllable structure is, however, quite 
well-studied, and the constraints  appear to have a universal basis, the 
sonority hierarchy (less sonorous consonants before more sonorant ones 
in the onset, the reverse order in the coda, more or less). 
Language-specific variation is mostly limited  to the  complexity of 
permissible onsets (1-n) or codas (0-n) and the type of segments allowed 
in each position. Many languages put a strong restriction on coda 
elements, e.g. Hausa, where basically only l,n, s, r and glides are 
permissible. Japanese is probably similar. Most languages also require a 
vocalic  nucleus, although some do permit syllabic nasals (e.g. Bantu), 
or liquids, as found with the island Krk. But whatever the range of 
possibilities, it is certainly universally and language-specifically 
restricted.

With written input we cannot use stress patterns to detect phonological 
wordhood, but what we can do is exploit the underlying constraints on 
syllable structure by using their graphemic correlates for the 
prediction of split points. Actually, these patterns may even  be 
specified  as an orthographic component to the lexical compounding rules...

>Are there any reasons to allow more than binary 
>branching?
>  
>
I can't think of one.

>Choice 3 - how do we formalise the spelling part of this?  This is the bit I'm 
>really not interested in - I think we should support alternatives to the 
>current string unification but I don't want to implement them ...
>
>On top of all this is multiword expression handling ... I don't think I want 
>to go into that right now!
>
>My current thoughts about redesign are that I want to do 2b properly, but I am 
>  
>
>not sure about allowing compounding, at least not if it can't be restricted.  
>  
>
At least for German, a treatment of compounding appears indispensible, 
so we had better try hard to find ways of restricting it...

Berthold

>As part of this, I'd redo the string unification code so it just handled a 
>single affixation with the recursion handled in the chart as I mentioned 
>before and I'd put in suitable hooks for alternative ways of handling the 
>spelling stuff.  I would allow input from external morphological analysers 
>that adopted the 2a strategy (this really comes for free once we allow a chart 
>with FSs as input).  General question - does this make sense?
>
>Ann
>
>
>
>
>  
>




More information about the developers mailing list