[Fwd: Re: [developers] processing of lexical rules]

Sat Feb 12 18:22:19 CET 2005

Hi all,

I think that we should keep supporting simple morphological analysis in the LKB
(as well as external mporphological analysers) and we can still use the string
unification approach, but the code will need to be rewritten.  Here are my
current thoughts.

I'm now sure that we need to go to an architecture where there are just charts
and feature structures.  Preprocessing is defined to return a chart
instantiated with feature structures (expressed in XML for the external
preprocessors).  It can be coupled with morphological analysis in the case
where we have external morphological analysers.  In this case it should be
possible to support a word syntax approach or an approach where the affixes
correspond to lexical rules. 

As far as the LKB internal morphology goes, I think the first thing to
appreciate is that morphological processing in analysis is inherently weird -
we have to take the morphologically complex string and split it into stem plus
affixes and then we do the FS unifications corresponding to the rules.  Right
now, what happens is that the string unifications associated with a rule
operate in one direction, with no associated feature structures, and when the
stem plus affix strings are found, we replay the rules in the other direction,
doing the feature structure unifications.  This can lead to a huge search space
and also requires lots of hacky code so that we can put edges on the chart.

If we assume a uniform chart/feature structure architecture, we still have to
operate the rules in the stemming directions, because we need to have the stem
to do lexical lookup, but in principle we could use the rule feature structures
at this point too and this could lead to greater efficiency.  e.g., if we have
a token with the spelling `writings', we run the full affix rules to get one
alternative corresponding to an application of the lexical rule for plural,
with stem `writing' plus affix `s' (which allows lexical lookup and no further
analysis by affixation rules) and also another alternative where the unanalysed
spelling is `writing', which licenses another affix rule to be tried, again
leading to two options.  All the alternatives go in the chart.  When we get a
structure that allows lexical lookup, i.e. one with a STEM, we do the lexical
lookup and unify in the lexical feature structure.  This further instantiates
the rule applications where unification succeeds but could also result in some
edges being removed from the chart.  There is no additional search at this
point.  The lexical rules which don't have any associated affixation also have
to be applied in the `reverse' direction for this approach to work.

It is clear that this is formally OK, but I am not sure whether it will be
efficient or not because unification is expensive compared to pure string
operations and because of the need to do the non-morphological lexical rules at
the same time.  I suspect the rules might have to be tightened up to be more
efficient `in reverse'.  However, it means we're bringing all the info we have
available to bear as early as possible, which should be a good thing in
principle, and it means we're using a chart throughout, avoiding redundant
processing.  We would definitely need to use some version of the rule filter
mechanism, and I think we would also need to have an additional lexical filter,
which filtered paths that couldn't lead to a stem.

If we did this, then the spelling code would reduce to checking individual
affixations, which would be nice.  It would be formally cleaner than the
current situation, where morphology is replayed along with additional
interpolations of rules.  The best thing, perhaps, would be to try this out
initially with a fake morphology component that didn't do the spelling changes
- if that was reasonably efficient, we could rewrite the string unificiation
component properly.

Ann