[Fwd: Re: [developers] processing of lexical rules]
Ann Copestake
Ann.Copestake at cl.cam.ac.uk
Mon Feb 14 18:42:55 CET 2005
I'd hope we can limit the things we do that are formally `extra'. e.g.,
blocking rules applying more than once is not so good for derivational
morphology `antiantimissile' etc. If we have spelling rules that go along
with rules expressed in TFSs, we should be able to use the TFSs to stop rules
applying more than once if we want to. Right now, the separation between the
spelling stage and the TFS stage means that won't work for blocking explosion
in hypotheses at the spelling stage, but the formalism makes it look like it
ought to work. Anyway, as I said, I think if we move to a situation where
everything involves FSs we can probably explore a range of possibilities for
efficiency while keeping strictly constraint-based. I think the string
unification paradigm can be extended to simple cases of infixation though not
to things like Arabic morphology.
However, I don't see how compounding works in the paradigm of `morphological
rule is just a lexical rule with attached spelling'. By definition, there are
multiple stems in a compound and I wouldn't have thought it made sense to
treat one as an affix. We could, I suppose, have a new sort of rule which
was unary with respect to tokens in the chart but binary with respect to stem
lookup. This is somewhat orthogonal to the issue of how one detects
compounds. In English, it works OK with productive compounds to assume that
any affixation affects the compound head alone and then to form the compound
with the inflected form. e.g., `towel rails' is analysed as (towel (rail+s))
as opposed to (towel rail)+s. What about German?
Re Emily's comments about XFST - to me a string `stem+nom+pl' would be better
reformulated as a (simple) feature structure because it's structured - to deal
with it, one has to parse the string. So minimally I'd do something like
[ STEM stem
AFFIXES < nom, pl > ]
but perhaps it could be done as a more directly grammar compatible structure
[ plural
DTRS < [ nom
DTRS < [ STEM stem ] > ] > ]
or whatever. Anyway, the point isn't really a linguistic one - all I'm saying
is that we need a flexible data structure to encode complex information and we
may as well stick to (T)FSs.
Other types of complex information are stuff like characterisation and the
original string which we want for the (R)MRS - these are currently dealt with
in a rather hacky way. We also need to have complex data structures for named
entity stuff, including as the simplest case the ersatz entries in the
preprocessor used with the ERG which really need to have an instantiated CARG.
I can see the situation getting worse as we decide we'd like to incorporate
further information. For instance, knowing whether a word is in italics or
not is sometimes crucial. Finally, although XFST doesn't return bracketed
structures, some morphology systems do.
Ann
More information about the developers
mailing list