[Fwd: Re: [developers] processing of lexical rules]

Ann Copestake Ann.Copestake at cl.cam.ac.uk
Mon Feb 14 18:42:55 CET 2005

I'd hope we can limit the things we do that are formally `extra'.  e.g., 
blocking rules applying more than once is not so good for derivational 
morphology `antiantimissile' etc.  If we have spelling rules that go along 
with rules expressed in TFSs, we should be able to use the TFSs to stop rules 
applying more than once if we want to.  Right now, the separation between the 
spelling stage and the TFS stage means that won't work for blocking explosion 
in hypotheses at the spelling stage, but the formalism makes it look like it 
ought to work.  Anyway, as I said, I think if we move to a situation where 
everything involves FSs we can probably explore a range of possibilities for 
efficiency while keeping strictly constraint-based.  I think the string 
unification paradigm can be extended to simple cases of infixation though not 
to things like Arabic morphology.

However, I don't see how compounding works in the paradigm of `morphological 
rule is just a lexical rule with attached spelling'.  By definition, there are 
multiple stems in a compound and I wouldn't have thought it made sense to 
treat one as an affix.   We could, I suppose, have a new sort of rule which 
was unary with respect to tokens in the chart but binary with respect to stem 
lookup.  This is somewhat orthogonal to the issue of how one detects 
compounds.  In English, it works OK with productive compounds to assume that 
any affixation affects the compound head alone and then to form the compound 
with the inflected form.  e.g., `towel rails' is analysed as (towel (rail+s)) 
as opposed to (towel rail)+s.  What about German?

Re Emily's comments about XFST - to me a string `stem+nom+pl' would be better 
reformulated as a (simple) feature structure because it's structured - to deal 
with it, one has to parse the string.  So minimally I'd do something like

[ STEM stem
  AFFIXES < nom, pl > ]

but perhaps it could be done as a more directly grammar compatible structure

[ plural
  DTRS < [ nom
           DTRS < [ STEM stem ] > ] > ]

or whatever.  Anyway, the point isn't really a linguistic one - all I'm saying 
is that we need a flexible data structure to encode complex information and we 
may as well stick to (T)FSs.

Other types of complex information are stuff like characterisation and the 
original string which we want for the (R)MRS  - these are currently dealt with 
in a rather hacky way.  We also need to have complex data structures for named 
entity stuff, including as the simplest case the ersatz entries in the 
preprocessor used with the ERG which really need to have an instantiated CARG. 
 I can see the situation getting worse as we decide we'd like to incorporate 
further information.  For instance, knowing whether a word is in italics or 
not is sometimes crucial.  Finally, although XFST doesn't return bracketed 
structures, some morphology systems do.


More information about the developers mailing list