[developers] Revised tokenisation and morphology

Ann Copestake Ann.Copestake at cl.cam.ac.uk
Thu Apr 21 00:37:23 CEST 2005


The branch is now in a state where I can parse and generate with small grammars
and parse some sentences with the ERG.  The active parser isn't supported yet,
although the required changes are minor. I can't really test it in the way I
would like - partly because I'm working on a not very powerful Windows laptop
with an iffy network connection and partly because I am out of time.  I am
going to ask Ben to see whether he can work on it so we can make it the main
trunk within a reasonable time frame.  

The changes are primarily in the handling of morphology - there's also support
for a more general approach to tokenisation. I have rewritten the string
unification code (morph.lsp - the old Bernie Jones code) completely and the
parser code will also support a variety of other morphological analysis
approaches.  The string unification code is now called on a rule-by-rule basis.
There is a *tchart* which gets instantiated by tokenisation and
morphophonology.  The *chart* now includes all edges derived from lexical and
morphological rule processing (i.e., morphosyntax).

When you run it you will see these edges on the chart display - if things are
working correctly it should be doing no more work than before in terms of
numbers of rules applying (and with luck less), but previously rule application
was `hidden' because edges were not put on the chart.  I have not yet done any
low-level optimisation of the code - it needs more checking before that would
be sensible I feel, but I have put in a more extensive approach to filtering
and there's another obvious thing that could be done in terms of checking
lexical types.  You will see, for instance, noun edges corresponding to
`barked' which is a bit counter-intuitive but is straightforwardly filterable.
The approach is basically the one I sketched in previous emails - the
morphophonology is seen as specifying a partial tree which the parser has to
conform to - so the `ed' affix will correspond to rules like `past' and all
edges that are licenced by `barked' have to contain that rule.  A nominal edge
`barked' can't be validly ruled out (without the filter) because there could be
a noun->verb conversion rule.  Believe it or not, there are extensive comments
in the code, so you could look at these to figure out what's happening.

Right now, I am not supporting the various global variables that were being
used with some grammars to constrain morphology by cutting off various search
paths - I rather hope we can do something cleaner but they can go back in if
necessary.  

Lots of grubby bits of code that had evolved because of the old approach where
morphology wasn't part of the chart have been removed.  morph-history etc have
gone.  This may break some grammars which have patches or whatever which refer
to these.

It would be great if someone could test this on the Norwegian grammar.  In
principle I think this should improve performance once we do the optimisations
etc and even if not, it will at least allow sensible experimentation with
packing, edge ranking and so on.

There are new menu commands under Debug to print the token chart and the
lexical rule `fsm' (not really an fsm at the moment but the obvious
generalisation of the rule filter for the lexical rules).  You will see Warning
messages when you load the ERG about rules that can self-feed - this isn't
necesarily something that needs to be fixed.

Ann




More information about the developers mailing list