[developers] PET/ERG frontend
Ann Copestake
Ann.Copestake at cl.cam.ac.uk
Fri Feb 2 19:47:23 CET 2007
rb432 at cam.ac.uk said:
> Yet, from a more pragmatic point of view the easiest thing to do may be to
> add some limited frontend capability to the PET. The RASP sentence splitter
> is 110 lines of LEX/C code, and the tokenizer is another 160 lines. A
> baseline type-guessing mechanism could simply try through lexical types in
> order of their a-priori probability of occurence.
PET can already use the same preprocessor as the LKB uses. Each grammar in the
LKB/PET will in principle use different tokenisation rules: the preprocessor
runs the rules. These are language-specific. The ERG rules have a .fsr
extension in the ERG directory. There are interfaces to more complex
segmenters/morphological analysers for languages like Japanese. There's some
info on this on the wiki and some relevant discussion on developers.
As far as unknown word guessing goes, it might well be rational to have a
dedicated type guesser that used the same tokeniser as the ERG rather than
mess around with the RASP tagger/ERG token alignment. This could be a POS
tagger, but that's probably not ideal. Note Frederik Fouvry's work, which
provides an alternative strategy for unknown words. Also note that we do need
sloppy alignment strategies for mapping ERG-RMRS to RASP-RMRS.
Ann
More information about the developers
mailing list