[developers] PET/ERG frontend

Fri Feb 2 19:47:23 CET 2007

rb432 at cam.ac.uk said:
> Yet, from a more pragmatic point of view the easiest thing to do may be to
> add some limited frontend capability to the PET. The RASP sentence splitter
> is 110 lines of LEX/C code, and the tokenizer is another 160 lines. A
> baseline type-guessing mechanism could simply try through lexical types in
> order of their a-priori probability of occurence. 

PET can already use the same preprocessor as the LKB uses. Each grammar in the 
LKB/PET will in principle use different tokenisation rules: the preprocessor 
runs the rules.  These are language-specific.  The ERG rules have a .fsr 
extension in the ERG directory.  There are interfaces to more complex 
segmenters/morphological analysers for languages like Japanese. There's some 
info on this on the wiki and some relevant discussion on developers.

As far as unknown word guessing goes, it might well be rational to have a 
dedicated type guesser that used the same tokeniser as the ERG rather than 
mess around with the RASP tagger/ERG token alignment.  This could be a POS 
tagger, but that's probably not ideal.  Note Frederik Fouvry's work, which 
provides an alternative strategy for unknown words.  Also note that we do need 
sloppy alignment strategies for mapping ERG-RMRS to RASP-RMRS.

Ann