[developers] PET/ERG frontend

Ulrich Schaefer ulrich.schaefer at dfki.de
Sun Feb 4 22:34:14 CET 2007

R. Bergmair schrieb:
> At that point, if there is to be a new release of the HoG, the 
> maintainers
> will face the tokenization problem in much the same way as I do now, if I
> want to get any frontend running with a current version of the ERG. I
> assume that applies to JTok+TNT in much the same way as it does for the
yes, probably.
> One way or the other, it will probably be the more distant future, when I
> will next be able to download a HoG and have it work out of the box with
> up-to-date components. That's what I meant, when I said the HoG 
> release cycle may be too slow for me. -- Since the HoG depends on 
> everything else,
> it is particularly hard to keep up to date.
now I understand, and agree. My hope is still that HoG being open source
will encourage contributions and fixes from different sites. We now have
e.g. valuable contributions (though not related to your problem) from the
SmartWeb project (by Greg Gulrajani) which I hope to be able to integrate
in the next release.
> I was assuming this is the kind of complexity level I should be looking
> at for my simple setup, where I don't need POS tagging, chunking,
> named entity recognition or anything fancy like that. This was what I
> had in mind when I said the HoG may be infrastructural overkill for my
> purposes. 
I don't think that integrating sentence splitting in PET would be a good 
This should be kept outside (developers: shout if you think I'm wrong), and
for corpus parsing, I don't see why this couldn't be separate process.

In my opinion, manually written rules for type guessing would not reach the
quality you could get by RASP or TnT for unknown words. So I think you
need PoS tagging and NER, and then an infrastructure such as HoG should
be exactly what you need, modulo the nasty and subtle issues that still 
because of different tokenizations (any contribution to improvement is of
course welcome), and with the surplus of a multilingual framework that
potentially works on other languages than, say, English (but it sounds as if
this is not an issue for you).

Did you already check the FSR preprocessor (LISP code from LKB, but I
think now also available in PET via ECL)? Doesn't it support pos tag 
input for
unknown words handling (I'm not an expert for this tool)?


More information about the developers mailing list