[developers] PET/ERG frontend

Mon Feb 5 03:33:50 CET 2007

On Mon, 5 Feb 2007, Timothy Baldwin wrote:

> We have been working on such a type guesser at Melbourne which uses the
> tokeniser supplied with the relevant grammar (experiments so far have focused
> on the ERG and JaCY). The main person behind this is currently finishing up
> his PhD, so don't ask for a cleaned-up version of the code until the dust has
> settled a bit, but the hope is we'll be able to release everything as part of
> the DELPH-IN resource package.

Sounds very interesting. I'll make sure to look into it,
once it's available.

> Otherwise, Yi Zhang has a mature type-guessing
> solution for unknown words which is properly integrated with PET, so you might
> like to talk to him.

Thanks for the hint! I'll do that.

Regarding Ulrich Schäfer's reply:

I'll have a look at FSR preprocessing in the PET. If it does
rely on POS tags for unknown words, that, of course, raises
the question where to get POS tags for the correct tokenization.
But we've discussed that in some depth now anyway.

Sentence-splitting in the PET: Yes, that would be a bad idea,
indeed. Once Ann mentioned there were *grammar* rules for
sentence and token splitting, that became clear to me.

Having "manually preprocessed" a small subset of my data to play
around with, I now definitely see your point about named entities.
These actually seemed to cause much more OOVs on my dataset (from
the RTE-3 callenge) than dictionary words that weren't in the ERG
lexicon.

Thanks!

Richard