[developers] PET/ERG frontend
R. Bergmair
rb432 at cam.ac.uk
Fri Feb 2 19:03:16 CET 2007
Thanks for your comments!
Yes, I absolutely agree that architecturally the HoG is what I should be
looking at. I'll explain some of the more pragmatic issues I see with the
HoG being used in a situation like mine.
If there will be a new release with RASP officially able to do RMRS that's
great news. The RASP3 currently in the Cambridge CVS gives me an error
upon using the undocumented "-or" output format. Ann gave me a piece of
glue code to call her RASP/RMRS conversion directly from the current LKB
codebase. That works just fine, but I'm not sure if RASP RMRSs are
sufficiently deep for my purposes.
I understand her RMRS-code is currently under heavy development, and I
assume the planned release of RASP will be based on that. The new
converter will also get rid of the message-quantifiers, which will make
RASP even more incompatible with PET/ERG, until Dan Flickinger releases
the new ERG branch which will also get rid of messages.
At that point, if there is to be a new release of the HoG, the maintainers
will face the tokenization problem in much the same way as I do now, if I
want to get any frontend running with a current version of the ERG. I
assume that applies to JTok+TNT in much the same way as it does for the
RASP.
One way or the other, it will probably be the more distant future, when I
will next be able to download a HoG and have it work out of the box with
up-to-date components. That's what I meant, when I said the HoG release
cycle may be too slow for me. -- Since the HoG depends on everything else,
it is particularly hard to keep up to date.
>The gold solution (as discussed in some DELPH-IN developer meetings)
>would be to have common and component-specific transducers operating
>on the SMAF format, e.g. implemented in XSLT 2.0, that would both
>integrate annotations from shallow preprocessing (tokens, PoS, NE)
>and at the same time make it compatible with ERG tokenization
>assumptions.
That seems to be the most elegant solution in the HoG architecture.
Yet, from a more pragmatic point of view the easiest thing to do may be
to add some limited frontend capability to the PET. The RASP sentence
splitter is 110 lines of LEX/C code, and the tokenizer is another 160
lines. A baseline type-guessing mechanism could simply try through
lexical types in order of their a-priori probability of occurence.
I was assuming this is the kind of complexity level I should be looking
at for my simple setup, where I don't need POS tagging, chunking,
named entity recognition or anything fancy like that. This was what I
had in mind when I said the HoG may be infrastructural overkill for my
purposes.
If this is the kind of thing planned for the PET anyway, I'd be happy
to get in on the action and contribute some code. I think it would be
really useful.
>Relying on RASP tokenization and tagging alone would be
>(in my opinion) a bad solution as this is just another closed source
>system...
I tend to agree. Converting the tokenizations is probably as complex,
as writing a new tokenizer anyway. What I wasn't sure about is how
hard it is to get type-guesses of a quality that is comparable to
using POS-tags from RASP with the current PET unknown-words mapping.
Any comments on that?
Thanks again! Your replies have been very helpful!
Richard
More information about the developers
mailing list