[developers] PET/ERG frontend
rb432 at cam.ac.uk
Fri Feb 2 13:42:20 CET 2007
I am currently trying to RMRSify the RTE-3 datasets, i.e.
- I have short independent snippets of English text,
each containing a few (maybe up to three or four)
- These snippets have not been preprocessed in
any way: I don't know the sentence boundaries.
I don't have any previous tokenization, and I
don't have POS tags.
- The sentences are reasonably short and simple,
but they do contain a fair share of tokens
unknown to the ERG lexicon.
- For each sentence, I want to have the 5 or
so best RMRSs.
- It's important that I get reasonably knowledge-
rich results, as my subsequent processing relies
quite heavily on the actual logic of the sentences.
Now, what is the most straightfoward way of putting
together a frontend and middleware based on DELPH-IN
infrastructure  to make this happen?
I understand the HoG is supposed to be a plug-and-play
solution, but I think the HoG might be infrastructural
overkill for my simple need. -- Furthermore HoG release
cycle may be too slow for my needs.
My initial intuition was to run RASP on the bare text,
and convert its tokenization, sentence splitting and
POS-tagging into a PET input chart, which would then
deal with unknown words and still give me knowledge-
rich RMRSs from the ERG.
Ann Copestake just explained to me that this won't
work with recent versions of the ERG, which assumes
punctuation to be tokenized differently from RASP.
I could put some work into making this transformation
happen. Do people have any comments on that? Is it
feasible, or are there any reasons why I would not
want to go down that road?
I understand there are people currently in a similar
situation (trying to RMRSify the BNC and so on).
What sort of frontends are currently being used when
it comes to semantic analysis of "real" data?
Any help would be greatly appreciated! Thanks!
P.S.: Since this is my first post to the list, I should
introduce myself: I'm a new PhD student of Ann Copestake,
and I will be working on RMRS-based entailment/similarity.
 My hardware is a Fedora Core 6 Linux on i64.
 Currently, I have the following software set up:
- about a dozen versions of ERG + lkb code,
most recent one is the DEC-14 build.
- PET 0.99.12, the PET currently in the SVN,
including ECL, MRS code and other deps
- RASP3, RASP2
- the most recent HoG
All of the components compile, and seem to work
as designed. PET/ERG produce RMRS output, and if
given a pet-input-chart also work with unknown words.
 I'd generally like to keep the bulk of my own code
in Python, but I have an Allegro Common Lisp 8.0
build environment available as well.
More information about the developers