[developers] PET/ERG frontend

Fri Feb 2 16:34:18 CET 2007

R. Bergmair wrote:
> Hello everyone,
>
> I am currently trying to RMRSify the RTE-3 datasets, i.e.
>
>  - I have short independent snippets of English text,
>    each containing a few (maybe up to three or four)
>    sentences.
>
>  - These snippets have not been preprocessed in
>    any way: I don't know the sentence boundaries.
>    I don't have any previous tokenization, and I
>    don't have POS tags.
>
>  - The sentences are reasonably short and simple,
>    but they do contain a fair share of tokens
>    unknown to the ERG lexicon.
>
>  - For each sentence, I want to have the 5 or
>    so best RMRSs.
>
>  - It's important that I get reasonably knowledge-
>    rich results, as my subsequent processing relies
>    quite heavily on the actual logic of the sentences.
>
> Now, what is the most straightfoward way of putting
> together a frontend and middleware based on DELPH-IN
> infrastructure [2] to make this happen?
>
> I understand the HoG is supposed to be a plug-and-play
> solution, 
yes, it does exactly what you want (except that you would additionally need
a RASP that generates RMRS -> Ann, couldn't you/we make a version
available via DELPH-IN ? - the current official RASP is RMRS-less).
> but I think the HoG might be infrastructural
> overkill for my simple need. 
??? - Infrastructure isn't bad per se. HoG has even been
designed for cases such as the one you described.
> -- Furthermore HoG release
> cycle may be too slow for my needs.
???
> My initial intuition was to run RASP on the bare text,
> and convert its tokenization, sentence splitting and
> POS-tagging into a PET input chart, which would then
> deal with unknown words and still give me knowledge-
> rich RMRSs from the ERG.
Here is how we use HoG for robust processing with ERG (on the
EUROPARL and QA corpora, e.g.).
For sentence splitting, I'd recommend an additional ("offline")
preprocessing step using JTok in a JTokSession as explained
in the HoG documentation (http://heartofgold.dfki.de/doc/heartofgolddoc.pdf)
on page 55
("Raw Input Text Preprocessing and Sentence Splitting", 2 Sentence 
Splitting).
This step (analyzeall.py with an xmlrpcsession.cfg where all modules are 
turned
off except JTok = jtok_session.cfg) reads the whole input text and 
generates a
single JTok XML annotation output file.
This file you would then transform using the indicated stylesheet into a 
text
file that has <tu>  (text units, i.e., sentences) replaced by newlines 
and the
other XML tags stripped off. This text file you would then use as input for
analyzeAll.py (important: with -n option!) with all relevant modules 
turned on
(e.g. default config with JTok, TnT, Sprout or Lingpipe, RASP, PET). You may
configure the desired output annotations in the array at the beginning 
of analyzeAll.py.
Using this session configuration, unknown words will be handled by TnT 
and Sprout (or
LingPipe, alternatively). If you need a recent LingPipe (2.4.0), I can 
send you the
Lingpipe2Module version we will publish soon as part of  the next HoG 
release.
> Ann Copestake just explained to me that this won't
> work with recent versions of the ERG, which assumes
> punctuation to be tokenized differently from RASP.
well, ERG tokenization assumptions are extraterrestrial 8-). Presumably,
there will be tokenization issues in any of the current preprocessing 
solutions
(including the one I just described).
> I could put some work into making this transformation
> happen. Do people have any comments on that? Is it
> feasible, or are there any reasons why I would not
> want to go down that road?
The gold solution (as discussed in some DELPH-IN developer meetings)
would be to have common and component-specific transducers operating
on the SMAF format, e.g. implemented in XSLT 2.0, that would both
integrate annotations from shallow preprocessing (tokens, PoS, NE)
and at the same time make it compatible with ERG tokenization
assumptions. Relying on RASP tokenization and tagging alone would be
(in my opinion) a bad solution as this is just another closed source 
system...
> I understand there are people currently in a similar
> situation (trying to RMRSify the BNC and so on).
>
> What sort of frontends are currently being used when
> it comes to semantic analysis of "real" data?
>
> Any help would be greatly appreciated! Thanks!
>
> Richard Bergmair
>
> P.S.: Since this is my first post to the list, I should
> introduce myself: I'm a new PhD student of Ann Copestake,
> and I will be working on RMRS-based entailment/similarity.
>
>
>
> Notes:
>
>  [1]  My hardware is a Fedora Core 6 Linux on i64.
>
>  [2]  Currently, I have the following software set up:
>
>       - about a dozen versions of ERG + lkb code,
>         most recent one is the DEC-14 build.
>       - PET 0.99.12, the PET currently in the SVN,
>         including ECL, MRS code and other deps
>       - RASP3, RASP2
RASP2: no RMRS. I don't know whether RASP3 is compatible
with RASP1 for which the RaspModule in HoG had been written.
>       - the most recent HoG
>
>       All of the components compile, and seem to work
>       as designed. PET/ERG produce RMRS output, and if
>       given a pet-input-chart also work with unknown words.
>
>  [3]  I'd generally like to keep the bulk of my own code
>       in Python, but I have an Allegro Common Lisp 8.0
>       build environment available as well.
>
So you will be happy with my (admittedly quickly and dirtily hacked) 
python scripts.
There is also a hog list hosted at delph-in where you could post 
HoG-specific questions (Cc).

-Uli

-- 
Ulrich Schaefer
Senior Software Engineer
Language Technology Lab
German Research Center for Artificial Intelligence (DFKI)
Stuhlsatzenhausweg 3
D-66123 Saarbruecken
phone +49 681 302 5154
http://www.dfki.de/~uschaefer