[developers] About the German Grammar

Tue May 19 14:06:25 CEST 2009

Dear Michael, 

I am cc'ing the developers list, since your question may be of interest
to other DELPH-IN members as well.  

On Mon, 2009-05-18 at 18:52 +0200, Peter Adolphs wrote:
> Hi Michael!
> 
> Sorry for the late reply. I've been off the last few days.
> 
> Michael Wayne Goodman wrote:
> > As a project for Emily Bender's seminar this term, I will be doing
> > some tests with various Delph-In grammars. I would like to use the GG
> > for this, and, as I understand it, you are the current maintainer of
> > the GG?
> 
> I'm rather the current contact at DFKI for the GG since Berthold is currently
> doing practically all the GG development.
> 

Essentially correct. Despite rumours to the contrary, I am still
maintaining the GG. 

> > 1. What kind of test suite do you have? Is it all in the SVN
> > repository?
> 
>  There are two repositories:
> http://gg.opendfki.de/repos/gg/trunk (http://gg.opendfki.de/browser)
> http://svn.emmtee.net/trunk/dfki/gg

> Berthold commits to the emmtee.net repository and sometimes transfers proven
> code to the DFKI repository.
> 

The LOGON repository is the active development branch. 
The reason behind this is that most of my recent work was done in
conjunction with my baby transfer grammars deen and ende.  

I regularly update the DFKI branch as well. I shall do the next sync in
due time.

> > I'm looking for 1500 or more sentences (they don't have to
> > be treebanked, just processable with PET and the LKB).
> 
> We have some treebank skeletons, e.g. Eiche (transcripts of spoken language,
> from Verbmobil), NEGRA (newspaper), TSNLP, etc. I have to check the licenses,
> but I'm sure I can send you the skeletons.
> 

There are several treebanks at the moment.  

1. Regression test suites

The MRS treebank tends to be in sync with the latest grammar since it is
used  for the GG and DeEn web demos as well. 

I should have versions of Babel and TSNLP test suites here in Bonn that
have been produced with recent versions of the grammar. I shall talk to
my student assistant to update them once more, then commit them as part
of the grammar. 

2. Stephan Oepen has produced a MaxEnt model based on the TiGer treebank
as distributed in the CoNLL shared task. This model is in sync with a
fairly recent version of the GG (only bug fixes since then). I am not
sure, whether we can distribute the model with the grammar. Stephan?

The grammar currently has a coverage of around 30% on TiGer. 

3. I had some treebanking of TiGer data done as part of my Checkpoint
project back in late 2007. Hans Uszkoreit has kindly given me permission
to use these annotated data to provide updated MaxEnt models for the
grammar. The realisation ranking model trgp.mem distributed with the
grammar was based on that treebank. 

The language model that comes with it was trained on the deWaC web
corpus.  

4. There is an extensive treebank of Verbmobil data created back in
2005/2006. See my ACL PaGe paper for details (2008). Unfortunately, the
grammar has seen some major changes since 2005/2006 when that treebank
was created, most notably a treatment of punctuation, a more elegant
(and efficient) approach to adjectival inflection, and an underspecified
treatment of scopal modification in the clausal domain. 

It may be possible to migrate the Verbmobil treebank to the current
version of GG. However, without some additional support on the side of
the Redwoods update mechanism I am not sure this can be done in a
time-effective (semi-automatic) way. What would be needed is either

a)  a conversion from tree discriminators to semantic discriminators,
which I expect to be more robust towards the changes I have introduced
in the meantime, or

b) a way to specify rules to ignore (inflectional layer for punctuation
rules) or rules to match sloppily (h-adunct -> h-adjunct, v-isect, or
v-scop). 

Maybe Stephan has some ideas here. 

Finally, the VM treebank has never been released officially by DFKI,
although Hans and I had agreed in 2007 that we wanted to make the
treebank available soon. The considerable delay is mainly due to the
fact that we needed to move to the punctuation-aware grammar in
Checkpoint, yet didn't have the time to do the major upgrade. 
However, I' d be happy to coordinate any upgrade effort, in case there
is some general interest. 

5. We once had some CLEF QA parallel treebanks for English and German.
Annotation was done at DFKI in 2006 (if I remember the date correctly).
Anyway, these treebanks are fairly small, so it might be quicker to just
reannotate from scratch. Maybe Micha Jellinghaus has some data already
annotated with the current grammar, as part of his PhD project. 

To summarise:

Depending on what you need, there are several options: 

1. Regression testing:

I can provide MRS, Babel and TSNLP treebanks and models. 

2. Parsing newspaper text (TiGer)

Try the model created by Stephan, if possible. 

I should check how much effort it will be to update the treebank created
in Checkpoint and possibly provide an updated version of that. 

3. Generation: 

The tgrp.g.mem model performs quite fine still. 
I can probably also provide some models based on the regression test
suites shortly. 

> > 2. How granular are repository commits? Eg. Jacy is pretty good at
> > committing a new version for every small change, while the ERG tends
> > to get a new version for large or multiple changes.

> You should better ask that Berthold since I didn't do any bigger changes to the
> grammar so far.
> 

Pretty regularly. Minor fixes actually make it into the LOGON branch
quite quickly. Micha Jellinghaus has done some recent testing of the MRS
output and I tend to provide fixes within a few days (if it isn't
anything that requires major changes to the grammar). 

> > 3. I'm doing automatic error detection, so I need to be able to ask a
> > grammar maintainer if the errors I find are valid. Could I ask you
> > these questions?
> 
> Yes, Berthold and me, please.
> 

I'd be pleased to get your bug reports.  

Hope this helps. 

Berthold

> Please be aware that I won't have much time to answer your mails till end of
> May, since we have a project deadline then.
> 
> Cheers,
> 
> Peter
>