[developers] Treebank corpus reader for NLTK

Tue Jul 5 23:02:55 CEST 2011

Hi developers,

I was originally going to send this email only to those who showed
interest at the summit, but since we agreed to use the mailing lists
more I'm doing so here.

Someone brought up the idea of creating a corpus reader for NLTK (the
Natural Language ToolKit; nltk.org) for loading and interacting with
HPSG-style treebanks. I briefly looked into how that might be
accomplished, and I have some questions for you.

I imagine that we, at least, would like to load derivation trees (or
just POS-labeled trees?). There is a BracketParseCorpusReader that
loads parse trees in the form of (S(NP(N dogs))(VP(V bark))), and we
could extend that class to handle our derivation trees. The
AlpinoCorpusReader provides an example of how to extend that class.
But as far as I can tell, BracketParseCorpusReader only accommodates
two fields per node (tag and word/node, as in ("N", "dogs")), so would
we want to throw away other information like node id, score, start and
end?

We probably also want to load the semantics and pair it with a
derivation. I don't see an obvious way of doing this with the current
machinery, so I will later ask on nltk-devel at lists.sourceforge.net.

Other questions are:

1. From what format should we attempt to load the treebanks? [incr
tsdb()] profiles? An exported tree-only format? Something else?

2. What kind of information do we want to be made available? Parse
trees, semantics, other?

3. Can we submit a treebank (Redwoods, Hinoki, other?) to the NLTK
corpora? This would increase exposure, but it would be harder to
ensure people use up-to-date versions.

Feel free to respond even if you were not at the summit.

Thanks

-- 
-Michael Wayne Goodman