[developers] Treebank corpus reader for NLTK

Wed Jul 6 02:24:00 CEST 2011

I haven't used NLTK in a few years now, and I think the answers to some 
of these questions depend on what people want to do with the corpus 
after it is read, but I having played a fair bit recently with 
processing treebank data in various formats, I'd definitely say read in 
from [incr tsdb()] profiles. Having an extra step of exporting just 
makes it less likely the data will get used or updated, and the profiles 
are a (reasonably) stable format that we are all used to.

Rebecca

On 06/07/11 07:02, Michael Wayne Goodman wrote:
> Hi developers,
>
> I was originally going to send this email only to those who showed
> interest at the summit, but since we agreed to use the mailing lists
> more I'm doing so here.
>
> Someone brought up the idea of creating a corpus reader for NLTK (the
> Natural Language ToolKit; nltk.org) for loading and interacting with
> HPSG-style treebanks. I briefly looked into how that might be
> accomplished, and I have some questions for you.
>
> I imagine that we, at least, would like to load derivation trees (or
> just POS-labeled trees?). There is a BracketParseCorpusReader that
> loads parse trees in the form of (S(NP(N dogs))(VP(V bark))), and we
> could extend that class to handle our derivation trees. The
> AlpinoCorpusReader provides an example of how to extend that class.
> But as far as I can tell, BracketParseCorpusReader only accommodates
> two fields per node (tag and word/node, as in ("N", "dogs")), so would
> we want to throw away other information like node id, score, start and
> end?
>
> We probably also want to load the semantics and pair it with a
> derivation. I don't see an obvious way of doing this with the current
> machinery, so I will later ask on nltk-devel at lists.sourceforge.net.
>
> Other questions are:
>
> 1. From what format should we attempt to load the treebanks? [incr
> tsdb()] profiles? An exported tree-only format? Something else?
>
> 2. What kind of information do we want to be made available? Parse
> trees, semantics, other?
>
> 3. Can we submit a treebank (Redwoods, Hinoki, other?) to the NLTK
> corpora? This would increase exposure, but it would be harder to
> ensure people use up-to-date versions.
>
> Feel free to respond even if you were not at the summit.
>
> Thanks
>