[developers] Treebank corpus reader for NLTK

Thu Jul 7 11:24:57 CEST 2011

G'day,

Thanks Michael for taking the initiative with this (and I answer some
of your questions further down).

On 6 July 2011 08:24, Rebecca Dridan <bec.dridan at gmail.com> wrote:
> I haven't used NLTK in a few years now, and I think the answers to some of
> these questions depend on what people want to do with the corpus after it is
> read, but I having played a fair bit recently with processing treebank data
> in various formats, I'd definitely say read in from [incr tsdb()] profiles.
> Having an extra step of exporting just makes it less likely the data will
> get used or updated, and the profiles are a (reasonably) stable format that
> we are all used to.

I agree that straight profiles are the way to go if at all possible.
One potential issue with this is that you can only store one form of
MRS (as far as I know), and may want more than one in the corpus.

> Rebecca
>
>> 1. From what format should we attempt to load the treebanks? [incr
>> tsdb()] profiles? An exported tree-only format? Something else?

>From profiles if possible (and it would be nice if we can get all of
the information from -tsdbdump).

>> 2. What kind of information do we want to be made available? Parse
>> trees, semantics, other?

parse trees, derivation trees (possibly in the same structure) MRS
(possibly as dependencies).

>> 3. Can we submit a treebank (Redwoods, Hinoki, other?) to the NLTK
>> corpora? This would increase exposure, but it would be harder to
>> ensure people use up-to-date versions.

I would be happy to submit part or all of the Hinoki treebank (with
the proviso it is a little out of date).

-- 
Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University