[developers] Treebank corpus reader for NLTK

Tue Jul 19 01:32:52 CEST 2011

Hi Mike, Francis and Rebecca,

Picking up this thread somewhat belatedly.  I agree with Rebecca
that it would be very helpful to have some use cases in mind
before going further with design.  Here are a few fairly random
ideas, without regard for whether they are feasible within NLTK
or python in general:

--- Linguistic search (perhaps people should just use Treebank
Search, but there are still things that even TS search can't
do [yet])
--- Training statistical parsers, either based on derivation
trees or (in the style of Luke Zettlemoyer's work) MRSs and
surface strings only
--- Using the lexical types or labels as POS tags to get
POS tagged corpora

Emily

On Thu, Jul 7, 2011 at 2:24 AM, Francis Bond <bond at ieee.org> wrote:
> G'day,
>
> Thanks Michael for taking the initiative with this (and I answer some
> of your questions further down).
>
> On 6 July 2011 08:24, Rebecca Dridan <bec.dridan at gmail.com> wrote:
>> I haven't used NLTK in a few years now, and I think the answers to some of
>> these questions depend on what people want to do with the corpus after it is
>> read, but I having played a fair bit recently with processing treebank data
>> in various formats, I'd definitely say read in from [incr tsdb()] profiles.
>> Having an extra step of exporting just makes it less likely the data will
>> get used or updated, and the profiles are a (reasonably) stable format that
>> we are all used to.
>
> I agree that straight profiles are the way to go if at all possible.
> One potential issue with this is that you can only store one form of
> MRS (as far as I know), and may want more than one in the corpus.
>
>> Rebecca
>>
>>> 1. From what format should we attempt to load the treebanks? [incr
>>> tsdb()] profiles? An exported tree-only format? Something else?
>
> >From profiles if possible (and it would be nice if we can get all of
> the information from -tsdbdump).
>
>>> 2. What kind of information do we want to be made available? Parse
>>> trees, semantics, other?
>
> parse trees, derivation trees (possibly in the same structure) MRS
> (possibly as dependencies).
>
>>> 3. Can we submit a treebank (Redwoods, Hinoki, other?) to the NLTK
>>> corpora? This would increase exposure, but it would be harder to
>>> ensure people use up-to-date versions.
>
> I would be happy to submit part or all of the Hinoki treebank (with
> the proviso it is a little out of date).
>
> --
> Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
> Division of Linguistics and Multilingual Studies
> Nanyang Technological University
>

-- 
Emily M. Bender
Associate Professor
Department of Linguistics
Check out CLMA on facebook! http://www.facebook.com/uwclma