[developers] New ERG with improved tokenization/preprocessing for PET

Francis Bond fcbond at gmail.com
Mon Apr 13 04:59:01 CEST 2009


G'day,

> I am pleased to announce a new release of the ERG, tagged '0902' (year/month -- it's taken us a little while to package and test it :).  This release benefits greatly from a new tokenization component of the ERG, mainly developed by Stephan Oepen, making use of the newly available chart-mapping facility that Peter Adolphs and Stephan have been developing.  One immediate benefit of this new facility is that when the grammar is run with PET, all of the preprocessing (tokenization adjustments, punctuation, dates, measure-NPs, numbers, etc.) is now done internally, so it should be much easier to use the grammar/parser as a module in applications.  Unknown-word handling is also well supported now, including both proper names and open-class words, and assumes the TnT tagset, with generic entries created on the fly triggered by these tags for verbs, nouns, adjectives, and adverbs.  Characterization information is now preserved in the MRSs produced, and contentful predicate names are
>  introduced for unknown words.  For emerging documentation on this new facility, see
> wiki.delph-in.net/PetInput

That sounds fantastic.

A couple of comments.

I think that should be  wiki.delph-in.net/moin/PetInput

I think there may be an issue with TnT in that as far as I know TnT is
not open, so some DELPH-IN members will not be able to use the
unknown-word handling.  For example, I am fairly sure I don't have a
local license for tnt, and then you also (probably) need a WSJ license
for the model (which I do have).  Does everything apart from the
unknown-words work without TnT?

> In this release, you'll find some additional treebank profiles in the 'gold' subdirectory, namely the first four sections of the emerging WeScience corpus which consists of 100 Wikipedia articles on computational linguistics.  We expect to complete the treebanking of the full corpus (now 25% complete) by this summer.

Let me be the first to say Wheeeeeeeee!


-- 
Francis Bond <http://www2.nict.go.jp/x/x161/en/member/bond/>
NICT Language Infrastructure Group




More information about the developers mailing list