[developers] WikiWoods

Tue Jun 11 03:56:49 CEST 2019

Hi Stephan,

Thank you and I am sorry, I asked before read the article…  Ops! ;-)

I am considering to follow the ideas from http://moin.delph-in.net/WeScience to make another golden data from a subset of wikipedia pages from other domain, in particular, Petroleum Geology (https://en.wikipedia.org/wiki/Category:Petroleum_geology). 

The question is how much of this category was in the Wikipedia release from 2008. It would be nice to reproduce the whole processing pipeline with the last wikipedia dump. Is it possible? The tokeniser, described in the papers (in particular, http://www.delph-in.net/wescience/Ytrestol:09.pdf) is not available anymore.. If one can produce new GML files from the current wikipedia dump, next question would be how to process them with ACE or PET. I am assuming ACE is current the best option, right?

Suggestions?

--
Alexandre Rademaker
http://arademaker.github.io

> On 10 Jun 2019, at 20:03, Stephan Oepen <oe at ifi.uio.no> wrote:
> 
> hi alexandre,
> 
> which specific version of WikiWoods are you looking at?
> 
> starting from 1212 (i.e. the larger and cleaner GML version of the
> texts), article names (into the corresponding wikipedia dump) should
> be encoding using ⌊δ ... δ⌋ tags.
> 
> best wishes, oe
> 
> On Mon, Jun 10, 2019 at 11:58 PM Alexandre Rademaker
> <arademaker at gmail.com> wrote:
>> 
>> 
>> Does anyone know if in the files from http://moin.delph-in.net/WikiWoods corpus we can identify the original wikipedia page of each sentence? That is, can we reconstruct the text of the wikipedia page?
>> 
>> Best,
>> 
>> --
>> Alexandre Rademaker
>> http://arademaker.github.io
>> 
>> 
>>