[developers] Parsing linux forum data

Andrew MacKinlay admackin at gmail.com
Tue Nov 17 16:30:28 CET 2009


Hi,

Following on from my previous post, my more general task is the  
parsing of linux forum data, as some of you may be aware.

Ann suggested I might probe the Delph-in hive mind for any suggestions  
on dealing with this kind of data - eg recommendations for POS- 
tagging, sentence splitting etc.

Posts are, as you might imagine, wildly varying in quality and contain  
combinations of parseable data (which contains the entiy such as a URL  
which we would like to treat as atomic) and console output. For  
example, here is a randomly selected post that shows some of these  
features:


=============
This is my second run at this problem (history is at  http://www.linuxquestions.org/questi...hreadid=3109) 
.

After recompiling kernel 2.4 and doing all the 'make' options, I ran  
'lilo'.  -
I have one problem and three questions:
-
PROBLEM:
When I run lilo, get the following:
-
  Warning: device 0x0306 exceeds 1024 cylinder limit.
Fatal: Sector 51220658 too large for linear mode
(try 'lba32' instead) .
-
Questions:
1.) How do I find out what device 0x0306 is?
2.) How do I find out what is on sector 51220658 and why the system  
says it is too big?
3.)  I am booting from a floppy, which doesn't seem to care
about device 0x0306 or sector 51220658. Why isn't this a problem using  
a floppy?
Thanks to the guys that helped me the first time around and many  
thanks for taking another look. Hope I don't seem to be stupid or  
thankless, but, as yet, I don't understand.
==============

If anyone has any advice they could offer on the basis of experiences  
they've had with similar data in the past, it would be gratefully  
received.

Thanks,
Andy
  



More information about the developers mailing list