[developers] pyDelphin / [incr tsdb()] question
arademaker at gmail.com
Sat Apr 13 16:55:36 CEST 2019
No problem! I am sorry for so many typical newcomer questions that many times deviate from the original conversation.
What called my attention is this new perspective on the kind of data the profiles can hold. So far, I was taking profiles to be clean data, composed of sentences to me process typically imported from a text file with one sentence per line. That is, we should first clean the data before we create the profile.
This view imposes a separation between a profile (use for treebank, grammar debugging, etc.), where the fundamental units are sentences; and a format to be used in a pipeline for documents processing where we typically want to preserve the original document structure attaching annotations to the relevant segments.
But your message opens a new perspective. For example, I am constant dealing text extracted from PDF documents. Following the way you worked with mbox files, I could also think about ingesting the text file obtained from the PDF-to-Tex tools directly, preserving all the content but using the i-length field to mark the sentences to be processed and the ones to be ignored. I am not sure what will be the benefits for that approach besides having the whole data preserved in the profile. Anyway, a preprocessing will be still necessary to split the data into sentences.
So far, to deal with this sort of content, I was looking for markups based on standoff annotations like https://en.wikipedia.org/wiki/XCES.
> On 11 Apr 2019, at 21:09, Stephan Oepen <oe at ifi.uio.no> wrote:
> my apologies for the opaque inside joke, alexandre! some of us used to work at a start-up (called YY Technologies) a few years ago, developing the premier email auto response solution at the time. hence support for mbox files was relevant at least back then :-). and the ERG treebanks still include four profiles with e-commerce emails.
> regarding interpretation of the i-length field, i would maybe argue that the two use cases ultimately are the same meaning: the field quantifies the length of linguistic content; when there is nothing worse parsing in an item, its ‘linguistic length’ is zero (or maybe -1, not quite sure just now). but, yes, this results in a flag-like behavior for the processing commands: skipping over (what you might call ‘noise’) items which lack linguistic content.
> best, oe
More information about the developers