[developers] pyDelphin / [incr tsdb()] question

goodman.m.w at gmail.com goodman.m.w at gmail.com
Sat Apr 13 08:36:44 CEST 2019


Oops, I accidentally replied off-list to Stephan and Alexandre. I'll
summarize my responses (no longer inline to the main thread; sorry!)

Alexandre wrote:
> Is this the generic python split method from strings? Is it safe? What
about possible non whitespace exceptions on the tokenisation?

It is. This is just an ad hoc fix for the Xigt codebase to keep Emily's
group moving along with their research. We're hoping to make a more robust
version eventually.

Stephan wrote:
> this results in a flag-like behavior for the processing commands:
skipping over (what you might call ‘noise’) items which lack linguistic
content.

I thought that's what i-wf = 2 is for?
http://moin.delph-in.net/ItsdbReference#Well_Formedness_.28i-wf.29

On Fri, Apr 12, 2019 at 8:09 AM Stephan Oepen <oe at ifi.uio.no> wrote:

> my apologies for the opaque inside joke, alexandre!  some of us used to
> work at a start-up (called YY Technologies) a few years ago, developing the
> premier email auto response solution at the time.  hence support for mbox
> files was relevant at least back then :-).  and the ERG treebanks still
> include four profiles with e-commerce emails.
>
> regarding interpretation of the i-length field, i would maybe argue that
> the two use cases ultimately are the same meaning: the field quantifies the
> length of linguistic content; when there is nothing worse parsing in an
> item, its ‘linguistic length’ is zero (or maybe -1, not quite sure just
> now).  but, yes, this results in a flag-like behavior for the processing
> commands: skipping over (what you might call ‘noise’) items which lack
> linguistic content.
>
> best, oe
>
>
> On Thu, 11 Apr 2019 at 19:31 Alexandre Rademaker <arademaker at gmail.com>
> wrote:
>
>> Hi Stephan,
>>
>> > On 11 Apr 2019, at 13:52, Stephan Oepen <oe at ifi.uio.no> wrote:
>> >
>> > hi emily and mike,
>> >
>> > the [incr tsdb()] import facilities support some mixed-content document
>> formats, notably the un*x mbox format (you can guess when that was useful
>> functionality).
>>
>> Sorry, I don’t! Do you mean that mbox files can be imported to profiles
>> directly? So far, I always thought that a profile itens are all sentences
>> or phrases subject to be analysed by a grammar.
>>
>> >  to represent all data in the profile while not pretending that there
>> is linguistic content (worth sending to the parser) in email headers, the
>> corresponding items are marked as i-length = -1 (or maybe 0, not quite
>> sure).  this is the reason for the ‘Process | ...’ commands to require a
>> non-zero, positive length ... in other words a reassurance that there
>> actually is linguistic content in the item.  in this regard, i-length (like
>> i-id, i-input, and possibly i-wf as well) is a mandatory field in the item
>> relation.
>>
>> So i-length has two meanings, it is at the same time the length of the
>> input (in tokens) but also a flag. The -1 has special meaning, right?
>>
>> That is, what you are saying is that a profile can also accommodate noise
>> data and we can explicit use the i-length to mark what itens are relevant
>> for processing. Is that right?
>>
>> Best,
>> Alexandre
>>
>>
>>

-- 
-Michael Wayne Goodman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20190413/98c2280d/attachment.html>


More information about the developers mailing list