[developers] top-level cfrom/cto values in xml always -1

Ben Waldron benjamin.waldron at cl.cam.ac.uk
Mon Oct 10 17:43:17 CEST 2005


Ann Copestake wrote:
>Of course I agree that cfrom and cto have to refer to the original document (by
>some notion of original) otherwise the standoff annotation idea does not work.
>But the idea that the global cfrom/cto might refer to some arbitrary `extra'
>stuff that extends beyond the boundaries of what is tokenised needs some work
>to make precise.  Can we say it refers to the boundaries of whatever the
>sentence splitter returns?  e.g., suppose we have a bit of a document:
>
><p> 
><b>Carp</b>: a fish.  Sometimes eaten.<br>But not oftenNevertheless it is
>tasty.
>
>and this gets processed by a sentence splitter to give:
>
>"<b>Carp</b>: a fish.  "  
>"Sometimes eaten."
>"But not often."
>"Nevertheless it is tasty."
>
>cto of sentence 1 is cfrom of sentence 2, but cto of sentence 2 is less than
>cfrom of sentence 3 because the <br> has been removed.  Sentence 3 has an
>additional . which wasn't in the original (assume the sentence splitter is
>doing some regularisation - I can give a more convincing example if necessary
>but I really hope nobody is going to argue that we should disallow this in
>principle), so the `often' has a cto which is the same as the period's cfrom
>and cto.  Basically, if we go down this route, it means we have to be more
>formal about specifying what the sentence splitter does, not because we expect
>all sentence splitters to behave in the same way (we don't) but because we want
>the cfrom and cto to be consistent with different implementations of the same
>sentence splitting algorithm.  
>
>Is this generally agreed?
>  
I would say that, ideally, the standoff pointers (cfrom/cto) should at 
all stages be grounded in the original document. Supposing this original 
document is the following UTF-8 string:
<p><b>Carp</b>: a fish.  Sometimes eaten.<br>But not oftenNevertheless it is tasty.</p>

Output of the sentence splitter could be

"<b>Carp</b>: a fish.  "   = range(3,25)
"Sometimes eaten." = range(25,41)
"But not often." = range(45,58) + range(58,58)
"Nevertheless it is tasty." = range(58,83)

Given the above, ranges for tokens generated from each split sentence can be calculated. The "." in the second sentence is an inserted character, with cfrom=58 and cto=58. Some parts of the source text (eg. "<br>" = range(41,45)) are not included in the output of the sentence splitter.

Of course, the trick is in implementing this... I've implemented such ranges for a preprocessor (extension of the regex preprocessor) in the LKB. I hope to announce this on the developers list soon (when the code is presentable).

- Ben





More information about the developers mailing list