[developers] top-level cfrom/cto values in xml always -1
benjamin.waldron at cl.cam.ac.uk
Mon Oct 10 17:43:17 CEST 2005
Ann Copestake wrote:
>Of course I agree that cfrom and cto have to refer to the original document (by
>some notion of original) otherwise the standoff annotation idea does not work.
>But the idea that the global cfrom/cto might refer to some arbitrary `extra'
>stuff that extends beyond the boundaries of what is tokenised needs some work
>to make precise. Can we say it refers to the boundaries of whatever the
>sentence splitter returns? e.g., suppose we have a bit of a document:
><b>Carp</b>: a fish. Sometimes eaten.<br>But not oftenNevertheless it is
>and this gets processed by a sentence splitter to give:
>"<b>Carp</b>: a fish. "
>"But not often."
>"Nevertheless it is tasty."
>cto of sentence 1 is cfrom of sentence 2, but cto of sentence 2 is less than
>cfrom of sentence 3 because the <br> has been removed. Sentence 3 has an
>additional . which wasn't in the original (assume the sentence splitter is
>doing some regularisation - I can give a more convincing example if necessary
>but I really hope nobody is going to argue that we should disallow this in
>principle), so the `often' has a cto which is the same as the period's cfrom
>and cto. Basically, if we go down this route, it means we have to be more
>formal about specifying what the sentence splitter does, not because we expect
>all sentence splitters to behave in the same way (we don't) but because we want
>the cfrom and cto to be consistent with different implementations of the same
>sentence splitting algorithm.
>Is this generally agreed?
I would say that, ideally, the standoff pointers (cfrom/cto) should at
all stages be grounded in the original document. Supposing this original
document is the following UTF-8 string:
<p><b>Carp</b>: a fish. Sometimes eaten.<br>But not oftenNevertheless it is tasty.</p>
Output of the sentence splitter could be
"<b>Carp</b>: a fish. " = range(3,25)
"Sometimes eaten." = range(25,41)
"But not often." = range(45,58) + range(58,58)
"Nevertheless it is tasty." = range(58,83)
Given the above, ranges for tokens generated from each split sentence can be calculated. The "." in the second sentence is an inserted character, with cfrom=58 and cto=58. Some parts of the source text (eg. "<br>" = range(41,45)) are not included in the output of the sentence splitter.
Of course, the trick is in implementing this... I've implemented such ranges for a preprocessor (extension of the regex preprocessor) in the LKB. I hope to announce this on the developers list soon (when the code is presentable).
More information about the developers