[developers] top-level cfrom/cto values in xml always -1

Mon Oct 10 17:07:01 CEST 2005

Of course I agree that cfrom and cto have to refer to the original document (by
some notion of original) otherwise the standoff annotation idea does not work.
But the idea that the global cfrom/cto might refer to some arbitrary `extra'
stuff that extends beyond the boundaries of what is tokenised needs some work
to make precise.  Can we say it refers to the boundaries of whatever the
sentence splitter returns?  e.g., suppose we have a bit of a document:

Carp: a fish. Sometimes eaten. But not oftenNevertheless it is
tasty.

and this gets processed by a sentence splitter to give:

"Carp: a fish. " 
"Sometimes eaten."
"But not often."
"Nevertheless it is tasty."

cto of sentence 1 is cfrom of sentence 2, but cto of sentence 2 is less than
cfrom of sentence 3 because the has been removed. Sentence 3 has an
additional . which wasn't in the original (assume the sentence splitter is
doing some regularisation - I can give a more convincing example if necessary
but I really hope nobody is going to argue that we should disallow this in
principle), so the `often' has a cto which is the same as the period's cfrom
and cto. Basically, if we go down this route, it means we have to be more
formal about specifying what the sentence splitter does, not because we expect
all sentence splitters to behave in the same way (we don't) but because we want
the cfrom and cto to be consistent with different implementations of the same
sentence splitting algorithm.

Is this generally agreed?

Ann