[developers] [rmrs] semantics of global cfrom/cto

Mon Nov 6 15:47:19 CET 2006

Dear Ben,

thanks for the lucid summary.  I am CCing this also to developers to
make sure everyone who was in the discussion there can follow it.  I
have also put together a  summary of current issues, based on this and
discussion from last year at http://wiki.delph-in.net/moin/RmrsSpan.

On 11/6/06, Ben Waldron <bmw20 at cl.cam.ac.uk> wrote:
> Both the MRS and RMRS specs contain global cfrom/cto slots in addition
> to the cfrom/cto appearing on individual EPs. This gives something of
> the form:
>
> <rmrs cfrom='-1' cto='-1'>
> ...
> <ep cfrom='0' cto='3'>...</ep>
> ...
> </rmrs>
>
> The semantics of the cfrom/cto on the EP are clearly defined. They
> correspond to the (character) span between the start of the first token
> and the end of the last token spanned by the chart edge which introduced
> the predicate (or, more simply, the text span to which the EP corresponds).
>
> The semantics of the global cfrom/cto is currently unclear/undocumented.
> I believe this should correspond to the character span of the input
> segment ("sentence") which was given to the parser. For example, the
> input segment might have certain characters at the end (or start)
> stripped by the preprocessor, or the final token might be semantically
> vacuous (oe's insight). Such as semantics also fit neatly with the text
> interface spec, which contains a global cfrom/cto (I think this was at
> Uli's suggestion) corresponding  to exactly the span of the input
> segment. An alternative semantics, perhaps that originally envisioned,
> is to simply set the global cfrom to the minimum EP cfrom, and to set
> the global CTO to the maximum EP cto.
>
> I hope we can get a short discussion going, come to an agreed semantics,
> document it, and then get an implementation where the global cfrom/cto
> slots are set (the current '-1' values look a little silly). There was
> some discussion on this issue a year ago, but it appears the thread
> petered out before coming to any conclusion.

I think that a global cfrom/cto corresponding  to the span of the
input segment was the general consensus last year.  The discussion
finished where this one is starting, with a need for more
documentation on specific details.  In particular, what to do if/when
the tokenizer changes the input string.

Note, we should probably also make clear here whether CFROM/CTO refer
to character positions or character points.   Quoting Ben:
=======================================
 Eg. given the text 'abcd' the range CFROM=0 to
CTO=2 refers to the "abc" substring.

abcd
0123 = character positions

I would like to suggest we use character _points_ (the points between
characters) instead of the above -- more expressive and allows the
specification of empty ranges. Eg. given the text 'abcd' the range
CFROM=0 to CTO=2 would refer to the "ab" substring, whilst the range
CFROM=0 to CTO=3 would refer to the "abc" substring

.a.b.c.d.
0 1 2 3 4 = character points
==========================================

My understanding was that at Jerez we agreed to that this was a good
idea, but I am not sure of the current state of implemenation.    I
think the use of points may be necessary to deal with some of the
tokenization issues.

As a consensus emerges, I volunteer to write up the documentation on
the RMRS wiki.

-- 
Francis Bond  <www.kecl.ntt.co.jp/icl/mtg/members/bond/>
NTT Communication Science Laboratories | Natural Language Research Group