[developers] token-level ambiguity in the LKB parser
Ben Waldron
benjamin.waldron at cl.cam.ac.uk
Tue Jan 31 09:48:55 CET 2006
Stephan Oepen wrote:
> - warn you that i changed add-token-edge(), so as to return the edge
> just created; looking at current callers, nobody seems to care (but
> it is a change in API, after all).
>
I think this is an improvement on the old API.
> - ask you to have a look at the token chart in the attachment to see
> whether you would expect this kind of token-level ambiguity to work
> in the LKB parser: tokens 3 + 5, jointly have the same span as 7; i
> think it should, but some re-assurance would be welcome.
>
Be reassured, the parser is quite happy to receive such ambiguity in the
token chart. In fact, this is a very useful feature. The
"x-preprocessor" code produces a token chart of this form (given
suitable fsr rules) -- although the older "preprocessor" code does not
support such a capability. The NorSource grammar, for example, uses this
mechanism to handle tokenization ambiguity by providing multiple paths
through the token lattice in cases of (i) the period at the end of a
word, which may be part of abbreviation, or sentence final period (or
both) , or (ii) the Norwegian possessive "s", which comes with no
apostophe as in English and can attach to any word. By preserving such
ambiguity in the token lattice one can do away with the many complicated
preprocessor rules which might otherwise be necessary to ensure that the
single correct choice is made during tokenization (such as the
punctuation splitting rules in the ERG's preprocessor).
LKB(58): (print-tchart)
> token/spelling chart dump:
...
1-2 [3] ë°¥ => (ë°¥) [] <3 c 4>
...
1-3 [7] ë°¥ì´ë => (ë°¥ì´ë) [] <3 c 6>
...
2-3 [5] ì´ë => (ì´ë) [] <4 c 6>
...
>finally, woodley, in case you read this far: with token-level ambiguity
>the LUI chart display will need extending as regards `decoration' with
>surface elements at the bottom. it would be tempting to just send you
>
> #E[-0 -1 0 -1 "ì¡´ì´" "" []]
> #E[-1 -2 1 -1 "ë°¥" "" []]
> #E[-1 -3 2 -1 "ë°¥ì´ë‚˜" "" []]
> #E[-2 -3 3 -1 "ì´ë‚˜" "" []]
> #E[-3 -4 4 -1 "ë¹µì„" "" []]
> #E[-4 -5 5 -1 "먹었다" "" []]
>
>and have the tokens organize themselves in multiple rows. do you think
>future LUI releases might include support for this?
>
I would like to second this feature request for the LUI.
- Ben
More information about the developers
mailing list