[developers] token-level ambiguity in the LKB parser

Ben Waldron benjamin.waldron at cl.cam.ac.uk
Tue Jan 31 09:48:55 CET 2006


Stephan Oepen wrote:
>  - warn you that i changed add-token-edge(), so as to return the edge
>    just created; looking at current callers, nobody seems to care (but
>    it is a change in API, after all).
>  
I think this is an improvement on the old API.

>  - ask you to have a look at the token chart in the attachment to see
>    whether you would expect this kind of token-level ambiguity to work
>    in the LKB parser: tokens 3 + 5, jointly have the same span as 7; i
>    think it should, but some re-assurance would be welcome.
>  
Be reassured, the parser is quite happy to receive such ambiguity in the 
token chart. In fact, this is a very useful feature. The 
"x-preprocessor" code produces a token chart of this form (given 
suitable fsr rules) -- although the older "preprocessor" code does not 
support such a capability. The NorSource grammar, for example, uses this 
mechanism to handle tokenization ambiguity by providing multiple paths 
through the token lattice in cases of (i) the period at the end of a 
word, which may be part of abbreviation, or sentence final period (or 
both) , or (ii) the Norwegian possessive "s", which comes with no 
apostophe as in English and can attach to any word. By preserving such 
ambiguity in the token lattice one can do away with the many complicated 
preprocessor rules which might otherwise be necessary to ensure that the 
single correct choice is made during tokenization (such as the 
punctuation splitting rules in the ERG's preprocessor).

LKB(58): (print-tchart)

 > token/spelling chart dump:

...
1-2 [3] ë°¥ => (ë°¥) [] <3 c 4>
...
1-3 [7] 밥이나 => (밥이나) [] <3 c 6>
...
2-3 [5] 이나 => (이나) [] <4 c 6>
...

>finally, woodley, in case you read this far: with token-level ambiguity
>the LUI chart display will need extending as regards `decoration' with
>surface elements at the bottom.  it would be tempting to just send you
>
>  #E[-0 -1 0 -1 "존이" "" []]
>  #E[-1 -2 1 -1 "ë°¥" "" []]
>  #E[-1 -3 2 -1 "밥이나" "" []]
>  #E[-2 -3 3 -1 "이나" "" []]
>  #E[-3 -4 4 -1 "빵을" "" []]
>  #E[-4 -5 5 -1 "먹었다" "" []]
>
>and have the tokens organize themselves in multiple rows.  do you think
>future LUI releases might include support for this?
>  
I would like to second this feature request for the LUI.

- Ben




More information about the developers mailing list