[developers] grammar locale issues

Mon Jun 20 17:04:44 CEST 2005

Ann Copestake wrote:

>yes, but *grammar-locale* is not used anywhere, is it?  Has there been an
>agreement that this is going to be used?  If not, and if you think this is the
>right way to handle it, please propose it to the developers list and get
>agreement.  I know that you did send email to me, oe and Melanie about some
>variant of this idea some time ago, but developers is now the right venue, and
>now would be a good time to start discussion.
>
>The problem is that, as things stand, if NorSource is using this and nobody
>else is, then the NorSource guys are going to have problems that are
>incomprehensible to other users.  I think that as a matter of principle one
>should not add new functions to user-fns for individual grammars - it's not
>good in general for people to be running variants of the code since it's much
>more difficult to help them.  If this is good for NorSource, it would be good
>for other grammars too - hence should be proposed and turned into a general
>mechanism.
>  
>
Here would be my proposal:

MOTIVATION

Individual grammars encode their files using various encoding systems 
(eg. Latin-1, EUC, Unicode,...). This encoding is a property of the 
grammar files as a whole. Internally Lisp (at least, Allegro Common 
Lisp) stores all strings as 16-bit Unicode. But in order for the grammar 
to run correctly inside the LKB it is necessary that characters entering 
the Lisp universe are decoded correctly. Hence:

- (1) Lisp must use the correct encoding when reading in the grammar files.
- (2) Characters received from (or sent to) Emacs must be decoded correctly.
- (3) Characters received from (or sent to) the GUI (CLIM) must be 
decoded correctly.

(3) is no problem, because (as I understand it) a CLIM input window is 
able only to accept plain ASCII characters anyway. Hence (2) becomes 
vital for any grammar expecting to process any non-ASCII characters. If 
(1) is not satisfied, the grammar will fail to load, spitting out errors 
such as shown below but not telling the user why these strange errors 
might be occuring

Syntax error at position 62971:
Incorrect syntax following type name #\[
Ignoring (part of) entry for #\[

Incorrect syntax following type name POS_PL_A_5_0_4_0_3_0_2_0_1_0_SM堺
Ignoring (part of) entry for POS_PL_A_5_0_4_0_3_0_2_0_1_0_SM堺

Alternatively, the grammar files could appear to load fine, but inside 
the LKB the entries will be manged and will not function properly.

SOLUTIONS?

- (A) Leave things as they are...
- (B) Provide users with full comprehensible instructions on navigating 
the encoding maze, perhaps giving feedback at grammar load time if any 
settings are likely to be problematic.
- (C) Ensure encodings are automatically set correctly at grammar load time.

I'd like to propose a solution along the lines of (C).

Note that issue (1) can be resolved (with Allegro Common Lisp, at least) 
by setting *locale* at grammar load time, eg. in globals.lsp:

#+:allegro
(defparameter excl:*locale* (excl::find-locale "no.latin1"))

Issue (2) requires that Lisp and Emacs talk to each other using the same 
encoding, and also that this encoding can handle all characters passed 
in either direction. Unicode satisfies this requirement. Hence we need 
simply run Emacs from within a Unicode environment (the encoding is 
inherited), or include the following in the .emacs configuration file

(set-language-environment "utf-8")

The standard streams (*terminal-io*, *standard-input*, 
*standard-output*, *error-output*, *trace-output*, *query-io*, and 
*debug-io*) are set at Lisp startup (and are unaffected by any -locale 
argument passed to the Lisp process). So long as Emacs is run as above, 
emacs-mule will be used for both the interprocess communication and for 
the encoding of these streams (and we are assured that they will happily 
process any character we throw at them).

Issue (3) will not give rise to encoding errors since CLIM handles only 
ASCII. But moving to a GUI able to handle more than plain ASCII would be 
helpful for many users. Note that CLIM will happily display non-ASCII 
characters, as long as they were decoded correctly at the time they 
entered the LIsp universe.

Feedback much appreciated,
-Ben