[developers] Re: grammar locale issues

Ben Waldron benjamin.waldron at cl.cam.ac.uk
Fri Jun 24 12:22:24 CEST 2005


Hi Stephan-

>i am sympathetic to your proposal to build more checks and balances in,
>specifically make the `default' encoding for each grammar explicit as a
>global variable (we have long had that property in PET).  i thought the
>*grmmar-encoding* global, plus read-script-file-aux() doing some sanity
>checking was a promising idea.
>  
>
This seemed to me initially to be the simplest approach too -- eg. 
ensure the LKB loads the grammar files using the correct encoding (by 
checking that *grammar-locale* is the same as the LKB's *locale*) and 
tell the user in simple terms what's wrong and how to fix it (?) when 
this check fails. But seeing as it's just as easy to set *locale* to the 
value of *grammar-locale* defined in the grammar files, and if we do 
this everything just works and individual users no longer needs to 
ensure the appropriate "-locale" option is passed to the LKB at startup 
(be it from Emacs or from the command line or by double clicking an icon 
or whatever), I would argue that

(C) Ensure encodings are automatically set correctly at grammar load 
time [whenever possible]

is the approach we should be taking.

>also, i had been planning to toggle
>
>  (defparameter cdb::*cdb-ascii-p* nil)
>
>since the current LKB `dot.emacs' defaults to UTF-8 now, and i agree to
>your expectations that ASCII grammar will not break by writing two-byte
>CDB entries (i was hoping to test that assumption, though :-).
>  
>
I'll test that our assumption is indeed correct and get back to the list.

>finally, i am nervous about the set-up you propose where we try to make
>ELI communication always be UTF-8, but potentially have another coding
>convention for the grammar files (or i/o with sub-processes).  this is
>a new idea to me (i did not think it would be possible), but in general
>my experience has been that ensuring _one_ consistent coding system at
>all levels is the path to happiness: i believe your proposal could mean
>that the *common-lisp* buffer has a different (process) coding system
>than buffers visiting TDL files (e.g. for JaCY, where files continue to
>be in EUC, for now). 
>
No, this is not the case, at least in the sense that I think you mean. 
Emacs buffers internally uses a single multibyte character encoding 
(emacs-mule); similarly Allegro Lisp uses 16 bit Unicode, tsdb++ uses 
utf8, etc. Individual buffers can use various encodings for i/o to 
files, the terminal, processes, etc. So the *common-lisp* buffer uses 
one encoding to talk to Lisp (if we use an encoding such as emacs-mule 
or utf8 we can be assured all characters can pass in either direction), 
but a different setting for saving to a file (which we don't care about 
anyway). Each buffer associated with a TDL file will internally 
represent characters as emacs-mule, but when saving/reading to/from the 
filesystem uses a separate encoding (the default set at Emacs startup 
from the OS locale, or set by set-language-environment, or set by the 
"-*-...-*-" header in individual files). Keyboard input uses an encoding 
set at Emacs startup (and not changed by set language environment, 
although other commands could change this if there was any reason to do 
so).

M-x describe-current-coding-system displays the encodings associated 
with the current buffer. Eg.

==
Coding system for saving this buffer:
  Not set locally, use the default.
Default coding system (for new files):
  1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
Coding system for keyboard input:
  u -- utf-8 (alias of mule-utf-8)
Coding system for terminal output:
  u -- utf-8 (alias of mule-utf-8)
Coding systems for process I/O:
  encoding input to the process: = -- emacs-mule
  decoding output from the process: = -- emacs-mule
Defaults for subprocess I/O:
  decoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
  encoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1)
==

> i fear that will be harder to set up reliably for
>emacs(1) than just one consistent scheme and create potential for user
>confusion (and i have seen difficulties pasting in X across encodings).
>  
>
I agree that we need one consistent scheme. I would like to do this by 
(i) minimizing the number of per-grammar settings and (ii) minimizing 
further the number of per-user settings (eg. the lines they must add to 
their Emacs config file). So I would propose that

- Emacs-Lisp interprocess communication should always be set to 
emacs-mule (or unicode -- either one, it doesn't matter),
- the grammar files themselves should specify their encoding (to the 
LKB, by setting *locale* in the script, and potentially also to Emacs, 
via the "-*-...-*-" header),
- if the "-*-...-*-" header is not used, each user running the grammar 
must specify set-language-environment in their Emacs config file

We should also recommend that people use Unicode (utf8) whenever 
possible. This would mean that everyone could simply use

(set-language-environment "utf-8")

and will increasingly be the default of the OS in many cases anyway.

>i am not convinced this level of sophistication is really needed.  some
>of the currently documented procedures are more complex than i think is
>required (today).  for example, the following just works for me (modulo
>substitution of $DELPHINHOME, of course):
>
>  emacs -q &
>  M-x load-file RET $DELPHINHOME/lkb/etc/dot.emacs RET
>  M-x japanese RET
>  (read-script-file-aux "$DELPHINHOME/japanese/lkb/ascript")
>  (do-parse-tty "食べた")
>  
>
But this doesn't work for newbies developing new grammars. I also think 
that in general it's useful to have the option to run the grammars 
without relying on the sophisticated settings in the Emacs config file.

>--- melanie will be visiting here in july, and francis and i expect to
>streamline set-up for JaCY during her visit.
>
>somewhat more high-level, i am inclined to encourage more people to use
>UTF-8, 
>
absolutely

>but in western europe and japan, at least, there appears to be a
>strong, established non-UniCode tradition :-{.
>

-Ben
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20050624/073f1517/attachment.html>


More information about the developers mailing list