<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html;charset=UTF-8" http-equiv="Content-Type"> <title></title> </head> <body bgcolor="#ffffff" text="#000000"> Hi Stephan- <blockquote cite="mid200506211903.j5LJ3UXh006368@mv.uio.no" type="cite"> <pre wrap="">i am sympathetic to your proposal to build more checks and balances in, specifically make the `default' encoding for each grammar explicit as a global variable (we have long had that property in PET). i thought the *grmmar-encoding* global, plus read-script-file-aux() doing some sanity checking was a promising idea. </pre> </blockquote> This seemed to me initially to be the simplest approach too -- eg. ensure the LKB loads the grammar files using the correct encoding (by checking that *grammar-locale* is the same as the LKB's *locale*) and tell the user in simple terms what's wrong and how to fix it (?) when this check fails. But seeing as it's just as easy to set *locale* to the value of *grammar-locale* defined in the grammar files, and if we do this everything just works and individual users no longer needs to ensure the appropriate "-locale" option is passed to the LKB at startup (be it from Emacs or from the command line or by double clicking an icon or whatever), I would argue that (C) Ensure encodings are automatically set correctly at grammar load time [whenever possible] is the approach we should be taking. <blockquote cite="mid200506211903.j5LJ3UXh006368@mv.uio.no" type="cite"> <pre wrap=""> also, i had been planning to toggle (defparameter cdb::*cdb-ascii-p* nil) since the current LKB `dot.emacs' defaults to UTF-8 now, and i agree to your expectations that ASCII grammar will not break by writing two-byte CDB entries (i was hoping to test that assumption, though :-). </pre> </blockquote> I'll test that our assumption is indeed correct and get back to the list. <blockquote cite="mid200506211903.j5LJ3UXh006368@mv.uio.no" type="cite"> <pre wrap=""> finally, i am nervous about the set-up you propose where we try to make ELI communication always be UTF-8, but potentially have another coding convention for the grammar files (or i/o with sub-processes). this is a new idea to me (i did not think it would be possible), but in general my experience has been that ensuring _one_ consistent coding system at all levels is the path to happiness: i believe your proposal could mean that the *common-lisp* buffer has a different (process) coding system than buffers visiting TDL files (e.g. for JaCY, where files continue to be in EUC, for now). </pre> </blockquote> No, this is not the case, at least in the sense that I think you mean. Emacs buffers internally uses a single multibyte character encoding (emacs-mule); similarly Allegro Lisp uses 16 bit Unicode, tsdb++ uses utf8, etc. Individual buffers can use various encodings for i/o to files, the terminal, processes, etc. So the *common-lisp* buffer uses one encoding to talk to Lisp (if we use an encoding such as emacs-mule or utf8 we can be assured all characters can pass in either direction), but a different setting for saving to a file (which we don't care about anyway). Each buffer associated with a TDL file will internally represent characters as emacs-mule, but when saving/reading to/from the filesystem uses a separate encoding (the default set at Emacs startup from the OS locale, or set by set-language-environment, or set by the "-*-...-*-" header in individual files). Keyboard input uses an encoding set at Emacs startup (and not changed by set language environment, although other commands could change this if there was any reason to do so). M-x describe-current-coding-system displays the encodings associated with the current buffer. Eg. == Coding system for saving this buffer: Not set locally, use the default. Default coding system (for new files): 1 -- iso-latin-1 (alias: iso-8859-1 latin-1) Coding system for keyboard input: u -- utf-8 (alias of mule-utf-8) Coding system for terminal output: u -- utf-8 (alias of mule-utf-8) Coding systems for process I/O: encoding input to the process: = -- emacs-mule decoding output from the process: = -- emacs-mule Defaults for subprocess I/O: decoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1) encoding: 1 -- iso-latin-1 (alias: iso-8859-1 latin-1) == <blockquote cite="mid200506211903.j5LJ3UXh006368@mv.uio.no" type="cite"> <pre wrap=""> i fear that will be harder to set up reliably for emacs(1) than just one consistent scheme and create potential for user confusion (and i have seen difficulties pasting in X across encodings). </pre> </blockquote> I agree that we need one consistent scheme. I would like to do this by (i) minimizing the number of per-grammar settings and (ii) minimizing further the number of per-user settings (eg. the lines they must add to their Emacs config file). So I would propose that - Emacs-Lisp interprocess communication should always be set to emacs-mule (or unicode -- either one, it doesn't matter), - the grammar files themselves should specify their encoding (to the LKB, by setting *locale* in the script, and potentially also to Emacs, via the "-*-...-*-" header), - if the "-*-...-*-" header is not used, each user running the grammar must specify set-language-environment in their Emacs config file We should also recommend that people use Unicode (utf8) whenever possible. This would mean that everyone could simply use (set-language-environment "utf-8") and will increasingly be the default of the OS in many cases anyway. <blockquote cite="mid200506211903.j5LJ3UXh006368@mv.uio.no" type="cite"> <pre wrap=""> i am not convinced this level of sophistication is really needed. some of the currently documented procedures are more complex than i think is required (today). for example, the following just works for me (modulo substitution of $DELPHINHOME, of course): emacs -q & M-x load-file RET $DELPHINHOME/lkb/etc/dot.emacs RET M-x japanese RET (read-script-file-aux "$DELPHINHOME/japanese/lkb/ascript") (do-parse-tty "食べた") </pre> </blockquote> But this doesn't work for newbies developing new grammars. I also think that in general it's useful to have the option to run the grammars without relying on the sophisticated settings in the Emacs config file. <blockquote cite="mid200506211903.j5LJ3UXh006368@mv.uio.no" type="cite"> <pre wrap="">--- melanie will be visiting here in july, and francis and i expect to streamline set-up for JaCY during her visit. somewhat more high-level, i am inclined to encourage more people to use UTF-8, </pre> </blockquote> absolutely <blockquote cite="mid200506211903.j5LJ3UXh006368@mv.uio.no" type="cite"> <pre wrap="">but in western europe and japan, at least, there appears to be a strong, established non-UniCode tradition :-{.</pre> </blockquote> -Ben </body> </html>