[developers] Re: emacs encoding issues (JACY)

Ben Waldron benjamin.waldron at cl.cam.ac.uk
Sat Jun 25 22:35:42 CEST 2005


Ben Waldron wrote:

> Francis Bond wrote:
>
>> G'day,
>>
>>     Yes. Shall I check in a patch to the JACY CVS?
>>
>>
>> Sure.  Would that be patching user-fns.lsp?
>
>
> Yes, here:
>
> #+:chasen
> (defun preprocess-sentence-string (string &key (verbose 
> *chasen-debug-p*) posp)
> ...

I've had another look at this:

==

We have found that for the latest eli and emacs 21.4, that it always 
sets the (stream-external-format *terminal-io*) to :emacs-mule. We 
prefer it to be EUC-JP, so we evaluate the following in the lisp buffer:

(setf excl:*default-external-format*
 (setf (stream-external-format *terminal-io*) :euc))

==

The above, taken from the Wiki, was necessary because the japanify 
function was resetting (to EUC) the encoding used by Emacs in 
translating communications to/from the Lisp process, whilst the Lisp 
process was using the encoding which Emacs had told it to use at Lisp 
startup ((stream-external-format *terminal-io*) set to :emacs-mule). The 
above code ensured that the encodings matched again.

The solution would seem to be not to alter the encoding in the first place.

==

;;; this sets up an encoding
(defun japanify (buffer encoding)
  (save-excursion
    (switch-to-buffer buffer)
    (set-language-environment 'japanese)
    (set-buffer-file-coding-system encoding)
    (set-buffer-process-coding-system encoding encoding)) ;; <= X
  (setq default-buffer-file-coding-system encoding))

(defun lisp (&optional prefix)
  (setq lkb-tmp-dir "/tmp")
  (interactive "P")
  (load "/usr/local/delphin/acl/eli/fi-site-init")
  (setq fi:common-lisp-image-name "/usr/local/delphin/acl/alisp")
  (setq fi:common-lisp-image-file "/usr/local/delphin/acl/bclim.dxl")
  (setq fi:common-lisp-image-arguments 
    (list 
     "-locale" "japan.EUC"
     "-qq" "-L" "/usr/local/delphin/cl-init.cl"))
  (fi:common-lisp)
  (japanify "*common-lisp*" 'euc-jp)) ;; <= X

==

Regarding ChaSen: if we ensure that Emacs and Lisp agree on an encoding 
we can run the 'chasen' command, and if we further ensure that Lisp's 
*locale* is set to EUC (as it is for the JACY grammar) then we will 
exchange data with the ChaSen process in the encoding it can handle.

According to the (limited) documentation that I was able to find on the 
net, it should be possible to tell ChaSen to use alternative encodings 
to EUC:

==
2. How to use ChaSen system
---------------------------

   Suppose a Japanese text file `nihongo', which should be encoded in
   Japanese EUC (Extended UNIX Code), JIS (ISO-2022-JP), Shift_JIS
   (MS Kanji) or UTF-8.  Issue the following command:

   % chasen nihongo # Use the system default encode

   % chasen -i e nihongo-euc # Use EUC-JP or JIS

   % chasen -i s nihongo-euc # Use Shift_JIS

   % chasen -i w nihongo-euc # Use UTF-8

   The result of the morphological analysis is shown on the standard
   output.  If your terminal has a direct input facility of Japanese
   characters, simply type

   % chasen

   then input a Japanese sentence followed by a carrige return.
==

This doesn't work when I try it though... So I'll just but a wrapper in 
preprocess-sentence-string to ensure that *locale* is 'japan.EUC' when 
we talk to ChaSen.

To summarize, I think the code suggested on the Wiki can be reduced to 
the following:

==
(defun lisp (&optional prefix)
  (interactive "P")
  (set-language-environment 'japanese) ;; set input method/default 
coding for  files
  (setq default-buffer-file-coding-system 'euc-jp) ;; ensure new files 
saved in correct encoding
  (load "/usr/local/acl/acl70/eli/fi-site-init")
  (setq fi:common-lisp-image-name "/usr/local/acl/acl70/alisp")
  (setq fi:common-lisp-image-file "/usr/local/acl/acl70/bclim.dxl")
  (setq fi:common-lisp-image-arguments  (list  "-locale" "japan.EUC")) 
;; (<= X) ensure Lisp loads grammar files in correct encoding
  (fi:common-lisp))
==

The marked line can go if we include the following in globals.lsp (so 
that Lisp sets its locale appropriately):

==
#+:allegro
(defparameter excl:*locale* (excl::find-locale "japan.EUC"))
==

Some other questions I had concerning ChaSen:
- is there an up-to-date manual in English (I can't read Japanese 
without the help of the Google translator...)?
- has anyone run/considered running ChaSen in server mode?
- can ChaSen return a lattice?

Thanks,
-Ben



More information about the developers mailing list