[developers] Re: emacs encoding issues (JACY)

Francis Bond fcbond at gmail.com
Mon Jun 27 10:17:01 CEST 2005


G'day,
[snip]

> ==
> 
> We have found that for the latest eli and emacs 21.4, that it always
> sets the (stream-external-format *terminal-io*) to :emacs-mule. We
> prefer it to be EUC-JP, so we evaluate the following in the lisp buffer:
> 
> (setf excl:*default-external-format*
> (setf (stream-external-format *terminal-io*) :euc))
> 
> ==
> 
> The above, taken from the Wiki, was necessary because the japanify
> function was resetting (to EUC) the encoding used by Emacs in
> translating communications to/from the Lisp process, whilst the Lisp
> process was using the encoding which Emacs had told it to use at Lisp
> startup ((stream-external-format *terminal-io*) set to :emacs-mule). The
> above code ensured that the encodings matched again.
> 
> The solution would seem to be not to alter the encoding in the first 
> place.
> 
> ==
> 
> ;;; this sets up an encoding
> (defun japanify (buffer encoding)
> (save-excursion
> (switch-to-buffer buffer)
> (set-language-environment 'japanese)
> (set-buffer-file-coding-system encoding)
> (set-buffer-process-coding-system encoding encoding)) ;; <= X
> (setq default-buffer-file-coding-system encoding))
> 
> (defun lisp (&optional prefix)
> (setq lkb-tmp-dir "/tmp")
> (interactive "P")
> (load "/usr/local/delphin/acl/eli/fi-site-init")
> (setq fi:common-lisp-image-name "/usr/local/delphin/acl/alisp")
> (setq fi:common-lisp-image-file "/usr/local/delphin/acl/bclim.dxl")
> (setq fi:common-lisp-image-arguments
> (list
> "-locale" "japan.EUC"
> "-qq" "-L" "/usr/local/delphin/cl-init.cl <http://init.cl>"))
> (fi:common-lisp)
> (japanify "*common-lisp*" 'euc-jp)) ;; <= X
> 
> ==


Modifying as few things as possible seems like a good idea. I will test this 
and see how it goes. One potential problem, is that many (most?) users in 
Japan have already set these encodings for other purposes (I normally have 
it set to junet (iso-2022-jp-2)). So if we don't specify it, there could be 
surprises.


Regarding ChaSen: if we ensure that Emacs and Lisp agree on an encoding
> we can run the 'chasen' command, and if we further ensure that Lisp's
> *locale* is set to EUC (as it is for the JACY grammar) then we will
> exchange data with the ChaSen process in the encoding it can handle.
> 
> According to the (limited) documentation that I was able to find on the
> net, it should be possible to tell ChaSen to use alternative encodings
> to EUC:
> 
> ==
> 2. How to use ChaSen system
> ---------------------------
> 
> Suppose a Japanese text file `nihongo', which should be encoded in
> Japanese EUC (Extended UNIX Code), JIS (ISO-2022-JP), Shift_JIS
> (MS Kanji) or UTF-8. Issue the following command:
> 
> % chasen nihongo # Use the system default encode
> 
> % chasen -i e nihongo-euc # Use EUC-JP or JIS
> 
> % chasen -i s nihongo-euc # Use Shift_JIS
> 
> % chasen -i w nihongo-euc # Use UTF-8
> 
> The result of the morphological analysis is shown on the standard
> output. If your terminal has a direct input facility of Japanese
> characters, simply type
> 
> % chasen
> 
> then input a Japanese sentence followed by a carrige return.
> ==
> 
> This doesn't work when I try it though... So I'll just but a wrapper in
> preprocess-sentence-string to ensure that *locale* is 'japan.EUC' when
> we talk to ChaSen.


ChaSen handles different encodings by converting the dictionaries, and 
loading a different dictionary (this avoids the performance penalty of 
converting on the fly). This recoding has to be done by hand, using the 
ChaSen dict-utils and there have been versions where it didn't work as 
advertised. I think the safe thing is to add the wraper, as you suggest, and 
maybe even explicitly tell ChaSen to be EUC-JP (it is the default, but again 
I prefer to avoid surprises). Something like:

(command (format 
nil 
"~a -i e -F '(\"%m\" \"%M\" \"%P-+%Tn-%Fn\" \"%y\")\\n'" 
*chasen-application*)))

To summarize, I think the code suggested on the Wiki can be reduced to
> the following:
> 
> ==
> (defun lisp (&optional prefix)
> (interactive "P")
> (set-language-environment 'japanese) ;; set input method/default
> coding for files
> (setq default-buffer-file-coding-system 'euc-jp) ;; ensure new files
> saved in correct encoding
> (load "/usr/local/acl/acl70/eli/fi-site-init")
> (setq fi:common-lisp-image-name "/usr/local/acl/acl70/alisp")
> (setq fi:common-lisp-image-file "/usr/local/acl/acl70/bclim.dxl")
> (setq fi:common-lisp-image-arguments (list "-locale" "japan.EUC"))
> ;; (<= X) ensure Lisp loads grammar files in correct encoding
> (fi:common-lisp))
> ==
> 
> The marked line can go if we include the following in globals.lsp (so
> that Lisp sets its locale appropriately):

==
> #+:allegro
> (defparameter excl:*locale* (excl::find-locale "japan.EUC"))
> ==
> 

That's a great idea, I would much prefer to store this information with the 
grammar.

Perhaps we could then go one step further and take out:
(set-language-environment 'japanese) ;; set input method/default
coding for files
(setq default-buffer-file-coding-system 'euc-jp) ;; ensure new files
saved in correct encoding
 
These aren't necessary to run the grammar, they just make life easier. The 
lisp command would then be the same as for other grammars, and we could 
always make a new command e.g. lisp-ja,

(defun lisp-ja ()
(interactive)
(set-language-environment 'japanese)
(setq default-buffer-file-coding-system 'euc-jp) 
(lisp))

for those users who aren't in a Japanese environment by default.

> Some other questions I had concerning ChaSen:
> - is there an up-to-date manual in English (I can't read Japanese
> without the help of the Google translator...)?


No.

- has anyone run/considered running ChaSen in server mode?


Not that I know of, 

- can ChaSen return a lattice?


In theory yes (chasen -p), but I couldn't get it to work.

mecab a new morphological analyser that has most of ChaSen's functionality 
and is a fair bit faster, will return nbest (e.g., mecab -N 100) but not 
formatted as a lattice. You can access the lattice through the script 
bindings: http://chasen.org/~taku/software/mecab/bindings.html

As the bindings are written with SWIG, it should be possible to create them 
for ACL.

Kudo-san (now at google) recommended we move to mecab, but we (the JACY 
developers) haven't really discussed it yet. Mecab is still not as widely 
available as chasen, although it is becoming so.

-- 
Francis Bond <www.kecl.ntt.co.jp/icl/mtg/members/bond/<http://www.kecl.ntt.co.jp/icl/mtg/members/bond/>
>
NTT Communication Science Laboratories | Machine Translation Research Group
Now visiting the LOGON MT project in Oslo
<www.emmtee.net/<http://www.emmtee.net/>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20050627/521895e6/attachment.html>


More information about the developers mailing list