G'day,<br>
[snip]<br>
<div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">==<br><br>We have found that for the latest eli and emacs 21.4, that it always<br>sets the (stream-external-format *terminal-io*) to :emacs-mule. We
<br>prefer it to be EUC-JP, so we evaluate the following in the lisp buffer:<br><br>(setf excl:*default-external-format*<br> (setf (stream-external-format *terminal-io*) :euc))<br><br>==<br><br>The above, taken from the Wiki, was necessary because the japanify
<br>function was resetting (to EUC) the encoding used by Emacs in<br>translating communications to/from the Lisp process, whilst the Lisp<br>process was using the encoding which Emacs had told it to use at Lisp<br>startup ((stream-external-format *terminal-io*) set to :emacs-mule). The
<br>above code ensured that the encodings matched again.<br><br>The solution would seem to be not to alter the encoding in the first place.<br><br>==<br><br>;;; this sets up an encoding<br>(defun japanify (buffer encoding)
<br> (save-excursion<br> (switch-to-buffer buffer)<br> (set-language-environment 'japanese)<br> (set-buffer-file-coding-system encoding)<br> (set-buffer-process-coding-system encoding encoding)) ;; <= X<br>
(setq default-buffer-file-coding-system encoding))<br><br>(defun lisp (&optional prefix)<br> (setq lkb-tmp-dir "/tmp")<br> (interactive "P")<br> (load "/usr/local/delphin/acl/eli/fi-site-init")
<br> (setq fi:common-lisp-image-name "/usr/local/delphin/acl/alisp")<br> (setq fi:common-lisp-image-file "/usr/local/delphin/acl/bclim.dxl")<br> (setq fi:common-lisp-image-arguments<br> (list<br>
"-locale" "japan.EUC"<br> "-qq" "-L" "/usr/local/delphin/cl-<a href="http://init.cl">init.cl</a>"))<br> (fi:common-lisp)<br> (japanify "*common-lisp*" 'euc-jp)) ;; <= X
<br><br>==</blockquote><div><br>
Modifying as few things as possible seems like a good idea. I
will test this and see how it goes. One potential problem, is
that many (most?) users in Japan have already set these encodings
for other purposes (I normally have it set to junet
(iso-2022-jp-2)). So if we don't specify it, there could be
surprises.<br>
<br>
</div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Regarding ChaSen: if we ensure that Emacs and Lisp agree on an encoding<br>we can run the 'chasen' command, and if we further ensure that Lisp's
<br>*locale* is set to EUC (as it is for the JACY grammar) then we will<br>exchange data with the ChaSen process in the encoding it can handle.<br><br>According to the (limited) documentation that I was able to find on the
<br>net, it should be possible to tell ChaSen to use alternative encodings<br>to EUC:<br><br>==<br>2. How to use ChaSen system<br>---------------------------<br><br> Suppose a Japanese text file `nihongo', which should be encoded in
<br> Japanese EUC (Extended UNIX Code), JIS (ISO-2022-JP), Shift_JIS<br> (MS Kanji) or UTF-8. Issue the following command:<br><br> % chasen nihongo # Use the system default encode<br><br> % chasen -i e nihongo-euc # Use EUC-JP or JIS
<br><br> % chasen -i s nihongo-euc # Use Shift_JIS<br><br> % chasen -i w nihongo-euc # Use UTF-8<br><br> The result of the morphological analysis is shown on the standard<br> output. If your terminal has a direct input facility of Japanese
<br> characters, simply type<br><br> % chasen<br><br> then input a Japanese sentence followed by a carrige return.<br>==<br><br>This doesn't work when I try it though... So I'll just but a wrapper in<br>preprocess-sentence-string to ensure that *locale* is '
japan.EUC' when<br>we talk to ChaSen.</blockquote><div><br>
ChaSen handles different encodings by converting the dictionaries, and
loading a different dictionary (this avoids the performance penalty of
converting on the fly). This recoding has to be done by hand,
using the ChaSen dict-utils and there have been versions where it
didn't work as advertised. I think the safe thing is to add the
wraper, as you suggest, and maybe even explicitly tell ChaSen to be
EUC-JP (it is the default, but again I prefer to avoid
surprises). Something like:<br>
<br>
(command (format <br>
nil <br>
"~a -i e -F '(\"%m\" \"%M\" \"%P-+%Tn-%Fn\" \"%y\")\\n'" <br>
*chasen-application*)))<br>
</div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">To summarize, I think the code suggested on the Wiki can be reduced to<br>the following:
<br><br>==<br>(defun lisp (&optional prefix)<br> (interactive "P")<br> (set-language-environment 'japanese) ;; set input method/default<br>coding for files<br> (setq default-buffer-file-coding-system 'euc-jp) ;; ensure new files
<br>saved in correct encoding<br> (load "/usr/local/acl/acl70/eli/fi-site-init")<br> (setq fi:common-lisp-image-name "/usr/local/acl/acl70/alisp")<br> (setq fi:common-lisp-image-file "/usr/local/acl/acl70/bclim.dxl")
<br> (setq fi:common-lisp-image-arguments (list "-locale" "japan.EUC"))<br>;; (<= X) ensure Lisp loads grammar files in correct encoding<br> (fi:common-lisp))<br>==<br><br>The marked line can go if we include the following in
globals.lsp (so<br>that Lisp sets its locale appropriately):</blockquote><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">==<br>
#+:allegro<br>(defparameter excl:*locale* (excl::find-locale "japan.EUC"))<br>==<br>
</blockquote><div>
<div><br>
That's a great idea, I would much prefer to store this information with the grammar.<br>
<br>
Perhaps we could then go one step further and take out:<br>
(set-language-environment 'japanese) ;; set input method/default<br>
coding for files<br>
(setq default-buffer-file-coding-system 'euc-jp) ;; ensure new files<br>saved in correct encoding<br>
</div>
<br>
These aren't necessary to run the grammar, they just make life
easier. The lisp command would then be the same as for other
grammars, and we could always make a new command e.g. lisp-ja,<br>
<br>
(defun lisp-ja ()<br>
(interactive)<br>
(set-language-environment 'japanese)<br>
(setq default-buffer-file-coding-system 'euc-jp) <br>
(lisp))<br>
<br>
for those users who aren't in a Japanese environment by default.<br>
</div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Some other questions I had concerning ChaSen:<br>- is there an up-to-date manual in English (I can't read Japanese
<br>without the help of the Google translator...)?</blockquote><div><br>
No.<br>
<br>
</div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">- has anyone run/considered running ChaSen in server mode?</blockquote><div><br>
Not that I know of, <br>
</div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">- can ChaSen return a lattice?</blockquote><div><br>
In theory yes (chasen -p), but I couldn't get it to work.<br>
<br>
mecab a new morphological analyser that has most of ChaSen's
functionality and is a fair bit faster, will return nbest (e.g., mecab
-N 100) but not formatted as a lattice. You can access the
lattice through the script bindings:
<a href="http://chasen.org/~taku/software/mecab/bindings.html">http://chasen.org/~taku/software/mecab/bindings.html</a><br>
<br>
As the bindings are written with SWIG, it should be possible to create them for ACL.<br>
<br>
</div></div>Kudo-san (now at google) recommended we move to mecab, but
we (the JACY developers) haven't really discussed it yet. Mecab
is still not as widely available as chasen, although it is becoming so.<br>
<br>-- <br>Francis Bond <<a href="http://www.kecl.ntt.co.jp/icl/mtg/members/bond/">www.kecl.ntt.co.jp/icl/mtg/members/bond/</a>><br>NTT Communication Science Laboratories | Machine Translation Research Group<br>Now visiting the LOGON MT project in Oslo <
<a href="http://www.emmtee.net/">www.emmtee.net/</a>>