[developers] Re: emacs encoding issues
Ben Waldron
benjamin.waldron at cl.cam.ac.uk
Wed Jun 22 17:04:49 CEST 2005
Francis Bond wrote:
> G'day,
>
> Yes. Shall I check in a patch to the JACY CVS?
>
>
> Sure. Would that be patching user-fns.lsp?
Yes, here:
#+:chasen
(defun preprocess-sentence-string (string &key (verbose
*chasen-debug-p*) posp)
...
>
> On a related note, can the lexical database be a different encoding
> from the grammar (and the TDL lexicon files)? I would like to move to
> utf-8, but am very wary about breaking anything, as we have several
> people treebanking and a lot of different versions still being used.
> I suspect I should wait until I am back in Japan and try to move
> everything over in one fell swoop.
Yes. There are three settings which are relevant here: the database
server encoding, the 'database client' encoding, and the encoding used
by Lisp when decoding the octets it receives from the 'database client'.
When you create a lexical database you can set the database encoding:
bash install-lexdb.sh jap ~/jap/lexdb.fld ~/jap/lexdb.dfn "-E EUC_JP".
The setup script will use Unicode unless you tell it otherwise.
When you run a database client, the client has it's own client encoding.
This defaults to the database encoding, but can be set in other ways.
For example via a libpq function, or via the environment variable
PGCLIENTENCODING.
When you run the LKB, string conversion (for data obtained from the
'database client') uses the encoding of *locale* by default. The
LKB/LexDB code, and also the LKB code to read in grammar files, will use
this 'default' setting.
Hence what is necessary is to ensure that the client encoding and the
encoding used by the LKB/LexDB code are the same. Suppose your database
is in Unicode (utf8), and the LKB *locale* uses Japanese EUC. We can set
the database client encoding to EUC by running the following Lisp code
(before connecting to the lexical database):
(setf (sys:getenv "PGCLIENTENCODING") "EUC_JP")
The Emacs/LexDB interface piggybacks on the LKB Lisp code, so no extra
setting is necessary.
That's it. (But I'm considering patching the LKB/LexDB code so that no
special setting will be necessary in the grammar, whatever encodings you
use. That is, Unicode will be hard-coded as both the client encoding and
the string conversion encoding.)
The character sets available in the PostgreSQL universe are here:
www.postgresql.org/docs/8.0/static/multibyte.html
-Ben
More information about the developers
mailing list