[developers] Re: emacs encoding issues

Wed Jun 22 17:04:49 CEST 2005

Francis Bond wrote:

> G'day,
>
>     Yes. Shall I check in a patch to the JACY CVS?
>
>
> Sure.  Would that be patching user-fns.lsp?

Yes, here:

#+:chasen
(defun preprocess-sentence-string (string &key (verbose 
*chasen-debug-p*) posp)
...

>
> On a related note, can the lexical database be a different encoding 
> from the grammar (and the TDL lexicon files)?  I would like to move to 
> utf-8, but am very wary about breaking anything, as we have several 
> people treebanking and a lot of different versions still being used.  
> I suspect I should wait until I am back in Japan and try to move 
> everything over in one fell swoop.

Yes. There are three settings which are relevant here: the database 
server encoding, the 'database client' encoding, and the encoding used 
by Lisp when decoding the octets it receives from the 'database client'.

When you create a lexical database you can set the database encoding: 
bash install-lexdb.sh jap ~/jap/lexdb.fld ~/jap/lexdb.dfn "-E EUC_JP". 
The setup script will use Unicode unless you tell it otherwise.

When you run a database client, the client has it's own client encoding. 
This defaults to the database encoding, but can be set in other ways. 
For example via a libpq function, or via the environment variable 
PGCLIENTENCODING.

When you run the LKB, string conversion (for data obtained from the 
'database client') uses the encoding of *locale* by default. The 
LKB/LexDB code, and also the LKB code to read in grammar files, will use 
this 'default' setting.

Hence what is necessary is to ensure that the client encoding and the 
encoding used by the LKB/LexDB code are the same. Suppose your database 
is in Unicode (utf8), and the LKB *locale* uses Japanese EUC. We can set 
the database client encoding to EUC by running the following Lisp code 
(before connecting to the lexical database):

(setf (sys:getenv "PGCLIENTENCODING") "EUC_JP")

The Emacs/LexDB interface piggybacks on the LKB Lisp code, so no extra 
setting is necessary.

That's it. (But I'm considering patching the LKB/LexDB code so that no 
special setting will be necessary in the grammar, whatever encodings you 
use. That is, Unicode will be hard-coded as both the client encoding and 
the string conversion encoding.)

The character sets available in the PostgreSQL universe are here: 
www.postgresql.org/docs/8.0/static/multibyte.html

-Ben