Multi-charset (Cyrillic) support for Internet servers

by Eugene G. Crosser

For those who do not know - different computer platforms use different 8bit character sets for Cyrillic. Thus there is a problem of presenting the same textual data in different charsets to the network users on different computers.
| Libmcs | WWW | SMTP | POP/IMAP | News |
The whole archive

Many people prefer to implement multi-charset support for Internet services (such as Mail, News and WWW) with some kind of a "proxy" server. One example is "cyrproxy" proxy server by Alex Tutubalin that can be taken from ftp://ftp.lexa.ru/pub/domestic/lexa/. The idea sounds very good, because this approach does not require any modification in the server code. But this also has at least two serious drawbacks:

  1. Data encryption cannot be properly supported. This becomes important with the growing use of SSL.

  2. The server cannot properly do client IP address based authentication and accounting, because all incoming connections are originated from the the same host where the proxy resides.
That's why the author took another approach, making modifications in the server code to incorporate character set translation functionality.

Libmcs

All actual translation and client recognition job is done in a kinda universal toolkit,
libmcs-2.03.tar.gz, implementing dynamic translation table loading and character translation.

(See also older versions: libmcs-2.02.tar.gz, libmcs-2.1.tar.gz and libtrans.tar.gz)

The basic concept (proposed by Igor V. Semenyuk <iga@sovam.com>) is to set up multiple IP interfaces on the same machine (this is possible with all modern UNICes, both BSD and SVR4 flavours), one interface for each of supported codetables. These interfaces are assigned mnemonic names corresponding to the character set they support, e.g. koi.smtp.online.ru, win.smtp.online.ru, etc.

The server code is modified in the following way: when incoming connection is opened with accept(2) call, a function settrtab_bysocket() from the libtrans package should be called, with the accept()ed socket file descriptor as the parameter. settrtab_bysocket() determines the IP address of the local end of the connection with getsockname(2), and determines its host name with gethostbyaddr(3). Then, it tries to load a file with translation table named the same as the hostname. If there is no such file (i.e. the hostname does not match any charset mnemonic name), a "no-translation" table is set up. This table address is stored in the static memory, so that subsequent calls to TR_IN() and TR_OUT() macroes can do character conversion appropriate for the particular charset.

Then, every character received from the client should be passed thru the TR_IN() macro, and every character to be written to the client - thru the TR_OUT() macro.

There is also a "low-level" function, settrtab_byname(), which can be called in case the server had already determined its local domain name, to avoid redundant calls for getsockname() and gethostbyaddr(). This also can be useful if there is any other way to determine the desired character set, e.g. if the client has a way to specify it explicitly, as in "Accept-Charset:" header in HTTP protocol.

Ilya Etingof developed a Solaris STREAMS module that uses libmcs, which can be found here: http://www.glas.net/~ilya/software/osc.html.


WWW

As a modern application-level Internet protocol, HTTP is aware of character sets. There is a header, "Accept-Charset:", that is specially designed to tell the server which charsets the client is able to handle and which if them it prefers most.

Unfortunately, not all WWW clients are aware of this header. If accept-charset is not available, one could try to guess the correct codetale the client uses from the "User-Agent:" header. In most cases, this header, in addition to the browser name, contains an operating system identification string.

Though again, this may be not enough information. There are cases when different charsets are used on the same operating system. E.g. some MS/Windows® users install koi8-r fonts as a kludge solution to get the text from old-running WWW servers that only support koi8-r codetable for Cyrillic. Another example is a UNIX system using ISO-8859-5 instead of common koi8-r.

To deal with all these cases, the following solution is implemented: several virtual hosts are created on the same document tree, one for each supported codetable and one "generic". Normally, users come to the generic virtual host. The codetable they need is determined according to the "Accept-Charset:" header, or , if it is not available, by the "User-Agent:" header. In most cases, the user will get what she needs. If not, she has an option to request the same document from the specific virtual host, making the server use the codetable that she explicitly specifies via the host name.

Follows are the patches for Apache server (both SSL and non-SSL) implementing the described approach:


SMTP

Despite POP protocol successors have commands to post new mail from the client, most POP/IMAP client programs use SMTP for that. Thus, in addition to converting messages that the user receives over POP/IMAP, it is necessary to be able to convert messages that are received over SMTP. So, in addition to a normal non-translating SMTP server, we establish additional virtual servers with domain names corresponding to the character table mnemonic names. Clients should use those virtual hosts as SMTP servers to send mail.

Here are patches to smtpserver and smtp transport of Zmailer 2.99.27 (smtp transport should be patched to avoid converting 8bit subjects to quoted-printable in outgoing mail).

And here are patches to smtpserver of Zmailer 2.99.38, 2.99.44 and 2.99.47 (no need to patch smtp transport, just use "-8H" options).

Starting from version 2.99.48, on-the-fly translation code is included into mainstream Zmailer release. See the file smtpserver/README.translation in Zmailer distribution for description.

Of course, after applying the patch, you need to modify the Makefile (by hand) to add appropriate "-I" directive to CFLAGS and "-lmcs" to LDFLAGS.


POP/IMAP

Changes to IMAP/POP server are analogous to that for SMTP server. When a client comes to a virual server whose IP address is back-resolved to a valid codetable mnemonic name, she will receive mail translated to that encoding. Here are patches for UW
imap-4.BETA and imap-4.1.BETA servers. To make things compile, you also need to modify a section in c-client/Makefile appropriate for your platform, adding "-I/where/is/libtrans.h" and "-lmcs" where appropriate.

News

Here is an old patch for nnrpd from
INN-1.4, and a more recent patch for nnrpd from INN-2.2.1.
You need to configure INN like this:

LIBS="-L/usr/local/lib -lmcs" INCLUDES="-I/usr/local/include" ./configure ...

(assuming that you have libmcs installed in /usr/local)


average.org