Selecting the character set for your server

All data is encoded in your server in a special code. For example, the letter “a” is encoded as “97” in decimal. A character set is a specific collection of characters (including alphabetic and numeric characters, symbols, and nonprinting control characters) and their assigned numerical values, or codes. A character set generally contains the characters for an alphabet, for example, the Latin alphabet used in the English language, or a script such as Cyrillic used with languages such as Russian, Serbian, and Bulgarian. Character sets that are platform-specific and support a subset of languages, for example, the Western European languages, are called native or national character sets. All character sets that come with Adaptive Server, except for Unicode UTF-8, are native character sets.

A script is a writing system, a collection of all the elements that characterize the written form of a human language—for example, Latin, Japanese, or Arabic. Depending on the languages supported by an alphabet or script, a character set can support one or more languages. For example, the Latin alphabet supports the languages of Western Europe (see Group 1 in Table 7-1). On the other hand, the Japanese script supports only one language, Japanese. Therefore, the Group 1 character sets support multiple languages, while many character sets, such as those in Group 101, support only one language.

The language or languages that are covered by a character set is called a language group. A language group can contain many languages or only one language; a native character set is the platform-specific encoding of the characters for the language or languages of a particular language group.

Within a client/server network, you can support data processing in multiple languages if all the languages belong to the same language group (see Table 7-1). For example, if data in the server is encoded in a Group 1 character set, you could have French, German, and Italian data and any of the other Group 1 languages in the same database. However, you cannot store data from another language group in the same database. For example, you cannot store Japanese data with French or German data.

Unlike the native character sets just described, Unicode is an international character set that supports over 650 of the world’s languages, such as Japanese, Chinese, Russian, French, and German. Unicode allows you to mix different languages from different language groups in the same server, no matter what the platform.

Since all character sets support the Latin script, and therefore English, a character set always supports at least two languages—English and one other language.

Many languages are supported by more than one character set. The character set you install for a language depends on the client’s platform and operating system.

Adaptive Server supports the following languages and character sets:

Table 7-1: Supported languages and character sets

Language group

Languages

Character sets

Group 1

Western European: Albanian, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, Galician, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Spanish, Swedish

ASCII 8, CP 437, CP 850, CP 860, CP 863, CP 1252a , ISO 8859-1, ISO 8859-15, Macintosh Roman, ROMAN8

Group 2

Eastern European: Croatian, Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Romanian, Slovak, Slovene (and English)

CP 852, CP 1250, ISO 8859-2, Macintosh Central European

Group 4

Baltic (and English)

CP 1257

Group 5

Cyrillic: Bulgarian, Byelorussian, Macedonian, Russian, Serbian, Ukrainian (and English)

CP 855, CP 866, CP 1251, ISO 8859-5, Koi8, Macintosh Cyrillic

Group 6

Arabic (and English)

CP 864, CP 1256, ISO 8859-6

Group 7

Greek (and English)

CP 869, CP 1253, GREEK8, ISO 8859-7, Macintosh Greek

Group 8

Hebrew (and English)

CP 1255, ISO 8859-8

Group 9

Turkish (and English)

CP 857, CP 1254, ISO 8859-9, Macintosh Turkish, TURKISH8

Group 101

Japanese (and English)

CP 932 DEC Kanji, EUC-JIS, Shift-JIS

Group 102

Simplified Chinese (PRC) (and English)

CP 936, EUC-GB

Group 103

Traditional Chinese (ROC) (and English)

Big 5, CP 950b , EUC-CNS

Group 104

Korean (and English)

EUC-KSC

Group 105

Thai (and English)

CP 874, TIS 620

Group 106

Vietnamese (and English)

CP 1258

Unicode

Over 650 languages

UTF-8

a. CP 1252 is identical to ISO 8859-1 except for the 0x80–0x9F code points which are mapped to characters in CP 1252.

b. CP 950 is identical to Big 5.

NoteThe English language is supported by all character sets because the first 128 (decimal) characters of any character set include the Latin alphabet (defined as “ASCll-7). The characters beyond the first 128 differ between character sets and are used to support the characters in different native languages. For example, code points 0-127 of CP 932 and CP 874 both support English and the Latin alphabet. However, code points 128-255 support Japanese characters in CP 932 and code points 128-255 support Thai characters in CP 874.

The following character sets support the European currency symbol, the “euro”: CP 1252 (Western Europe); CP 1250 (Eastern Europe); CP 1251 (Cyrillic); CP 1256 (Arabic); CP 1253 (Greek); CP 1255 (Hebrew); CP 1254 (Turkish); CP 874 (Thai); and Unicode UTF-8.

To mix languages from different language groups you must use Unicode. If your server character set is Unicode, you can support more than 650 languages in a single server and mix languages from any language group.