FireFly Media Server

03/10/2006 at 10:01 PM #6577

Participant

(I shouldn’t write this, I shouldn’t write this… 😉 )

Locale “de_DE.UTF-8” and “en_GB.UTF-8” is not the same. But character encodings are same, UTF-8, but language and country is different, and makes programs behave differently.

ISO-8859-1, Windows-1252 is not same.
IBM codepages 850 and 437 are not same.
In Japan they usally use Shift-JIS, which has same issues with ASCII (and Latin-X).

There is not one singel character encoding that is “german”, there are many (as in any language). Note that many countries use many scripts, like 3 in Japan.

On servers I usally use locale “C”, and on desktops I use “sv_SE.UTF-8”.

In Unix (and Linux) you can set your locale when you log in or start a program.
Locale tells the program which language/country you want the program to use AND which character encodings (code page in MS Windows lingo) you want it to use. You usally set variable LANG to prefered locale (locale ‘C’ is special and usally means ASCII and american english. Used in scripts etc. No conversion and ordering of characters according to ASCII code).
Locales should make the program to use right sort and compare strings (‘é’, ‘è’ and ‘e’ is counted the same character in french when you compare them), writing time and date etc.

With this you can have different locales on each and every program that you run. You can have one Mozilla talking swedish and next time you run it it could talk german to you.

Anyway, the locales has three parts. Take “sv_DE.ISO-8859-1” as an example.

“sv” tells you want the program to talk swedish (if it can).
“DE” tells that you want the program to use german dates, currencies etc.
“ISO-8859-1” tells us which character encoding you want the program to use.

There are lots of encodings, like CCRDude wrote. ASCII is an old one that americans used, its also called ISO-646 or US-ASCII. (EBCDIC is another 8bit encoding that only IBM used). ASCII encodes characters in a 7-bit long character. There are different national variations of ASCII, where some characters, like {[]}|@$, usally are changed for some national symbols (like åäöÅÄÖ in swedish in ISO-646-SE). Still only 7-bit though.

Later ASCII was enlarged to 8 bit and called ISO-8859-X or Latin-X (where X is a number between 1 and 16). Still not all characters at the same time, but at least programmers could have both ‘[‘ and ‘Å’ in same file if they used ISO-8859-1 😉

So each program talk to you in your language and character encoding (code page). Next that log in on same machine can have another locale.

But still not all human characters in one document. So then came two standards, Unicode and ISO-10646 (or UCS). Unicode started with 16 bits for each character but now Unicode and ISO-10141 use 32 bits (and can expand to more). They also use same encodings and are mostly equal. With those two, we can code all characters used in human languages. Like ‘A’, ‘þ’, ‘Å’ and ‘Å’ (yes, ‘Å’ and ‘Å’ is different, one is character “A ring”, other is the unit “Ångström”, but I used Latin 1 here though :oops:). In this way it’s possible to convert and still have same semantics. There is some codings in Unicode (and ISO-10646) that combine two singel character, so a 32-bit character ‘Å’ could also be combined with character ‘A’ and ‘a ring over last character’. So Unicode is not as simpel as ISO-8859-1 or ASCII. But human caracters is not simpel either.

Anyway, all 16 bits are not always needed to encode a C program. You mostly only need 7 bit ASCII. To save space (and make old program kind of work), there was an encoding of Unicode that is called UTF-8. Which ordinary ASCII characters, you only need on 8-bit character, with 8:th bit 0. If it is a 1, you use next character to make up more bits in Unicode. You can use up to four bytes this way in UTF-8 encoding of Unicode for one character. There are als UTF-16 and UTF-32 (also called UCS-4).

So UTF-8 is a multi byte way of represent Unicode (or ISO-10646). If you use UTF-32, you don’t need to handle variable width characters.

Short summery, read a book about Unicode, it’s fun…

Wikipedia:
ISO-646: http://en.wikipedia.org/wiki/ISO_646
ISO-8859: http://en.wikipedia.org/wiki/ISO_8859
ISO-10646: http://en.wikipedia.org/wiki/Universal_character_set
Unicode: http://en.wikipedia.org/wiki/Unicode
Shift-JIS: http://en.wikipedia.org/wiki/Shift-JIS
Web pages:
Unicode: http://unicode.org/
Unicode FAQ: http://www.unicode.org/faq/
Free ISO-standards: http://isotc.iso.org/livelink/livelink/fetch/2000/2489/Ittf_Home/PubliclyAvailableStandards.htm
Character encodings: http://czyborra.com/charsets/