What is Unicode

Traditionnaly, character encodings use 8 bits, and thus are limited to 256 characters. This causes problems because:

  1. it's not enough for some languages;

  2. people speaking languages using different encodings have to choose one encoding to use, and have to switch the system's state when changing the language, which makes it difficult to mix several languages in the same file;

  3. etc...

Thus the UCS (Universal Character Set), also know as Unicode was created to handle and mix all of our world's scripts. This is a 32-bit (4 bytes) encoding, otherwise known as UCS4 because of the size of its characters, which is normalised by ISO as the 10646-1 standard. The most widely used characters from UCS are contained in the UCS2 16-bit subset of UCS; this is the subset used by the Linux console.

For convenience, the UTF encoding was designed as a variable-length encoding with ASCII backward-compatibility; all chars that have a UCS4 encoding can be expressed as a UTF sesquence, and vice-versa.

The Unicode consortium defines additional properties for UCS2 characters, also known as Unicode characters.

See: unicode(7), utf-8(7).