UTF-8 is a "variable length character encoding" which is used to encode special characters that are not available in the now outdated ASCII character set (aka "plain text").
With UTF-8, you can encode any character defined in the Unicode standard : accentuated letters, Japanese syllabaries, Chinese characters, Arabian abjads, mathematical and scientific symbols, etc.
UTF-8 is the most commonly used character encoding standard.
International sites like Wikipedia use it.
note : In this instructable, pseudo-codes will be written in a C/C++ dialect, and real sample codes will be written in C.
Remove these ads by
Signing UpStep 1Optional reminder about text files and charsets :
Computer's binary files (pictures, music, executable, etc.) and computer's text files (.txt files) are the same thing : they're all computer files.
A computer file is list of bytes.
A byte is formed of 8 bits.
A bit is a fundamental binary (2 state) element. It can be set (contains 1) or unset (contains 0).
By changing the states of the 8 bits of a byte, it's possible to make 256 different combinations.
Each combination forms a binary number.
It is possible to convert binary numbers into decimal numbers.
It is, thus, possible to count in binary :
00000000 (0)
00000001 (1)
00000010 (2)
00000011 (3)
00000100 (4)
00000101 (5)
...
11111100 (252)
11111101 (253)
11111110 (254)
11111111 (255)
Thus, each byte of a computer file contains a numeral value from 00000000 to 11111111 in binary (from 0 to 255 in decimal).
We can then use bytes to store any integer numbers from 0 to 255.
If we want to store historical dates like 1783 or mathematical values like 1.41421, we are forced to "encode" them using several bytes.
With two bytes, it's possible to store integer numbers between 0 and 65,535.
With 4 bytes, it's possible to encode (with some eventual approximation) any real numbers.
The same goes with text : each character of a string is encoded into a value from 0 to 255, giving, thus, a maximum of 256 different characters.
At the beginning, as computers were mainly a western technology, 256 possible characters was more than enough : 26 small letters, 26 capital letters, 10 numbers, few punctuations symbols ...
Americans created the ASCII standard (American Standard Code for Information Interchange).
It was widely used (and adapted) in Europe too. It even has been extended to contain most of the accentuated characters widely used in Europe.
Thus, each byte of an ASCII (or plain text) file contain 1 character.
However, not every countries around the world use the Latin alphabet.
For instance, Russians created their own standard, which was incompatible with the ASCII standard. Greek created their own standard, which was incompatible with the ASCII standard, etc.
For long time, on the internet, it was very difficult to display several different alphabet together on the same page, because each alphabet needed a different "charset encoding", and only one "charset encoding" per page was easily possible.
International sites like Wikipedia would have been very difficult to make.
The most common trick to display mathematical formulas or Chinese characters on an English page, was to display them as pictures ...
They quickly went to the conclusion that 256 characters was not enough, and that every different and possible characters and symbols of the world needed to be grouped into a single and universal set of character : Unicode.
.
| « Previous Step | Download PDFView All Steps | Next Step » |








































You're so smart! Nice job!
I wrote it directly in english, the documents I studied are all in english too ...
Les articles Wikipedia ont trEs certainement une version franCaise. Ce pourrait-Etre une source d'information ...
Wikipedia articles probably have a french version. They could be a source of information ...
Eventuellement, si j'ai du temps A tuer, peut-Etre envisagerais-je de traduire cet Instructable ...
Eventually, when I'll have some extra-time to kill, I may think about translating this Instructable.
=oP