Programming : How to Detect and Read UTF-8 Characters in Text Strings ...

The purpose of this instructable is to explain to programmers how to extract UTF-8 characters from a text strings, when no Unicode library is available. This may help them to make their applications UTF-8 compatible.

UTF-8 is a "variable length character encoding" which is used to encode special characters that are not available in the now outdated ASCII character set (aka "plain text").

With UTF-8, you can encode any character defined in the Unicode standard : accentuated letters, Japanese syllabaries, Chinese characters, Arabian abjads, mathematical and scientific symbols, etc.

UTF-8 is the most commonly used character encoding standard.
International sites like Wikipedia use it.

note : In this instructable, pseudo-codes will be written in a C/C++ dialect, and real sample codes will be written in C.

Step 1: Optional Reminder About Text Files and Charsets :

(If you already know how ASCII characters are encoded into text-files, you can skip this step.)

Computer's binary files (pictures, music, executable, etc.) and computer's text files (.txt files) are the same thing : they're all computer files.

A computer file is list of bytes.
A byte is formed of 8 bits.
A bit is a fundamental binary (2 state) element. It can be set (contains 1) or unset (contains 0).

By changing the states of the 8 bits of a byte, it's possible to make 256 different combinations.
Each combination forms a binary number.
It is possible to convert binary numbers into decimal numbers.
It is, thus, possible to count in binary :

00000000 (0)
00000001 (1)
00000010 (2)
00000011 (3)
00000100 (4)
00000101 (5)
...
11111100 (252)
11111101 (253)
11111110 (254)
11111111 (255)

Thus, each byte of a computer file contains a numeral value from 00000000 to 11111111 in binary (from 0 to 255 in decimal).

We can then use bytes to store any integer numbers from 0 to 255.
If we want to store historical dates like 1783 or mathematical values like 1.41421, we are forced to "encode" them using several bytes.
With two bytes, it's possible to store integer numbers between 0 and 65,535.
With 4 bytes, it's possible to encode (with some eventual approximation) any real numbers.

The same goes with text : each character of a string is encoded into a value from 0 to 255, giving, thus, a maximum of 256 different characters.

At the beginning, as computers were mainly a western technology, 256 possible characters was more than enough : 26 small letters, 26 capital letters, 10 numbers, few punctuations symbols ...
Americans created the ASCII standard (American Standard Code for Information Interchange).
It was widely used (and adapted) in Europe too. It even has been extended to contain most of the accentuated characters widely used in Europe.

Thus, each byte of an ASCII (or plain text) file contain 1 character.

However, not every countries around the world use the Latin alphabet.
For instance, Russians created their own standard, which was incompatible with the ASCII standard. Greek created their own standard, which was incompatible with the ASCII standard, etc.

For long time, on the internet, it was very difficult to display several different alphabet together on the same page, because each alphabet needed a different "charset encoding", and only one "charset encoding" per page was easily possible.

International sites like Wikipedia would have been very difficult to make.
The most common trick to display mathematical formulas or Chinese characters on an English page, was to display them as pictures ...

They quickly went to the conclusion that 256 characters was not enough, and that every different and possible characters and symbols of the world needed to be grouped into a single and universal set of character : Unicode.
.

Step 2: Optional Reminder About Unicode :

(if you already know what's Unicode, you can skip this step)

Unicode is compatible with the old ASCII standard (This means that the first 128 characters of Unicode have the same codes than those from ASCII), and contains every code of every possible characters and symbols of every alphabets, adjabs and logograms of every nations and cultures of the world. And currently, there is about 100,000 different characters.

This means that we need more than 1 byte to store the code of most of them.

With one byte (8 bits), we could encode only the 256 first Unicode characters (which are ASCII compatible)
With two bytes (16 bits), we could encode the 65,536 first Unicode characters.
With four bytes (32 bits), we could encode them all, and even more ...

So, seems the most universal way to store Unicode compatible text in computer files would be to use 4 bytes per character.

However, old ASCII text files would become unreadable (as they use only 1 byte per character). And converting them to 4 bytes per characters would waste a lot of space (four time more space) ...

That's why they invented various encoding methods to encode Unicode text without wasting too much space, and keeping compatibility with old ASCII files. These encoding methods are named : UTF-7, UTF-8, UTF-16 and UTF-32.
.

Step 3: What's UTF-8 ?

.
UTF-8 is a mean to encode any Unicode characters in the middle of a "traditional" ASCII (plain text) file.

ASCII files needs only one byte per character. It's perfect when you only write in English.

However, you may need to write a Chinese character or a mathematical formula in the middle of your text, and UTF-8 makes it possible : When the code of an Unicode character does not fit in a single byte, it is encoded into 2, 3 or 4 bytes.

This encoding tries not to break the old and traditional ASCII encoding.
This means that if you read an UTF-8 text with a text editor that is not UTF-8 compatible, neither the editor will crash, neither the formatting of the text will be all messed up.
Instead of displaying a single and correct Unicode character, the incompatible editor will display 2, 3 or 4 extended ASCII characters.

On the other hand, a malformed UTF-8 code may lead to unexpected problems if the UTF-8 compatible text editor has not been correctly coded.

Step 4: Keeping Compatibility With ASCII.

.
Unicode keeps the compatibility with ASCII.

ASCII characters are encoded from 32 to 127.
Codes from 0 to 31 are control codes mainly used for pagination : tabulation, carriage return, end of string, etc ...
(note : many of those control codes are outdated todays.)

Thus, ASCII characters only need 7 of the 8 bits of a byte to be encoded : 00000000 to 01111111 in binary.

This means that the 8th bit of an ASCII code is always set to 0. (reminder : keep in mind that bits are counted from right to left. The 1st bit is, thus, at right, and the last one is at left).

As, in most programming languages, the 8th bit of a byte is usually used to define the sign (positive or negative) of a signed value, this also means that signed bytes containing an ASCII code will always be positive (8th bit set to 0).

For compatibility purpose, the UTF-8 encoding will avoid to use positive values, because an incompatible applications will interpret Unicode characters encoded into 4 bytes as 4 different extended ASCII characters : and because of the controls code (0 to 31), this may lead to various unexpected results.

Step 5: How Are "UTF-8 Encoded" Unicode Characters ?

.
Let's read the first byte of a text file :

signed char myByte = Read_A_Byte_From( myFile );

Codes from 0 to 127 are encoded into a single byte.
(from 0x00 to 0x7F)

They are ASCII compatible.
They only need 7 of the 8 bits of a byte : 127 in decimal == 01111111 in binary.
The 8th bit of the byte is, thus always set to 0.

That's how our UTF-8 compatible application will know that our character is encoded in a single byte.
If our byte is positive (8th bit set to 0), this mean that it's an ASCII character.

if ( myByte >= 0 ) return myByte;

Codes greater than 127 are encoded into several bytes.

On the other hand, if our byte is negative, this means that it's probably an UTF-8 encoded character whose code is greater than 127.
This also means that it should be followed by at least one more negative byte.

UTF-8 is designed to encode any Unicode character using less space as possible.

If it's possible to encode an Unicode character within only 2 bytes, we will not use more than those 2 bytes. We will use 4 bytes only if absolutely required.

We then need a method to guess in how many bytes is encoded a character.
We can extract this information from the first negative byte, by counting how many of the last bits of our byte are set to one :

110xxxxx : 2 last bits set to 1, means our character is encoded into 2 bytes.
We have to read 1 more negative byte.

1110xxxx : 3 last bits set to 1, means our character is encoded into 3 bytes.
We have to read 2 more negative bytes.

11110xxx : 4 last bits set to 1, means our character is encoded into 4 bytes.
We have to read 3 more negative bytes.

The following extra negative byte(s) all have the 8th bit set to 1 (negative sign), and the 7th bit set to 0 : 10yyyyyy
If the following extra byte(s) are positive or have their 7th bit set to 1, this mean that the UTF-8 encoded character is malformed. Our application absolutely have to detect malformed encoding.

Codes from 128 to 2047 are encoded into 2 bytes.
(from 0x0100 to 0x7FFF)

Characters encoded into two bytes are like that :

110xxxxx, 10yyyyyy

To decode it, we simply have to group our 5 x bits with our 6 y bits : xxxxxyyyyyy

if ( myByte IS LIKE 110xxxxxx ) {    // we're going to read the next byte    myNextByte = Read_A_Byte_From( myFile );    if ( myNextByte >= 0 ) OR ( myNextByte IS NOT LIKE 10yyyyyy )     {        // if our next byte is greater than 0        // or is malformed        // this means that our UTF-8 code is malformed        // or that it's not an UTF-8 text or code.        // Maybe it's an extended-ASCII text ?        // The best we can do, here, is to        // treat myByte and myNextByte as        // two extended ASCII char.        // We cancel the reading of myNextByte ...        Unread_A_Byte_From( myFile );        // ... and we return myByte like if it was an extended ASCII char.        return myByte;    }    // If we are here, this means we have    // a well formed UTF-8 code.    // We're going to decode it :    myUnicode = xxxxxx << 6 | yyyyyy;        // we grouped our 5 x bits and our 6 y bits     // into myUnicode.    // we can now return this code :    return myUnicode;}

Codes from 2048 to 55295 and from 57344 to 65535 are encoded into 3 bytes.
(from 0x0800 to 0xD7FF, and from 0xE000 to 0xFFFF)

Characters encoded into three bytes are like that :

1110xxxx, 10yyyyyy, 10zzzzzz

Codes from 65536 to 1114111 are encoded into 4 bytes.
(from 0x010000 to 0x10FFFF)

Characters encoded into four bytes are like that :

11110xxx, 10yyyyyy, 10zzzzzz, 10wwwwww

Step 6: Malformed UTF-8 Codes ...

.
Malformed UTF-8 codes may lead to various bugs and eventual crashes if the compatible application is not appropriately programmed.

You may find malformed UTF-8 codes because of various reasons :
- the text is an extended-ASCII one (extended-ASCII uses 255 char instead of 127)
- the text is not complete : some bytes are missing ...
- there is a bug in the application that generated the UTF-8 encoded text ...

Your application must detect all of that !
If your UTF-8 application is not appropriately designed, it may be vulnerable to hackers.

When you design your application, you must keep in mind all of that.

Here, as a simple example, is a function to detect UTF-8 encoding and to extract Unicode out of a string of char. (function not tested).

// utf8 points to a byte of a text string// Uni  points to a variable which will store the Unicode// the function returns how many byte have been readint UTF8_to_Unicode ( char * utf8, unsigned int * Uni ){    if ( utf8 == NULL ) return 0;    if ( Uni  == NULL ) return 0;    // U-00000000 - U-0000007F    // ASCII code ?    if (*utf8 >= 0    ) { *Uni= *utf8; return 1; }     int len=0;    unsigned char * u = (unsigned char *)utf8;    *Uni = 0;    // U-00000080 - U-000007FF : 110xxxxx 10xxxxxx    if ( (u[0]&0xE0) == 0xC0 ) { len = 2; *Uni = u[0]&0x1F; }     else    // U-00000800 - U-0000FFFF : 1110xxxx 10xxxxxx 10xxxxxx    if ( (u[0]&0xF0) == 0xE0 ) { len = 3; *Uni = u[0]&0x0F; }     else    // U-00010000 - U-001FFFFF : 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx    if ( (u[0]&0xF8) == 0xF0 ) { len = 4; *Uni = u[0]&0x07; }     else    {        // our UTF-8 character is malformed        // let's return it as an extended-ASCII        *Uni = u[0];        return 1;    }    // we're going to read the following bytes    int a;    for ( a=1; a<len; a++ )     {         if ( ( u[a] >=0 ) || ( (u[a]&0xC0) != 0x80 ) )        {            // our UTF-8 code is malformed ...            // let's return it as an extended-ASCII            *Uni = u[0];            return 1;        }        // so far, everything seems Ok.        // we safely build our Unicode        *Uni = (*Uni<<6) | (u[a]&0x3F);     }    // According to Unicode 5.0    // codes in the range 0xD800 to 0xDFFF    // are not allowed.    if ( ( (*Uni) >= 0xD800 ) || ( (*Uni) <= 0xDFFF ) )    {        // In this case, our UTF-8 code was well formed.        // So, or we break it into 2 extended ASCII codes,        // or we display an other symbol instead ...        // We should read the Unicode 5.0 book to        // to know their official recommendations though ...        *Uni = '?';        return 1;    }    // it's done !    // *Uni contains our unicode.    // we simply return how many bytes    // it was stored in.    return len;}

For instance, let's say that you have an UTF-8 encoded string, and that you want to count how many characters it has.
If you use the old and traditional strlen() function, it will return the number of bytes, which is not the number of characters.

(code not tested)

int UTF8_strlen( char * str ){    if ( str == NULL ) return 0;    int Uni = 0;    int Len = 0;    int Cnt = 0;    while ( *str != 0 )    {        Cnt = UTF8_to_Unicode ( str, &Uni );        if ( ( Cnt == 0 ) || ( Uni == 0 ) ) return Len;        str += Cnt;        Len++;    };    return Len;}