3 Simple Ways to
Share What You Make

With Instructables you can share what you make with the world — and tap into an ever-growing community of creative experts.

PhotosPhotos

Share one or more photos of a project, recipe, or whatever you've made, quickly and easily.

Step by StepStep-By-Step

Share your step-by-step photos with text instructions of what you made so others can do it too!

VideoVideo

Share your how-to video. You'll need your embed code from a video site such as YouTube.

Programming : how to detect and read UTF-8 characters in text strings ...

Programming : how to detect and read UTF-8 characters in text strings ...
The purpose of this instructable is to explain to programmers how to extract UTF-8 characters from a text strings, when no Unicode library is available. This may help them to make their applications UTF-8 compatible.

UTF-8 is a "variable length character encoding" which is used to encode special characters that are not available in the now outdated ASCII character set (aka "plain text").

With UTF-8, you can encode any character defined in the Unicode standard : accentuated letters, Japanese syllabaries, Chinese characters, Arabian abjads, mathematical and scientific symbols, etc.

UTF-8 is the most commonly used character encoding standard.
International sites like Wikipedia use it.

note : In this instructable, pseudo-codes will be written in a C/C++ dialect, and real sample codes will be written in C.
 
Remove these adsRemove these ads by Signing Up
 

Step 1Optional reminder about text files and charsets :

Optional reminder about text files and charsets :
(If you already know how ASCII characters are encoded into text-files, you can skip this step.)

Computer's binary files (pictures, music, executable, etc.) and computer's text files (.txt files) are the same thing : they're all computer files.

A computer file is list of bytes.
A byte is formed of 8 bits.
A bit is a fundamental binary (2 state) element. It can be set (contains 1) or unset (contains 0).

By changing the states of the 8 bits of a byte, it's possible to make 256 different combinations.
Each combination forms a binary number.
It is possible to convert binary numbers into decimal numbers.
It is, thus, possible to count in binary :

00000000 (0)
00000001 (1)
00000010 (2)
00000011 (3)
00000100 (4)
00000101 (5)
...
11111100 (252)
11111101 (253)
11111110 (254)
11111111 (255)

Thus, each byte of a computer file contains a numeral value from 00000000 to 11111111 in binary (from 0 to 255 in decimal).

We can then use bytes to store any integer numbers from 0 to 255.
If we want to store historical dates like 1783 or mathematical values like 1.41421, we are forced to "encode" them using several bytes.
With two bytes, it's possible to store integer numbers between 0 and 65,535.
With 4 bytes, it's possible to encode (with some eventual approximation) any real numbers.

The same goes with text : each character of a string is encoded into a value from 0 to 255, giving, thus, a maximum of 256 different characters.

At the beginning, as computers were mainly a western technology, 256 possible characters was more than enough : 26 small letters, 26 capital letters, 10 numbers, few punctuations symbols ...
Americans created the ASCII standard (American Standard Code for Information Interchange).
It was widely used (and adapted) in Europe too. It even has been extended to contain most of the accentuated characters widely used in Europe.

Thus, each byte of an ASCII (or plain text) file contain 1 character.

However, not every countries around the world use the Latin alphabet.
For instance, Russians created their own standard, which was incompatible with the ASCII standard. Greek created their own standard, which was incompatible with the ASCII standard, etc.

For long time, on the internet, it was very difficult to display several different alphabet together on the same page, because each alphabet needed a different "charset encoding", and only one "charset encoding" per page was easily possible.

International sites like Wikipedia would have been very difficult to make.
The most common trick to display mathematical formulas or Chinese characters on an English page, was to display them as pictures ...

They quickly went to the conclusion that 256 characters was not enough, and that every different and possible characters and symbols of the world needed to be grouped into a single and universal set of character : Unicode.
.
« Previous StepDownload PDFView All StepsNext Step »
5 comments
Mar 19, 2008. 4:59 PMGorillazMiko says:
Woah... this looks so hard to do.
You're so smart! Nice job!
Mar 19, 2008. 10:07 AMalexsolex says:
unicode is soooooo annoying when you try coding but are not expert !! I think this tuto is defintely wonderfull, but looking your pics, would i be possible to get the same but in French ? :) (or probably a link explaining all those things - I'm probably too lazy to look for it myself on google..) Thanks
Mar 19, 2008. 12:19 PMgmjhowe says:
Useful! thanx
Mar 19, 2008. 8:28 AMpingeee says:
whoa, good to know these man. Thanks

Pro

Get More Out of Instructables

Already have an Account?

close

All Steps Viewing
View all steps of an Instructable on the same page when you're a Pro Member.

Upgrade to Pro today!
18
Followers
9
Author:chooseausername