2024-01-29

Understanding Character Encoding.

If we try to interpret bytes in memory as a string we have to know the underlying encoding to be able to decode the string.

It does not make sense to have a string without knowing what encoding it uses.

—Joel Spolsky

Watch me

Unicode, in friendly terms: ASCII, UTF-8, code points, character encodings, and more

How characters are encoded

Take the string Hello. The steps are as follows:

Every character is part of a character set defined by Unicode.
Every character (grapheme) in Hello gets a unique identifier which is called code point. 3) Every Code Point is encoded using an encoding scheme like UTF-8.

This value is than converted into binary and written to storage.

Graphemes --> Code Points --> Encoding --> Binary

Hello interpreted in Unicode code points looks like this:

U+0048 U+0065 U+006C U+006C U+006F

Now a character encoding standard can be applied to represent each Unicode character in memory. The most common ones are UTF-8 and UTF-16.

Remark: A character set like Unicode is not the same as a character encoding standard as UTF-8.

UTF-8

Takes up one to four bytes based on the Unicode value. In this example encoding Hello every character can be encoded using one byte.

Hex

\x48\x65\x6c\x6c\x6f

Binary

01001000 01100101 01101100 01101100 01101111

The concatenated binary respresentation is the storage ready version.

UTF-16

Hex

\u0048\u0065\u006c\u006c\u006f

Binary

01001000 00000000 01100101 00000000 01101100 00000000 01101100 00000000 01101111