Introducing UTF-21, a toy character encoding

by
, posted

In short: I created UTF-21, an impractical alternative to character encodings like UTF-8.

Quick crash course: character encoding & Unicode

Before you can understand my horrible creation, you need to understand a little about Unicode. You can skip this if you want.

Each character has a number

Character encoding is the process of converting characters to numbers and back, typically for digital storage and transmission.

You’ve probably heard of ASCII, which maps 128 characters to numbers. For example, W is number 87 and number 36 is $.

As you might expect, there are more than 128 characters in the world. Characters like ñ and 🥺 can’t be represented as ASCII.

Unicode is like ASCII, but instead of 128 characters, there are 1,114,111 characters. Way more! That lets us store characters like ñ (character #241) and 🥺 (character #129402). It’s a little more complex than this, but that’s the rough idea.

Here are a few examples from the big Unicode table:

CharacterUnicode scalar
F70
ñ241
🥺129402

(Note that some glyphs, like 👩🏾‍🌾, are made up of multiple characters and therefore have multiple scalars. For more, see this post.)

If you want to represent this full range—0 to ~1.1 million—you need 21 bits of data. How do people store these bits?

Storing the numbers

Unicode has three official ways of storing these numbers: UTF-8, UTF-16, and UTF-32.

I think UTF-32 is the simplest. Each number is put into a 32-bit integer, or 4 bytes. This is called a “fixed-width” encoding. Because you only need 21 bits of data, more than a third of the space is wasted, but it’s simpler and faster for some operations.

In constrast, UTF-8 and UTF-16 are “variable-width” encodings. UTF-16 tries to fit characters into a single 16-bit number, and if it can’t, it expands to two. UTF-8 is conceptually similar, but it uses 8-bit numbers (bytes) as the smallest unit. (Fun fact: UTF-8 is a superset of ASCII.)

For example, for the character F, which has a scalar value of 70 (46 in hex):

EncodingBytes
UTF-3200 00 00 46
UTF-1600 46
UTF-846

And for the character 🥺, which has a scalar value of 129402 (01f97a in hex):

EncodingBytes
UTF-3200 01 f9 7a
UTF-16d8 3e dd 7a
UTF-8f0 9f a5 ba

Introducing UTF-21

UTF-8, UTF-16, and UTF-32 are widely used and a lot of smart people have worked on them.

Today, I’m introducing UTF-21, a toy character encoding made by me, a lone dingus.

To represent the full range of Unicode scalars, you need 21 bits. That’s precisely what UTF-21 does. Each scalar is represented by a 21-bit number, packed back-to-back with no space between.

For example, F has a scalar value of 70, which is encoded like this in binary:

 000000000000001000110

🥺, which has a scalar value of 129402, is encoded like this:

 000011111100101111010

Modern computers like to store data as bytes, not bits. Therefore, the end of the data is padded with zeroes until it fits in a byte. That means there will be between 0 and 7 bits of padding at the end of a UTF-21 data stream.

The string F🥺 would be encoded like this in binary:

 000000000000001000110 000011111100101111010 000000

The first 21 bits are for the F, the next 21 are for 🥺, and the last 6 are padding.

How does it perform?

UTF-21 is just a toy project, but how does it stack up against the official UTFs?

In short, it’s more efficient than UTF-32 but probably worse than the others in most cases.

Why did I do this?

For fun!

UTF-21 probably goes under the category of “useless stuff”. It’s not particularly efficient or good, but I learned a bunch about how Unicode works and had a lot of fun building it.

I hope this was equally fun and informative to read!

Thanks to Manuel Strehl for reviewing an early draft of this post.