Introducing UTF-21, a toy character encoding
In short: I created UTF-21, an impractical alternative to character encodings like UTF-8.
Quick crash course: character encoding & Unicode
Before you can understand my horrible creation, you need to understand a little about Unicode. You can skip this if you want.
Each character has a number
Character encoding is the process of converting characters to numbers and back, typically for digital storage and transmission.
You’ve probably heard of ASCII, which maps 128 characters to numbers. For example, W
is number 87 and number 36 is $
.
As you might expect, there are more than 128 characters in the world. Characters like ñ
and 🥺
can’t be represented as ASCII.
Unicode is like ASCII, but instead of 128 characters, there are 1,114,111 characters. Way more! That lets us store characters like ñ
(character #241) and 🥺
(character #129402). It’s a little more complex than this, but that’s the rough idea.
Here are a few examples from the big Unicode table:
Character | Unicode scalar |
---|---|
F | 70 |
ñ | 241 |
🥺 | 129402 |
(Note that some glyphs, like 👩🏾🌾
, are made up of multiple characters and therefore have multiple scalars. For more, see this post.)
If you want to represent this full range—0 to ~1.1 million—you need 21 bits of data. How do people store these bits?
Storing the numbers
Unicode has three official ways of storing these numbers: UTF-8, UTF-16, and UTF-32.
I think UTF-32 is the simplest. Each number is put into a 32-bit integer, or 4 bytes. This is called a “fixed-width” encoding. Because you only need 21 bits of data, more than a third of the space is wasted, but it’s simpler and faster for some operations.
In constrast, UTF-8 and UTF-16 are “variable-width” encodings. UTF-16 tries to fit characters into a single 16-bit number, and if it can’t, it expands to two. UTF-8 is conceptually similar, but it uses 8-bit numbers (bytes) as the smallest unit. (Fun fact: UTF-8 is a superset of ASCII.)
For example, for the character F
, which has a scalar value of 70
(46
in hex):
Encoding | Bytes |
---|---|
UTF-32 | 00 00 00 46 |
UTF-16 | 00 46 |
UTF-8 | 46 |
And for the character 🥺
, which has a scalar value of 129402
(01f97a
in hex):
Encoding | Bytes |
---|---|
UTF-32 | 00 01 f9 7a |
UTF-16 | d8 3e dd 7a |
UTF-8 | f0 9f a5 ba |
Introducing UTF-21
UTF-8, UTF-16, and UTF-32 are widely used and a lot of smart people have worked on them.
Today, I’m introducing UTF-21, a toy character encoding made by me, a lone dingus.
To represent the full range of Unicode scalars, you need 21 bits. That’s precisely what UTF-21 does. Each scalar is represented by a 21-bit number, packed back-to-back with no space between.
For example, F
has a scalar value of 70
, which is encoded like this in binary:
000000000000001000110
🥺
, which has a scalar value of 129402
, is encoded like this:
000011111100101111010
Modern computers like to store data as bytes, not bits. Therefore, the end of the data is padded with zeroes until it fits in a byte. That means there will be between 0 and 7 bits of padding at the end of a UTF-21 data stream.
The string F🥺
would be encoded like this in binary:
000000000000001000110 000011111100101111010 000000
The first 21 bits are for the F
, the next 21 are for 🥺
, and the last 6 are padding.
How does it perform?
UTF-21 is just a toy project, but how does it stack up against the official UTFs?
UTF-32: UTF-21 is always more efficient than UTF-32 because it’s 11 bits smaller per character. You’d expect UTF-21 strings to be roughly 66% the size of their UTF-32 counterparts.
For example,
Hello world!
is 32 bytes in UTF-21 and 48 bytes in UTF-32.UTF-16: UTF-21 is less efficient than UTF-16 when you don’t need two code units (surrogate pairs) and more efficient when you do. It depends on your use case, but I suspect most strings will be more efficient in UTF-16.
For example, UTF-16 wins for the string
foo bar
: 14 bytes of UTF-16 and 19 bytes of UTF-21. But it loses for the string🌍🌏🌎
: 12 bytes of UTF-16 and only 8 for UTF-21.UTF-8: UTF-21 is less efficient than UTF-8 unless you need three or four bytes for the character, and then UTF-21 is more efficient.
For example, UTF-8 wins for the string
foo bar
—7 bytes versus 19—but loses for the string안녕하세요
—14 bytes to 15.
In short, it’s more efficient than UTF-32 but probably worse than the others in most cases.
Why did I do this?
For fun!
UTF-21 probably goes under the category of “useless stuff”. It’s not particularly efficient or good, but I learned a bunch about how Unicode works and had a lot of fun building it.
I hope this was equally fun and informative to read!
Thanks to Manuel Strehl for reviewing an early draft of this post.