Getting the UTF-32 bytes of JavaScript strings
This post assumes you understand UTF-32.
Recently, I wanted to get the UTF-32 bytes of a JavaScript string for a demo I was working on. I couldn’t find anyone else who had done this, so I thought I’d write this post.
My goal was to write a generator function that yielded each UTF-32 byte.
First, I started by generating the string’s Unicode code points. Iterating over a JavaScript string yields the Unicode code points as strings.
function* unicodeCodePoints(str) {
for (const character of str) {
const codePoint = character.codePointAt(0);
yield codePoint;
}
}
[...unicodeCodePoints("hi 🌍")];
// => [104, 105, 32, 127757]
Now I needed to turn these into bytes. I did a little bit masking and shifting to turn these four-byte numbers into four one-byte numbers:
function* utf32Bytes(str) {
for (const character of str) {
const codepoint = character.codePointAt(0);
// Get the most significant byte.
// For example, given 0x12345678, yield 0x12.
yield (codepoint & 0xff000000) >> 24;
// Get the next most significant byte, and so on.
yield (codepoint & 0x00ff0000) >> 16;
yield (codepoint & 0x0000ff00) >> 8;
yield codepoint & 0x000000ff;
}
}
[...utf32Bytes("hi 🌍")];
// => [0, 0, 0, 104, 0, 0, 0, 105, 0, 0, 0, 32, 0, 1, 243, 13]
And that’s it! I could now get the UTF-32 bytes of a JavaScript string.
I want the results as a buffer
My solution uses a generator. If you want the results as a Uint8Array
, simply pass the result to the Uint8Array
constructor:
new Uint8Array(utf32Bytes("hi 🌍"));
// => Uint8Array(16) [0, 0, 0, 104, 0, ...]
I want the little endian bytes
My solution yields big endian results (UTF-32BE), not little endian (UTF-32LE). If you want little endian results, you can just switch the order of the yield
s.
function* utf32LeBytes(str) {
for (const character of str) {
const codepoint = character.codePointAt(0);
// Get the least significant byte.
// For example, given 0x12345678, yield 0x78.
yield codepoint & 0x000000ff;
// Get the next least significant byte, and so on.
yield (codepoint & 0x0000ff00) >> 8;
yield (codepoint & 0x00ff0000) >> 16;
yield (codepoint & 0xff000000) >> 24;
}
}
[...utf32LeBytes("hi 🌍")];
// => [104, 0, 0, 0, 105, 0, 0, 0, 32, 0, 0, 0, 13, 243, 1, 0]
You can also attach the byte order mark to your result by adding a few yield
s at the beginning.
I want something else
I also needed to get the UTF-8 and UTF-16 bytes for my little demo, so I wrote up how to do those too: