Getting the UTF-32 bytes of JavaScript strings

by
, posted

This post assumes you understand UTF-32.

Recently, I wanted to get the UTF-32 bytes of a JavaScript string for a demo I was working on. I couldn’t find anyone else who had done this, so I thought I’d write this post.

My goal was to write a generator function that yielded each UTF-32 byte.

First, I started by generating the string’s Unicode code points. Iterating over a JavaScript string yields the Unicode code points as strings.

function* unicodeCodePoints(str) {
  for (const character of str) {
    const codePoint = character.codePointAt(0);
    yield codePoint;
  }
}

[...unicodeCodePoints("hi 🌍")];
// => [104, 105, 32, 127757]

Now I needed to turn these into bytes. I did a little bit masking and shifting to turn these four-byte numbers into four one-byte numbers:

function* utf32Bytes(str) {
  for (const character of str) {
    const codepoint = character.codePointAt(0);

    // Get the most significant byte.
    // For example, given 0x12345678, yield 0x12.
    yield (codepoint & 0xff000000) >> 24;

    // Get the next most significant byte, and so on.
    yield (codepoint & 0x00ff0000) >> 16;
    yield (codepoint & 0x0000ff00) >> 8;
    yield codepoint & 0x000000ff;
  }
}

[...utf32Bytes("hi 🌍")];
// => [0, 0, 0, 104, 0, 0, 0, 105, 0, 0, 0, 32, 0, 1, 243, 13]

And that’s it! I could now get the UTF-32 bytes of a JavaScript string.

I want the results as a buffer

My solution uses a generator. If you want the results as a Uint8Array, simply pass the result to the Uint8Array constructor:

new Uint8Array(utf32Bytes("hi 🌍"));
// => Uint8Array(16) [0, 0, 0, 104, 0, ...]

I want the little endian bytes

My solution yields big endian results (UTF-32BE), not little endian (UTF-32LE). If you want little endian results, you can just switch the order of the yields.

function* utf32LeBytes(str) {
  for (const character of str) {
    const codepoint = character.codePointAt(0);

    // Get the least significant byte.
    // For example, given 0x12345678, yield 0x78.
    yield codepoint & 0x000000ff;

    // Get the next least significant byte, and so on.
    yield (codepoint & 0x0000ff00) >> 8;
    yield (codepoint & 0x00ff0000) >> 16;
    yield (codepoint & 0xff000000) >> 24;
  }
}

[...utf32LeBytes("hi 🌍")];
// => [104, 0, 0, 0, 105, 0, 0, 0, 32, 0, 0, 0, 13, 243, 1, 0]

You can also attach the byte order mark to your result by adding a few yields at the beginning.

I want something else

I also needed to get the UTF-8 and UTF-16 bytes for my little demo, so I wrote up how to do those too: