Frontend Development 14 min read

The Nuances of Base64 Encoding Strings in JavaScript

The article explains that JavaScript’s native btoa() and atob() functions only handle ASCII, so to correctly base64‑encode Unicode strings you must convert them with TextEncoder to UTF‑8 bytes, use Uint8Array, and decode with TextDecoder, while checking for malformed surrogate pairs via isWellFormed or encodeURIComponent to avoid silent data loss.

Sohu Tech Products

Dec 6, 2023

The Nuances of Base64 Encoding Strings in JavaScript

Base64 encoding and decoding is a common way to convert binary content into text suitable for the web. It is typically used in data URLs, such as for embedded images.

What happens when you apply base64 encoding and decoding to strings in JavaScript? This article explores these details and common pitfalls to avoid.

The btoa() and atob() Functions

The core functions for base64 encoding and decoding in JavaScript are btoa() and atob(). The btoa() function converts a string to a base64-encoded string, while atob() decodes it.

Unfortunately, as noted in the MDN documentation, this only works for strings containing ASCII characters—characters that can be represented with a single byte. In other words, this does not work for Unicode.

Unicode and Strings in JavaScript

Unicode is the current global character encoding standard, assigning numbers to specific characters for use in computer systems. JavaScript handles strings using UTF-16, which breaks functions like btoa() that assume each character in the string maps to a single byte.

Unicode has two common methods for converting code points to byte sequences that computers can consistently interpret: UTF-8 and UTF-16. In UTF-8, a code point can use one to four bytes (each byte 8 bits). In UTF-16, a code point is always two bytes (16 bits).

btoa() and atob() with Unicode

The error occurs because the string contains UTF-16 characters that exist outside a single byte. MDN provides useful example code to solve this "Unicode problem":

function base64ToBytes(base64) {
  const binString = atob(base64);
  return Uint8Array.from(binString, (m) => m.codePointAt(0));
}

function bytesToBase64(bytes) {
  const binString = String.fromCodePoint(...bytes);
  return btoa(binString);
}

const validUTF16String = 'hello⛳❤️🧀';

const validUTF16StringEncoded = bytesToBase64(new TextEncoder().encode(validUTF16String));
console.log(`Encoded string: [${validUTF16StringEncoded}]`);

const validUTF16StringDecoded = new TextDecoder().decode(base64ToBytes(validUTF16StringEncoded));
console.log(`Decoded string: [${validUTF16StringDecoded}]`);

The encoding process works as follows:

Use the TextEncoder interface to convert the UTF-16 encoded JavaScript string to a UTF-8 encoded byte stream via TextEncoder.encode().

This returns a Uint8Array, a less commonly used data type in JavaScript and a subclass of TypedArray.

Pass this Uint8Array to the bytesToBase64() function, which uses String.fromCodePoint() to treat each byte in the Uint8Array as a code point and create a string from it.

Use btoa() to base64 encode this string.

The decoding process is the same but in reverse order.

Cases of Silent Failure

Using the same code but with a different string containing a lone surrogate:

const partiallyInvalidUTF16String = 'hello⛳❤️🧀\uDE75';

const partiallyInvalidUTF16StringEncoded = bytesToBase64(new TextEncoder().encode(partiallyInvalidUTF16String));
console.log(`Encoded string: [${partiallyInvalidUTF16StringEncoded}]`);

const partiallyInvalidUTF16StringDecoded = new TextDecoder().decode(base64ToBytes(partiallyInvalidUTF16StringEncoded));
console.log(`Decoded string: [${partiallyInvalidUTF16StringDecoded}]`);

The decoded string shows a replacement character (�) instead of the original lone surrogate. It didn't fail or throw an error, but the input and output data have been silently changed.

String Mutation in JavaScript APIs

UTF-16 has a concept called surrogate pairs. For code points larger than 65535 (the maximum value for a 16-bit number), UTF-16 uses two 16-bit code units called surrogates. A lone surrogate occurs when only one half of a surrogate pair is present.

In JavaScript, some APIs work despite lone surrogates while others fail. TextDecoder's default setting replaces malformed data with a replacement character. The � character (hexadecimal \uFFFD) is that replacement character.

Checking for Well-Formed Strings

Recent browser versions now have a function for this purpose: isWellFormed(). You can also achieve similar results using encodeURIComponent(), which throws a URIError if the string contains lone surrogates.

function isWellFormed(str) {
  if (typeof(str.isWellFormed)!="undefined") {
    return str.isWellFormed();
  } else {
    try {
      encodeURIComponent(str);
      return true;
    } catch (error) {
      return false;
    }
  }
}

Putting It All Together

Now that you know how to handle Unicode and lone surrogates, you can put everything together to create code that handles all cases without performing silent text replacement.

function base64ToBytes(base64) {
  const binString = atob(base64);
  return Uint8Array.from(binString, (m) => m.codePointAt(0));
}

function bytesToBase64(bytes) {
  const binString = String.fromCodePoint(...bytes);
  return btoa(binString);
}

function isWellFormed(str) {
  if (typeof(str.isWellFormed)!="undefined") {
    return str.isWellFormed();
  } else {
    try {
      encodeURIComponent(str);
      return true;
    } catch (error) {
      return false;
    }
  }
}

const validUTF16String = 'hello⛳❤️🧀';
const partiallyInvalidUTF16String = 'hello⛳❤️🧀\uDE75';

if (isWellFormed(validUTF16String)) {
  const validUTF16StringEncoded = bytesToBase64(new TextEncoder().encode(validUTF16String));
  console.log(`Encoded string: [${validUTF16StringEncoded}]`);

  const validUTF16StringDecoded = new TextDecoder().decode(base64ToBytes(validUTF16StringEncoded));
  console.log(`Decoded string: [${validUTF16StringDecoded}]`);
} else {
  // Ignore
}

if (isWellFormed(partiallyInvalidUTF16String)) {
  // Ignore
} else {
  console.log(`Cannot process a string with lone surrogates: [${partiallyInvalidUTF16String}]`);
}

This code can be optimized in many ways, such as creating a polyfill, changing TextDecoder's parameters to throw instead of silently replace at lone surrogates, and more. With this knowledge and code, you can explicitly decide how to handle malformed strings—whether to reject the data, explicitly enable replacement, or throw errors for later analysis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

JavaScript Web Development Unicode character encoding UTF-16 Base64 TextDecoder TextEncoder

Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.