Fundamentals 16 min read

Parse 16‑Digit Timestamps Up to 700× Faster Than std::stringstream

This article explores why standard string‑to‑integer conversions become performance bottlenecks in high‑concurrency scenarios and presents a series of increasingly optimized C++ solutions—from native library calls to loop‑unrolled, byteswap, divide‑and‑conquer, and SIMD tricks—demonstrating dramatic speed gains backed by Google Benchmark results.

Programmer DD

Sep 8, 2021

Parse 16‑Digit Timestamps Up to 700× Faster Than std::stringstream

In many performance‑critical contests, converting fixed‑length (16‑digit) timestamp strings to integers becomes a hotspot, and the usual Integer.valueOf / Long.valueOf or std::stringstream are far from optimal.

Problem Statement

Given a 16‑character numeric string, we need the fastest possible parsing method.

timestamp
1585201087123567
1585201087123585
1585201087123621

The benchmark baseline (named BM_mov) simply loads the constant into a register.

Native Solutions

std::atoll

std::stringstream

C++17

charconv

boost::spirit::qi

Google Benchmark results show stringstream is ~391× slower than the baseline, while charconv and boost::spirit perform much better.

Naive Loop

inline std::uint64_t parse_naive(std::string_view s) noexcept {
  std::uint64_t result = 0;
  for (char digit : s) {
    result *= 10;
    result += digit - '0';
  }
  return result;
}

Despite its simplicity, this approach can beat the standard library when input validation is unnecessary.

Loop‑Unrolled Solution

inline std::uint64_t parse_unrolled(std::string_view s) noexcept {
  std::uint64_t result = 0;
  result += (s[0] - '0') * 1000000000000000ULL;
  result += (s[1] - '0') * 100000000000000ULL;
  // ... omitted for brevity ...
  result += (s[15] - '0');
  return result;
}

Removing the loop reduces overhead and yields a noticeable speedup.

Byteswap Technique

By reinterpreting the string as a 64‑bit integer, subtracting the ASCII zero constant, and applying __builtin_bswap64, we achieve the fastest native implementation so far.

template<> inline std::uint64_t get_zeros_string<std::uint64_t>() noexcept {
  std::uint64_t result = 0;
  constexpr char zeros[] = "00000000";
  std::memcpy(&result, zeros, sizeof(result));
  return result;
}

inline std::uint64_t parse_8_chars(const char* string) noexcept {
  std::uint64_t chunk = 0;
  std::memcpy(&chunk, string, sizeof(chunk));
  chunk = __builtin_bswap64(chunk - get_zeros_string<std::uint64_t>());
  // ...
  return chunk;
}

Divide‑and‑Conquer (Bitmask) Solution

inline std::uint64_t parse_8_chars(const char* string) noexcept {
  std::uint64_t chunk = 0;
  std::memcpy(&chunk, string, sizeof(chunk));
  // 1‑byte mask
  std::uint64_t lower = (chunk & 0x0f000f000f000f00) >> 8;
  std::uint64_t upper = (chunk & 0x000f000f000f000f) * 10;
  chunk = lower + upper;
  // 2‑byte mask
  lower = (chunk & 0x00ff000000ff0000) >> 16;
  upper = (chunk & 0x000000ff000000ff) * 100;
  chunk = lower + upper;
  // 4‑byte mask
  lower = (chunk & 0x0000ffff00000000) >> 32;
  upper = (chunk & 0x000000000000ffff) * 10000;
  chunk = lower + upper;
  return chunk;
}

This reduces the number of arithmetic operations by combining digits in parallel, achieving O(log n) behaviour.

Trick Combination

inline std::uint64_t parse_trick(std::string_view s) noexcept {
  std::uint64_t upper = parse_8_chars(s.data());
  std::uint64_t lower = parse_8_chars(s.data() + 8);
  return upper * 100000000ULL + lower;
}

The combined approach improves performance by roughly 56 % over the pure unrolled version.

SIMD Trick

Using SSE/AVX intrinsics, we load 16 bytes at once, subtract the ASCII zero, and apply vectorised multiply‑add operations to collapse digits.

inline std::uint64_t parse_16_chars(const char* string) noexcept {
  auto chunk = _mm_lddqu_si128(reinterpret_cast<const __m128i*>(string));
  auto zeros = _mm_set1_epi8('0');
  chunk = _mm_sub_epi8(chunk, zeros);
  const auto mult1 = _mm_set_epi8(1,10,1,10,1,10,1,10,1,10,1,10,1,10,1,10);
  chunk = _mm_maddubs_epi16(chunk, mult1);
  const auto mult2 = _mm_set_epi16(1,100,1,100,1,100,1,100);
  chunk = _mm_madd_epi16(chunk, mult2);
  chunk = _mm_packus_epi32(chunk, chunk);
  const auto mult3 = _mm_set_epi16(0,0,0,0,1,10000,1,10000);
  chunk = _mm_madd_epi16(chunk, mult3);
  return ((chunk[0] & 0xffffffffULL) * 100000000ULL) + (chunk[0] >> 32);
}

On modern CPUs this reaches ~0.75 ns per conversion, a several‑hundred‑fold improvement over stringstream.

Conclusion

Standard conversion utilities are often sufficient, but when parsing massive streams of fixed‑length numeric strings they become bottlenecks. By applying low‑level tricks—loop unrolling, byteswap, bitmasking, and SIMD vectorisation—C++ developers can achieve order‑of‑magnitude speedups, which is crucial for high‑throughput data‑processing systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SIMD C++String Parsing byteswap

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.