Parse 16‑Digit Timestamps Up to 700× Faster Than std::stringstream
This article explores why standard string‑to‑integer conversions become performance bottlenecks in high‑concurrency scenarios and presents a series of increasingly optimized C++ solutions—from native library calls to loop‑unrolled, byteswap, divide‑and‑conquer, and SIMD tricks—demonstrating dramatic speed gains backed by Google Benchmark results.
In many performance‑critical contests, converting fixed‑length (16‑digit) timestamp strings to integers becomes a hotspot, and the usual Integer.valueOf / Long.valueOf or std::stringstream are far from optimal.
Problem Statement
Given a 16‑character numeric string, we need the fastest possible parsing method.
timestamp
1585201087123567
1585201087123585
1585201087123621The benchmark baseline (named BM_mov) simply loads the constant into a register.
Native Solutions
std::atoll std::stringstreamC++17
charconv boost::spirit::qiGoogle Benchmark results show stringstream is ~391× slower than the baseline, while charconv and boost::spirit perform much better.
Naive Loop
inline std::uint64_t parse_naive(std::string_view s) noexcept {
std::uint64_t result = 0;
for (char digit : s) {
result *= 10;
result += digit - '0';
}
return result;
}Despite its simplicity, this approach can beat the standard library when input validation is unnecessary.
Loop‑Unrolled Solution
inline std::uint64_t parse_unrolled(std::string_view s) noexcept {
std::uint64_t result = 0;
result += (s[0] - '0') * 1000000000000000ULL;
result += (s[1] - '0') * 100000000000000ULL;
// ... omitted for brevity ...
result += (s[15] - '0');
return result;
}Removing the loop reduces overhead and yields a noticeable speedup.
Byteswap Technique
By reinterpreting the string as a 64‑bit integer, subtracting the ASCII zero constant, and applying __builtin_bswap64, we achieve the fastest native implementation so far.
template<> inline std::uint64_t get_zeros_string<std::uint64_t>() noexcept {
std::uint64_t result = 0;
constexpr char zeros[] = "00000000";
std::memcpy(&result, zeros, sizeof(result));
return result;
}
inline std::uint64_t parse_8_chars(const char* string) noexcept {
std::uint64_t chunk = 0;
std::memcpy(&chunk, string, sizeof(chunk));
chunk = __builtin_bswap64(chunk - get_zeros_string<std::uint64_t>());
// ...
return chunk;
}Divide‑and‑Conquer (Bitmask) Solution
inline std::uint64_t parse_8_chars(const char* string) noexcept {
std::uint64_t chunk = 0;
std::memcpy(&chunk, string, sizeof(chunk));
// 1‑byte mask
std::uint64_t lower = (chunk & 0x0f000f000f000f00) >> 8;
std::uint64_t upper = (chunk & 0x000f000f000f000f) * 10;
chunk = lower + upper;
// 2‑byte mask
lower = (chunk & 0x00ff000000ff0000) >> 16;
upper = (chunk & 0x000000ff000000ff) * 100;
chunk = lower + upper;
// 4‑byte mask
lower = (chunk & 0x0000ffff00000000) >> 32;
upper = (chunk & 0x000000000000ffff) * 10000;
chunk = lower + upper;
return chunk;
}This reduces the number of arithmetic operations by combining digits in parallel, achieving O(log n) behaviour.
Trick Combination
inline std::uint64_t parse_trick(std::string_view s) noexcept {
std::uint64_t upper = parse_8_chars(s.data());
std::uint64_t lower = parse_8_chars(s.data() + 8);
return upper * 100000000ULL + lower;
}The combined approach improves performance by roughly 56 % over the pure unrolled version.
SIMD Trick
Using SSE/AVX intrinsics, we load 16 bytes at once, subtract the ASCII zero, and apply vectorised multiply‑add operations to collapse digits.
inline std::uint64_t parse_16_chars(const char* string) noexcept {
auto chunk = _mm_lddqu_si128(reinterpret_cast<const __m128i*>(string));
auto zeros = _mm_set1_epi8('0');
chunk = _mm_sub_epi8(chunk, zeros);
const auto mult1 = _mm_set_epi8(1,10,1,10,1,10,1,10,1,10,1,10,1,10,1,10);
chunk = _mm_maddubs_epi16(chunk, mult1);
const auto mult2 = _mm_set_epi16(1,100,1,100,1,100,1,100);
chunk = _mm_madd_epi16(chunk, mult2);
chunk = _mm_packus_epi32(chunk, chunk);
const auto mult3 = _mm_set_epi16(0,0,0,0,1,10000,1,10000);
chunk = _mm_madd_epi16(chunk, mult3);
return ((chunk[0] & 0xffffffffULL) * 100000000ULL) + (chunk[0] >> 32);
}On modern CPUs this reaches ~0.75 ns per conversion, a several‑hundred‑fold improvement over stringstream.
Conclusion
Standard conversion utilities are often sufficient, but when parsing massive streams of fixed‑length numeric strings they become bottlenecks. By applying low‑level tricks—loop unrolling, byteswap, bitmasking, and SIMD vectorisation—C++ developers can achieve order‑of‑magnitude speedups, which is crucial for high‑throughput data‑processing systems.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
