Backend Development 12 min read

Why the Most‑Copied Stack Overflow Byte‑Count Snippet Is Flawed (and How to Fix It)

This article examines the popular Java humanReadableByteCount function, reveals hidden bugs in its handling of edge cases and floating‑point precision, presents a logarithm‑based solution, discusses empirical research on Stack Overflow code reuse, and offers a robust final implementation.

Programmer DD

Aug 12, 2021

Why the Most‑Copied Stack Overflow Byte‑Count Snippet Is Flawed (and How to Fix It)

One day while browsing Stack Overflow for reputation, I encountered a question asking how to format a byte count into a human‑readable string such as "123.5 MB" with the result limited to the range 1–999.9 and an appropriate unit suffix.

The accepted answer used a loop over suffixes (EB, PB, TB, GB, MB, kB, B) and corresponding magnitudes, selecting the first scale smaller than the byte count. The pseudocode was:

suffixes   = [ "EB", "PB", "TB", "GB", "MB", "kB", "B" ]
magnitudes = [ 10^18, 10^15, 10^12, 10^9, 10^6, 10^3, 100 ]
i = 0
while (i < magnitudes.length && magnitudes[i] > byteCount)
    i++
printf("%.1f %s", byteCount / magnitudes[i], suffixes[i])

Although correct, the loop felt wasteful, so I realized that the suffixes are simply powers of 1000 (or 1024 for IEC) and can be derived with logarithms. My first alternative implementation was:

public static String humanReadableByteCount(long bytes, boolean si) {
    int unit = si ? 1000 : 1024;
    if (bytes < unit) return bytes + " B";
    int exp = (int) (Math.log(bytes) / Math.log(unit));
    String pre = (si ? "kMGTPE" : "KMGTPE").charAt(exp-1) + (si ? "" : "i");
    return String.format("%.1f %sB", bytes / Math.pow(unit, exp), pre);
}

This version eliminated the loop but introduced readability and performance concerns due to the use of log and pow. I later discovered a 2018 empirical study by Sebastian Baltes (Empirical Software Engineering) titled “Usage and Attribution of Stack Overflow Code Snippets in GitHub Projects”, which showed that most developers copy code from Stack Overflow without proper attribution.

The study extracted snippets from Stack Overflow data dumps and matched them against public GitHub repositories, revealing that the humanReadableByteCount function appears thousands of times. A quick check can be done with:

$ git grep humanReadableByteCount

Further investigation uncovered several subtle bugs in the original snippet:

For an input of 999 999 bytes (SI mode) the function returned "1000.0 kB" instead of the correct "1.0 MB" because the rounding threshold was off.

Similar off‑by‑one errors occurred at higher scales (e.g., 999 949 999 999 999 999 bytes produced "1000.0 PB" instead of "999.9 PB").

Floating‑point precision limits of double caused incorrect results near Long.MAX_VALUE.

To address these issues I introduced a threshold based on Math.ceil(unit^exp * (unit‑0.05)) and adjusted the exponent when the byte count exceeds that threshold. I also added handling for negative inputs by using an absolute value variable, and refined the algorithm to reduce the magnitude when the exponent is large to improve precision.

The final, strict‑fp version is:

// From: https://programming.guide/worlds-most-copied-so-snippet.html
public static strictfp String humanReadableByteCount(long bytes, boolean si) {
    int unit = si ? 1000 : 1024;
    long absBytes = bytes == Long.MIN_VALUE ? Long.MAX_VALUE : Math.abs(bytes);
    if (absBytes < unit) return bytes + " B";
    int exp = (int) (Math.log(absBytes) / Math.log(unit));
    long th = (long) Math.ceil(Math.pow(unit, exp) * (unit - 0.05));
    if (exp < 6 && absBytes >= th - ((th & 0xFFF) == 0xD00 ? 51 : 0)) exp++;
    String pre = (si ? "kMGTPE" : "KMGTPE").charAt(exp-1) + (si ? "" : "i");
    if (exp > 4) {
        bytes /= unit;
        exp -= 1;
    }
    return String.format("%.1f %sB", bytes / Math.pow(unit, exp), pre);
}

Despite its original goal of avoiding loops and branches, the final code is more complex than the initial answer, illustrating that even highly up‑voted snippets can contain hidden flaws. The key take‑aways are to test all edge cases, be cautious with floating‑point arithmetic, and always attribute copied code.

Stack Overflow code, even with thousands of up‑votes, may still be buggy; thorough testing and proper attribution are essential.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java code review stack overflow floating-point Empirical Study Byte Count Human Readable

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.