Backend Development 13 min read

How I Cut a 1‑Billion‑Row PHP Parser from 25 Minutes to 27 Seconds

This article walks through a step‑by‑step performance engineering journey for the 1 billion‑row challenge in PHP, starting with a naïve CSV parser, replacing fgetcsv with fgets, applying reference operators, type casting, enabling JIT, and finally using multithreaded parallel processing to shrink runtime from 25 minutes to under 30 seconds.

Open Source Tech Hub

Mar 12, 2024

How I Cut a 1‑Billion‑Row PHP Parser from 25 Minutes to 27 Seconds

Overview

The 1 billion‑row challenge (1brc) on GitHub (https://github.com/gunnarmorling/1brc) requires computing the minimum, maximum and average temperature for each city from a 13 GB CSV file. This summary describes a step‑by‑step optimisation of a PHP script that solves the problem.

Naïve implementation

The initial version reads the file with fgetcsv(), stores per‑city data in an associative array $stations (min, max, sum, count) and sorts the result with ksort(). On a typical laptop the runtime is about 25 minutes .

<?php
$stations = [];
$fp = fopen('measurements.txt', 'r');
while ($data = fgetcsv($fp, null, ';')) {
    if (!isset($stations[$data[0]])) {
        $stations[$data[0]] = [$data[1], $data[1], $data[1], 1];
    } else {
        $stations[$data[0]][3]++;
        $stations[$data[0]][2] += $data[1];
        if ($data[1] < $stations[$data[0]][0]) $stations[$data[0]][0] = $data[1];
        if ($data[1] > $stations[$data[0]][1]) $stations[$data[0]][1] = $data[1];
    }
}
ksort($stations);
echo '{';
foreach($stations as $k=&$station) {
    $station[2] = $station[2]/$station[3];
    echo $k, '=', $station[0], '/', $station[2], '/', $station[1], ', ';
}
echo '}';
?>

Replace fgetcsv() with fgets()

Reading each line with fgets() and splitting on the semicolon removes the CSV parser overhead.

while ($data = fgets($fp, 999)) {
    $pos = strpos($data, ';');
    $city = substr($data, 0, $pos);
    $temp = substr($data, $pos + 1, -1);
    // … update $stations …
}

This change reduces the runtime to 19 minutes 49 seconds (≈21 % faster).

Use reference operator

Assigning a reference to the city entry avoids repeated hash look‑ups.

$station = &$stations[$city];
$station[3]++;
$station[2] += $temp;
// instead of $stations[$city][3]++ etc.

Runtime improves to 17 minutes 48 seconds (≈10 % faster).

Combine min/max checks

Merge the two separate comparisons into a single conditional block, shaving another ~2 %.

Explicit type casting

Convert the temperature string to a float once, rather than on each use.

$temp = (float)substr($data, $pos + 1, -1);

Runtime drops to 13 minutes 32 seconds (≈21 % faster).

Enable PHP JIT

In CLI mode OPCache is disabled by default. Enable it and allocate a JIT buffer:

opcache.enable_cli=1
opcache.jit-buffer-size=10M

With JIT the script finishes in 7 minutes 19 seconds (≈45.9 % reduction).

Multithreaded parallel processing

Because each line can be processed independently, the file is split into $threads_cnt chunks aligned on newline boundaries. Each chunk is processed by a separate \parallel\Runtime worker. The workers return partial $stations maps, which the main thread merges.

<?php
$file = 'measurements.txt';
$threads_cnt = 16;

function get_file_chunks(string $file, int $cpu_count): array {
    $size = filesize($file);
    $chunk_size = $cpu_count === 1 ? $size : (int)($size / $cpu_count);
    $fp = fopen($file, 'rb');
    $chunks = [];
    $chunk_start = 0;
    while ($chunk_start < $size) {
        $chunk_end = min($size, $chunk_start + $chunk_size);
        if ($chunk_end < $size) {
            fseek($fp, $chunk_end);
            fgets($fp); // move to next 

            $chunk_end = ftell($fp);
        }
        $chunks[] = [$chunk_start, $chunk_end];
        $chunk_start = $chunk_end + 1;
    }
    fclose($fp);
    return $chunks;
}

$process_chunk = function (string $file, int $chunk_start, int $chunk_end): array {
    $stations = [];
    $fp = fopen($file, 'rb');
    fseek($fp, $chunk_start);
    while ($data = fgets($fp)) {
        $chunk_start += strlen($data);
        if ($chunk_start > $chunk_end) break;
        $pos = strpos($data, ';');
        $city = substr($data, 0, $pos);
        $temp = (float)substr($data, $pos + 1, -1);
        if (isset($stations[$city])) {
            $station = &$stations[$city];
            $station[3]++;
            $station[2] += $temp;
            if ($temp < $station[0]) $station[0] = $temp;
            elseif ($temp > $station[1]) $station[1] = $temp;
        } else {
            $stations[$city] = [$temp, $temp, $temp, 1];
        }
    }
    return $stations;
};

$chunks = get_file_chunks($file, $threads_cnt);
$futures = [];
for ($i = 0; $i < $threads_cnt; $i++) {
    $runtime = new \parallel\Runtime();
    $futures[$i] = $runtime->run($process_chunk, [$file, $chunks[$i][0], $chunks[$i][1]]);
}

$results = [];
for ($i = 0; $i < $threads_cnt; $i++) {
    $chunk_result = $futures[$i]->value();
    foreach ($chunk_result as $city => $measurement) {
        if (isset($results[$city])) {
            $result = &$results[$city];
            $result[2] += $measurement[2];
            $result[3] += $measurement[3];
            if ($measurement[0] < $result[0]) $result[0] = $measurement[0];
            if ($measurement[1] > $result[1]) $result[1] = $measurement[1];
        } else {
            $results[$city] = $measurement;
        }
    }
}
ksort($results);

echo "{
";
foreach ($results as $k => &$station) {
    echo "\t", $k, '=', $station[0], '/', ($station[2]/$station[3]), '/', $station[1], ",
";
}
echo "}
";
?>

With 16 threads the total runtime is 1 minute 35 seconds . After recompiling PHP 8.3 with optimized CFLAGS and using 10 threads, the final runtime reaches 27.7 seconds .

Key take‑aways

High‑level helpers such as fgetcsv() hide costly operations; low‑level line reading ( fgets()) gives full control.

Reference variables and explicit type casting reduce hash look‑ups and repeated conversions.

Enabling JIT in CLI can dramatically speed up CPU‑bound workloads.

The parallel extension allows near‑linear scaling when the work is cleanly split.

Performance tuning is layered: algorithmic changes, language‑level options, and hardware‑aware parallelism all contribute.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Optimization JIT Parallel 1BRC

Written by

Open Source Tech Hub

Sharing cutting-edge internet technologies and practical AI resources.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.