How to Process a 1‑Billion‑Row Text File in Go 30× Faster (3.4 s)

This article walks through a step‑by‑step performance engineering journey that reduces the processing time of a 1 billion‑line, 13 GB text file from 105 seconds to 3.4 seconds in Go by applying custom parsing, bulk I/O, a hand‑rolled hash table, and parallelism.

Code Wrench
Code Wrench
Code Wrench
How to Process a 1‑Billion‑Row Text File in Go 30× Faster (3.4 s)

Challenge Background

Input: one billion lines, each line formatted as city;temperature. Compute the minimum, maximum, and average temperature for each city using only the Go standard library (no unsafe, assembly, or memory‑mapped files).

Generating Test Data

A helper program creates a large file with random city names and temperatures.

package main

import (
    "fmt"
    "math/rand"
    "os"
    "time"
)

// generateData creates a test file with the given number of lines.
func generateData(filename string, lines int) error {
    stations := []string{"Berlin", "Paris", "London", "NewYork", "Tokyo", "Sydney", "Moscow", "Beijing", "Delhi", "Rio"}
    f, err := os.Create(filename)
    if err != nil {
        return err
    }
    defer f.Close()
    rand.Seed(time.Now().UnixNano())
    for i := 0; i < lines; i++ {
        station := stations[rand.Intn(len(stations))]
        temp := float64(rand.Intn(1980)-990) / 10.0 // -99.0 ~ +99.0
        fmt.Fprintf(f, "%s;%.1f
", station, temp)
    }
    return nil
}

func main() {
    lines := 10000000
    if err := generateData("measurements.txt", lines); err != nil {
        panic(err)
    }
    fmt.Printf("generated: measurements.txt (%d lines)
", lines)
}

Running the program produces measurements.txt, which can be used for benchmarks.

Optimization Steps

1. Baseline (≈105 s)

The straightforward implementation uses bufio.Scanner, splits each line with strings.Cut, parses the temperature with strconv.ParseFloat, and stores per‑city statistics in a map[string]*StationStats.

scanner := bufio.NewScanner(file)
stats := make(map[string]*StationStats)

for scanner.Scan() {
    line := scanner.Text()
    station, tempStr, _ := strings.Cut(line, ";")
    tempF, _ := strconv.ParseFloat(tempStr, 64)
    temp := int(tempF * 10) // store as tenths of a degree
    s, ok := stats[station]
    if !ok {
        s = &StationStats{}
        stats[station] = s
    }
    s.Update(temp)
}

Allocates a new string for every line, uses a generic floating‑point parser, and hashes string keys in the map.

Execution time ~105 seconds.

2. Custom temperature parser (≈55 s)

Temperatures are always two‑digit integers followed by a single decimal digit. A hand‑written parser works directly on the byte slice, eliminating the heavy strconv.ParseFloat call.

func parseTemp(b []byte) int {
    neg := false
    if b[0] == '-' {
        neg = true
        b = b[1:]
    }
    index := 0
    temp := int(b[0] - '0')
    index++
    if b[index] != '.' {
        temp = temp*10 + int(b[index]-'0')
        index++
    }
    index++ // skip '.'
    temp = temp*10 + int(b[index]-'0')
    if neg {
        return -temp
    }
    return temp
}

Parsing overhead roughly halved, reducing total runtime to ~55 seconds.

3. Bulk read instead of Scanner (≈41 s)

Replace bufio.Scanner with a large pre‑allocated buffer and manual \n splitting to remove per‑line slice allocations and scanner bookkeeping.

buf := make([]byte, 1<<20) // 1 MB buffer
for {
    n, err := file.Read(buf)
    if n == 0 && err == io.EOF {
        break
    }
    start := 0
    for i := 0; i < n; i++ {
        if buf[i] == '
' {
            line := buf[start:i]
            processLine(line) // uses parseTemp internally
            start = i + 1
        }
    }
}

Reduces function‑call and slice‑allocation overhead, achieving ~41 seconds (2.5× faster than baseline).

4. Custom hash table (≈22 s)

The standard map[string] incurs string allocation and garbage‑collection pressure. A bespoke hash table stores keys as []byte and uses FNV‑1a hashing.

type entry struct {
    key   []byte
    stats StationStats
    used  bool
}

type hashTable struct {
    buckets []entry
}

func (h *hashTable) getOrInsert(key []byte) *StationStats {
    hash := fnv1a(key)
    idx := int(hash & uint64(len(h.buckets)-1))
    for {
        e := &h.buckets[idx]
        if !e.used {
            e.key = append([]byte(nil), key...)
            e.used = true
            return &e.stats
        }
        if bytes.Equal(e.key, key) {
            return &e.stats
        }
        idx = (idx + 1) & (len(h.buckets) - 1)
    }
}

Eliminates string allocations and reduces GC load, cutting the runtime to ~22 seconds.

5. Parallel processing (≈3.4 s)

Split the file into chunks, process each chunk in a separate goroutine, and merge the per‑chunk hash tables.

numCPU := runtime.NumCPU()
results := make([]*hashTable, numCPU)
var wg sync.WaitGroup

for i := 0; i < numCPU; i++ {
    start := int64(i) * chunk
    end := start + chunk
    if i == numCPU-1 {
        end = size
    }
    wg.Add(1)
    go func(idx int, s, e int64) {
        defer wg.Done()
        results[idx] = processChunk(filename, s, e)
    }(i, start, end)
}
wg.Wait()

Fully utilizes multi‑core CPUs, bringing total processing time down to ~3.4 seconds—a ~30× speedup over the naive baseline.

Final Complete Program

The following Go program combines all five optimizations (custom parser, bulk read, hand‑rolled hash table, and parallel execution).

package main

import (
    "bytes"
    "fmt"
    "hash/fnv"
    "io"
    "os"
    "runtime"
    "sync"
)

type StationStats struct {
    min, max, sum, count int
}

func (s *StationStats) Update(temp int) {
    if s.count == 0 {
        s.min, s.max = temp, temp
    }
    if temp < s.min {
        s.min = temp
    }
    if temp > s.max {
        s.max = temp
    }
    s.sum += temp
    s.count++
}

// ---- custom hash table ----

type entry struct {
    key   []byte
    stats StationStats
    used  bool
}

type hashTable struct {
    buckets []entry
}

func newHashTable(size int) *hashTable {
    return &hashTable{buckets: make([]entry, size)}
}

func fnv1a(data []byte) uint64 {
    h := fnv.New64a()
    h.Write(data)
    return h.Sum64()
}

func (h *hashTable) getOrInsert(key []byte) *StationStats {
    hash := fnv1a(key)
    idx := int(hash & uint64(len(h.buckets)-1))
    for {
        e := &h.buckets[idx]
        if !e.used {
            e.key = append([]byte(nil), key...)
            e.used = true
            return &e.stats
        }
        if bytes.Equal(e.key, key) {
            return &e.stats
        }
        idx = (idx + 1) & (len(h.buckets) - 1)
    }
}

// ---- high‑performance temperature parser ----
func parseTemp(b []byte) int {
    neg := false
    if b[0] == '-' {
        neg = true
        b = b[1:]
    }
    index := 0
    temp := int(b[0] - '0')
    index++
    if b[index] != '.' {
        temp = temp*10 + int(b[index]-'0')
        index++
    }
    index++ // skip '.'
    temp = temp*10 + int(b[index]-'0')
    if neg {
        return -temp
    }
    return temp
}

func processChunk(filename string, start, end int64) *hashTable {
    f, _ := os.Open(filename)
    defer f.Close()
    if start > 0 {
        f.Seek(start, io.SeekStart)
        // Align to next newline
        buf := make([]byte, 1)
        for {
            n, err := f.Read(buf)
            if n == 0 || err != nil || buf[0] == '
' {
                break
            }
            start++
        }
    } else {
        f.Seek(start, io.SeekStart)
    }
    buf := make([]byte, 1<<20) // 1 MB
    ht := newHashTable(1 << 20)
    leftover := []byte{}
    for {
        n, err := f.Read(buf)
        if n == 0 {
            break
        }
        data := append(leftover, buf[:n]...)
        lines := bytes.Split(data, []byte("
"))
        if err != io.EOF {
            leftover = lines[len(lines)-1]
            lines = lines[:len(lines)-1]
        }
        for _, line := range lines {
            if len(line) == 0 {
                continue
            }
            idx := bytes.LastIndexByte(line, ';')
            station := line[:idx]
            temp := parseTemp(line[idx+1:])
            s := ht.getOrInsert(station)
            s.Update(temp)
        }
        if err == io.EOF {
            break
        }
    }
    return ht
}

func optimized(filename string, workers int) (map[string]*StationStats, error) {
    f, err := os.Open(filename)
    if err != nil {
        return nil, err
    }
    defer f.Close()
    fi, _ := f.Stat()
    size := fi.Size()
    chunk := size / int64(workers)
    results := make([]*hashTable, workers)
    var wg sync.WaitGroup
    for i := 0; i < workers; i++ {
        start := int64(i) * chunk
        end := start + chunk
        if i == workers-1 {
            end = size
        }
        wg.Add(1)
        go func(idx int, s, e int64) {
            defer wg.Done()
            results[idx] = processChunk(filename, s, e)
        }(i, start, end)
    }
    wg.Wait()
    final := make(map[string]*StationStats)
    for _, ht := range results {
        for _, e := range ht.buckets {
            if !e.used {
                continue
            }
            k := string(e.key)
            s, ok := final[k]
            if !ok {
                s = &StationStats{}
                final[k] = s
            }
            s.sum += e.stats.sum
            s.count += e.stats.count
            if e.stats.min < s.min || s.count == e.stats.count {
                s.min = e.stats.min
            }
            if e.stats.max > s.max || s.count == e.stats.count {
                s.max = e.stats.max
            }
        }
    }
    return final, nil
}

func main() {
    res, _ := optimized("measurements.txt", runtime.NumCPU())
    for k, v := range res {
        avg := float64(v.sum) / float64(v.count) / 10.0
        fmt.Printf("%s=%.1f (min: %.1f, max: %.1f)
", k, avg, float64(v.min)/10.0, float64(v.max)/10.0)
    }
}

Results

Baseline implementation: ~105 seconds.

Fully optimized version: ~3.4 seconds, roughly a 30× speedup.

Takeaways

Start with a correct baseline to ensure functional correctness.

Identify the main bottlenecks: string allocations, generic parsing, map look‑ups, and I/O.

Iteratively eliminate unnecessary work—custom parsers, bulk reads, bespoke data structures.

When the standard library becomes a limiting factor, a well‑designed custom structure can provide order‑of‑magnitude gains.

Leverage concurrency to fully exploit modern multi‑core CPUs for the final performance boost.

Reference: https://benhoyt.com/writings/go-1brc/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Optimizationlarge-file1BRChash-table
Code Wrench
Written by

Code Wrench

Focuses on code debugging, performance optimization, and real-world engineering, sharing efficient development tips and pitfall guides. We break down technical challenges in a down-to-earth style, helping you craft handy tools so every line of code becomes a problem‑solving weapon. 🔧💻

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.