How to Process a 1‑Billion‑Row Text File in Go 30× Faster (3.4 s)
This article walks through a step‑by‑step performance engineering journey that reduces the processing time of a 1 billion‑line, 13 GB text file from 105 seconds to 3.4 seconds in Go by applying custom parsing, bulk I/O, a hand‑rolled hash table, and parallelism.
Challenge Background
Input: one billion lines, each line formatted as city;temperature. Compute the minimum, maximum, and average temperature for each city using only the Go standard library (no unsafe, assembly, or memory‑mapped files).
Generating Test Data
A helper program creates a large file with random city names and temperatures.
package main
import (
"fmt"
"math/rand"
"os"
"time"
)
// generateData creates a test file with the given number of lines.
func generateData(filename string, lines int) error {
stations := []string{"Berlin", "Paris", "London", "NewYork", "Tokyo", "Sydney", "Moscow", "Beijing", "Delhi", "Rio"}
f, err := os.Create(filename)
if err != nil {
return err
}
defer f.Close()
rand.Seed(time.Now().UnixNano())
for i := 0; i < lines; i++ {
station := stations[rand.Intn(len(stations))]
temp := float64(rand.Intn(1980)-990) / 10.0 // -99.0 ~ +99.0
fmt.Fprintf(f, "%s;%.1f
", station, temp)
}
return nil
}
func main() {
lines := 10000000
if err := generateData("measurements.txt", lines); err != nil {
panic(err)
}
fmt.Printf("generated: measurements.txt (%d lines)
", lines)
}Running the program produces measurements.txt, which can be used for benchmarks.
Optimization Steps
1. Baseline (≈105 s)
The straightforward implementation uses bufio.Scanner, splits each line with strings.Cut, parses the temperature with strconv.ParseFloat, and stores per‑city statistics in a map[string]*StationStats.
scanner := bufio.NewScanner(file)
stats := make(map[string]*StationStats)
for scanner.Scan() {
line := scanner.Text()
station, tempStr, _ := strings.Cut(line, ";")
tempF, _ := strconv.ParseFloat(tempStr, 64)
temp := int(tempF * 10) // store as tenths of a degree
s, ok := stats[station]
if !ok {
s = &StationStats{}
stats[station] = s
}
s.Update(temp)
}Allocates a new string for every line, uses a generic floating‑point parser, and hashes string keys in the map.
Execution time ~105 seconds.
2. Custom temperature parser (≈55 s)
Temperatures are always two‑digit integers followed by a single decimal digit. A hand‑written parser works directly on the byte slice, eliminating the heavy strconv.ParseFloat call.
func parseTemp(b []byte) int {
neg := false
if b[0] == '-' {
neg = true
b = b[1:]
}
index := 0
temp := int(b[0] - '0')
index++
if b[index] != '.' {
temp = temp*10 + int(b[index]-'0')
index++
}
index++ // skip '.'
temp = temp*10 + int(b[index]-'0')
if neg {
return -temp
}
return temp
}Parsing overhead roughly halved, reducing total runtime to ~55 seconds.
3. Bulk read instead of Scanner (≈41 s)
Replace bufio.Scanner with a large pre‑allocated buffer and manual \n splitting to remove per‑line slice allocations and scanner bookkeeping.
buf := make([]byte, 1<<20) // 1 MB buffer
for {
n, err := file.Read(buf)
if n == 0 && err == io.EOF {
break
}
start := 0
for i := 0; i < n; i++ {
if buf[i] == '
' {
line := buf[start:i]
processLine(line) // uses parseTemp internally
start = i + 1
}
}
}Reduces function‑call and slice‑allocation overhead, achieving ~41 seconds (2.5× faster than baseline).
4. Custom hash table (≈22 s)
The standard map[string] incurs string allocation and garbage‑collection pressure. A bespoke hash table stores keys as []byte and uses FNV‑1a hashing.
type entry struct {
key []byte
stats StationStats
used bool
}
type hashTable struct {
buckets []entry
}
func (h *hashTable) getOrInsert(key []byte) *StationStats {
hash := fnv1a(key)
idx := int(hash & uint64(len(h.buckets)-1))
for {
e := &h.buckets[idx]
if !e.used {
e.key = append([]byte(nil), key...)
e.used = true
return &e.stats
}
if bytes.Equal(e.key, key) {
return &e.stats
}
idx = (idx + 1) & (len(h.buckets) - 1)
}
}Eliminates string allocations and reduces GC load, cutting the runtime to ~22 seconds.
5. Parallel processing (≈3.4 s)
Split the file into chunks, process each chunk in a separate goroutine, and merge the per‑chunk hash tables.
numCPU := runtime.NumCPU()
results := make([]*hashTable, numCPU)
var wg sync.WaitGroup
for i := 0; i < numCPU; i++ {
start := int64(i) * chunk
end := start + chunk
if i == numCPU-1 {
end = size
}
wg.Add(1)
go func(idx int, s, e int64) {
defer wg.Done()
results[idx] = processChunk(filename, s, e)
}(i, start, end)
}
wg.Wait()Fully utilizes multi‑core CPUs, bringing total processing time down to ~3.4 seconds—a ~30× speedup over the naive baseline.
Final Complete Program
The following Go program combines all five optimizations (custom parser, bulk read, hand‑rolled hash table, and parallel execution).
package main
import (
"bytes"
"fmt"
"hash/fnv"
"io"
"os"
"runtime"
"sync"
)
type StationStats struct {
min, max, sum, count int
}
func (s *StationStats) Update(temp int) {
if s.count == 0 {
s.min, s.max = temp, temp
}
if temp < s.min {
s.min = temp
}
if temp > s.max {
s.max = temp
}
s.sum += temp
s.count++
}
// ---- custom hash table ----
type entry struct {
key []byte
stats StationStats
used bool
}
type hashTable struct {
buckets []entry
}
func newHashTable(size int) *hashTable {
return &hashTable{buckets: make([]entry, size)}
}
func fnv1a(data []byte) uint64 {
h := fnv.New64a()
h.Write(data)
return h.Sum64()
}
func (h *hashTable) getOrInsert(key []byte) *StationStats {
hash := fnv1a(key)
idx := int(hash & uint64(len(h.buckets)-1))
for {
e := &h.buckets[idx]
if !e.used {
e.key = append([]byte(nil), key...)
e.used = true
return &e.stats
}
if bytes.Equal(e.key, key) {
return &e.stats
}
idx = (idx + 1) & (len(h.buckets) - 1)
}
}
// ---- high‑performance temperature parser ----
func parseTemp(b []byte) int {
neg := false
if b[0] == '-' {
neg = true
b = b[1:]
}
index := 0
temp := int(b[0] - '0')
index++
if b[index] != '.' {
temp = temp*10 + int(b[index]-'0')
index++
}
index++ // skip '.'
temp = temp*10 + int(b[index]-'0')
if neg {
return -temp
}
return temp
}
func processChunk(filename string, start, end int64) *hashTable {
f, _ := os.Open(filename)
defer f.Close()
if start > 0 {
f.Seek(start, io.SeekStart)
// Align to next newline
buf := make([]byte, 1)
for {
n, err := f.Read(buf)
if n == 0 || err != nil || buf[0] == '
' {
break
}
start++
}
} else {
f.Seek(start, io.SeekStart)
}
buf := make([]byte, 1<<20) // 1 MB
ht := newHashTable(1 << 20)
leftover := []byte{}
for {
n, err := f.Read(buf)
if n == 0 {
break
}
data := append(leftover, buf[:n]...)
lines := bytes.Split(data, []byte("
"))
if err != io.EOF {
leftover = lines[len(lines)-1]
lines = lines[:len(lines)-1]
}
for _, line := range lines {
if len(line) == 0 {
continue
}
idx := bytes.LastIndexByte(line, ';')
station := line[:idx]
temp := parseTemp(line[idx+1:])
s := ht.getOrInsert(station)
s.Update(temp)
}
if err == io.EOF {
break
}
}
return ht
}
func optimized(filename string, workers int) (map[string]*StationStats, error) {
f, err := os.Open(filename)
if err != nil {
return nil, err
}
defer f.Close()
fi, _ := f.Stat()
size := fi.Size()
chunk := size / int64(workers)
results := make([]*hashTable, workers)
var wg sync.WaitGroup
for i := 0; i < workers; i++ {
start := int64(i) * chunk
end := start + chunk
if i == workers-1 {
end = size
}
wg.Add(1)
go func(idx int, s, e int64) {
defer wg.Done()
results[idx] = processChunk(filename, s, e)
}(i, start, end)
}
wg.Wait()
final := make(map[string]*StationStats)
for _, ht := range results {
for _, e := range ht.buckets {
if !e.used {
continue
}
k := string(e.key)
s, ok := final[k]
if !ok {
s = &StationStats{}
final[k] = s
}
s.sum += e.stats.sum
s.count += e.stats.count
if e.stats.min < s.min || s.count == e.stats.count {
s.min = e.stats.min
}
if e.stats.max > s.max || s.count == e.stats.count {
s.max = e.stats.max
}
}
}
return final, nil
}
func main() {
res, _ := optimized("measurements.txt", runtime.NumCPU())
for k, v := range res {
avg := float64(v.sum) / float64(v.count) / 10.0
fmt.Printf("%s=%.1f (min: %.1f, max: %.1f)
", k, avg, float64(v.min)/10.0, float64(v.max)/10.0)
}
}Results
Baseline implementation: ~105 seconds.
Fully optimized version: ~3.4 seconds, roughly a 30× speedup.
Takeaways
Start with a correct baseline to ensure functional correctness.
Identify the main bottlenecks: string allocations, generic parsing, map look‑ups, and I/O.
Iteratively eliminate unnecessary work—custom parsers, bulk reads, bespoke data structures.
When the standard library becomes a limiting factor, a well‑designed custom structure can provide order‑of‑magnitude gains.
Leverage concurrency to fully exploit modern multi‑core CPUs for the final performance boost.
Reference: https://benhoyt.com/writings/go-1brc/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code Wrench
Focuses on code debugging, performance optimization, and real-world engineering, sharing efficient development tips and pitfall guides. We break down technical challenges in a down-to-earth style, helping you craft handy tools so every line of code becomes a problem‑solving weapon. 🔧💻
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
