Operations 38 min read

How to Process 10GB Logs in 30 Seconds with Grep, Sed, and Awk

This comprehensive guide shows how to use the GNU tools grep, sed, and awk to quickly analyse massive Nginx access logs, covering their streaming design, optimal command parameters, real‑world examples, performance tricks, security safeguards and step‑by‑step scripts for fault isolation and reporting.

Ops Community
Ops Community
Ops Community
How to Process 10GB Logs in 30 Seconds with Grep, Sed, and Awk

Overview

During a Double‑11 outage the author needed to analyse ~12 GB of Nginx access logs within minutes. Traditional editors were too slow, so a one‑liner using grep, sed and awk identified the offending IP in about 30 seconds.

Why the “three swordsmen” are fast

Streaming processing : each tool reads a line, processes it and discards it, so memory usage is constant regardless of file size.

C implementation : the tools are written in C and use low‑level system I/O, giving them a performance edge over interpreted languages.

Pipelines : Unix pipes let commands pass data directly without temporary files, enabling parallel execution.

Optimised regex engines : grep’s DFA and fixed‑string (-F) modes are highly efficient.

Tool responsibilities

grep – fast pattern search, ideal for filtering large logs.

sed – stream editor for in‑place text substitution or deletion.

awk – full programming language for field‑wise calculations, aggregations and formatted output.

Typical scenarios

Fault isolation – locate error lines, count occurrences, identify abusive IPs.

Log analytics – compute request counts, response‑time statistics, top URLs.

Configuration management – bulk edit config files safely.

Data processing – CSV/JSON conversion, report generation.

Automation – embed the three tools in shell scripts for repeatable tasks.

When not to use them

Complex multi‑file joins – use ELK, Python, or dedicated log platforms.

Persistent storage or stateful processing – the tools are stateless.

Deeply nested data structures – awk’s arrays are limited.

Environment requirements

The commands were tested on Ubuntu 22.04, CentOS 8 and macOS Sonoma. GNU versions required: grep 3.8+, sed 4.8+, gawk 5.1+. On macOS install the GNU variants via Homebrew (ggrep, gsed, gawk) and use the g prefix or aliases.

Core command parameters

grep

grep -n "ERROR"               # show line numbers
grep -i "error"               # case‑insensitive
grep -c "ERROR"               # count matches
grep -v "DEBUG"               # exclude pattern
grep -A 5 "Exception"         # show 5 lines after match
grep -B 2 "ERROR"            # show 2 lines before match
grep -C 3 "FATAL"            # show 3 lines context
grep -l "password" *.conf    # list files containing pattern
grep -r "TODO" ./src         # recursive search
grep -w "error"               # whole‑word match
grep -o "ip=[0-9.]+'"          # output only the match
grep -E "err|warn"           # extended regex
grep -P "\d{4}"              # Perl regex
grep -F "fixed[string]"       # fixed‑string search

Choosing -E vs -P : use -E for simple extended regex, -P when you need PCRE features such as \d, \s or look‑ahead.

sed

# basic substitution
sed 's/old/new/' file
sed 's/old/new/g' file

# in‑place edit with backup
sed -i.bak 's/worker_processes auto/worker_processes 8/' /etc/nginx/nginx.conf

# delete lines
sed '/DEBUG/d' file

# address range
sed '10,20d' file

# multiple commands
sed -e 's/a/b/' -e 's/c/d/' file

# script file
sed -f script.sed file

Safety tip : always back up files before using sed -i in production.

awk

# field printing
awk '{print $1}' access.log

# conditional filtering
awk '$3 > 100' file

# aggregation
awk '{ip[$1]++} END{for(i in ip) print ip[i], i}' access.log

# BEGIN/END blocks
awk 'BEGIN{FS=":"} {print $1}' file

# arrays and formatting
awk '{count[$7]++; sum[$7]+=$NF}
     END{for(u in count) printf "%s %d %.3f
", u, count[u], sum[u]/count[u]}' access.log

Step‑by‑step example

Generate test data

#!/bin/bash
LOG_FILE="access.log"
TOTAL_LINES=10000000   # ~1 GB
IPS=("192.168.1.100" "192.168.1.101" "10.0.0.50" "10.0.0.51" "172.16.0.10" "8.8.8.8" "1.1.1.1" "203.0.113.50")
URLS=("/api/users" "/api/orders" "/api/products" "/api/search" "/static/js/main.js")
STATUS_CODES=("200" "200" "200" "200" "200" "201" "301" "302" "400" "401" "403" "404" "500" "502" "503")
USER_AGENTS=("Mozilla/5.0" "curl/7.88.1" "python-requests/2.28.0")
for ((i=1;i<=TOTAL_LINES;i++)); do
  ip=${IPS[$RANDOM % ${#IPS[@]}]}
  url=${URLS[$RANDOM % ${#URLS[@]}]}
  status=${STATUS_CODES[$RANDOM % ${#STATUS_CODES[@]}]}
  ua=${USER_AGENTS[$RANDOM % ${#USER_AGENTS[@]}]}
  size=$((RANDOM % 50000 + 100))
  resp=$(awk -v min=0.001 -v max=5.0 'BEGIN{srand(); print min+rand()*(max-min)}')
  ts=$(date "+%d/%b/%Y:%H:%M:%S +0800")
  echo "$ip - - [$ts] \"GET $url HTTP/1.1\" $status $size \"$ua\" $resp"
done > "$LOG_FILE"

Validate the log

# size and line count
ls -lh access.log
wc -l access.log

# preview
head -5 access.log
tail -5 access.log

# random sample
shuf -n 10 access.log | less

Common use‑cases with concrete commands

Top‑10 IPs

# method 1 – classic pipeline
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10

# method 2 – pure awk (lower memory)
awk '{ip[$1]++} END{for(i in ip) print ip[i], i}' access.log | sort -rn | head -10

Top‑10 URLs by request count

awk '{print $7}' access.log | sort | uniq -c | sort -rn | head -10

Top‑10 URLs by average response time

awk '{url=$7; t=$NF; cnt[url]++; sum[url]+=t}
     END{for(u in cnt) printf "%s %.3f %d
", u, sum[u]/cnt[u], cnt[u]}' access.log |
sort -k2 -rn | head -10

5xx error details

grep " 500 " access.log
awk '$9 ~ /^5/ {print}' access.log | head -20
awk '$9 ~ /^5/ {err[$7]++} END{for(u in err) print err[u], u}' access.log | sort -rn | head -10

Performance optimisation tips

Filter with grep before feeding data to awk to reduce volume.

Use LC_ALL=C for pure ASCII processing.

Split huge files and process chunks in parallel with GNU parallel or xargs.

Prefer ripgrep (rg) over grep for faster fixed‑string searches.

Avoid unnecessary sort; use associative arrays in awk when possible.

Increase buffer size ( grep --buffer-size=1M) for large inputs.

Security best practices

Never run sed -i directly on production files without a backup; use sed -i.bak or a copy‑restore workflow.

Sanitise any user‑supplied patterns before passing them to grep or sed to avoid command injection.

Limit resource usage with ulimit, timeout or nice when processing very large logs.

Troubleshooting checklist

Check file size and line count.

Verify field separators (default whitespace, use -F for custom).

Inspect a few lines with head to confirm format.

Use awk '{for(i=1;i<=NF;i++) print i":"$i}' to debug field extraction.

Ensure correct line endings (LF vs CRLF) and character encoding.

Simple monitoring script

#!/bin/bash
LOG_FILE="/var/log/app/app.log"
THRESHOLD=10
INTERVAL=60

while true; do
  cnt=$(grep -c "ERROR" "$LOG_FILE")
  if [ "$cnt" -ge "$THRESHOLD" ]; then
    curl -X POST -H "Content-Type: application/json" \
      -d "{\"text\":\"[ALERT] $cnt errors in last minute\"}" \
      https://your-webhook-url
  fi
  sleep "$INTERVAL"
done

Conclusion

The three GNU tools provide a lightweight, high‑performance toolbox for SREs facing massive log files. Understanding their streaming nature, regex engines and how to combine them in pipelines enables fault isolation, analytics and automation without heavyweight platforms.

SRElog analysisshell scriptinggrepawksed
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.