How to Process 10GB Logs in 30 Seconds with Grep, Sed, and Awk
This comprehensive guide shows how to use the GNU tools grep, sed, and awk to quickly analyse massive Nginx access logs, covering their streaming design, optimal command parameters, real‑world examples, performance tricks, security safeguards and step‑by‑step scripts for fault isolation and reporting.
Overview
During a Double‑11 outage the author needed to analyse ~12 GB of Nginx access logs within minutes. Traditional editors were too slow, so a one‑liner using grep, sed and awk identified the offending IP in about 30 seconds.
Why the “three swordsmen” are fast
Streaming processing : each tool reads a line, processes it and discards it, so memory usage is constant regardless of file size.
C implementation : the tools are written in C and use low‑level system I/O, giving them a performance edge over interpreted languages.
Pipelines : Unix pipes let commands pass data directly without temporary files, enabling parallel execution.
Optimised regex engines : grep’s DFA and fixed‑string (-F) modes are highly efficient.
Tool responsibilities
grep – fast pattern search, ideal for filtering large logs.
sed – stream editor for in‑place text substitution or deletion.
awk – full programming language for field‑wise calculations, aggregations and formatted output.
Typical scenarios
Fault isolation – locate error lines, count occurrences, identify abusive IPs.
Log analytics – compute request counts, response‑time statistics, top URLs.
Configuration management – bulk edit config files safely.
Data processing – CSV/JSON conversion, report generation.
Automation – embed the three tools in shell scripts for repeatable tasks.
When not to use them
Complex multi‑file joins – use ELK, Python, or dedicated log platforms.
Persistent storage or stateful processing – the tools are stateless.
Deeply nested data structures – awk’s arrays are limited.
Environment requirements
The commands were tested on Ubuntu 22.04, CentOS 8 and macOS Sonoma. GNU versions required: grep 3.8+, sed 4.8+, gawk 5.1+. On macOS install the GNU variants via Homebrew (ggrep, gsed, gawk) and use the g prefix or aliases.
Core command parameters
grep
grep -n "ERROR" # show line numbers
grep -i "error" # case‑insensitive
grep -c "ERROR" # count matches
grep -v "DEBUG" # exclude pattern
grep -A 5 "Exception" # show 5 lines after match
grep -B 2 "ERROR" # show 2 lines before match
grep -C 3 "FATAL" # show 3 lines context
grep -l "password" *.conf # list files containing pattern
grep -r "TODO" ./src # recursive search
grep -w "error" # whole‑word match
grep -o "ip=[0-9.]+'" # output only the match
grep -E "err|warn" # extended regex
grep -P "\d{4}" # Perl regex
grep -F "fixed[string]" # fixed‑string searchChoosing -E vs -P : use -E for simple extended regex, -P when you need PCRE features such as \d, \s or look‑ahead.
sed
# basic substitution
sed 's/old/new/' file
sed 's/old/new/g' file
# in‑place edit with backup
sed -i.bak 's/worker_processes auto/worker_processes 8/' /etc/nginx/nginx.conf
# delete lines
sed '/DEBUG/d' file
# address range
sed '10,20d' file
# multiple commands
sed -e 's/a/b/' -e 's/c/d/' file
# script file
sed -f script.sed fileSafety tip : always back up files before using sed -i in production.
awk
# field printing
awk '{print $1}' access.log
# conditional filtering
awk '$3 > 100' file
# aggregation
awk '{ip[$1]++} END{for(i in ip) print ip[i], i}' access.log
# BEGIN/END blocks
awk 'BEGIN{FS=":"} {print $1}' file
# arrays and formatting
awk '{count[$7]++; sum[$7]+=$NF}
END{for(u in count) printf "%s %d %.3f
", u, count[u], sum[u]/count[u]}' access.logStep‑by‑step example
Generate test data
#!/bin/bash
LOG_FILE="access.log"
TOTAL_LINES=10000000 # ~1 GB
IPS=("192.168.1.100" "192.168.1.101" "10.0.0.50" "10.0.0.51" "172.16.0.10" "8.8.8.8" "1.1.1.1" "203.0.113.50")
URLS=("/api/users" "/api/orders" "/api/products" "/api/search" "/static/js/main.js")
STATUS_CODES=("200" "200" "200" "200" "200" "201" "301" "302" "400" "401" "403" "404" "500" "502" "503")
USER_AGENTS=("Mozilla/5.0" "curl/7.88.1" "python-requests/2.28.0")
for ((i=1;i<=TOTAL_LINES;i++)); do
ip=${IPS[$RANDOM % ${#IPS[@]}]}
url=${URLS[$RANDOM % ${#URLS[@]}]}
status=${STATUS_CODES[$RANDOM % ${#STATUS_CODES[@]}]}
ua=${USER_AGENTS[$RANDOM % ${#USER_AGENTS[@]}]}
size=$((RANDOM % 50000 + 100))
resp=$(awk -v min=0.001 -v max=5.0 'BEGIN{srand(); print min+rand()*(max-min)}')
ts=$(date "+%d/%b/%Y:%H:%M:%S +0800")
echo "$ip - - [$ts] \"GET $url HTTP/1.1\" $status $size \"$ua\" $resp"
done > "$LOG_FILE"Validate the log
# size and line count
ls -lh access.log
wc -l access.log
# preview
head -5 access.log
tail -5 access.log
# random sample
shuf -n 10 access.log | lessCommon use‑cases with concrete commands
Top‑10 IPs
# method 1 – classic pipeline
awk '{print $1}' access.log | sort | uniq -c | sort -rn | head -10
# method 2 – pure awk (lower memory)
awk '{ip[$1]++} END{for(i in ip) print ip[i], i}' access.log | sort -rn | head -10Top‑10 URLs by request count
awk '{print $7}' access.log | sort | uniq -c | sort -rn | head -10Top‑10 URLs by average response time
awk '{url=$7; t=$NF; cnt[url]++; sum[url]+=t}
END{for(u in cnt) printf "%s %.3f %d
", u, sum[u]/cnt[u], cnt[u]}' access.log |
sort -k2 -rn | head -105xx error details
grep " 500 " access.log
awk '$9 ~ /^5/ {print}' access.log | head -20
awk '$9 ~ /^5/ {err[$7]++} END{for(u in err) print err[u], u}' access.log | sort -rn | head -10Performance optimisation tips
Filter with grep before feeding data to awk to reduce volume.
Use LC_ALL=C for pure ASCII processing.
Split huge files and process chunks in parallel with GNU parallel or xargs.
Prefer ripgrep (rg) over grep for faster fixed‑string searches.
Avoid unnecessary sort; use associative arrays in awk when possible.
Increase buffer size ( grep --buffer-size=1M) for large inputs.
Security best practices
Never run sed -i directly on production files without a backup; use sed -i.bak or a copy‑restore workflow.
Sanitise any user‑supplied patterns before passing them to grep or sed to avoid command injection.
Limit resource usage with ulimit, timeout or nice when processing very large logs.
Troubleshooting checklist
Check file size and line count.
Verify field separators (default whitespace, use -F for custom).
Inspect a few lines with head to confirm format.
Use awk '{for(i=1;i<=NF;i++) print i":"$i}' to debug field extraction.
Ensure correct line endings (LF vs CRLF) and character encoding.
Simple monitoring script
#!/bin/bash
LOG_FILE="/var/log/app/app.log"
THRESHOLD=10
INTERVAL=60
while true; do
cnt=$(grep -c "ERROR" "$LOG_FILE")
if [ "$cnt" -ge "$THRESHOLD" ]; then
curl -X POST -H "Content-Type: application/json" \
-d "{\"text\":\"[ALERT] $cnt errors in last minute\"}" \
https://your-webhook-url
fi
sleep "$INTERVAL"
doneConclusion
The three GNU tools provide a lightweight, high‑performance toolbox for SREs facing massive log files. Understanding their streaming nature, regex engines and how to combine them in pipelines enables fault isolation, analytics and automation without heavyweight platforms.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
