How to Build a Cost‑Effective, High‑Throughput Log Collection System with ClickHouse
This article examines the challenges of scaling log storage and retrieval for high‑traffic services, analyzes the cost and performance limits of traditional ELK‑based pipelines, and presents a streamlined, UDP‑driven architecture using ClickHouse that dramatically reduces hardware expenses while handling hundreds of gigabytes per second.
Background
In daily work we need to store logs such as request parameters, info and error messages, which provide the basis for troubleshooting system issues.
Log storage and retrieval is a common task; many frameworks exist. For a few machines we can log to local files with Log4j. When log volume grows we use the ELK stack, and for even larger volumes we introduce a message queue like Kafka or use Filebeat to ship logs.
The typical log pipeline consists of temporary storage, transport, persistence, and fast query.
Scale and Cost Explosion
When a module of the JD App receives 2‑5 × 10⁴ requests per second (up to a million at peak), each request generates 40 KB‑2 MB of log data (median ~60 KB). At 30 000 requests per second this equals 1.8 GB of log data per second, requiring >15 GB/s throughput for a single module.
Storing raw logs on local disks, shipping them to a MQ cluster, and persisting them consumes massive disk I/O, network bandwidth, and CPU, often requiring thousands of servers.
Shortening the Process and Reducing Traffic
Applying Occam’s razor, we remove unnecessary steps: instead of writing to local disks and MQ, logs are kept in memory and sent directly via UDP to workers.
Compression (Snappy, ZSTD) reduces payload size by 80‑90 %, turning a 60 KB message into 6‑8 KB, dramatically cutting bandwidth and improving worker throughput. If compressed payload exceeds UDP’s 64 KB limit, HTTP can be used.
A Stronger Log Collection System
The new architecture consists of:
Configuration Center : stores worker IPs for clients.
Client : fetches worker IPs, aggregates logs, compresses them, and sends via UDP (or HTTP if needed).
Worker : receives logs, parses them, and writes to ClickHouse.
ClickHouse : a high‑performance column‑oriented DB with excellent compression and write speed.
Dashboard : visual query interface built on ClickHouse.
Client‑Side Log Aggregation
The client captures request parameters (via HTTP filter or RPC interceptor) and key log entries (info, error) using custom appenders for Log4j/Logback/Log4j2. Logs are kept in memory, serialized with Protobuf, compressed, and sent to workers via UDP.
If the compressed packet exceeds UDP limits, it falls back to HTTP. Thread‑local storage (TransmittableThreadLocal) preserves trace IDs across thread pools.
Worker‑Side Consumption and Ingestion
Workers must buffer massive incoming logs, parse them, and write to ClickHouse. They use large‑memory containers (8 CPU / 32 GB) and a double‑queue system: one queue for raw data, another for parsed rows ready for batch insertion.
Benchmarks show a single worker container can process 1‑5 × 10⁷ raw log lines per second (≈2 × 10⁴ QPS) and write >200 MB/s to ClickHouse, equivalent to >1 GB of original data.
ClickHouse Power
ClickHouse provides high‑availability clustering (domain → CHProxy → CH nodes) and vectorized execution with SIMD, delivering multi‑dimensional, real‑time analytics on massive datasets.
Multi‑Condition Query Console
The dashboard offers SQL‑based queries with day‑level partitioning, PREWHERE optimization, and carefully designed index fields (e.g., timestamp, user ID) to achieve sub‑second response times on billions of rows.
Summary and Comparison
The traditional pipeline duplicates data across local disks, MQ, and databases, consuming three times the storage. The new UDP‑ClickHouse pipeline reduces disk usage to ~0.8 of ClickHouse alone, with ClickHouse’s own compression bringing effective storage to ~80 % of raw size, saving >75 % of storage cost.
Hardware savings exceed 70 % because fewer servers are needed for both I/O and CPU. CPU consumption on the client side drops to a single serialization step, while workers achieve >10× higher consumption performance compared to MQ‑based solutions.
Overall, the design delivers a cost‑effective, high‑throughput log collection system suitable for billion‑scale traffic.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
JD Cloud Developers
JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
