How to Build a Cost‑Effective High‑Throughput Log Collection System with ClickHouse and UDP

This article analyzes the challenges of massive log storage and retrieval, calculates the bandwidth and hardware costs of traditional pipelines, and presents a streamlined architecture that uses in‑memory buffering, UDP transport, compression, and ClickHouse to achieve petabyte‑scale throughput while cutting storage costs by over 75%.

dbaplus Community
dbaplus Community
dbaplus Community
How to Build a Cost‑Effective High‑Throughput Log Collection System with ClickHouse and UDP

Background

In everyday services we need to store logs such as request parameters, info and error messages to aid troubleshooting. Traditional approaches start with local files (log4j) and evolve to ELK stacks, then to message queues like Kafka, and finally to filebeat‑based ingestion. These solutions work for modest traffic but become costly at scale.

Cost Explosion at Large Scale

Using a JD.com App module as an example, a single request generates 40 KB–2 MB of log data (median ~60 KB). With 30 k requests per second the raw log volume reaches 1.8 GB/s, and peak traffic can demand >15 GB/s. Storing the raw files, writing to Kafka (which also persists to disk), and replicating data would require thousands of servers, making the solution financially untenable.

Shortening the Pipeline

Applying Occam’s razor—"remove unnecessary entities"—the design discards local disk writes and Kafka. Logs are kept in memory, compressed with Snappy or ZSTD, and sent directly via UDP (or HTTP if the packet exceeds 64 KB) to a worker cluster.

Robust Log Collection System

The new architecture consists of four components:

Configuration Center : Stores worker IPs for clients to discover.

Client : Pulls worker addresses, compresses logs, and streams them over UDP.

Worker : Receives UDP packets, parses them, and batches inserts into ClickHouse.

ClickHouse : A column‑oriented OLAP database with high compression and write performance, partitioned by day for fast queries.

Workers are the performance bottleneck; they use large‑memory containers (8 CPU / 32 GB) and a double‑buffer queue to absorb bursts before writing to ClickHouse.

Client‑Side Log Aggregation

The client SDK provides filters for HTTP and RPC frameworks to capture request/response payloads, and custom appenders for Log4j/Logback/Log4j2 that buffer logs in memory and forward them via UDP. Large messages that still exceed UDP limits are sent over HTTP. Thread‑local storage (TransmittableThreadLocal) preserves trace IDs across thread pools.

Worker‑Side Consumption and Ingestion

Workers can process 10‑50 million raw log rows per second, translating to ~2 × 10⁴ client QPS. ClickHouse ingestion stabilises at 160‑200 MB/s per worker, meaning a few hundred workers can handle hundreds of gigabytes of raw logs per second. All data remains compressed until a user query triggers decompression.

ClickHouse Advantages

ClickHouse’s vectorised execution, SIMD optimisations, and columnar storage deliver 2‑3× higher write throughput when using local tables instead of distributed tables. The cluster employs a three‑layer architecture (Domain → CHProxy → CH nodes) with automatic fail‑over, ensuring high availability.

Multi‑Condition Query Console

The UI provides simple SQL‑based queries, leveraging ClickHouse features such as PREWHERE and proper sharding to achieve sub‑second response times on billions of rows. Indexes on time and user identifiers further accelerate look‑ups.

Summary & Comparison

Compared with the traditional pipeline (disk + Kafka + DB), the new design reduces disk usage to ~0.8 × ClickHouse’s footprint (after compression) and cuts overall storage cost by >75 %. CPU consumption also drops because the client only performs a single protobuf serialization, and workers avoid double‑disk writes. The result is a scalable, low‑cost log collection system capable of handling petabyte‑scale daily traffic.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Backendlog collectionHigh ThroughputUDP
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.