Design and Implementation of a High‑Performance, Scalable MySQL Log Analysis System
This document describes the background, problems, requirements, analysis, and detailed design of a high‑performance, highly available, and scalable log‑analysis pipeline for MySQL that leverages Kafka, ClickHouse, MySQL, and custom services to aggregate, enrich, and visualize massive query logs.
The system records MySQL SQL statements, captures them via a sniffer, and writes the raw logs to ClickHouse, enabling queries such as which apps access which tables, resource reclamation, and audit assistance.
Key terms are defined: AppCode (service name), ClickHouse (OLAP columnar DB), Kafka (distributed queue), Zookeeper (coordination service), scalability, and aggregation ratio.
Current shortcomings include insufficient ClickHouse consumption capacity leading to ~80% data loss, inaccurate service‑info calculation due to frequent IP changes, limited aggregation dimensions, and lack of monitoring and high‑availability mechanisms.
The requirements call for high performance, high availability, high scalability, sub‑minute latency for IP‑AppCode mapping, richer aggregation dimensions with drill‑down capability, and comprehensive monitoring.
Two architectural options were considered; the chosen solution moves enrichment and aggregation to the server side (SnifferServer) to avoid adding load to the sniffer.
The overall design consists of three main services: dubaiMeta for IP‑AppCode mapping (stateless, MySQL‑backed cache with change‑notification), SnifferServer split into consumer, aggregator, and writer modules (consumer reads from Kafka‑Sniffer, aggregator enriches and aggregates data, writer batches inserts into ClickHouse), and SnifferAnalyze which periodically reads aggregated data from ClickHouse, performs further analysis, and stores results in ClickHouse (raw) and MySQL (summary).
dubaiMeta stores the latest mapping in MySQL, caches it locally, and propagates changes via a notification mechanism to keep all nodes consistent while remaining effectively stateless.
SnifferServer performance tests show a single consumer can process >200k events/s, two instances achieve ~330k events/s, meeting the 133w/s input rate; aggregation ratios are 10‑30%, and writer batches of ≥10k rows achieve efficient ClickHouse ingestion.
Startup order: writer → aggregator → consumer; shutdown order reverses these steps, ensuring no data loss. Each module’s thread count and cache size are tunable for performance versus resource usage.
SnifferAnalyze acquires a distributed lock before processing, runs on a single node at a time, and writes raw analysis to ClickHouse while summary data goes to MySQL; limited node‑level scalability is noted.
ClickHouse deployment uses Replication and Distributed engines with Zookeeper coordination, providing fault tolerance (one replica per shard can fail) and horizontal scalability by adding shards without data migration.
Monitoring currently displays node health and performance metrics and integrates with the company’s existing alerting system.
Operational results show average SnifferServer memory ~12 GB, GC pauses ≤0.6 ms, CPU utilization ~70% on 32 cores, network throughput ~330 Mb/s up, 400 Mb/s down, and end‑to‑end latency well within targets, with UI dashboards illustrating per‑business response times and slow‑query statistics.
The system achieves high performance, availability, and scalability, providing developers with actionable insights into database behavior and risk assessment, especially when combined with a slow‑query risk index.
References: ClickHouse documentation, replication engine guide, and related internal articles.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
