ClickHouse Deployment in Lenovo Manufacturing: Architecture, Data Integration, and Performance Optimization
This article details Lenovo's implementation of ClickHouse in a manufacturing environment, covering the current data landscape, cluster architecture, integration challenges, performance enhancements, and solutions such as Seatunnel and query pre‑aggregation, illustrating how OLAP engines can address real‑time analytics and concurrency issues in production data pipelines.
Lenovo, a traditional manufacturing enterprise, faces a fragmented data environment with numerous business systems using heterogeneous databases such as MySQL, PostgreSQL, Hive, MongoDB, Oracle, and SQL Server, leading to slow T+1/T+2 reporting cycles.
The data flow moves from business systems to an integration platform, then to a data center containing various storage solutions, and finally to MySQL for front‑end consumption, resulting in latency of one to several weeks for metric delivery.
Key pain points include massive UPDATE operations on order records that cause deadlocks and extremely long SQL queries with dozens of LEFT JOINs, sometimes taking hours to complete.
The proposed solution records data as immutable events, leveraging an OLAP engine (ClickHouse) to perform direct, detailed queries without complex joins, and builds wide tables to simplify query logic.
ClickHouse is deployed in a two‑shard, two‑replica cluster managed by ZooKeeper for replication, with Nginx load balancing to distribute query traffic; data ingestion occurs via Kafka, separating write and read paths to improve concurrency.
Initial JDBC‑based ingestion became a bottleneck at billions of rows due to heavy merge operations. Introducing Seatunnel allowed ClickHouse to receive pre‑merged data files directly, bypassing the merge step and dramatically increasing write throughput, especially for bulk historical loads.
Query concurrency was further enhanced by separating read/write workloads, scaling ZooKeeper memory, and applying pre‑aggregation (Projection) techniques that reduced typical query times from seconds to milliseconds, at the cost of increased storage.
The Q&A section addresses incremental synchronization via date partitions, Seatunnel's dual modes (JDBC and file‑based), replica recovery procedures, and methods to boost concurrency by adding shards or replicas.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.