Impala at NetEase: Architecture, Iceberg Integration, Management System, Optimizations and Future Roadmap
This talk presents NetEase's practical experience with Impala, covering its core architecture, new features in version 3.x, integration with Apache Iceberg, a custom management platform, profiling and statistics enhancements, as well as future plans involving Kubernetes, Alluxio caching and pre‑computation strategies.
Impala is an interactive SQL query engine originally contributed by Cloudera to Apache, providing high‑concurrency, low‑latency analytics without storing data itself. NetEase uses Impala as a front‑end for querying data stored in HDFS, Kudu, HBase and other systems, and integrates it with internal BI tools.
Impala Positioning and Usage – Impala follows a decentralized MPP architecture with two node roles: Coordinators (SQL parsing, planning) and Executors (data scanning, aggregation). Each Impala daemon (Impalad) contains a Java‑based Frontend for parsing and a C++‑based Backend for execution.
Core Services – The cluster consists of three main services: the Impalad process (Coordinator + Executor), the catalogd metadata service (caches Hive Metastore information), and the statestored publish‑subscribe service for synchronizing metadata and resource queues.
Impala 3.x New Features – Versions 3.0‑3.4 add support for multiple DISTINCT operators, graceful shutdown, ORC file format, DATE type, caching remote files, CBO enhancements, and JSON‑exportable query profiles.
Internal Enhancements at NetEase – Development includes Iceberg table support (catalog types, partition specifications, CREATE/INSERT syntax), a dedicated Impala management system that persists query information to MySQL, profile parsing for detailed node metrics, and a compute‑stats module for automatic statistics collection.
Iceberg Integration – Impala can create and query Iceberg tables using SPEC partition definitions, with predicate push‑down to Iceberg APIs, enabling efficient file‑list pruning and reuse of existing Impala scan code.
Management System Features – The system provides modules for profile analysis, statistics computation, and resource‑queue configuration, allowing operators to monitor top‑N queries by bytes read, file‑open latency, and join performance, as well as to adjust queue settings via a web UI.
Future Roadmap – Planned improvements include Kubernetes‑based deployment and dynamic scaling, Alluxio caching with partition‑level file caching, pre‑computation via materialized views, SQL routing to optimized intermediate tables, and further extensions to the resource‑queue and statistics modules.
The presentation concludes with an invitation to the DataFunTalk community for further discussion and collaboration.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.