Big Data Technology Trends and Cloud Data Warehouse Architecture Practices
The article reviews recent big-data trends—from Hadoop’s evolution and Spark’s in-memory advances to emerging storage like Ozone—while detailing data-warehouse models, query-optimizer techniques, and cloud-native architectures that integrate diverse data sources, enabling scalable, AI-ready analytics and modern data-lake capabilities.
In the era of data explosion, with the continuous development of enterprise business data volumes, semi-structured and unstructured data are increasing significantly. Traditional data warehouses face major challenges. Building new-type data warehouses using big data technologies like Hadoop and Spark has become an increasing number of enterprises' approach to addressing data challenges.
This article邀请了腾讯云大数据基础团队负责人堵俊平分享大数据领域最近的技术趋势。The speaker is a big data technology expert who previously worked at EMC, VMware, and led the Hortonworks YARN team. He is an Apache Hadoop Committer and PMC.
大数据技术发展与趋势
The big data era began in 2006, starting with Google's three papers: GFS, MapReduce (2004), and BigTable (2006). GFS is a distributed file system, and MapReduce is an execution engine—both crucial for Hadoop's birth and leading the big data era.
Key milestones include: 2008-HBase became a top-level project as the first NoSQL database based on Hadoop; 2009-AWS launched EMR, the first cloud-based big data product; 2011-Facebook open-sourced Apache Hive, bringing traditional data warehouse and SQL engines to Hadoop ecosystem; 2012-Hadoop 2.0 with YARN became a general scheduling framework supporting multiple big data computing frameworks including Spark; 2013-Spark joined Apache as an independent project, marking the beginning of the in-memory computing era; 2017-Hadoop 3.0 released,象征着大数据向容器化方向进化。
Big data development trends: 1) Migration to cloud to simplify DevOps; 2) Integration with AI platforms for unified data analysis and machine learning; 3) Data lakes supporting diverse data types without strict ETL requirements; 4) Convergence of batch processing and stream computing (Lambda and Kappa architectures).
技术进展
Ozone : Next-generation native database storage for Hadoop, overcoming HDFS limitations by重构元数据分配方式 using Storage Container, solving massive small file storage issues and providing object storage interfaces.
Spark : Recent versions (2.2, 2.3) show significant improvements in SQL and Streaming. The Hydrogen project aims to integrate deep learning frameworks into Spark ecosystem, enabling unified data analysis, machine learning, and deep learning through one language and API. Key features include Barrier Execution Mode for gang scheduling, Optimized Data Exchange, and Accelerator Aware Scheduling for heterogeneous platforms (CPU, GPU, FPGA).
数据仓库技术介绍
Database vs Data Warehouse: Databases are transaction-oriented, for data updates and changes, using row storage; data warehouses are analysis-oriented, processing large data volumes, typically using column storage.
Data warehouse modeling approaches: 1) Relational model (3NF) proposed by Bill Inmon; 2) Multi-dimensional model (star schema, snowflake schema); 3) Data Vault model emphasizing auditability, historical data, traceability, and atomicity.
Data warehouse layers: ODS (Operational Data Store), DW (Data Warehouse - includes明细数据层 and 汇总数据层), DM (Data Mart), ADS (Application Data Service).
Data integration methods: 1) Timestamp-based; 2) Snapshot (full table comparison with MD5); 3) Triggers; 4) Logs (e.g., MySQL binlog).
查询优化器与执行引擎
Query optimizer processes: SQL expression → AST → Unoptimized logical plan → Optimized logical plan → Physical execution plan.
Optimization strategies: RBO (Rule-Based Optimization) using static rules; CBO (Cost-Based Optimization) requiring detailed statistics (max/min values, table size, partition info).
Join algorithms: 1) Broadcast join (small table broadcast to each node, highest performance for large-small table joins); 2) Shuffle hash join (suitable for large-medium table joins); 3) Bucket join (for large-large table joins with pre-sorting).
Join order optimization: Left-deep tree vs Balanced Bushy tree (generates fewer intermediate temporary tables, faster, better for star/snowflake models).
Physical execution models: 1) Volcano model (operator-based with Open/Next/Close functions, but suffers from virtual function call overhead); 2) Column-at-a-time (returns multiple columns as arrays, high query efficiency but high memory/IO overhead); 3) Vectored iterator model (returns vectors that fit in CPU cache, can combine with JIT for SIMD optimization).
Code Generation technology: Impala uses LLVM to generate native executable binary code with SIMD instructions; SparkSQL Tungsten uses reflection to generate Java bytecode.
Performance optimization areas: CPU optimization, IO optimization (data locality), memory optimization (hot data caching, CPU cache命中率, GC reduction).
技术流派
1) Share Disk (SMP architecture like Oracle/DB2, poor scalability); 2) Sharding (early distributed database, limited expansion); 3) Share Nothing: MPP (Teradata, Greenplum, AWS Redshift—high performance but scalability bottlenecks and poor fault tolerance) and SQL on Hadoop (Hive, SparkSQL, Impala, Presto, HAWQ).
云数仓架构与实践
Cloud data warehouse differs from traditional ones: user data scattered across different services (MySQL, RDBMS, Object Store, streaming), requiring strong integration capabilities. Reference architecture includes: IaaS layer integration, Query Engine, DB IDE/Notebook, data management services (data catalog, metadata, data governance), upper-layer data applications, and management/monitoring modules.
Case study: A large trading platform needed PB-level data ingestion, OLAP analysis, data lineage tracking, data quality control, and scalable architecture—making cloud data warehouse an excellent choice.
数据湖新趋势
Data lake differs from traditional data warehouse: uses ELT (load data first, then transform) rather than ETL (transform before load), providing more flexibility. It supports structured, unstructured, and semi-structured data storage without requiring pre-standardization. Data lake provides comprehensive data storage, management, and analysis capabilities, supporting multiple data sources and applications through a one-stop approach.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.