Mastering the Big Data Technology Stack: From Basics to Advanced Applications
This comprehensive guide outlines the entire big data technology ecosystem, covering foundational techniques, data acquisition, transmission, storage, processing frameworks, governance, and the most popular tools and platforms that power modern large‑scale data solutions.
Big data has become a critical production factor across industries, enabling cost reduction, efficiency gains, new product development, and smarter business decisions through massive data storage, computation, analysis, and mining.
1. Big Data Fundamental Technologies
Data sharding and routing split petabyte‑scale datasets across clusters, while replication improves reliability and raises consistency challenges. Efficient algorithms and data structures are essential for high‑performance processing of massive distributed datasets.
2. Data Acquisition
Data acquisition is the first stage of the big data lifecycle, encompassing structured, semi‑structured, and unstructured sources such as logs, network traffic, audio, video, and images.
System Log Collection
Collects operational data from databases, servers, and applications.
Instrumentation: browser (PC) points, mobile client points, server points.
Collection frameworks: Chukwa, Splunk Forwarder, Flume, Fluentd, Logstash, Scribe.
Network Data Collection
Obtains data from websites via crawlers or public APIs, including text, video, and images.
Crawler tools: Nutch, Heritrix, Scrapy, WebCollector.
Device Data Collection
Gathers data from physical devices such as sensors and probes.
3. Data Transmission
Acquired data is moved through channels to storage and downstream applications, enabling timely notifications of data changes.
Message Queues
Middleware that solves log collection, application coupling, asynchronous messaging, and traffic shaping, providing high performance, high availability, scalability, and eventual consistency.
Data Synchronization
Synchronizes raw operational data (ODS) from sources like MySQL to data warehouses using batch extraction and load processes.
Data Subscription
Delivers real‑time incremental data to support cache updates, asynchronous decoupling, and complex ETL pipelines.
Serialization
Converts objects into transmittable formats; serialization efficiency directly impacts big data transfer performance.
4. Data Organization & Integration
Provides high‑performance, reliable storage for massive heterogeneous data, supporting large‑scale analysis and computation.
Physical Storage
Classified by server type: closed systems (mainframes) and open systems (Windows, UNIX, Linux). Open systems further split into internal and external storage, with external storage divided into Direct‑Attached Storage (DAS) and Fabric‑Attached Storage (FAS), which includes Network‑Attached Storage (NAS) and Storage Area Network (SAN).
Different application scenarios lead to object storage, block storage, and file system storage.
Distributed File/Object Storage Systems
Offer scalable, high‑throughput storage across multiple nodes for both file and object data.
Popular systems: HDFS, OpenStack Swift, Ceph, GlusterFS, Lustre, AFS, OSS.
Distributed Relational Databases
Address the limitations of centralized RDBMSs (performance, scalability) by distributing data across clusters.
Popular solutions: DRDS, TiDB, Greenplum, Cobar, Aurora, Mycat.
Analytical Databases
Designed for online statistical analysis, ad‑hoc queries, and data mining.
Popular solutions: Kylin, AnalyticDB, Druid, ClickHouse, Vertica, MonetDB, InfiniDB, LucidDB.
Search Engines
Provide distributed, high‑performance, scalable search and analysis over massive datasets.
Popular engines: Elasticsearch, Solr, OpenSearch.
Graph Databases
Store and query relationships using graph structures, ideal for social networks and complex relational data.
Popular solutions: Titan, Neo4j, ArangoDB, OrientDB, MapGraph, AllegroGraph.
Columnar Databases
Store data by columns, optimizing batch processing and real‑time queries.
Popular solutions: Phoenix, Cassandra, HBase, Kudu, Hypertable.
Document Databases
Manage semi‑structured document data.
Popular solutions: MongoDB, CouchDB, OrientDB, MarkLogic.
Key‑Value Stores
Simple, high‑throughput storage for caching and fast lookups.
Popular solutions: Redis, Memcached, Tair.
5. Data Computation
Big data computation handles parallel processing, analysis, and mining to meet diverse business needs.
Streaming Compute
Processes continuous data streams with low latency, distinguishing it from pure real‑time compute.
Frameworks: Storm, Flink, Yahoo S4, Kafka Streams, Twitter Heron, Apache Samza, Spark Streaming.
Batch Compute
Executes large‑scale parallel processing on static datasets.
Frameworks: Tez, MapReduce, Hive, Spark, Pig, Apache Beam.
Ad‑hoc Query (Interactive Analysis)
Enables flexible, on‑the‑fly queries for business intelligence.
Frameworks: Impala, Hawq, Dremel, Drill, Phoenix, Tajo, Presto, Hortonworks Stinger.
Incremental Compute
Processes only newly added data to improve efficiency, used in scenarios like search engine index updates.
Architectures: Lambda, Kappa, IOTA.
Frameworks: Microsoft Kineograph, Galaxy, Google Percolator, Druid.
Graph Compute
Analyzes large graph datasets using specialized models.
Frameworks: Pregel, GraphChi, Spark GraphX, PowerGraph, Apache Giraph, Apache Hama.
6. Distributed Coordination Systems
Provide naming, state synchronization, cluster management, and configuration services for large distributed environments.
Frameworks: Chubby, Alibaba Diamond, Alibaba ConfigServer, Zookeeper, Eureka, Consul.
7. Cluster Resource Management & Scheduling
Unified management and allocation of resources across clusters and data centers.
Frameworks: Omega, Brog, Mesos, Corona, YARN, Torca.
Monitoring tools: Ambari, Chukwa, Hue.
8. Workflow Management Engines
Handle complex, dependency‑rich data pipelines using directed acyclic graphs (DAGs).
Frameworks: Oozie, Azkaban, Luigi, Airflow.
9. Data Warehouse
Separates analytical workloads from transactional systems to enable comprehensive decision support and historical analysis.
10. Data Governance
Ensures the value of massive, heterogeneous data through governance, quality, and security practices.
Metadata Management
Manages schema, lineage, permissions, and other auxiliary information to improve data discoverability and system maintenance.
Data Quality
Focuses on accuracy, completeness, and consistency of data assets.
Data Security
Protects data against breaches and ensures compliance, a prerequisite for any big data application.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
