Big Data 20 min read

Mastering the Big Data Technology Stack: From Basics to Advanced Applications

This comprehensive guide outlines the entire big data technology ecosystem, covering foundational techniques, data acquisition, transmission, storage, processing frameworks, governance, and the most popular tools and platforms that power modern large‑scale data solutions.

Alibaba Cloud Developer

Jun 11, 2020

Mastering the Big Data Technology Stack: From Basics to Advanced Applications

Big data has become a critical production factor across industries, enabling cost reduction, efficiency gains, new product development, and smarter business decisions through massive data storage, computation, analysis, and mining.

1. Big Data Fundamental Technologies

Data sharding and routing split petabyte‑scale datasets across clusters, while replication improves reliability and raises consistency challenges. Efficient algorithms and data structures are essential for high‑performance processing of massive distributed datasets.

2. Data Acquisition

Data acquisition is the first stage of the big data lifecycle, encompassing structured, semi‑structured, and unstructured sources such as logs, network traffic, audio, video, and images.

System Log Collection

Collects operational data from databases, servers, and applications.

Instrumentation: browser (PC) points, mobile client points, server points.

Collection frameworks: Chukwa, Splunk Forwarder, Flume, Fluentd, Logstash, Scribe.

Network Data Collection

Obtains data from websites via crawlers or public APIs, including text, video, and images.

Crawler tools: Nutch, Heritrix, Scrapy, WebCollector.

Device Data Collection

Gathers data from physical devices such as sensors and probes.

3. Data Transmission

Acquired data is moved through channels to storage and downstream applications, enabling timely notifications of data changes.

Message Queues

Middleware that solves log collection, application coupling, asynchronous messaging, and traffic shaping, providing high performance, high availability, scalability, and eventual consistency.

Data Synchronization

Synchronizes raw operational data (ODS) from sources like MySQL to data warehouses using batch extraction and load processes.

Data Subscription

Delivers real‑time incremental data to support cache updates, asynchronous decoupling, and complex ETL pipelines.

Serialization

Converts objects into transmittable formats; serialization efficiency directly impacts big data transfer performance.

4. Data Organization & Integration

Provides high‑performance, reliable storage for massive heterogeneous data, supporting large‑scale analysis and computation.

Physical Storage

Classified by server type: closed systems (mainframes) and open systems (Windows, UNIX, Linux). Open systems further split into internal and external storage, with external storage divided into Direct‑Attached Storage (DAS) and Fabric‑Attached Storage (FAS), which includes Network‑Attached Storage (NAS) and Storage Area Network (SAN).

Different application scenarios lead to object storage, block storage, and file system storage.

Distributed File/Object Storage Systems

Offer scalable, high‑throughput storage across multiple nodes for both file and object data.

Popular systems: HDFS, OpenStack Swift, Ceph, GlusterFS, Lustre, AFS, OSS.

Distributed Relational Databases

Address the limitations of centralized RDBMSs (performance, scalability) by distributing data across clusters.

Popular solutions: DRDS, TiDB, Greenplum, Cobar, Aurora, Mycat.

Analytical Databases

Designed for online statistical analysis, ad‑hoc queries, and data mining.

Popular solutions: Kylin, AnalyticDB, Druid, ClickHouse, Vertica, MonetDB, InfiniDB, LucidDB.

Search Engines

Provide distributed, high‑performance, scalable search and analysis over massive datasets.

Popular engines: Elasticsearch, Solr, OpenSearch.

Graph Databases

Store and query relationships using graph structures, ideal for social networks and complex relational data.

Popular solutions: Titan, Neo4j, ArangoDB, OrientDB, MapGraph, AllegroGraph.

Columnar Databases

Store data by columns, optimizing batch processing and real‑time queries.

Popular solutions: Phoenix, Cassandra, HBase, Kudu, Hypertable.

Document Databases

Manage semi‑structured document data.

Popular solutions: MongoDB, CouchDB, OrientDB, MarkLogic.

Key‑Value Stores

Simple, high‑throughput storage for caching and fast lookups.

Popular solutions: Redis, Memcached, Tair.

5. Data Computation

Big data computation handles parallel processing, analysis, and mining to meet diverse business needs.

Streaming Compute

Processes continuous data streams with low latency, distinguishing it from pure real‑time compute.

Frameworks: Storm, Flink, Yahoo S4, Kafka Streams, Twitter Heron, Apache Samza, Spark Streaming.

Batch Compute

Executes large‑scale parallel processing on static datasets.

Frameworks: Tez, MapReduce, Hive, Spark, Pig, Apache Beam.

Ad‑hoc Query (Interactive Analysis)

Enables flexible, on‑the‑fly queries for business intelligence.

Frameworks: Impala, Hawq, Dremel, Drill, Phoenix, Tajo, Presto, Hortonworks Stinger.

Incremental Compute

Processes only newly added data to improve efficiency, used in scenarios like search engine index updates.

Architectures: Lambda, Kappa, IOTA.

Frameworks: Microsoft Kineograph, Galaxy, Google Percolator, Druid.

Graph Compute

Analyzes large graph datasets using specialized models.

Frameworks: Pregel, GraphChi, Spark GraphX, PowerGraph, Apache Giraph, Apache Hama.

6. Distributed Coordination Systems

Provide naming, state synchronization, cluster management, and configuration services for large distributed environments.

Frameworks: Chubby, Alibaba Diamond, Alibaba ConfigServer, Zookeeper, Eureka, Consul.

7. Cluster Resource Management & Scheduling

Unified management and allocation of resources across clusters and data centers.

Frameworks: Omega, Brog, Mesos, Corona, YARN, Torca.

Monitoring tools: Ambari, Chukwa, Hue.

8. Workflow Management Engines

Handle complex, dependency‑rich data pipelines using directed acyclic graphs (DAGs).

Frameworks: Oozie, Azkaban, Luigi, Airflow.

9. Data Warehouse

Separates analytical workloads from transactional systems to enable comprehensive decision support and historical analysis.

10. Data Governance

Ensures the value of massive, heterogeneous data through governance, quality, and security practices.

Metadata Management

Manages schema, lineage, permissions, and other auxiliary information to improve data discoverability and system maintenance.

Data Quality

Focuses on accuracy, completeness, and consistency of data assets.

Data Security

Protects data against breaches and ensures compliance, a prerequisite for any big data application.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Governance Data Architecture

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.