Overview of Big Data Technologies and Architectures
This article provides a comprehensive overview of major big‑data platforms such as Hadoop, Spark, Flink, Kafka, and related ecosystem components, explaining their core concepts, storage models, processing frameworks, and architectural patterns for handling massive, distributed datasets.
Background
Hadoop: An open‑source data‑analysis platform that solves reliable storage and processing of massive data sets that cannot fit on a single machine, especially unstructured data, using core components HDFS and MapReduce.
HDFS: Provides a fault‑tolerant, distributed storage system across multiple servers.
MapReduce: A standardized processing pipeline that is data‑locality aware: it reads data, maps it, shuffles by key, and reduces to produce final output.
Amazon Elastic Map Reduce (EMR): A managed solution that runs on Amazon EC2 and S3. It is cost‑effective for occasional large‑scale jobs but is tightly optimized for data stored in S3, which can add latency.
Hadoop extensions: Includes Sqoop, Flume, Hive, Pig, Mahout, Datafu, and HUE.
Pig: A platform for analyzing large data sets using a high‑level language and execution engine.
Hive: A data‑warehouse system for Hadoop offering an SQL‑like query language for summarization, ad‑hoc queries, and analysis.
HBase: A distributed, scalable NoSQL store that supports random real‑time read/write access.
Sqoop: A tool designed for efficient bulk transfer between Hadoop and structured data stores such as relational databases.
Flume: A distributed, reliable service for efficiently collecting, aggregating, and moving large volumes of log data.
ZooKeeper: A centralized service for maintaining configuration information, naming, providing distributed synchronization, and group services.
Cloudera: The most mature Hadoop distribution, offering extensive deployment cases, powerful deployment, management, and monitoring tools, and the real‑time processing project Impala.
Hortonworks: A 100% open‑source Apache Hadoop vendor that contributes many enhancements back to the core, enabling Hadoop to run natively on platforms such as Windows Server and Azure.
MapR: Focuses on performance and usability by supporting the native Unix file system instead of HDFS, offering high‑availability features like snapshots and stateful failover, and leading the Apache Drill project (an open‑source implementation of Google Dremel).
Principles
Data Storage
Our goal is to build a reliable, massively scalable, and easy‑to‑maintain system. Computer memory exhibits locality: access speed increases from bottom to top of the memory hierarchy, while storage cost also rises. Compared to memory, disks and SSDs require careful data placement because performance varies dramatically. Disks provide persistence, low unit cost, and easy backup, but with cheap memory many data sets can be kept in RAM and distributed across machines, often using key‑value stores such as Memcached for caching. Persistent memory can be achieved with battery‑backed RAM, write‑ahead logs, periodic snapshots, or replication to other machines. On restart, state is reloaded from disk or network. Write‑ahead logs are typically appended to disk, while reads are served directly from memory. Systems like VoltDB, MemSQL, and RAMCloud (in‑memory relational databases) deliver high performance and avoid the complexities of disk management.
HyperLogLog, Bloom Filter, Count‑Min Sketch
These are algorithms used in big‑data contexts. They apply a set of independent hash functions to the input. HyperLogLog estimates the cardinality of a massive set by counting leading zeros in hashed values and using low‑order bits as buckets. Bloom Filter marks bits for each hash during preprocessing; a lookup checks those bits to determine possible presence, yielding false positives but no false negatives. Count‑Min Sketch extends Bloom Filter to estimate the frequency of an element, not just its existence.
CAP Theorem
In simple terms, a distributed system can provide at most two of the three properties: Consistency, Availability, and Partition tolerance. Designing systems requires trade‑offs. Other important distributed algorithms and theories include Paxos, Gossip protocols, Quorum, logical clocks, vector clocks, the Byzantine Generals problem, and two‑phase commit.
Technologies
Source: thinkbig.teradata.com
Big‑data architectures must be flexible to meet varying latency (SLA) requirements, data volume, update rates, and analytical needs. The diagram above shows component choices across different domains.
Key Google technologies that pioneered modern big‑data processing include Spanner, F1, and Dremel .
Spanner: A globally distributed, multi‑version, highly scalable database with synchronous replication, supporting external consistency for distributed transactions across hundreds of data centers and millions of servers.
F1: Built on Spanner, it adds distributed SQL, secondary indexes, and strong transactional consistency, replacing an older MySQL sharding solution for AdWords.
Dremel: A query engine that runs on thousands of servers, offering SQL‑like syntax and the ability to process petabyte‑scale data in seconds.
Spark
Spark, the hottest big‑data technology in 2014, focuses on in‑memory computation for faster analytics, supporting graph, stream, and batch processing. It originated from the Berkeley AMPLab, which later founded Databricks.
Flink
Flink uses a SQL‑like query optimizer, distinguishing it from Apache Spark and delivering better performance for certain workloads.
Kafka
Described as LinkedIn’s “central nervous system,” Kafka manages streams of data from many applications, providing near‑real‑time processing for companies such as LinkedIn, Netflix, Uber, and Verizon.
Storm
Storm is Twitter’s real‑time stream processing framework, offering distributed, fault‑tolerant computation for use cases such as online machine learning, ETL, and continuous analytics.
Samza
Developed by LinkedIn, Samza integrates tightly with Kafka and provides a complementary stream processing platform.
Lambda Architecture
Nathan’s article “How to Beat the CAP Theorem” introduced the Lambda Architecture, which combines batch processing for high‑latency, large‑volume data with real‑time stream processing for low‑latency data, merging the results in a serving layer.
Summingbird addresses the maintenance overhead of Lambda Architecture by providing a unified programming model that runs on both batch and stream engines.
NoSQL
Traditional relational databases store data in hierarchical tables, which struggle with many‑to‑many relationships. Modern NoSQL systems (e.g., Cassandra, MongoDB, Couchbase) fall into document, graph, column‑family, and key‑value categories, each solving specific problems; there is no one‑size‑fits‑all solution.
Cassandra
In big‑data architectures, Cassandra stores structured data with a column‑family model, offering high availability, durability, and eventual consistency across massive clusters.
SQL on Hadoop
Open‑source projects such as Apache Hive, Spark SQL, Cloudera Impala, Hortonworks Stinger, Facebook Presto, Apache Tajo, and Apache Drill provide SQL‑like query capabilities on top of Hadoop, many inspired by Google Dremel.
Impala
Developed by Cloudera, Impala offers low‑latency SQL queries over data stored in HDFS and HBase, claiming 5‑10× speedup over Hive, though Spark has recently eclipsed it in popularity.
Drill
Apache Drill is the open‑source counterpart to Dremel, designed for interactive analysis of large data sets.
Druid
Druid is an open‑source, column‑oriented, distributed data store optimized for real‑time analytics on billions of rows, delivering sub‑second query response times.
Berkeley Data Analytics Stack (BDAS)
Beyond Spark, BDAS includes projects such as:
Mesos – a distributed resource manager that runs Hadoop, MPI, and Spark workloads in a unified environment.
Tachyon – a fault‑tolerant, memory‑speed distributed file system for sharing data across cluster frameworks.
BlinkDB – an interactive, approximate SQL engine that trades query accuracy for faster response times on massive data.
Cloudera
Cloudera provides the classic Hadoop solution suite.
HDP (Hadoop Data Platform)
Hortonworks’ reference architecture for enterprise Hadoop deployments.
Redshift
Amazon Redshift, based on ParAccel, is a massively parallel data‑warehouse solution with a SQL interface, tightly integrated with AWS services and capable of high performance from terabyte to petabyte scales.
Netflix
Netflix runs a fully AWS‑based data processing stack.
Intel
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Art of Distributed System Architecture Design
Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
