Big Data 17 min read

Beyond Hadoop: Modern Big Data Platforms and Technologies Explained

This article surveys the evolution of Hadoop and its ecosystem, explains core storage and processing concepts, and introduces contemporary big‑data technologies such as Spark, Flink, Kafka, Lambda architecture, NoSQL databases, and cloud‑native solutions, highlighting their roles and trade‑offs.

21CTO

Nov 19, 2015

Beyond Hadoop: Modern Big Data Platforms and Technologies Explained

1. Background

Hadoop is an open‑source data‑analysis platform that solves reliable storage and processing of big data (data too large for a single machine to store or process in required time). It is suitable for unstructured data and includes core components HDFS and MapReduce.

HDFS provides a distributed, elastic data‑storage system across servers.

MapReduce defines a standardized processing flow that is data‑locality aware: read data, map, shuffle by key, and reduce to produce the final output.

Amazon Elastic MapReduce (EMR) is a hosted solution running on EC2 and S3. It can reduce cost for occasional large‑scale jobs but is tightly optimized for data stored in S3 and may incur higher latency.

Hadoop’s ecosystem also includes extensions such as Sqoop, Flume, Hive, Pig, Mahout, Datafu and HUE.

Pig is a platform for analyzing large data sets using a high‑level language and execution infrastructure.

Hive is a data‑warehouse system for Hadoop that offers an SQL‑like query language for aggregation, ad‑hoc queries and analysis.

HBase is a distributed, scalable big‑data store that supports random real‑time read/write access.

Sqoop is a tool designed for efficient bulk transfer between Hadoop and structured data stores such as relational databases.

Flume is a distributed, reliable service for efficiently collecting, aggregating and moving large volumes of log data.

ZooKeeper provides centralized services for configuration information, naming, distributed synchronization and group services.

Cloudera is the most mature Hadoop distribution, offering strong deployment, management and monitoring tools, and contributes the real‑time processing project Impala.

Hortonworks delivers a 100 % open‑source Apache Hadoop offering and contributes many enhancements that allow Hadoop to run natively on platforms such as Windows Server and Azure.

MapR focuses on performance and usability by supporting the native Unix file system instead of HDFS, offering high‑availability features such as snapshots and stateful failover, and leads the Apache Drill project, an open‑source implementation of Google’s Dremel for SQL‑like real‑time queries.

2. Principles

Data storage aims for reliability, massive scalability and ease of maintenance. Locality dictates that lower‑level storage (memory) is faster but more expensive, while higher‑level storage (disk, SSD) is cheaper but slower.

Compared with memory, disk and SSD require careful data placement because performance varies greatly. Disks provide persistence, low unit cost and easy backup. With cheap memory, many data sets can be kept in memory and distributed across machines, often using key‑value stores such as Memcached. Persistent memory can be achieved with battery‑backed RAM, write‑ahead logs, snapshots, or replication across machines. On restart, state is restored from disk or network. Write‑ahead logs are typically appended to disk, while reads are served directly from memory. Systems like VoltDB, MemSQL, RAMCloud, and in‑memory relational databases (e.g., MemSQL) offer high performance and avoid disk management overhead.

Algorithms such as HyperLogLog, Bloom Filter, and Count‑Min Sketch are widely used in big‑data scenarios. They employ multiple independent hash functions to process input. HyperLogLog estimates the cardinality of a large set by counting leading zeros in hashed values. Bloom Filter marks hashed positions during preprocessing; a lookup checks these bits to determine possible presence, allowing false positives but no false negatives. Count‑Min Sketch extends Bloom Filter to estimate the frequency of an element.

Distributed systems must balance consistency, availability, and partition tolerance, typically sacrificing one. Advanced algorithms and theories such as Paxos, Gossip protocols, Quorum, logical clocks, vector clocks, Byzantine fault tolerance, and two‑phase commit require careful study.

3. Technologies

Big‑data architectures must be flexible to meet varying latency requirements (SLA), data volume, update frequency and analytical needs. The diagram below illustrates component choices across different domains.

Google’s “new three‑horsemen” – Spanner, F1 and Dremel – are foundational.

Spanner is a globally distributed, multi‑version, synchronously replicated database that supports external consistency and spans hundreds of data centers with trillions of rows.

F1 builds on Spanner, adding distributed SQL, transactional consistency and secondary indexes; it replaced a legacy MySQL sharding solution in Google’s AdWords.

Dremel enables interactive analysis of petabyte‑scale data across thousands of servers using an SQL‑like language, delivering results in seconds.

Spark

Spark, popular since 2014, focuses on in‑memory computation for faster analytics and also supports graph, streaming and batch processing. It originated from Berkeley AMP Lab and is commercialized by Databricks.

Flink

Flink applies SQL‑like query optimization techniques, distinguishing it from current versions of Spark, and can apply global optimization plans to individual queries for better performance.

Kafka

Kafka, described as LinkedIn’s “central nervous system,” manages streams of data from many applications, processes them in near‑real time, and distributes results. It powers real‑time pipelines at LinkedIn, Netflix, Uber and Verizon.

Storm

Storm, used by Twitter, is a distributed, fault‑tolerant real‑time computation framework that simplifies continuous stream processing for analytics, online machine learning, ETL and more.

Samza

Samza, promoted by LinkedIn, integrates tightly with Kafka and serves as a complementary stream‑processing engine.

Lambda Architecture

The Lambda Architecture combines batch processing for high‑volume, high‑latency data with real‑time stream processing for low‑latency data, merging the results in a serving layer to achieve both scalability and timeliness.

Summingbird

Summingbird, developed at Twitter, enables a single programming model to run on both batch and stream systems, reducing the overhead of maintaining separate pipelines.

NoSQL

Traditional relational databases struggle with many‑to‑many relationships; NoSQL databases such as Cassandra, MongoDB and Couchbase address various use cases with document, graph, column‑family and key‑value models, but no single solution fits all scenarios.

Cassandra

Cassandra, a column‑family store, offers high availability and eventual consistency, supporting massive clusters with petabyte‑scale data.

SQL on Hadoop

Projects like Apache Hive, Spark SQL, Cloudera Impala, Hortonworks Stinger, Facebook Presto, Apache Tajo and Apache Drill bring SQL‑style querying to Hadoop, some inspired by Google Dremel.

Impala

Impala, developed by Cloudera, provides low‑latency SQL queries over data in HDFS and HBase, claiming 5‑10× speedup over Hive, though Spark’s popularity is eclipsing it.

Drill

Drill is the open‑source Apache implementation of Dremel, designed for interactive analysis of large data sets.

Druid

Druid is an open‑source, column‑oriented, distributed data store optimized for real‑time analytics on billions of rows.

Berkeley Data Analytics Stack (BDAS)

Beyond Spark, BDAS includes projects such as Mesos (resource manager for Hadoop, MPI, Spark), Tachyon (high‑fault‑tolerance distributed memory file system), BlinkDB (approximate query engine that trades accuracy for speed), and others.

Cloudera

Cloudera remains the leading Hadoop distribution, offering comprehensive solutions.

HDP (Hadoop Data Platform)

HDP, from Hortonworks, provides a curated stack for enterprise Hadoop deployments.

Redshift

Amazon Redshift, based on ParAccel, is a massively parallel data‑warehouse service with a SQL interface, tightly integrated with AWS services and capable of high performance from terabyte to petabyte scales, especially when using SSD storage.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink kafka NoSQL Spark Hadoop

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.