Spark vs Hadoop, Flink, HBase/Cassandra, Kafka & Tachyon: Expert Q&A
During a lively “Sit and Discuss” session, experts compared Spark and Hadoop, evaluated Flink against Spark, contrasted HBase with Cassandra, explained why Kafka (and sometimes Flink) is preferred for distributed messaging, and shared insights on Tachyon’s role in modern big‑data ecosystems.
Introduction
“Sit and Discuss” is a rotating Q&A format where after answering a question the responder can ask the next participant.
During the Big Data themed week, several leading domestic experts answered enthusiastically, receiving positive feedback.
Key Questions
The discussion covered the following questions:
Spark can replace Hadoop?
Compare Flink with Spark.
Views on HBase and Cassandra.
Which distributed message queue do you choose for stream processing and why?
Thoughts on Tachyon?
Q1. Can Spark replace Hadoop?
Hadoop includes Common, HDFS, YARN, and MapReduce; Spark never claimed to replace Hadoop, at most it can replace MapReduce.
Hadoop has evolved into an ecosystem that incorporates frameworks like Spark, which integrates seamlessly with HDFS and runs on YARN.
Spark also works with other systems such as Elasticsearch and Cassandra, so Hadoop is not a prerequisite for using Spark, though Spark fits well within the Hadoop ecosystem.
Q2. Compare Flink with Spark
As a Spark evangelist in China, I have long followed Flink; it is well‑known in Europe.
Both Flink and Spark aim to be unified compute engines for batch, streaming, machine learning, and graph processing, and both integrate with the Hadoop ecosystem (e.g., HDFS, YARN).
However, their approaches differ:
Spark emphasized “in‑memory computing” at launch, while Flink follows an MPP‑like architecture.
Flink claims true real‑time processing; Spark uses micro‑batch.
Flink’s incremental iteration processes only changed data, reducing work in later iterations.
Flink also implements its own memory management, dividing the heap into Network Buffers, Memory Manager Pool, and remaining heap.
Spark’s optimization project, Tungsten, also performs its own memory management and other aggressive optimizations, reflected in versions 1.4–1.6.
Spark supports DAG execution, whereas Flink supports cyclic graphs.
Q3. Views on HBase and Cassandra
Some engineers mistakenly think Facebook abandoned Cassandra, which is a misconception.
HBase and Cassandra share many similarities:
Both are column‑oriented stores.
Both write to a log, then to an in‑memory structure, and finally flush to disk using LSM‑tree based files (HLog → MemStore → StoreFile for HBase; CommitLog → Memtable → SSTable for Cassandra).
…
Key differences include:
HBase relies on ZooKeeper; Cassandra is self‑sufficient.
HBase has a Master node; Cassandra uses seed nodes.
HBase obtains metadata from ZooKeeper; Cassandra uses Gossip for communication.
HBase requires HDFS; Cassandra does not.
Cassandra supports secondary indexes; HBase does not.
HBase has coprocessors; Cassandra lacks them.
Overall, HBase is more centralized, while Cassandra is decentralized.
Choice depends on specific requirements.
Q4. Which distributed message queue do you prefer for stream processing?
Kafka (often with Flume in front) is my choice; it satisfies the needs for speed, scalability, and durability.
Kafka’s advantages:
Writes data to disk but leverages the OS page cache; reads use sendfile for efficient transfer.
Scales easily by adding brokers.
Logical separation via topics and partitions; after version 0.9, consumers no longer need ZooKeeper for offset tracking.
Flexible deletion policies: max‑age, max‑size, or key‑based compaction.
Active community and broad support from streaming frameworks such as Spark Streaming and Storm.
Tools like Camus can periodically move Kafka data to HDFS.
Q5. Thoughts on Tachyon
I have early‑tested Tachyon (a distributed in‑memory file system written in Java) and see strong potential.
Tachyon stores data in memory and periodically checkpoints to an underlying filesystem (commonly HDFS), mitigating data loss on node failures.
It adopts Spark’s lineage concept for file recovery and provides an HDFS‑compatible interface, allowing MapReduce, Spark, and Flink to use Tachyon with minimal code changes.
Additional features include table support for high‑density column queries.
Typical use cases:
Sharing data between Spark jobs.
Sharing data across different frameworks.
Preventing total cache loss when Spark’s BlockManager fails.
Alleviating memory reuse issues.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.