Big Data 13 min read

From Storage to Real‑Time: The Evolution of Big Data Technologies

This article outlines the three historical stages of big data technology—from early storage and batch processing, through market‑driven integration with Hive, to today’s focus on speed with Spark, Impala and streaming—while detailing the Hadoop ecosystem components such as HDFS, MapReduce, KV stores and emerging solutions like YDB.

MaGe Linux Operations

May 3, 2017

From Storage to Real‑Time: The Evolution of Big Data Technologies

Three Stages of Big Data Technology Development

Since 2009, BAT companies heavily invested in Hadoop to solve massive data storage and simple analysis, focusing on logging user behavior, basic PV/UV statistics, and improving storage capacity, cluster scale, and extensibility.

Stage 1 – Store and Wait for Opportunity

Data is first accumulated to create value; without data there is nothing to analyze.

Website click‑stream logs contain hidden potential.

Simple PV/UV metrics meet basic needs.

Emphasis shifts to storage capability and scalability.

Stage 2 – Marketization

Attention turns to integrating data into a full‑view warehouse. Hive rises, and over 80% of large‑scale clusters run Hive‑like tasks. Companies first exploit data internally, then open it to external users for collaborative mining.

Stage 3 – Speed Is King

Timeliness and response time become critical; faster processing wins business advantage. New Hadoop‑ecosystem technologies such as Spark, Impala, Kylin, Druid and Storm focus on reducing latency.

Alipay performs instant multi‑dimensional analysis in seconds.

Tencent ads generate real‑time audience segments and ad delivery.

Big Data Technology Ecosystem

Big data is now a national strategy. The Hadoop ecosystem—like a kitchen full of tools—provides storage, processing, and query layers, each with specific strengths and trade‑offs.

HDFS

HDFS (Hadoop Distributed File System) enables storage of petabyte‑scale data across hundreds of machines, presenting a single logical file system to users.

MapReduce, Tez, Spark

After data is stored, processing is required. MapReduce is the first‑generation batch engine; Tez and Spark are second‑generation engines that improve memory usage and reduce disk I/O. The classic Map‑Reduce model (Map → Shuffle → Reduce) is illustrated by a word‑frequency counting example.

Hive

Pig and Hive provide higher‑level scripting and SQL interfaces that compile into MapReduce jobs, allowing developers and analysts to write concise, maintainable code.

Impala, Presto, Drill

These interactive SQL engines were created because Hive on MapReduce is too slow for ad‑hoc analysis; they sacrifice some fault tolerance for much faster query execution.

Spark

Hive on Tez/Spark and SparkSQL combine the speed of modern engines with familiar SQL syntax, offering a versatile, single‑system solution.

Storm

Streaming computation (e.g., Storm) processes data in real time as it arrives, enabling sub‑second analytics at the cost of flexibility.

KV Stores (Cassandra, HBase, MongoDB)

Key‑Value stores provide ultra‑fast lookups for massive datasets, ideal for scenarios like retrieving a record by ID, but they lack complex joins and strong consistency guarantees.

YDB

YDB applies traditional indexing techniques to big‑data workloads, delivering low‑latency, high‑throughput queries without requiring heavyweight Hadoop clusters.

In the big‑data era, success depends not only on data volume but also on processing speed, cost efficiency, and the right combination of tools.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Streaming Hive MapReduce HDFS Spark Hadoop

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.