Big Data 7 min read

Understanding Big Data: 4V Traits, Google’s Distributed Computing, and Hadoop Ecosystem

This article explores the 4V characteristics of big data, real‑world data growth examples, historical analogies, Google’s GFS‑MapReduce‑BigTable model, Hadoop’s architecture and HDFS processes, HBase components, NoSQL alternatives, and practical big‑data applications at Tencent and beyond.

21CTO

Nov 26, 2015

Understanding Big Data: 4V Traits, Google’s Distributed Computing, and Hadoop Ecosystem

Just as telescopes let us perceive the universe and microscopes let us observe microbes, big data is reshaping how we live and understand the world.

Big Data's 4V Characteristics – Source

Company’s "big data"

As business expands, unstructured data related to processes and rules explodes. Examples:

Business systems store 200,000 images daily, consuming 100 GB of disk space per day.

6000 contract video files are generated daily, each about 250 MB, consuming 1 TB of disk space per day.

Big Data in the Three Kingdoms

The "Straw Boat Borrowing Arrows" story relates to big data: ancient astronomers collected wind, cloud, temperature, humidity, illumination, and seasonal data—large, heterogeneous, unstructured inputs processed by the human brain to reach conclusions.

Google’s Distributed Computing Triple

1. Google File System (GFS) solves storage by using many cheap machines with redundancy to achieve both speed and safety.

2. MapReduce splits operations into map (divide data) and reduce (aggregate results) phases.

3. BigTable provides a scalable solution for storing structured data on distributed systems, handling massive tables and load balancing.

Hadoop Architecture

HDFS Read Process

Client requests file read from NameNode.

NameNode returns DataNode locations.

Client reads the file from the DataNodes.

HDFS Write Process

Client requests file write to NameNode.

NameNode, based on file size and block configuration, returns a set of DataNodes.

Client splits the file into blocks and writes each block sequentially to the assigned DataNodes.

MapReduce – Mapping and Reducing Model

Input data → Map (split tasks) → Execute & return results → Reduce (aggregate) → Output results

HBase – Distributed Data Storage System

Client uses HBase RPC to communicate with HMaster and HRegionServer.

Zookeeper coordinates services; HMaster monitors HRegionServer health.

HMaster manages table CRUD operations.

HRegionServer handles I/O requests, reading/writing data from HDFS.

HRegion is the smallest storage unit (a table).

HStore consists of MemStore and StoreFile.

HLog records every write to both MemStore and a log file.

Other NoSQL Products

Why Use NoSQL?

High‑concurrency website DB evolution illustrated with diagrams.

Hadoop 2.0

MapReduce components: JobTracker (coordinates jobs) and TaskTracker (executes task splits).

Big Data Technology Areas

Tencent Big Data Status (2014)

Tencent Big Data Platform Architecture

Company Data Processing Platform Architecture

Company Big Data Platform Diagram

Application 1 – Data Analysis

Application 2 – Video Storage

Application 3 – Offline Log Analysis

Application 5 – Online Data Analysis

Reference: JD.com’s Samza streaming computation practice.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

MapReduce NoSQL Distributed Computing Hadoop Data Architecture

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.