Understanding Big Data: 4V Traits, Google’s Distributed Computing, and Hadoop Ecosystem
This article explores the 4V characteristics of big data, real‑world data growth examples, historical analogies, Google’s GFS‑MapReduce‑BigTable model, Hadoop’s architecture and HDFS processes, HBase components, NoSQL alternatives, and practical big‑data applications at Tencent and beyond.
Just as telescopes let us perceive the universe and microscopes let us observe microbes, big data is reshaping how we live and understand the world.
Big Data's 4V Characteristics – Source
Company’s "big data"
As business expands, unstructured data related to processes and rules explodes. Examples:
Business systems store 200,000 images daily, consuming 100 GB of disk space per day.
6000 contract video files are generated daily, each about 250 MB, consuming 1 TB of disk space per day.
Big Data in the Three Kingdoms
The "Straw Boat Borrowing Arrows" story relates to big data: ancient astronomers collected wind, cloud, temperature, humidity, illumination, and seasonal data—large, heterogeneous, unstructured inputs processed by the human brain to reach conclusions.
Google’s Distributed Computing Triple
1. Google File System (GFS) solves storage by using many cheap machines with redundancy to achieve both speed and safety.
2. MapReduce splits operations into map (divide data) and reduce (aggregate results) phases.
3. BigTable provides a scalable solution for storing structured data on distributed systems, handling massive tables and load balancing.
Hadoop Architecture
HDFS Read Process
Client requests file read from NameNode.
NameNode returns DataNode locations.
Client reads the file from the DataNodes.
HDFS Write Process
Client requests file write to NameNode.
NameNode, based on file size and block configuration, returns a set of DataNodes.
Client splits the file into blocks and writes each block sequentially to the assigned DataNodes.
MapReduce – Mapping and Reducing Model
Input data → Map (split tasks) → Execute & return results → Reduce (aggregate) → Output results
HBase – Distributed Data Storage System
Client uses HBase RPC to communicate with HMaster and HRegionServer.
Zookeeper coordinates services; HMaster monitors HRegionServer health.
HMaster manages table CRUD operations.
HRegionServer handles I/O requests, reading/writing data from HDFS.
HRegion is the smallest storage unit (a table).
HStore consists of MemStore and StoreFile.
HLog records every write to both MemStore and a log file.
Other NoSQL Products
Why Use NoSQL?
High‑concurrency website DB evolution illustrated with diagrams.
Hadoop 2.0
MapReduce components: JobTracker (coordinates jobs) and TaskTracker (executes task splits).
Big Data Technology Areas
Tencent Big Data Status (2014)
Tencent Big Data Platform Architecture
Company Data Processing Platform Architecture
Company Big Data Platform Diagram
Application 1 – Data Analysis
Application 2 – Video Storage
Application 3 – Offline Log Analysis
Application 5 – Online Data Analysis
Reference: JD.com’s Samza streaming computation practice.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
