Understanding Data: Types, Systems, and Big Data Technologies
This article explains what data is, classifies it into structured, semi‑structured and unstructured forms, describes data mining, databases, data warehouses, the full data lifecycle, and surveys the big‑data ecosystem including storage, batch and real‑time processing, resource scheduling, and visualization technologies.
Data has become a critical asset for product development and company growth, encompassing user information, behavior logs, UGC, and media in various formats such as text, images, video, and audio.
From a technical perspective, data is divided into three categories: structured data that fits relational tables, semi‑structured data with flexible schemas, and unstructured data that cannot be easily represented in two‑dimensional tables.
Data mining extracts useful knowledge from massive datasets, supporting statistical analysis and machine‑learning‑based personalization, recommendation, and marketing.
Databases (relational and NoSQL) store online business data, while data warehouses, built on technologies like Hive, store historical data for analytical workloads.
ETL (extract‑transform‑load) moves data from source systems into warehouses, and the data lifecycle progresses from hot (frequently accessed) to warm and finally cold data, each requiring different technical solutions.
A complete data system consists of data collection (instrumentation/"埋点"), queuing (e.g., Kafka, RabbitMQ), processing (offline batch or real‑time streaming), storage (MySQL, sharding, or big‑data stores), and visualization (dashboards, charts, eCharts).
Depending on scale, data can be small (handled with sampling), medium (requires more robust pipelines), or large (necessitating big‑data technologies such as Hadoop, HDFS, HBase, Spark, and Kafka).
Big‑data characteristics are described by the 5Vs (volume, variety, velocity, veracity, value) and involve technologies for file storage (HDFS), NoSQL databases (HBase, Elasticsearch), batch processing (MapReduce, Tez, Spark), interactive SQL engines (Hive, Presto, Impala), real‑time streaming (Storm, Spark Streaming), resource scheduling (YARN, Mesos), coordination (Zookeeper), and monitoring (Ganglia).
The Lambda architecture combines batch, speed, and serving layers to provide both historical and real‑time views, typically using Hadoop for batch, Impala for serving, and Storm or Spark for speed.
Personalized recommendation systems, a major application of machine learning, rely on log collection, recommendation algorithms (content‑based, association‑rule, collaborative filtering), and UI presentation, and they fit within the broader data ecosystem described above.
In summary, data statistics, machine learning, and personalized recommendation are the most impactful uses of data today, and building an appropriate data system—ranging from simple relational databases to full‑scale big‑data platforms—is essential for handling different data volumes and business requirements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
