Big Data 16 min read

Building a Billion‑Scale Real‑Time Analytics Platform: Architecture & Techniques

This article explains how a billion‑scale data analytics system can achieve second‑level data ingestion and query without predefined metrics, detailing the product requirements, technical choices, and the end‑to‑end architecture from collection to storage and real‑time querying.

dbaplus Community
dbaplus Community
dbaplus Community
Building a Billion‑Scale Real‑Time Analytics Platform: Architecture & Techniques

Customer Requirements and Product Design

To handle daily data volumes of up to one billion events, the system must support private deployment, open‑source extensibility, low ETL cost, second‑level data import, and sub‑second query response, while allowing multi‑dimensional analysis without pre‑defined metrics or dimensions.

Overall Architecture

The platform follows a five‑step data processing pipeline typical of large‑scale analytics, illustrated in the architecture diagram.

1. Data Collection Subsystem

Three data sources are supported:

Front‑end operations (iOS, Android, web) collected via three methods: full‑point, visual point, and code point.

Back‑end logs collected directly from services.

Business data imported via tools or RESTful APIs.

Code‑point collection provides the richest data, enabling custom attributes such as order amount and user level.

2. Data Ingestion Subsystem

Data is sent via HTTP API to an Nginx front‑end, which writes requests to log files. An Extractor module reads these logs in real time, validates formats, enriches data (IP‑based location, User‑Agent parsing), performs ID‑mapping, and publishes events to Kafka.

3. Data Model

The system adopts an Event + User model. An Event records a user action with a user ID, event name, and up to 10,000 custom properties, while a User stores static attributes such as age, location, and tags. Both are schema‑free; new fields are discovered automatically.

4. Data Import and Storage

Data is stored on HDFS with a Write‑Optimized Store (WOS) and a Read‑Optimized Store (ROS). WOS uses Kudu for real‑time writes; once a table reaches a threshold, it is converted to Parquet files in ROS. Parquet files are partitioned by event date and name, with file sizes around 512 MB and sorted by user ID and timestamp to enable efficient column‑arithmetic scans.

5. Data Query Subsystem

Queries arrive at a WebServer and are forwarded to a QueryEngine, which translates them into SQL executed by Impala. Impala reads from both Kudu and Parquet via a unified view. Optimizations include limited query models with targeted UDF/UDAF implementations, custom caching, and user‑level sampling.

6. Metadata, Monitoring, and Operations

Metadata (schemas, dimensions, configurations) is stored in MySQL and ZooKeeper; query caches reside in Redis. A Monitor module continuously checks system health and performs automatic remediation. Operational tools automate data cleanup, version upgrades, performance analysis, and multi‑project management.

User Segmentation and Behavior Prediction

User segmentation tags users based on historical actions (e.g., recent purchasers). Behavior prediction estimates the probability of future actions. Detailed implementations are deferred to future articles.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Real-time analyticsdata ingestionImpala
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.