Big Data 12 min read

How Suning Built a Scalable Real-Time Log Analysis Platform with Spark Streaming

Suning’s real‑time log analysis system integrates Flume, Kafka, Storm and Spark Streaming to collect, cleanse, and compute metrics like NDCG, ensuring low latency, high throughput, exact‑once processing, and robust data safety while supporting multi‑dimensional analytics on massive online‑offline traffic.

Suning Technology
Suning Technology
Suning Technology
How Suning Built a Scalable Real-Time Log Analysis Platform with Spark Streaming

Introduction

With Hadoop stacks becoming stable, the main bottleneck has shifted from compute capacity to soft requirements such as data diversity, complex business analysis, system stability, and data reliability. Suning.com adopted a dual‑online‑offline strategy for smart retail, making its log analysis system the first step in data‑driven operations.

Data Analysis Process and Architecture

Business Background

Suning’s online and offline operators demand diverse, timely data analysis. The real‑time log analysis system processes billions of log entries daily, requiring low latency, no data loss, and complex analytical logic.

Rich data sources: online/offline traffic, sales, customer service, etc.

Diverse business needs: marketing, procurement, finance, supply‑chain, merchants, and more.

Process and Architecture

The system consists of three stages: collection, cleaning, and metric calculation.

Collection module: gathers logs from various sources and streams them to Kafka via Flume.

Cleaning module: receives logs in real time, transforms and structures them using Storm, then forwards the cleaned data back to Kafka.

Metric calculation: consumes structured data from Kafka and computes indicators using either Storm or Spark Streaming tasks. Spark Streaming is used for near‑real‑time scenarios, offering high throughput, standard SQL support, simple development, and window‑function capabilities. The Suning data cloud platform provides integrated components such as Hive, Spark, Storm, Druid, Elasticsearch, HBase, and Kafka.

After metric calculation, results are stored in HBase, Druid, and other storage engines for downstream business systems to retrieve and provide analytical services.

Spark Streaming in Metric Analysis

Spark Streaming Overview

Spark Streaming adapts the batch‑processing model to achieve near‑real‑time computation by processing data in micro‑batches. It supports multiple data sources (Kafka, Flume, HDFS, Kinesis) and can write results to HDFS, relational databases, and other sinks.

Compared with Storm, Spark Streaming offers higher throughput, native SQL support, better integration with storage systems, convenient development, and window‑function capabilities for complex calculations.

NDCG Metric Analysis

Normalized Discounted Cumulative Gain (NDCG) evaluates search ranking quality. It assigns higher weights to top‑ranked results using a logarithmic discount factor. The ideal DCG (IDCG) is computed first, then the actual DCG, and NDCG = DCG / IDCG, with values closer to 1 indicating better rankings.

Example: for the query "Apple" on Suning.com, using the top‑4 results yields IDCG = 1, DCG = 0.5, resulting in NDCG = 0.5.

NDCG Calculation Design

Analysis of search behavior shows 86% of searches complete within 5 minutes and 90% within 10 minutes. Therefore, a 15‑minute sliding window was chosen for real‑time NDCG calculation, presenting two challenges: windowed computation over the past 15 minutes and deduplication of searches within the window.

Using Spark Streaming’s window feature, a 15‑minute window with a 5‑minute slide computes NDCG for searches occurring between 15 and 10 minutes ago, avoiding duplicate calculations.

Initial tests revealed high resource consumption for storing 15‑minute data. By separating search logs in Kafka and subscribing only to this stream for NDCG, resource usage was significantly reduced.

Performance and Data Safety

Performance Assurance

Capacity planning must account for continuously growing traffic logs and peak loads during major promotions. Horizontal scaling—adjusting Kafka partitions, Storm processing nodes, and Spark Streaming concurrency—ensures the system meets performance requirements.

Multi‑dimensional Analysis Optimization

For NDCG, four dimensions (region, city, channel, keyword) are combined, requiring 15 separate updates in HBase. Adding finer time granularity (hour, week, month) would multiply tasks and storage, prompting the adoption of an OLAP engine. Suning’s data cloud introduced Druid in 2017, offering columnar storage, real‑time Kafka ingestion, and fast aggregations, dramatically improving NDCG computation efficiency.

Data Assurance

To prevent data loss during task restarts, two guarantees are required: the data source must retain data, and processing tasks must ensure data is handled. Kafka provides disk persistence and replication; Storm offers an Ack mechanism; Spark Streaming uses checkpointing (WAL) and the direct Kafka API to record offsets, enabling efficient recovery.

Exactly‑once Semantics

For sales‑related data, processing must be both complete and singular. Two solutions are presented:

Lambda architecture + Redis deduplication : after an order is processed, its ID is stored in Redis to prevent re‑processing.

MPP with primary key : using PG‑Citus as an MPP database, order IDs are defined as primary keys to enforce uniqueness.

Future Architecture Evolution and Optimization

The current stack relies on open‑source components and still faces challenges such as frequent data quality issues, lack of monitoring, and constant rule changes. Starting in late 2017, two modules were added:

Data quality monitoring: configurable rules for real‑time and batch data validation, supporting sampling and full‑volume checks with alerting.

Data cleaning rule configuration system: abstracts cleaning logic into configurable rules using Drools and Groovy, enabling easier updates.

Conclusion and Outlook

The log processing and analysis system underpins BI and data‑mining applications, acting as a crucial bridge between raw traffic and business insights. In large‑scale, multi‑line scenarios, systematic platforms for data quality, timeliness, and stability are essential for the success of smart retail.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big Datadata pipelineReal-time analyticsData QualitySpark StreamingNDCG
Suning Technology
Written by

Suning Technology

Official Suning Technology account. Explains cutting-edge retail technology and shares Suning's tech practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.