Big Data 18 min read

How to Build a Real-Time Big Data Sentiment Analysis Platform Using Lambda & Kappa

This article explores the design of a large‑scale, real‑time sentiment analysis system, detailing the data ingestion, processing, and storage requirements, comparing Lambda and Kappa architectures, and presenting an Alibaba Cloud solution that combines Tablestore and Blink for unified batch‑and‑stream processing.

21CTO
21CTO
21CTO
How to Build a Real-Time Big Data Sentiment Analysis Platform Using Lambda & Kappa

Introduction

The rapid growth of the Internet has turned every user into a potential broadcaster, generating massive amounts of media, e‑commerce orders, and reviews that must be captured and analyzed in real time for business decision‑making.

Key Requirements for a Big Data Sentiment System

Massive real‑time ingestion : Web crawlers must collect pages from portals and self‑media, de‑duplicate before fetching, and extract sub‑pages after crawling.

Raw page processing : Convert unstructured HTML into structured fields such as title, summary, and product reviews.

Structured sentiment analysis : Classify content, apply sentiment tagging, and generate outputs like hot topics, influence analysis, propagation paths, user profiling, and alerts.

Flexible storage and interactive queries : Support full‑text search and multi‑field analytical queries for both analysts and business users.

Real‑time alerting : Detect major events and trigger immediate notifications.

System Design Overview

The upcoming two‑part series will first present the overall architecture, then dive into database schema and sample code.

Lambda Architecture (Wiki)

Lambda combines batch processing (e.g., Hadoop, Spark) with real‑time stream processing (e.g., Spark Streaming) using a queue such as Kafka. Raw pages are written to Kafka, then simultaneously consumed by a batch layer (HDFS → Hadoop) and a speed layer (Spark Streaming). Results are stored in a serving layer (e.g., HBase) and queried via Elasticsearch.

Open‑Source Sentiment Analysis Solution

The pipeline starts with a distributed crawler that pushes raw HTML to Kafka. Stream processing extracts structured fields, performs tokenization and sentiment analysis, and writes results to MySQL or HBase, which are then synchronized to Elasticsearch for flexible queries and alerting.

Stream jobs extract and tokenize content, then apply sentiment dictionaries.

Results are stored in MySQL/HBase and indexed in Elasticsearch; major events trigger Kafka alerts.

Batch jobs periodically reprocess the full dataset with Spark to refine sentiment models.

Challenges of the Open‑Source Stack

Operating many components (Kafka, HBase, Spark, Flink, Elasticsearch) increases failure risk and maintenance overhead.

Separate storage for batch and stream leads to data redundancy and duplicated code.

Synchronizing databases with search engines adds complexity and consistency concerns.

Lambda Plus: A Simplified Architecture

To reduce component count, the Lambda Plus design merges batch and stream storage using a single distributed database that supports both random access and sequential log consumption, enabling one codebase for both workloads.

Real‑time writes to the database provide a unified source for batch processing.

The database’s log interface supplies incremental data for stream engines.

Both batch and stream results are written back to the database, offering rich query capabilities.

Cloud‑Based Sentiment System on Alibaba Cloud

We selected Alibaba Cloud TableStore (a multi‑model distributed database) for storage and Blink for unified stream‑batch computation.

TableStore integrates with Blink via channel services, eliminating the need for custom data‑flow code.

The solution reduces the typical open‑source stack from six‑seven components to just TableStore and Blink, both fully managed and horizontally scalable.

Developers focus on data‑processing logic; TableStore handles both table and queue semantics.

Blink supports real‑time and batch jobs on the same codebase, simplifying development and operations.

Advantages

Significant reduction in operational complexity and cost.

Unified storage eliminates data duplication and synchronisation overhead.

Real‑time alerts are implemented via TableStore triggers and Function Compute.

Multi‑model indexing in TableStore replaces the HBase + Solr/Elasticsearch combo, improving consistency and query simplicity.

Conclusion

Building on the "Hundred‑Billion‑Scale Global Sentiment Analysis System Storage Design" paper, the proposed Lambda Plus architecture leverages TableStore and Blink to achieve real‑time, scalable sentiment analysis with far fewer components and lower maintenance burden.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataReal-time ProcessingSentiment AnalysisLambda architectureKappa architecture
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.