Big Data Technology and Architecture: Leveraging Spark and HBase for Real‑Time and Offline Processing
This article outlines the challenges of various big‑data scenarios such as financial risk control, recommendation systems, and social feeds, explains why Spark is chosen over alternatives, describes a one‑stop data platform architecture with Spark‑HBase integration, and shares best‑practice tips and case studies.
Scenario Requirements and Challenges
The article begins by enumerating typical big‑data use cases: financial risk control (user‑profile libraries, web‑crawling, anti‑fraud, order data), personalized recommendation (user behavior analysis, profiling, recommendation engine, massive real‑time data), social feeds (posts, comments, real‑time processing), spatio‑temporal analytics (monitoring data, trajectories, device data, geographic information, regional statistics and queries), and general big‑data workloads (dimension tables, result tables, offline analysis, massive real‑time storage).
New Challenges
It highlights the characteristics of Apache HBase for online queries—schema‑free tables, random and range queries, native distributed storage, high throughput, low latency, multi‑version, incremental import, and multidimensional deletion. New challenges include streaming and batch ingestion, complex analytics, machine‑learning and graph computing, and ecosystem/federated analysis.
Why Choose Spark
Speed: query optimization and caching enable Spark to analyze any data size quickly; logistic‑regression workloads can be 100× faster than Hadoop.
One‑stop solution: Spark supports complex SQL, streaming, ML, and graph processing in a single application.
Developer‑friendly: native APIs for SQL, Python, Scala, Java, and R.
Rich ecosystem: integrates with Kafka, HBase, Cassandra, MongoDB, Redis, MySQL, SQL Server, etc.
Platform Architecture and Cases
One‑Stop Data Processing Platform Architecture
Data ingestion: Spark Streaming performs streaming ETL and incremental loading into HBase/Phoenix.
Online query: HBase/Phoenix provides high‑concurrency query services.
Offline analysis & algorithms: Spark SQL, ML, and graph libraries process data stored in HBase/Phoenix.
Typical Business Scenario – Crawler + Search Engine
Performance: streaming throughput of 200,000 records/second.
Query capability: HBase syncs to Solr for full‑text search.
One‑stop solution: Spark natively reads HBase via SQL, enabling an integrated Spark + HBase + Solr platform.
Typical Business Scenario – Big‑Data Risk Control System
Spark supports both real‑time (in‑process) and offline risk analysis.
Seamless integration with HBase, RDS, MongoDB and other online stores.
Typical Business Scenario – Building a Data Warehouse (Recommendation & Risk Control)
Millisecond‑level detection and interception of fraudulent orders, handling tens of thousands of concurrent requests.
Spark’s columnar Parquet storage delivers up to 10× performance over Greenplum for large‑scale analytics.
One‑stop solution: Spark reads HBase/Phoenix data via SQL.
Managed Spark service ensures job stability, reduces operational overhead, and the data workbench lowers management cost.
Principles and Best Practices
Spark API evolution: RDD → DataFrame → Dataset.
Spark Streaming uses a micro‑batch model for real‑time data.
Common performance issues (job backlog, high latency, insufficient concurrency) can be mitigated by increasing Kafka partitions, adjusting spark.streaming.blockInterval, and optimizing hot code paths (broadcast variables, code refactoring).
Streaming data ingestion into HBase:
Micro‑Batch processing latency ~100 ms; Continuous processing can reach ~1 ms.
Optimizations for the Spark‑HBase connector are illustrated below:
The demo code (including Spark operations on HBase and Phoenix) is available at https://github.com/aliyun/aliyun-apsaradb-hbase-demo .
Finally, readers are encouraged to like, bookmark, and share the article.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
