Tagged articles
53 articles
Page 1 of 1
Big Data Tech Team
Big Data Tech Team
Apr 17, 2025 · Big Data

Essential Spark Interview Q&A: Master Data Warehouse Engineer Questions

This article compiles a comprehensive set of Spark interview questions frequently asked by leading tech companies, providing detailed explanations of Spark’s performance mechanisms, architecture, RDD persistence, checkpointing, streaming, dependency types, HA setup, and practical coding examples to help data warehouse engineers prepare effectively.

Data WarehouseRDDSpark
0 likes · 21 min read
Essential Spark Interview Q&A: Master Data Warehouse Engineer Questions

Understanding Spark Streaming Checkpoint Mechanism for Real‑Time Feature Computation

The article explains how Spark Streaming's checkpoint mechanism works, detailing the four-step process—from setting the checkpoint directory to writing RDD data and finalizing the checkpoint—highlighting its role in ensuring fault‑tolerant, fast recovery for real‑time recommendation feature pipelines.

Big DataCheckpointReal-time Processing
0 likes · 7 min read
Understanding Spark Streaming Checkpoint Mechanism for Real‑Time Feature Computation
Big Data Technology Architecture
Big Data Technology Architecture
Nov 23, 2020 · Big Data

One‑Stop Data Lake Ingestion Solution with Alibaba Cloud Data Lake Formation (DLF)

The article describes Alibaba Cloud's Data Lake Formation service, presenting a unified, real‑time, and low‑latency solution for ingesting heterogeneous data sources—including RDS, DTS, TableStore, and SLS—into an OSS‑backed data lake using templates, a Spark‑based ingestion engine, and modern file formats such as Delta Lake.

Alibaba CloudDelta LakeReal-time Processing
0 likes · 10 min read
One‑Stop Data Lake Ingestion Solution with Alibaba Cloud Data Lake Formation (DLF)
58 Tech
58 Tech
Jun 10, 2020 · Big Data

Real‑time Data Warehouse Practices at 58 Tongcheng Bao: From Spark Streaming 1.0 to Flink‑based 2.0

This article details the evolution of 58 Tongcheng Bao's real‑time data warehouse, describing the initial Spark‑Streaming architecture, its limitations, and the redesign using Flink with a layered ODS‑DWD‑DWS‑APP model, data‑quality monitoring, join techniques, and the resulting improvements in latency and accuracy.

Big DataData QualityFlink
0 likes · 9 min read
Real‑time Data Warehouse Practices at 58 Tongcheng Bao: From Spark Streaming 1.0 to Flink‑based 2.0
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 19, 2020 · Big Data

Understanding the Backpressure Mechanism in Spark Streaming

This article explains Spark Streaming's backpressure mechanism, detailing how batch intervals can cause data accumulation, the role of Receivers versus DirectKafkaInputDStream, configuration to enable backpressure, and the internal workings of RateController, ReceiverRateController, ReceiverSupervisor, BlockGenerator, and rate calculations for Kafka streams.

Big DataKafkaRateController
0 likes · 12 min read
Understanding the Backpressure Mechanism in Spark Streaming
dbaplus Community
dbaplus Community
Mar 9, 2020 · Artificial Intelligence

How LSTM‑Powered Real‑Time Alerting with Spark Streaming Boosts Ops Efficiency

This article details a deep‑learning‑driven, real‑time alert system that combines TensorFlow LSTM time‑series forecasting with Spark Streaming to achieve high‑coverage, low‑latency anomaly detection for large‑scale data‑ops environments, including data preprocessing, metric classification, model training, and deployment pipelines.

AI OpsLSTMSpark Streaming
0 likes · 18 min read
How LSTM‑Powered Real‑Time Alerting with Spark Streaming Boosts Ops Efficiency
DataFunTalk
DataFunTalk
Mar 8, 2020 · Big Data

Real-Time Log Monitoring and Alerting System for iQIYI Membership Services

This article describes how iQIYI built a real‑time, multi‑dimensional log monitoring platform using Spark Streaming, Flink, Kafka and Druid to handle billions of logs, improve alerting accuracy, reduce incident response time, and outline future intelligent monitoring enhancements.

DruidFlinkLog Analytics
0 likes · 10 min read
Real-Time Log Monitoring and Alerting System for iQIYI Membership Services
iQIYI Technical Product Team
iQIYI Technical Product Team
Mar 6, 2020 · Big Data

Real-Time Log Monitoring and Alerting for iQIYI Membership Services

To support over 100 million iQIYI members, the team rebuilt a real‑time log monitoring platform that gathers access, exception, Nginx and front‑end logs via a Venus‑Agent, streams them through Kafka to Spark Streaming and Flink, stores metrics in Druid, and provides minute‑level host and business alerts, achieving 80 % faster incident investigation, detecting 90 % of member complaints early, and generating more than 4,800 actionable alerts.

Big DataFlinkLog Analytics
0 likes · 11 min read
Real-Time Log Monitoring and Alerting for iQIYI Membership Services
Youzan Coder
Youzan Coder
Mar 6, 2020 · Backend Development

Full-Link Tracing System: Architecture, Java Agent Integration, Multi-language Support, and Data Processing

Youzan’s full‑link tracing system combines a multi‑language SDK, Java Agent dynamic attachment, transparent upgrades, asynchronous context propagation, and a Spark‑based data pipeline that indexes traces in Elasticsearch and stores them in HBase, enabling real‑time diagnostics, log correlation, and future container‑level tracing expansion.

Distributed TracingJava AgentMicroservices
0 likes · 15 min read
Full-Link Tracing System: Architecture, Java Agent Integration, Multi-language Support, and Data Processing
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 9, 2019 · Big Data

Comparative Study of Apache Flink and Spark Streaming at Xiaomi: Architecture, Performance, and Serialization

This article examines Xiaomi's migration from Spark Streaming to Apache Flink, comparing scheduling strategies, mini‑batch versus true streaming, resource utilization, latency, and serialization mechanisms, and concludes with practical insights and custom optimization techniques for large‑scale data processing.

Big DataFlinkMini-Batch
0 likes · 17 min read
Comparative Study of Apache Flink and Spark Streaming at Xiaomi: Architecture, Performance, and Serialization
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 30, 2019 · Big Data

Building a Real‑Time Data Processing Pipeline with Apache Kafka, Spark Streaming, and Cassandra

This tutorial explains how to create a highly scalable, fault‑tolerant real‑time data processing platform by configuring a Kafka topic, a Cassandra keyspace, adding Spark and connector dependencies, developing a Java‑based Spark Streaming pipeline, enabling checkpoints, and deploying the application with spark‑submit.

Big DataJavaKafka
0 likes · 8 min read
Building a Real‑Time Data Processing Pipeline with Apache Kafka, Spark Streaming, and Cassandra
Beike Product & Technology
Beike Product & Technology
Sep 20, 2019 · Big Data

Understanding DStream Construction and Execution in Spark Streaming

This article explains how Spark Streaming's DStream abstraction is built from InputDStream through successive transform operators, details the internal ForEachDStream implementation, describes the job generation and scheduling workflow, and outlines how Beike's real‑time platform leverages these mechanisms for large‑scale streaming tasks.

Big DataDstreamReal-time Processing
0 likes · 10 min read
Understanding DStream Construction and Execution in Spark Streaming
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 11, 2019 · Big Data

Evolution of Zhihu's Real-Time Data Warehouse: From Spark Streaming 1.0 to Flink‑Based 2.0

This article details Zhihu's real‑time data warehouse evolution, describing the 1.0 Spark Streaming architecture, its limitations, and the 2.0 redesign that introduces Flink, layered data models, streaming and batch ETL, metric storage choices, and future roadmap for scalable, low‑latency analytics.

FlinkLambda architectureSpark Streaming
0 likes · 19 min read
Evolution of Zhihu's Real-Time Data Warehouse: From Spark Streaming 1.0 to Flink‑Based 2.0
58 Tech
58 Tech
Sep 6, 2019 · Big Data

Architecture and Technical Implementation of the WMDA Data Analytics Platform

The article details WMDA's end‑to‑end data analytics architecture, covering zero‑event data collection, real‑time and offline processing pipelines built on Spark Streaming, Druid, Hadoop, Kettle, and TaskServer, and explains how these components collaborate to deliver comprehensive user behavior analysis.

Big DataDruidETL
0 likes · 11 min read
Architecture and Technical Implementation of the WMDA Data Analytics Platform
Xueersi Online School Tech Team
Xueersi Online School Tech Team
Sep 6, 2019 · Big Data

Real-Time Data Architecture, Evolution, and Applications at an Online School

The article details the six‑layer big‑data architecture of an online school, chronicles its migration from Storm to Spark Streaming and finally to Flink, and showcases concrete real‑time applications such as gateway monitoring, user‑profile tagging, renewal reporting, and advertising analysis, while outlining future development directions.

AnalyticsBig Data ArchitectureFlink
0 likes · 14 min read
Real-Time Data Architecture, Evolution, and Applications at an Online School
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 7, 2019 · Big Data

Dynamic Variable Loading in Real-Time Stream Processing: Spark Streaming vs Flink Broadcast Mechanisms

Real-time streaming jobs require dynamic configuration loading without restarts, and this article compares two common approaches—polling pull and push control streams—examining Spark Streaming’s broadcast variables and Flink’s broadcast state, discussing their implementations, advantages, limitations, and practical considerations.

Broadcast VariableDynamic ConfigurationFlink
0 likes · 10 min read
Dynamic Variable Loading in Real-Time Stream Processing: Spark Streaming vs Flink Broadcast Mechanisms
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 24, 2019 · Big Data

Resetting Daily Page View Counters in Spark Streaming Using mapWithState and Timeout

The article explains how to use Spark Streaming's mapWithState operator to count product page views, addresses the issue of daily PV reset by presenting two solutions—restarting the streaming job via a scheduled script and configuring a StreamingContext timeout—plus an alternative approach using Redis for external state management.

Spark Streamingdaily resetmapWithState
0 likes · 4 min read
Resetting Daily Page View Counters in Spark Streaming Using mapWithState and Timeout
ITPUB
ITPUB
May 29, 2019 · Big Data

How to Build a Trillion-Scale Real-Time Data Platform: Lessons from DTCC 2019

In a DTCC 2019 keynote, Zhao Qun, director of big‑data platform at Percent Point, outlines the challenges of trillion‑scale real‑time analytics and presents a transparent, fine‑grained architecture built on Kafka, Spark Streaming, ClickHouse, HBase, Ceph and Elasticsearch, detailing design principles, component sizing, multi‑center deployment, performance testing and operational safeguards.

Big DataKafkaReal-time analytics
0 likes · 17 min read
How to Build a Trillion-Scale Real-Time Data Platform: Lessons from DTCC 2019
Big Data Technology & Architecture
Big Data Technology & Architecture
May 12, 2019 · Big Data

Understanding Spark Streaming Integration with Kafka: Receiver-based and Direct Approaches

This article explains Spark Streaming’s architecture, core concepts such as DStream, windowing, and the two Kafka integration methods—Receiver-based and Direct approaches—detailing their configurations, memory implications, checkpointing, and best‑practice recommendations for reliable, high‑throughput real‑time data processing.

Big DataDirect ApproachReceiver Approach
0 likes · 18 min read
Understanding Spark Streaming Integration with Kafka: Receiver-based and Direct Approaches
Youzan Coder
Youzan Coder
Mar 20, 2019 · Big Data

Evolution of Real-Time Computing at Youzan: From Storm to Flink and Future Directions

Youzan’s real‑time computing platform progressed from early Storm deployments through Spark Streaming to a Flink‑based architecture, adding unified task management, monitoring, and dedicated streaming clusters, while now pursuing SQL‑driven jobs, a Druid OLAP engine, and a future real‑time data warehouse.

Big DataFlinkSpark Streaming
0 likes · 14 min read
Evolution of Real-Time Computing at Youzan: From Storm to Flink and Future Directions
DataFunTalk
DataFunTalk
Mar 7, 2019 · Big Data

Design and Evolution of Didi's Real‑Time Data Computing Platform

The article details how Didi built and iterated its real‑time data platform, describing the shift from MySQL‑based batch processing to a Kafka‑Samza‑Druid architecture with Spark Streaming and Flink, the challenges addressed, and the current capabilities and operational metrics.

Big DataDruidFlink
0 likes · 9 min read
Design and Evolution of Didi's Real‑Time Data Computing Platform
dbaplus Community
dbaplus Community
Feb 28, 2019 · Big Data

How Zhihu Built a Real-Time Data Warehouse: From Spark Streaming to Flink

This article details Zhihu's evolution of its real-time data warehouse, covering the 1.0 version built on Spark Streaming, the 2.0 upgrade using Flink Streaming SQL, architectural layers, ETL processes, and future directions such as streaming SQL platformization and automated result validation.

ETLFlinkLambda architecture
0 likes · 19 min read
How Zhihu Built a Real-Time Data Warehouse: From Spark Streaming to Flink
HomeTech
HomeTech
Jan 18, 2019 · Big Data

Data Mill: A Real‑Time Spark Streaming Framework for DSP Business Support

Data Mill is a Spark‑Streaming‑based real‑time computation framework that abstracts tasks as DataFrames, enables SQL‑driven development, and supports DSP business requirements by reducing latency to 15‑30 minutes while providing a scalable architecture, caching strategy, and automated fault handling.

CacheDSPReal‑Time Computing
0 likes · 10 min read
Data Mill: A Real‑Time Spark Streaming Framework for DSP Business Support
Didi Tech
Didi Tech
Dec 18, 2018 · Big Data

Evolution and Architecture of Didi's Real-Time Computing Platform

From early self‑built Storm and Spark Streaming clusters to a unified YARN‑based Spark platform and finally a low‑latency Flink system with extended CEP and StreamSQL capabilities, Didi’s real‑time computing platform evolved through three stages, delivering multi‑tenant isolation, rich SQL processing, and dramatically reduced development costs.

Big DataCEPFlink
0 likes · 9 min read
Evolution and Architecture of Didi's Real-Time Computing Platform
DataFunTalk
DataFunTalk
Oct 14, 2018 · Big Data

Exploring Real-Time Data Warehouse Practices Based on HBase

The article details the evolution from an offline to a real‑time HBase data warehouse, covering business scenarios, the use of Maxwell for MySQL‑to‑Kafka ingestion, Phoenix for SQL access, CDH cluster tuning, monitoring, and several production case studies.

HBaseKafkaPhoenix
0 likes · 14 min read
Exploring Real-Time Data Warehouse Practices Based on HBase
Tencent Cloud Developer
Tencent Cloud Developer
Sep 6, 2018 · Big Data

Real-Time Stream Computing: Concepts, Challenges, and Tencent Cloud Solutions

As mobile and IoT data surge, real-time stream computing—especially Flink’s low-latency, high-throughput, exactly-once engine—addresses challenges of latency, accuracy, and usability, and Tencent Cloud’s managed Flink service provides elastic, secure, integrated pipelines for applications ranging from online status monitoring to fraud detection and smart transportation.

Apache StormBig DataFlink
0 likes · 30 min read
Real-Time Stream Computing: Concepts, Challenges, and Tencent Cloud Solutions
Meitu Technology
Meitu Technology
Aug 2, 2018 · Big Data

Spark Streaming vs Flink – Architecture, Scheduling & Fault Tolerance

This article compares Spark Streaming and Flink across runtime models, component roles, programming APIs, task scheduling, time semantics, dynamic Kafka partition detection, fault‑tolerance mechanisms, exactly‑once guarantees, and back‑pressure handling, providing code examples and practical insights for real‑time data processing.

Dynamic Partition DetectionExactly-OnceFlink
0 likes · 23 min read
Spark Streaming vs Flink – Architecture, Scheduling & Fault Tolerance
Beike Product & Technology
Beike Product & Technology
Jun 22, 2018 · Big Data

Beike Zhaofang's 秒X Real‑Time Analytics Platform: Architecture, Implementation, and Use Cases

The article details the design and deployment of the 秒X real‑time analytics platform at Beike Zhaofang, covering its background, Spark Streaming‑based architecture, fast configuration, data processing pipeline, monitoring, visualization, practical applications, and future development plans.

DruidElasticsearchReal-time analytics
0 likes · 7 min read
Beike Zhaofang's 秒X Real‑Time Analytics Platform: Architecture, Implementation, and Use Cases
Ctrip Technology
Ctrip Technology
Jun 4, 2018 · Big Data

Real-Time Data Processing Frameworks and Kafka Practices at Ctrip Ticketing

This article examines Ctrip Ticket's real-time data processing ecosystem, comparing batch and streaming frameworks such as Hadoop, Spark, Storm, Flink, and Spark Streaming, detailing Kafka deployment and configuration, and describing how these technologies are applied in production for log analysis, seat‑occupancy detection, and anti‑crawling.

FlinkReal-time ProcessingSpark Streaming
0 likes · 12 min read
Real-Time Data Processing Frameworks and Kafka Practices at Ctrip Ticketing
Beike Product & Technology
Beike Product & Technology
Mar 9, 2018 · Big Data

How Lianjia Built a Low‑Latency Real‑Time Data Platform with Spark Streaming

This article details Lianjia's journey of designing and implementing a low‑latency, stable real‑time computing platform using Spark Streaming on YARN, covering technical selection, architecture components, version compatibility challenges, exactly‑once semantics, graceful shutdown, Kafka tuning, and future enhancements.

Big DataExactly-OnceKafka
0 likes · 11 min read
How Lianjia Built a Low‑Latency Real‑Time Data Platform with Spark Streaming
Ctrip Technology
Ctrip Technology
Sep 20, 2017 · Big Data

Building a Real‑Time Computing Platform with Spark Streaming at Ctrip: Design, Implementation, and Lessons Learned

This article describes how Ctrip migrated its large‑scale real‑time platform from JStorm to Spark Streaming, detailing the architectural design, the Muise Spark Core encapsulation, operational metrics, encountered pitfalls, and future plans to adopt Flink and Beam for streaming workloads.

Big DataExactly-OnceSpark Streaming
0 likes · 22 min read
Building a Real‑Time Computing Platform with Spark Streaming at Ctrip: Design, Implementation, and Lessons Learned
MaGe Linux Operations
MaGe Linux Operations
Sep 11, 2017 · Big Data

How Big Data Can Revolutionize Operations Monitoring

This article explores applying big‑data thinking and platforms—such as Flume, Spark Streaming, and HBase—to operations monitoring, detailing data sources, metric categories, architecture design, implementation steps, and the benefits of a scalable, low‑code monitoring platform.

Big DataOperationsSpark Streaming
0 likes · 10 min read
How Big Data Can Revolutionize Operations Monitoring
Qunar Tech Salon
Qunar Tech Salon
Apr 21, 2017 · Big Data

Ensuring Exact‑Once Semantics in Spark Streaming with Kafka: Offline Repair and Data Deduplication Strategies

This article explains why Spark Streaming combined with Kafka can only guarantee at‑least‑once delivery, outlines the challenges of delayed and out‑of‑order events, and presents practical offline‑repair, deduplication, and output‑format techniques—including code examples—to achieve exact‑once semantics in big‑data pipelines.

Exact-OnceHBaseHDFS
0 likes · 11 min read
Ensuring Exact‑Once Semantics in Spark Streaming with Kafka: Offline Repair and Data Deduplication Strategies
Meituan Technology Team
Meituan Technology Team
Nov 4, 2016 · Big Data

Design and Implementation of a Low-Latency App Exception Monitoring Platform Using Spark Streaming, Kafka, and Elasticsearch

The paper presents a production‑grade, low‑cost mobile‑app exception monitoring platform built on Spark Streaming, Kafka, and Elasticsearch that achieves high availability through exactly‑once processing and checkpointing, minute‑level latency by decoupling raw and symbolized logs, high throughput via reservoir sampling, and dynamic scalability without code changes.

Big DataElasticsearchException Monitoring
0 likes · 11 min read
Design and Implementation of a Low-Latency App Exception Monitoring Platform Using Spark Streaming, Kafka, and Elasticsearch
dbaplus Community
dbaplus Community
Jul 5, 2016 · Operations

How to Transform Operations Monitoring with Big Data Thinking

This article explains how to apply big‑data concepts and platforms to operations monitoring, covering data sources, metric extraction from logs, architectural design with Flume, Spark Streaming and HBase, implementation steps, and the resulting benefits for scalability and rapid metric development.

Spark Streaminglog analysis
0 likes · 11 min read
How to Transform Operations Monitoring with Big Data Thinking

TalkingData’s Journey to Building a Mobile Big Data Platform with Spark and YARN

This article recounts how TalkingData progressively introduced Spark into its Hadoop‑YARN based mobile big‑data platform, detailing early architectures, migration challenges, performance gains, the fully Spark‑centric redesign with Kafka and Spark Streaming, encountered pitfalls, and future plans for further optimization.

Data PlatformHadoopSpark
0 likes · 16 min read
TalkingData’s Journey to Building a Mobile Big Data Platform with Spark and YARN
21CTO
21CTO
Sep 24, 2015 · Big Data

Comparing Apache Storm, Spark, and Samza: Which Real‑Time Stream Processor Fits Your Needs?

Apache Storm, Spark Streaming, and Samza are three open‑source, low‑latency, scalable distributed systems for real‑time data processing; this article outlines their architectures, key concepts, differences in data handling, state management, delivery guarantees, and typical use‑cases to help you choose the right framework.

Apache SamzaApache StormBig Data
0 likes · 7 min read
Comparing Apache Storm, Spark, and Samza: Which Real‑Time Stream Processor Fits Your Needs?

Comparative Overview of Apache Storm, Spark Streaming, and Samza for Real-Time Data Processing

This article introduces Apache Storm, Spark Streaming, and Samza, explains their architectures, common features, key differences such as delivery guarantees and state management, and provides guidance on selecting the most suitable framework for various real‑time big‑data use cases.

Apache StormBig DataComparison
0 likes · 8 min read
Comparative Overview of Apache Storm, Spark Streaming, and Samza for Real-Time Data Processing