Tag

Spark Streaming

0 views collected around this technical thread.

NetEase LeiHuo UX Big Data Technology
NetEase LeiHuo UX Big Data Technology
Aug 3, 2022 · Big Data

Understanding Spark Streaming Checkpoint Mechanism for Real‑Time Feature Computation

The article explains how Spark Streaming's checkpoint mechanism works, detailing the four-step process—from setting the checkpoint directory to writing RDD data and finalizing the checkpoint—highlighting its role in ensuring fault‑tolerant, fast recovery for real‑time recommendation feature pipelines.

Big DataCheckpointReal-time Processing
0 likes · 7 min read
Understanding Spark Streaming Checkpoint Mechanism for Real‑Time Feature Computation
Big Data Technology Architecture
Big Data Technology Architecture
Nov 23, 2020 · Big Data

One‑Stop Data Lake Ingestion Solution with Alibaba Cloud Data Lake Formation (DLF)

The article describes Alibaba Cloud's Data Lake Formation service, presenting a unified, real‑time, and low‑latency solution for ingesting heterogeneous data sources—including RDS, DTS, TableStore, and SLS—into an OSS‑backed data lake using templates, a Spark‑based ingestion engine, and modern file formats such as Delta Lake.

Alibaba CloudData ingestionDelta Lake
0 likes · 10 min read
One‑Stop Data Lake Ingestion Solution with Alibaba Cloud Data Lake Formation (DLF)
58 Tech
58 Tech
Jun 10, 2020 · Big Data

Real‑time Data Warehouse Practices at 58 Tongcheng Bao: From Spark Streaming 1.0 to Flink‑based 2.0

This article details the evolution of 58 Tongcheng Bao's real‑time data warehouse, describing the initial Spark‑Streaming architecture, its limitations, and the redesign using Flink with a layered ODS‑DWD‑DWS‑APP model, data‑quality monitoring, join techniques, and the resulting improvements in latency and accuracy.

Big DataFlinkKafka
0 likes · 9 min read
Real‑time Data Warehouse Practices at 58 Tongcheng Bao: From Spark Streaming 1.0 to Flink‑based 2.0
DataFunTalk
DataFunTalk
Mar 8, 2020 · Big Data

Real-Time Log Monitoring and Alerting System for iQIYI Membership Services

This article describes how iQIYI built a real‑time, multi‑dimensional log monitoring platform using Spark Streaming, Flink, Kafka and Druid to handle billions of logs, improve alerting accuracy, reduce incident response time, and outline future intelligent monitoring enhancements.

Big DataDruidFlink
0 likes · 10 min read
Real-Time Log Monitoring and Alerting System for iQIYI Membership Services
iQIYI Technical Product Team
iQIYI Technical Product Team
Mar 6, 2020 · Big Data

Real-Time Log Monitoring and Alerting for iQIYI Membership Services

To support over 100 million iQIYI members, the team rebuilt a real‑time log monitoring platform that gathers access, exception, Nginx and front‑end logs via a Venus‑Agent, streams them through Kafka to Spark Streaming and Flink, stores metrics in Druid, and provides minute‑level host and business alerts, achieving 80 % faster incident investigation, detecting 90 % of member complaints early, and generating more than 4,800 actionable alerts.

Big DataDruidFlink
0 likes · 11 min read
Real-Time Log Monitoring and Alerting for iQIYI Membership Services
Youzan Coder
Youzan Coder
Mar 6, 2020 · Backend Development

Full-Link Tracing System: Architecture, Java Agent Integration, Multi-language Support, and Data Processing

Youzan’s full‑link tracing system combines a multi‑language SDK, Java Agent dynamic attachment, transparent upgrades, asynchronous context propagation, and a Spark‑based data pipeline that indexes traces in Elasticsearch and stores them in HBase, enabling real‑time diagnostics, log correlation, and future container‑level tracing expansion.

Distributed TracingJava agentOpenTracing
0 likes · 15 min read
Full-Link Tracing System: Architecture, Java Agent Integration, Multi-language Support, and Data Processing
360 Tech Engineering
360 Tech Engineering
Jan 16, 2020 · Big Data

Real-Time and Offline Integrated Solution for Channel Analysis Data Processing

This article presents a comprehensive real‑time and offline integrated solution for a channel analysis system, detailing challenges, architecture, implementation using Flink, Spark Streaming, Kafka, Elasticsearch, and HIVE, and demonstrating minute‑level latency and high accuracy through performance evaluations.

Big DataElasticsearchFlink
0 likes · 10 min read
Real-Time and Offline Integrated Solution for Channel Analysis Data Processing
Beike Product & Technology
Beike Product & Technology
Sep 20, 2019 · Big Data

Understanding DStream Construction and Execution in Spark Streaming

This article explains how Spark Streaming's DStream abstraction is built from InputDStream through successive transform operators, details the internal ForEachDStream implementation, describes the job generation and scheduling workflow, and outlines how Beike's real‑time platform leverages these mechanisms for large‑scale streaming tasks.

DStreamReal-time ProcessingScala
0 likes · 10 min read
Understanding DStream Construction and Execution in Spark Streaming
58 Tech
58 Tech
Sep 6, 2019 · Big Data

Architecture and Technical Implementation of the WMDA Data Analytics Platform

The article details WMDA's end‑to‑end data analytics architecture, covering zero‑event data collection, real‑time and offline processing pipelines built on Spark Streaming, Druid, Hadoop, Kettle, and TaskServer, and explains how these components collaborate to deliver comprehensive user behavior analysis.

Big DataDruidETL
0 likes · 11 min read
Architecture and Technical Implementation of the WMDA Data Analytics Platform
Xueersi Online School Tech Team
Xueersi Online School Tech Team
Sep 6, 2019 · Big Data

Real-Time Data Architecture, Evolution, and Applications at an Online School

The article details the six‑layer big‑data architecture of an online school, chronicles its migration from Storm to Spark Streaming and finally to Flink, and showcases concrete real‑time applications such as gateway monitoring, user‑profile tagging, renewal reporting, and advertising analysis, while outlining future development directions.

FlinkReal-time StreamingSpark Streaming
0 likes · 14 min read
Real-Time Data Architecture, Evolution, and Applications at an Online School
Big Data Technology Architecture
Big Data Technology Architecture
May 17, 2019 · Big Data

Optimizing Real-Time Kafka Writes in Spark Streaming Using a Broadcasted KafkaProducer

To improve the performance of writing streaming data to Kafka, the article demonstrates how to replace per-partition KafkaProducer creation with a lazily-initialized, broadcasted producer in Scala, reducing overhead and achieving dozens‑fold speed gains.

Big DataBroadcast VariableScala
0 likes · 3 min read
Optimizing Real-Time Kafka Writes in Spark Streaming Using a Broadcasted KafkaProducer
Youzan Coder
Youzan Coder
Mar 20, 2019 · Big Data

Evolution of Real-Time Computing at Youzan: From Storm to Flink and Future Directions

Youzan’s real‑time computing platform progressed from early Storm deployments through Spark Streaming to a Flink‑based architecture, adding unified task management, monitoring, and dedicated streaming clusters, while now pursuing SQL‑driven jobs, a Druid OLAP engine, and a future real‑time data warehouse.

FlinkSpark StreamingYouzan
0 likes · 14 min read
Evolution of Real-Time Computing at Youzan: From Storm to Flink and Future Directions
DataFunTalk
DataFunTalk
Mar 7, 2019 · Big Data

Design and Evolution of Didi's Real‑Time Data Computing Platform

The article details how Didi built and iterated its real‑time data platform, describing the shift from MySQL‑based batch processing to a Kafka‑Samza‑Druid architecture with Spark Streaming and Flink, the challenges addressed, and the current capabilities and operational metrics.

Big DataDruidFlink
0 likes · 9 min read
Design and Evolution of Didi's Real‑Time Data Computing Platform
Youzan Coder
Youzan Coder
Feb 1, 2019 · Big Data

Design and Implementation of Log Parsing for a Big Data Offline Task Platform

The article describes a log‑parsing feature for Youzan’s big‑data offline platform that captures runtime logs from Hive, Spark, DataX, MapReduce and HBase jobs, categorizes scheduling types, extracts metrics such as read/write bytes, shuffle volume and GC time, and processes them in real time via a Filebeat‑Logstash‑Kafka‑Spark‑Streaming pipeline storing results in Redis for monitoring, optimization and resource‑usage ranking.

Resource MonitoringSpark StreamingYARN
0 likes · 7 min read
Design and Implementation of Log Parsing for a Big Data Offline Task Platform
HomeTech
HomeTech
Jan 18, 2019 · Big Data

Data Mill: A Real‑Time Spark Streaming Framework for DSP Business Support

Data Mill is a Spark‑Streaming‑based real‑time computation framework that abstracts tasks as DataFrames, enables SQL‑driven development, and supports DSP business requirements by reducing latency to 15‑30 minutes while providing a scalable architecture, caching strategy, and automated fault handling.

Big DataCacheDSP
0 likes · 10 min read
Data Mill: A Real‑Time Spark Streaming Framework for DSP Business Support
Didi Tech
Didi Tech
Dec 18, 2018 · Big Data

Evolution and Architecture of Didi's Real-Time Computing Platform

From early self‑built Storm and Spark Streaming clusters to a unified YARN‑based Spark platform and finally a low‑latency Flink system with extended CEP and StreamSQL capabilities, Didi’s real‑time computing platform evolved through three stages, delivering multi‑tenant isolation, rich SQL processing, and dramatically reduced development costs.

Big DataCEPFlink
0 likes · 9 min read
Evolution and Architecture of Didi's Real-Time Computing Platform
DataFunTalk
DataFunTalk
Oct 14, 2018 · Big Data

Exploring Real-Time Data Warehouse Practices Based on HBase

The article details the evolution from an offline to a real‑time HBase data warehouse, covering business scenarios, the use of Maxwell for MySQL‑to‑Kafka ingestion, Phoenix for SQL access, CDH cluster tuning, monitoring, and several production case studies.

Big DataHBaseKafka
0 likes · 14 min read
Exploring Real-Time Data Warehouse Practices Based on HBase
Tencent Cloud Developer
Tencent Cloud Developer
Sep 6, 2018 · Big Data

Real-Time Stream Computing: Concepts, Challenges, and Tencent Cloud Solutions

As mobile and IoT data surge, real-time stream computing—especially Flink’s low-latency, high-throughput, exactly-once engine—addresses challenges of latency, accuracy, and usability, and Tencent Cloud’s managed Flink service provides elastic, secure, integrated pipelines for applications ranging from online status monitoring to fraud detection and smart transportation.

Apache StormBig DataCloud Services
0 likes · 30 min read
Real-Time Stream Computing: Concepts, Challenges, and Tencent Cloud Solutions
Beike Product & Technology
Beike Product & Technology
Jun 22, 2018 · Big Data

Beike Zhaofang's 秒X Real‑Time Analytics Platform: Architecture, Implementation, and Use Cases

The article details the design and deployment of the 秒X real‑time analytics platform at Beike Zhaofang, covering its background, Spark Streaming‑based architecture, fast configuration, data processing pipeline, monitoring, visualization, practical applications, and future development plans.

Big DataDruidElasticsearch
0 likes · 7 min read
Beike Zhaofang's 秒X Real‑Time Analytics Platform: Architecture, Implementation, and Use Cases
Qunar Tech Salon
Qunar Tech Salon
Jun 20, 2018 · Big Data

How Spark Streaming Submits Tasks: Internal Mechanics and Code Walkthrough

This article explains the internal workflow of Spark Streaming task submission, detailing how StreamingContext, DStream, receivers, and output operators are transformed into RDD jobs, and includes annotated Scala code examples that illustrate each step of the process.

Big DataDStreamReal-time Processing
0 likes · 13 min read
How Spark Streaming Submits Tasks: Internal Mechanics and Code Walkthrough