Tagged articles

Spark Streaming

53 articles · Page 1 of 1

Apr 17, 2025 · Big Data

Essential Spark Interview Q&A: Master Data Warehouse Engineer Questions

This article compiles a comprehensive set of Spark interview questions frequently asked by leading tech companies, providing detailed explanations of Spark’s performance mechanisms, architecture, RDD persistence, checkpointing, streaming, dependency types, HA setup, and practical coding examples to help data warehouse engineers prepare effectively.

Data WarehouseRDDSpark

0 likes · 21 min read

Essential Spark Interview Q&A: Master Data Warehouse Engineer Questions

NetEase LeiHuo UX Big Data Technology

Aug 3, 2022 · Big Data

Understanding Spark Streaming Checkpoint Mechanism for Real‑Time Feature Computation

The article explains how Spark Streaming's checkpoint mechanism works, detailing the four-step process—from setting the checkpoint directory to writing RDD data and finalizing the checkpoint—highlighting its role in ensuring fault‑tolerant, fast recovery for real‑time recommendation feature pipelines.

Big DataCheckpointReal-time Processing

0 likes · 7 min read

Understanding Spark Streaming Checkpoint Mechanism for Real‑Time Feature Computation

Big Data Technology & Architecture

Mar 31, 2021 · Big Data

Real-time MySQL Incremental Data Processing with Canal, Spark Streaming, and Kafka

This article explains how to use Alibaba's Canal to capture MySQL binlog changes, forward them to Kafka, and process the incremental data in real time with Spark Streaming, including installation, configuration, client development, Spark code, testing, and troubleshooting dependency conflicts.

CanalReal-time ProcessingScala

0 likes · 21 min read

Real-time MySQL Incremental Data Processing with Canal, Spark Streaming, and Kafka

Big Data Technology Architecture

Nov 23, 2020 · Big Data

One‑Stop Data Lake Ingestion Solution with Alibaba Cloud Data Lake Formation (DLF)

The article describes Alibaba Cloud's Data Lake Formation service, presenting a unified, real‑time, and low‑latency solution for ingesting heterogeneous data sources—including RDS, DTS, TableStore, and SLS—into an OSS‑backed data lake using templates, a Spark‑based ingestion engine, and modern file formats such as Delta Lake.

Alibaba CloudDelta LakeReal-time Processing

0 likes · 10 min read

One‑Stop Data Lake Ingestion Solution with Alibaba Cloud Data Lake Formation (DLF)

dbaplus Community

Sep 14, 2020 · Operations

How iQIYI Scaled Real‑Time Log Monitoring for 100M+ Users with Spark, Flink and Druid

Facing a surge to over 100 million members, iQIYI rebuilt its monitoring stack by ingesting four log types, adopting Spark Streaming, Flink and Druid for real‑time analysis, and optimizing resource usage, which cut incident resolution time by more than 80 % while supporting billion‑level data volumes.

DruidFlinkOperations

0 likes · 12 min read

How iQIYI Scaled Real‑Time Log Monitoring for 100M+ Users with Spark, Flink and Druid

Big Data Technology & Architecture

Aug 18, 2020 · Big Data

End-to-End Real-Time Web Log Processing with Flume, Kafka, Spark Streaming, HBase, and Spring Boot

This tutorial demonstrates how to generate simulated web access logs in Python, schedule them with Crontab, collect them in real time using Flume, forward them to Kafka, process the streams with Spark Streaming, store results in HBase, and visualize the data via a Spring Boot application with ECharts.

Big DataEChartsFlume

0 likes · 36 min read

End-to-End Real-Time Web Log Processing with Flume, Kafka, Spark Streaming, HBase, and Spring Boot

Big Data Technology & Architecture

Aug 12, 2020 · Big Data

Real‑time User Behavior Collection Using Flume, Kafka, and Spark Streaming on Hadoop

This guide explains how to continuously collect web‑service user behavior logs, route them through Flume agents to Kafka, and finally ingest them with Spark Streaming into HDFS, covering environment preparation, configuration files, deployment steps, and verification procedures.

Big DataFlumeHadoop

0 likes · 9 min read

Real‑time User Behavior Collection Using Flume, Kafka, and Spark Streaming on Hadoop

Big Data Technology & Architecture

Jun 13, 2020 · Big Data

Achieving Exactly-Once Semantics in Kafka and Spark Streaming

This article explains the three message delivery semantics in distributed stream processing, compares Kafka‑Spark Streaming integration methods (receiver vs direct stream), and details how to achieve exactly‑once guarantees through idempotent or transactional writes, including code examples.

@TransactionalBig DataExactly-once

0 likes · 8 min read

Achieving Exactly-Once Semantics in Kafka and Spark Streaming

58 Tech

Jun 10, 2020 · Big Data

Real‑time Data Warehouse Practices at 58 Tongcheng Bao: From Spark Streaming 1.0 to Flink‑based 2.0

This article details the evolution of 58 Tongcheng Bao's real‑time data warehouse, describing the initial Spark‑Streaming architecture, its limitations, and the redesign using Flink with a layered ODS‑DWD‑DWS‑APP model, data‑quality monitoring, join techniques, and the resulting improvements in latency and accuracy.

Big DataData QualityFlink

0 likes · 9 min read

Real‑time Data Warehouse Practices at 58 Tongcheng Bao: From Spark Streaming 1.0 to Flink‑based 2.0

Big Data Technology & Architecture

Apr 19, 2020 · Big Data

Understanding the Backpressure Mechanism in Spark Streaming

This article explains Spark Streaming's backpressure mechanism, detailing how batch intervals can cause data accumulation, the role of Receivers versus DirectKafkaInputDStream, configuration to enable backpressure, and the internal workings of RateController, ReceiverRateController, ReceiverSupervisor, BlockGenerator, and rate calculations for Kafka streams.

Big DataRateControllerReceiver

0 likes · 12 min read

Understanding the Backpressure Mechanism in Spark Streaming

dbaplus Community

Mar 9, 2020 · Artificial Intelligence

How LSTM‑Powered Real‑Time Alerting with Spark Streaming Boosts Ops Efficiency

This article details a deep‑learning‑driven, real‑time alert system that combines TensorFlow LSTM time‑series forecasting with Spark Streaming to achieve high‑coverage, low‑latency anomaly detection for large‑scale data‑ops environments, including data preprocessing, metric classification, model training, and deployment pipelines.

AI OpsAnomaly DetectionLSTM

0 likes · 18 min read

How LSTM‑Powered Real‑Time Alerting with Spark Streaming Boosts Ops Efficiency

DataFunTalk

Mar 8, 2020 · Big Data

Real-Time Log Monitoring and Alerting System for iQIYI Membership Services

This article describes how iQIYI built a real‑time, multi‑dimensional log monitoring platform using Spark Streaming, Flink, Kafka and Druid to handle billions of logs, improve alerting accuracy, reduce incident response time, and outline future intelligent monitoring enhancements.

DruidFlinkLog Analytics

0 likes · 10 min read

Real-Time Log Monitoring and Alerting System for iQIYI Membership Services

iQIYI Technical Product Team

Mar 6, 2020 · Big Data

Real-Time Log Monitoring and Alerting for iQIYI Membership Services

To support over 100 million iQIYI members, the team rebuilt a real‑time log monitoring platform that gathers access, exception, Nginx and front‑end logs via a Venus‑Agent, streams them through Kafka to Spark Streaming and Flink, stores metrics in Druid, and provides minute‑level host and business alerts, achieving 80 % faster incident investigation, detecting 90 % of member complaints early, and generating more than 4,800 actionable alerts.

Big DataFlinkLog Analytics

0 likes · 11 min read

Real-Time Log Monitoring and Alerting for iQIYI Membership Services

Youzan Coder

Mar 6, 2020 · Backend Development

Full-Link Tracing System: Architecture, Java Agent Integration, Multi-language Support, and Data Processing

Youzan’s full‑link tracing system combines a multi‑language SDK, Java Agent dynamic attachment, transparent upgrades, asynchronous context propagation, and a Spark‑based data pipeline that indexes traces in Elasticsearch and stores them in HBase, enabling real‑time diagnostics, log correlation, and future container‑level tracing expansion.

Distributed TracingJava AgentMicroservices

0 likes · 15 min read

Full-Link Tracing System: Architecture, Java Agent Integration, Multi-language Support, and Data Processing

360 Tech Engineering

Jan 16, 2020 · Big Data

Real-Time and Offline Integrated Solution for Channel Analysis Data Processing

This article presents a comprehensive real‑time and offline integrated solution for a channel analysis system, detailing challenges, architecture, implementation using Flink, Spark Streaming, Kafka, Elasticsearch, and HIVE, and demonstrating minute‑level latency and high accuracy through performance evaluations.

Big DataElasticsearchFlink

0 likes · 10 min read

Real-Time and Offline Integrated Solution for Channel Analysis Data Processing

Big Data Technology & Architecture

Nov 9, 2019 · Big Data

Comparative Study of Apache Flink and Spark Streaming at Xiaomi: Architecture, Performance, and Serialization

This article examines Xiaomi's migration from Spark Streaming to Apache Flink, comparing scheduling strategies, mini‑batch versus true streaming, resource utilization, latency, and serialization mechanisms, and concludes with practical insights and custom optimization techniques for large‑scale data processing.

Big DataFlinkMini-Batch

0 likes · 17 min read

Comparative Study of Apache Flink and Spark Streaming at Xiaomi: Architecture, Performance, and Serialization

Big Data Technology & Architecture

Oct 30, 2019 · Big Data

Building a Real‑Time Data Processing Pipeline with Apache Kafka, Spark Streaming, and Cassandra

This tutorial explains how to create a highly scalable, fault‑tolerant real‑time data processing platform by configuring a Kafka topic, a Cassandra keyspace, adding Spark and connector dependencies, developing a Java‑based Spark Streaming pipeline, enabling checkpoints, and deploying the application with spark‑submit.

Big DataCassandraJava

0 likes · 8 min read

Building a Real‑Time Data Processing Pipeline with Apache Kafka, Spark Streaming, and Cassandra

Beike Product & Technology

Sep 20, 2019 · Big Data

Understanding DStream Construction and Execution in Spark Streaming

This article explains how Spark Streaming's DStream abstraction is built from InputDStream through successive transform operators, details the internal ForEachDStream implementation, describes the job generation and scheduling workflow, and outlines how Beike's real‑time platform leverages these mechanisms for large‑scale streaming tasks.

Big DataDstreamReal-time Processing

0 likes · 10 min read

Understanding DStream Construction and Execution in Spark Streaming

Big Data Technology & Architecture

Sep 11, 2019 · Big Data

Evolution of Zhihu's Real-Time Data Warehouse: From Spark Streaming 1.0 to Flink‑Based 2.0

This article details Zhihu's real‑time data warehouse evolution, describing the 1.0 Spark Streaming architecture, its limitations, and the 2.0 redesign that introduces Flink, layered data models, streaming and batch ETL, metric storage choices, and future roadmap for scalable, low‑latency analytics.

FlinkLambda architectureReal-Time Data Warehouse

0 likes · 19 min read

Evolution of Zhihu's Real-Time Data Warehouse: From Spark Streaming 1.0 to Flink‑Based 2.0

58 Tech

Sep 6, 2019 · Big Data

Architecture and Technical Implementation of the WMDA Data Analytics Platform

The article details WMDA's end‑to‑end data analytics architecture, covering zero‑event data collection, real‑time and offline processing pipelines built on Spark Streaming, Druid, Hadoop, Kettle, and TaskServer, and explains how these components collaborate to deliver comprehensive user behavior analysis.

Big DataDruidETL

0 likes · 11 min read

Architecture and Technical Implementation of the WMDA Data Analytics Platform

Xueersi Online School Tech Team

Sep 6, 2019 · Big Data

Real-Time Data Architecture, Evolution, and Applications at an Online School

The article details the six‑layer big‑data architecture of an online school, chronicles its migration from Storm to Spark Streaming and finally to Flink, and showcases concrete real‑time applications such as gateway monitoring, user‑profile tagging, renewal reporting, and advertising analysis, while outlining future development directions.

AnalyticsBig Data ArchitectureFlink

0 likes · 14 min read

Real-Time Data Architecture, Evolution, and Applications at an Online School

Big Data Technology & Architecture

Aug 7, 2019 · Big Data

Dynamic Variable Loading in Real-Time Stream Processing: Spark Streaming vs Flink Broadcast Mechanisms

Real-time streaming jobs require dynamic configuration loading without restarts, and this article compares two common approaches—polling pull and push control streams—examining Spark Streaming’s broadcast variables and Flink’s broadcast state, discussing their implementations, advantages, limitations, and practical considerations.

Broadcast VariableFlinkSpark Streaming

0 likes · 10 min read

Dynamic Variable Loading in Real-Time Stream Processing: Spark Streaming vs Flink Broadcast Mechanisms

Big Data Technology & Architecture

Jun 24, 2019 · Big Data

Resetting Daily Page View Counters in Spark Streaming Using mapWithState and Timeout

The article explains how to use Spark Streaming's mapWithState operator to count product page views, addresses the issue of daily PV reset by presenting two solutions—restarting the streaming job via a scheduled script and configuring a StreamingContext timeout—plus an alternative approach using Redis for external state management.

RedisSpark Streamingdaily reset

0 likes · 4 min read

Resetting Daily Page View Counters in Spark Streaming Using mapWithState and Timeout

ITPUB

May 29, 2019 · Big Data

How to Build a Trillion-Scale Real-Time Data Platform: Lessons from DTCC 2019

In a DTCC 2019 keynote, Zhao Qun, director of big‑data platform at Percent Point, outlines the challenges of trillion‑scale real‑time analytics and presents a transparent, fine‑grained architecture built on Kafka, Spark Streaming, ClickHouse, HBase, Ceph and Elasticsearch, detailing design principles, component sizing, multi‑center deployment, performance testing and operational safeguards.

Big DataSpark Streamingarchitecture

0 likes · 17 min read

How to Build a Trillion-Scale Real-Time Data Platform: Lessons from DTCC 2019

Big Data Technology Architecture

May 17, 2019 · Big Data

Optimizing Real-Time Kafka Writes in Spark Streaming Using a Broadcasted KafkaProducer

To improve the performance of writing streaming data to Kafka, the article demonstrates how to replace per-partition KafkaProducer creation with a lazily-initialized, broadcasted producer in Scala, reducing overhead and achieving dozens‑fold speed gains.

Broadcast VariablePerformance OptimizationScala

0 likes · 3 min read

Optimizing Real-Time Kafka Writes in Spark Streaming Using a Broadcasted KafkaProducer

Big Data Technology & Architecture

May 12, 2019 · Big Data

Understanding Spark Streaming Integration with Kafka: Receiver-based and Direct Approaches

This article explains Spark Streaming’s architecture, core concepts such as DStream, windowing, and the two Kafka integration methods—Receiver-based and Direct approaches—detailing their configurations, memory implications, checkpointing, and best‑practice recommendations for reliable, high‑throughput real‑time data processing.

Big DataDirect ApproachReceiver Approach

0 likes · 18 min read

Understanding Spark Streaming Integration with Kafka: Receiver-based and Direct Approaches

Youzan Coder

Mar 20, 2019 · Big Data

Evolution of Real-Time Computing at Youzan: From Storm to Flink and Future Directions

Youzan’s real‑time computing platform progressed from early Storm deployments through Spark Streaming to a Flink‑based architecture, adding unified task management, monitoring, and dedicated streaming clusters, while now pursuing SQL‑driven jobs, a Druid OLAP engine, and a future real‑time data warehouse.

Big DataFlinkSpark Streaming

0 likes · 14 min read

Evolution of Real-Time Computing at Youzan: From Storm to Flink and Future Directions

DataFunTalk

Mar 7, 2019 · Big Data

Design and Evolution of Didi's Real‑Time Data Computing Platform

The article details how Didi built and iterated its real‑time data platform, describing the shift from MySQL‑based batch processing to a Kafka‑Samza‑Druid architecture with Spark Streaming and Flink, the challenges addressed, and the current capabilities and operational metrics.

Big DataDruidFlink

0 likes · 9 min read

Design and Evolution of Didi's Real‑Time Data Computing Platform

dbaplus Community

Feb 28, 2019 · Big Data

How Zhihu Built a Real-Time Data Warehouse: From Spark Streaming to Flink

This article details Zhihu's evolution of its real-time data warehouse, covering the 1.0 version built on Spark Streaming, the 2.0 upgrade using Flink Streaming SQL, architectural layers, ETL processes, and future directions such as streaming SQL platformization and automated result validation.

ETLFlinkLambda architecture

0 likes · 19 min read

How Zhihu Built a Real-Time Data Warehouse: From Spark Streaming to Flink

HomeTech

Jan 18, 2019 · Big Data

Data Mill: A Real‑Time Spark Streaming Framework for DSP Business Support

Data Mill is a Spark‑Streaming‑based real‑time computation framework that abstracts tasks as DataFrames, enables SQL‑driven development, and supports DSP business requirements by reducing latency to 15‑30 minutes while providing a scalable architecture, caching strategy, and automated fault handling.

CacheDSPReal-Time Computing

0 likes · 10 min read

Data Mill: A Real‑Time Spark Streaming Framework for DSP Business Support

dbaplus Community

Jan 9, 2019 · Big Data

How Didi Built a Scalable Real‑Time Computing Platform with Spark, Flink, and StreamSQL

This article outlines Didi's journey from fragmented, self‑built real‑time clusters to a unified, YARN‑managed platform that leverages Spark Streaming, Flink, and StreamSQL, detailing architectural choices, resource isolation, CEP enhancements, and the resulting impact on latency‑critical services.

CEPFlinkReal-time Streaming

0 likes · 10 min read

How Didi Built a Scalable Real‑Time Computing Platform with Spark, Flink, and StreamSQL

Didi Tech

Dec 18, 2018 · Big Data

Evolution and Architecture of Didi's Real-Time Computing Platform

From early self‑built Storm and Spark Streaming clusters to a unified YARN‑based Spark platform and finally a low‑latency Flink system with extended CEP and StreamSQL capabilities, Didi’s real‑time computing platform evolved through three stages, delivering multi‑tenant isolation, rich SQL processing, and dramatically reduced development costs.

Big DataCEPFlink

0 likes · 9 min read

Evolution and Architecture of Didi's Real-Time Computing Platform

DataFunTalk

Oct 14, 2018 · Big Data

Exploring Real-Time Data Warehouse Practices Based on HBase

The article details the evolution from an offline to a real‑time HBase data warehouse, covering business scenarios, the use of Maxwell for MySQL‑to‑Kafka ingestion, Phoenix for SQL access, CDH cluster tuning, monitoring, and several production case studies.

HBasePhoenixReal-Time Data Warehouse

0 likes · 14 min read

Exploring Real-Time Data Warehouse Practices Based on HBase

Tencent Cloud Developer

Sep 6, 2018 · Big Data

Real-Time Stream Computing: Concepts, Challenges, and Tencent Cloud Solutions

As mobile and IoT data surge, real-time stream computing—especially Flink’s low-latency, high-throughput, exactly-once engine—addresses challenges of latency, accuracy, and usability, and Tencent Cloud’s managed Flink service provides elastic, secure, integrated pipelines for applications ranging from online status monitoring to fraud detection and smart transportation.

Apache StormBig DataFlink

0 likes · 30 min read

Real-Time Stream Computing: Concepts, Challenges, and Tencent Cloud Solutions

Meitu Technology

Aug 2, 2018 · Big Data

Spark Streaming vs Flink – Architecture, Scheduling & Fault Tolerance

This article compares Spark Streaming and Flink across runtime models, component roles, programming APIs, task scheduling, time semantics, dynamic Kafka partition detection, fault‑tolerance mechanisms, exactly‑once guarantees, and back‑pressure handling, providing code examples and practical insights for real‑time data processing.

Dynamic Partition DetectionExactly-onceFlink

0 likes · 23 min read

Spark Streaming vs Flink – Architecture, Scheduling & Fault Tolerance

Beike Product & Technology

Jun 22, 2018 · Big Data

Beike Zhaofang's 秒X Real‑Time Analytics Platform: Architecture, Implementation, and Use Cases

The article details the design and deployment of the 秒X real‑time analytics platform at Beike Zhaofang, covering its background, Spark Streaming‑based architecture, fast configuration, data processing pipeline, monitoring, visualization, practical applications, and future development plans.

DruidElasticsearchSpark Streaming

0 likes · 7 min read

Beike Zhaofang's 秒X Real‑Time Analytics Platform: Architecture, Implementation, and Use Cases

Qunar Tech Salon

Jun 20, 2018 · Big Data

How Spark Streaming Submits Tasks: Internal Mechanics and Code Walkthrough

This article explains the internal workflow of Spark Streaming task submission, detailing how StreamingContext, DStream, receivers, and output operators are transformed into RDD jobs, and includes annotated Scala code examples that illustrate each step of the process.

Big DataDstreamReal-time Processing

0 likes · 13 min read

How Spark Streaming Submits Tasks: Internal Mechanics and Code Walkthrough

Ctrip Technology

Jun 4, 2018 · Big Data

Real-Time Data Processing Frameworks and Kafka Practices at Ctrip Ticketing

This article examines Ctrip Ticket's real-time data processing ecosystem, comparing batch and streaming frameworks such as Hadoop, Spark, Storm, Flink, and Spark Streaming, detailing Kafka deployment and configuration, and describing how these technologies are applied in production for log analysis, seat‑occupancy detection, and anti‑crawling.

FlinkReal-time ProcessingSpark Streaming

0 likes · 12 min read

Real-Time Data Processing Frameworks and Kafka Practices at Ctrip Ticketing

Architecture Digest

May 28, 2018 · Big Data

Building a Real-Time Stream Processing Platform with Hadoop Ecosystem (Kafka, Spark Streaming, HBase)

This guide details how to construct a real-time data processing platform on CentOS 7 using the Hadoop ecosystem—installing and configuring Zookeeper, Maven, Hadoop, Kafka, HBase, Spark, and Flume—followed by a Spark Streaming job that consumes Kafka messages and writes them into HBase.

Big DataFlumeHBase

0 likes · 14 min read

Building a Real-Time Stream Processing Platform with Hadoop Ecosystem (Kafka, Spark Streaming, HBase)

Beike Product & Technology

Mar 9, 2018 · Big Data

How Lianjia Built a Low‑Latency Real‑Time Data Platform with Spark Streaming

This article details Lianjia's journey of designing and implementing a low‑latency, stable real‑time computing platform using Spark Streaming on YARN, covering technical selection, architecture components, version compatibility challenges, exactly‑once semantics, graceful shutdown, Kafka tuning, and future enhancements.

Big DataExactly-onceReal-time Processing

0 likes · 11 min read

How Lianjia Built a Low‑Latency Real‑Time Data Platform with Spark Streaming

Suning Technology

Mar 9, 2018 · Big Data

How Suning Built a Scalable Real-Time Log Analysis Platform with Spark Streaming

Suning’s real‑time log analysis system integrates Flume, Kafka, Storm and Spark Streaming to collect, cleanse, and compute metrics like NDCG, ensuring low latency, high throughput, exact‑once processing, and robust data safety while supporting multi‑dimensional analytics on massive online‑offline traffic.

Big DataData QualityNDCG

0 likes · 12 min read

How Suning Built a Scalable Real-Time Log Analysis Platform with Spark Streaming

Ctrip Technology

Feb 28, 2018 · Big Data

Using Alluxio to Mitigate HDFS Maintenance Impact on Real-Time Jobs in Ctrip's Big Data Platform

The article explains how Ctrip's big‑data platform introduced Alluxio to isolate real‑time Spark Streaming jobs from HDFS NameNode maintenance, reduce NameNode pressure, improve Spark SQL performance, and provide a unified storage layer across multiple HDFS clusters.

AlluxioBig DataData Lake

0 likes · 9 min read

Using Alluxio to Mitigate HDFS Maintenance Impact on Real-Time Jobs in Ctrip's Big Data Platform

Ctrip Technology

Sep 20, 2017 · Big Data

Building a Real‑Time Computing Platform with Spark Streaming at Ctrip: Design, Implementation, and Lessons Learned

This article describes how Ctrip migrated its large‑scale real‑time platform from JStorm to Spark Streaming, detailing the architectural design, the Muise Spark Core encapsulation, operational metrics, encountered pitfalls, and future plans to adopt Flink and Beam for streaming workloads.

Big DataExactly-onceSpark Streaming

0 likes · 22 min read

Building a Real‑Time Computing Platform with Spark Streaming at Ctrip: Design, Implementation, and Lessons Learned

MaGe Linux Operations

Sep 11, 2017 · Big Data

How Big Data Can Revolutionize Operations Monitoring

This article explores applying big‑data thinking and platforms—such as Flume, Spark Streaming, and HBase—to operations monitoring, detailing data sources, metric categories, architecture design, implementation steps, and the benefits of a scalable, low‑code monitoring platform.

Big DataOperationsSpark Streaming

0 likes · 10 min read

How Big Data Can Revolutionize Operations Monitoring

Qunar Tech Salon

Apr 21, 2017 · Big Data

Ensuring Exact‑Once Semantics in Spark Streaming with Kafka: Offline Repair and Data Deduplication Strategies

This article explains why Spark Streaming combined with Kafka can only guarantee at‑least‑once delivery, outlines the challenges of delayed and out‑of‑order events, and presents practical offline‑repair, deduplication, and output‑format techniques—including code examples—to achieve exact‑once semantics in big‑data pipelines.

Exact-OnceHBaseHDFS

0 likes · 11 min read

Ensuring Exact‑Once Semantics in Spark Streaming with Kafka: Offline Repair and Data Deduplication Strategies

Meituan Technology Team

Nov 4, 2016 · Big Data

Design and Implementation of a Low-Latency App Exception Monitoring Platform Using Spark Streaming, Kafka, and Elasticsearch

The paper presents a production‑grade, low‑cost mobile‑app exception monitoring platform built on Spark Streaming, Kafka, and Elasticsearch that achieves high availability through exactly‑once processing and checkpointing, minute‑level latency by decoupling raw and symbolized logs, high throughput via reservoir sampling, and dynamic scalability without code changes.

Big DataElasticsearchException Monitoring

0 likes · 11 min read

Design and Implementation of a Low-Latency App Exception Monitoring Platform Using Spark Streaming, Kafka, and Elasticsearch

Liulishuo Tech Team

Oct 17, 2016 · Big Data

Practical Tips and Common Pitfalls for Tuning Apache Spark Performance

This article shares hands‑on experience from Spark Summit attendees, covering why Spark is powerful, common performance problems such as slow jobs, OOM, data skew, excessive partitions, and provides concrete tuning advice on executors, cores, memory, and debugging techniques.

Apache SparkBig DataData Skew

0 likes · 11 min read

Practical Tips and Common Pitfalls for Tuning Apache Spark Performance

dbaplus Community

Jul 5, 2016 · Operations

How to Transform Operations Monitoring with Big Data Thinking

This article explains how to apply big‑data concepts and platforms to operations monitoring, covering data sources, metric extraction from logs, architectural design with Flume, Spark Streaming and HBase, implementation steps, and the resulting benefits for scalability and rapid metric development.

Spark Streaminglog analysis

0 likes · 11 min read

How to Transform Operations Monitoring with Big Data Thinking

Art of Distributed System Architecture Design

Oct 29, 2015 · Big Data

TalkingData’s Journey to Building a Mobile Big Data Platform with Spark and YARN

This article recounts how TalkingData progressively introduced Spark into its Hadoop‑YARN based mobile big‑data platform, detailing early architectures, migration challenges, performance gains, the fully Spark‑centric redesign with Kafka and Spark Streaming, encountered pitfalls, and future plans for further optimization.

Data PlatformHadoopSpark

0 likes · 16 min read

TalkingData’s Journey to Building a Mobile Big Data Platform with Spark and YARN

21CTO

Sep 24, 2015 · Big Data

Comparing Apache Storm, Spark, and Samza: Which Real‑Time Stream Processor Fits Your Needs?

Apache Storm, Spark Streaming, and Samza are three open‑source, low‑latency, scalable distributed systems for real‑time data processing; this article outlines their architectures, key concepts, differences in data handling, state management, delivery guarantees, and typical use‑cases to help you choose the right framework.

Apache SamzaApache StormBig Data

0 likes · 7 min read

Comparing Apache Storm, Spark, and Samza: Which Real‑Time Stream Processor Fits Your Needs?

Art of Distributed System Architecture Design

Sep 24, 2015 · Big Data

Comparative Overview of Apache Storm, Spark Streaming, and Samza for Real-Time Data Processing

This article introduces Apache Storm, Spark Streaming, and Samza, explains their architectures, common features, key differences such as delivery guarantees and state management, and provides guidance on selecting the most suitable framework for various real‑time big‑data use cases.

Apache StormBig DataComparison

0 likes · 8 min read

Comparative Overview of Apache Storm, Spark Streaming, and Samza for Real-Time Data Processing

Art of Distributed System Architecture Design

Apr 24, 2015 · Big Data

Pinterest Real-Time Data Pipeline Using Kafka, Spark, and MemSQL

Pinterest built a real‑time data pipeline that streams user engagement events through Apache Kafka into Spark Streaming, enriches them with location and category information, and persists the results in MemSQL to enable fast, SQL‑based analytics for its recommendation engine.

Big DataMemSQLPinterest

0 likes · 3 min read

Pinterest Real-Time Data Pipeline Using Kafka, Spark, and MemSQL

Qunar Tech Salon

Mar 16, 2015 · Big Data

Comparison of Apache Storm, Spark Streaming, and Samza for Real‑Time Data Processing

This article introduces Apache Storm, Spark Streaming, and Apache Samza, outlines their architectures, highlights commonalities and differences such as delivery guarantees and state management, and offers guidance on selecting the most suitable framework for various real‑time big‑data use cases.

Apache SamzaApache StormBig Data

0 likes · 8 min read

Comparison of Apache Storm, Spark Streaming, and Samza for Real‑Time Data Processing