Tagged articles

Spark

623 articles · Page 5 of 7

Jul 10, 2020 · Big Data

Creating a Test Table in Phoenix/HBase and Implementing a Custom Bitmap Aggregation Function in Spark

This tutorial demonstrates how to create a VARBINARY test table in HBase using Phoenix, serialize its data with RoaringBitmap, implement a custom Spark aggregation function to merge bitmap values, and query the table via Spark SQL, showcasing a practical big-data processing workflow.

Big DataHBasePhoenix

0 likes · 6 min read

Creating a Test Table in Phoenix/HBase and Implementing a Custom Bitmap Aggregation Function in Spark

Big Data Technology & Architecture

Jul 8, 2020 · Big Data

Using Spark SQL User-Defined Functions, Aggregate Functions, and Window Functions

This article demonstrates how to create and register custom scalar UDFs, untyped and type‑safe aggregate functions (UDAF and Aggregator) in Spark SQL, and how to apply window functions such as ROW_NUMBER, providing complete Scala code examples and execution results.

AggregatorBig DataSQL

0 likes · 16 min read

Using Spark SQL User-Defined Functions, Aggregate Functions, and Window Functions

Programmer DD

Jul 7, 2020 · Big Data

How to Choose a Worthwhile Technology: Depth, Ecosystem, and Evolution

The article outlines a three‑dimensional framework—technical depth, ecosystem breadth, and evolution capability—to help engineers decide which big‑data or stream‑processing technology (such as Hadoop, Spark, or Flink) is worth investing time in, and provides practical tips like using Google Trends and GitHub awesome lists.

Big DataFlinkHadoop

0 likes · 12 min read

How to Choose a Worthwhile Technology: Depth, Ecosystem, and Evolution

Big Data Technology & Architecture

Jul 7, 2020 · Big Data

Analysis of Apache Spark's Unified Memory Management Model (Spark 2.2.1)

This article analyzes Apache Spark's executor-side memory management model, focusing on the UnifiedMemoryManager in Spark 2.2.1, detailing on‑heap and off‑heap memory regions, dynamic execution/storage memory sharing, task memory allocation, and practical configuration examples.

Big DataExecutorMemory Management

0 likes · 10 min read

Analysis of Apache Spark's Unified Memory Management Model (Spark 2.2.1)

Big Data Technology Architecture

Jul 7, 2020 · Big Data

Spark Thrift Server: Introduction, Deployment Guide, Architecture, and Comparison with HiveServer2

This article introduces Spark Thrift Server, explains how to deploy it by copying configuration files and required JARs, details its architecture and SQL execution flow, compares it with HiveServer2, and discusses its advantages, limitations, and practical suitability.

HiveServer2SQLSpark

0 likes · 7 min read

Spark Thrift Server: Introduction, Deployment Guide, Architecture, and Comparison with HiveServer2

Big Data Technology & Architecture

Jul 5, 2020 · Big Data

Understanding Spark Memory Management: On‑heap, Off‑heap, and Unified Memory

This article provides a comprehensive overview of Spark's memory management, covering executor memory architecture, the differences between on‑heap and off‑heap memory, static versus unified memory managers, storage and execution memory handling, and practical guidelines for optimizing Spark applications.

Big DataExecutorMemory Management

0 likes · 21 min read

Understanding Spark Memory Management: On‑heap, Off‑heap, and Unified Memory

DataFunTalk

Jul 1, 2020 · Artificial Intelligence

Architecture and Implementation of Autohome's Machine Learning Platform

The article presents a comprehensive overview of Autohome's one‑stop machine learning platform, detailing its background, architecture, resource scheduling, data processing, model training (including distributed deep learning), deployment, real‑world applications such as purchase‑intent and recommendation models, and future development directions.

AutoMLMachine Learning PlatformSpark

0 likes · 19 min read

Architecture and Implementation of Autohome's Machine Learning Platform

Big Data Technology & Architecture

Jul 1, 2020 · Big Data

Overview of Spark SQL Adaptive Execution Optimization Engine

This article explains Spark SQL's Adaptive Execution engine, covering its background, dynamic plan adjustments, shuffle partition tuning, data skew mitigation techniques, and the key configuration parameters needed to enable and fine‑tune adaptive query execution for improved performance.

Adaptive ExecutionBig DataConfiguration

0 likes · 7 min read

Overview of Spark SQL Adaptive Execution Optimization Engine

Big Data Technology & Architecture

Jun 19, 2020 · Big Data

Comparison of Flink and Spark in Standalone and YARN Deployment Modes

This article compares Apache Flink and Apache Spark in both standalone and YARN deployment modes, detailing their architecture, job scheduling differences, and specific configurations such as Flink’s yarn‑cluster and yarn‑session modes versus Spark’s yarn‑client and yarn‑cluster modes.

Big DataComparisonFlink

0 likes · 4 min read

Comparison of Flink and Spark in Standalone and YARN Deployment Modes

Big Data Technology Architecture

Jun 15, 2020 · Big Data

Apache Hudi Copy‑On‑Write Tutorial: Core Concepts and Hands‑On Spark Implementation

This article introduces Apache Hudi’s core concepts and demonstrates how to operate in Copy‑On‑Write mode on a Spark‑based data lake, covering prerequisites, table types, configuration properties, upsert, incremental queries, and record deletion with Scala code examples.

Apache HudiCopy-On-WriteScala

0 likes · 14 min read

Apache Hudi Copy‑On‑Write Tutorial: Core Concepts and Hands‑On Spark Implementation

DataFunTalk

Jun 14, 2020 · Big Data

Designing an Offline Big Data Processing Architecture Based on Object Storage

This article presents a comprehensive offline big‑data processing framework that leverages scalable object storage for PB‑level data, details storage and compute engine requirements, compares cost options, describes data pipeline design, and showcases an e‑commerce case study with Spark‑driven analytics.

Big DataData EngineeringSpark

0 likes · 19 min read

Designing an Offline Big Data Processing Architecture Based on Object Storage

iQIYI Technical Product Team

Jun 12, 2020 · Artificial Intelligence

Deepthought: An End‑to‑End Machine Learning Platform at iQIYI

Deepthought is iQIYI’s end‑to‑end machine‑learning platform that unifies distributed frameworks, decouples pipeline stages, integrates with Tongtian Tower, and offers visual drag‑and‑drop configuration, evolving from a fraud‑detection prototype to a generic system with real‑time inference, automated hyper‑parameter optimization, and support for large‑scale data across anti‑fraud, recommendation, and analytics workloads.

AI platformAutoMLData Engineering

0 likes · 13 min read

Deepthought: An End‑to‑End Machine Learning Platform at iQIYI

Big Data Technology & Architecture

Jun 9, 2020 · Big Data

Comprehensive Overview and Best Practices for Apache Spark Streaming

This article provides a detailed introduction to Spark Streaming, covering its architecture, DStream concepts, initialization, data sources, transformations, windowed aggregations, output operations, checkpointing, fault‑tolerance semantics, deployment, performance tuning, and monitoring for building reliable high‑throughput streaming applications.

Big DataDstreamScala

0 likes · 17 min read

Comprehensive Overview and Best Practices for Apache Spark Streaming

Big Data Technology Architecture

May 31, 2020 · Big Data

Applying Apache Hudi in Medical Big Data: Architecture, Synchronization, Storage Choices, and Future Directions

This article examines the use of Apache Hudi for building a hospital‑wide medical big‑data platform, covering construction background, reasons for selecting Hudi, data synchronization methods, storage mode choices, query optimizations, and future development considerations.

Apache HudiCopy-on-WriteData synchronization

0 likes · 7 min read

Applying Apache Hudi in Medical Big Data: Architecture, Synchronization, Storage Choices, and Future Directions

Big Data Technology & Architecture

May 28, 2020 · Big Data

From SQL to RDD: Understanding Spark's Internal Architecture

This article explains how Spark converts SQL queries into RDD operations by creating a SparkSession, registering temporary views, executing SQL, and then detailing the underlying InternalRow, TreeNode, and Expression structures that power the Catalyst optimizer.

CatalystInternalRowRDD

0 likes · 5 min read

From SQL to RDD: Understanding Spark's Internal Architecture

Big Data Technology Architecture

May 27, 2020 · Big Data

Why Spark Outperforms Hadoop MapReduce: In‑Memory Computing, Task Scheduling, and Execution Strategies

The article explains that Spark’s in‑memory processing, thread‑based task model, selective shuffle sorting, and flexible RDD/DAG architecture give it a significant performance advantage over Hadoop MapReduce’s disk‑heavy, process‑based batch execution.

Distributed ProcessingMapReduceSpark

0 likes · 4 min read

Why Spark Outperforms Hadoop MapReduce: In‑Memory Computing, Task Scheduling, and Execution Strategies

Architect

May 21, 2020 · Big Data

Parallel Execution of Multiple Spark Jobs to Optimize Resource Utilization and Reduce Parquet File Count

This article examines how to run several Spark jobs concurrently on a shared SparkContext, balancing full CPU‑vcore utilization with the need to generate fewer Parquet files, and presents practical experiments, scheduling strategies, and performance results.

Big DataJob SchedulingParquet

0 likes · 12 min read

Parallel Execution of Multiple Spark Jobs to Optimize Resource Utilization and Reduce Parquet File Count

Big Data Technology & Architecture

May 16, 2020 · Big Data

Apache Kylin Single‑Node Installation Guide and Troubleshooting

This article provides a comprehensive step‑by‑step guide for installing Apache Kylin on a single machine, covering required software versions, environment variable configuration, Spark dependency handling, main Kylin properties, verification steps, and detailed solutions to common errors such as Zookeeper host issues, HTTP 404, Jackson conflicts, MapReduce jobhistory problems, missing Spark classes, HiveConf errors, and YARN shuffle service configuration.

Apache KylinBig DataHadoop

0 likes · 26 min read

Apache Kylin Single‑Node Installation Guide and Troubleshooting

Big Data Technology Architecture

May 15, 2020 · Big Data

Performance Tuning of Hive on Spark in YARN Mode

This article explains how to optimize Hive on Spark running on YARN, covering YARN node resource configuration, Spark executor and driver memory settings, dynamic allocation, parallelism, and key Hive parameters to achieve superior performance compared to Hive on MapReduce.

Cluster ConfigurationHivePerformance Tuning

0 likes · 11 min read

Performance Tuning of Hive on Spark in YARN Mode

Architect

May 12, 2020 · Big Data

An Overview of Apache Hudi: Architecture, Concepts, and Query Types

Apache Hudi is an open‑source data‑lake framework that leverages Spark and Hadoop‑compatible storage to provide efficient ingestion, incremental processing, and multiple query modes such as snapshot, incremental, and read‑optimized for large analytical datasets.

Apache HudiBig DataData Lake

0 likes · 11 min read

An Overview of Apache Hudi: Architecture, Concepts, and Query Types

Big Data Technology Architecture

May 10, 2020 · Big Data

The Flourishing Big Data Ecosystem and the Rise of Delta Lake

The article reviews the evolution of the big‑data ecosystem from 2017 to 2019, highlights Spark’s dominance, examines storage‑layer challenges of traditional Hive‑based warehouses, and explains how Delta Lake’s metadata‑driven library simplifies architecture, adds ACID features, and competes with Hudi and Iceberg.

Delta LakeSpark

0 likes · 8 min read

The Flourishing Big Data Ecosystem and the Rise of Delta Lake

Top Architect

May 4, 2020 · Backend Development

Aloha: A Scala‑Based Distributed Task Scheduling Framework – Overview, Extensions, and Architecture

Aloha is a Scala‑implemented distributed task scheduling and management framework built on Spark that provides plug‑in extensions, high‑availability Master‑Worker architecture, custom event listeners, and a lightweight Scala‑based RPC layer for managing long‑running jobs such as Spark, Flink, and ETL tasks.

ALOHADistributed SchedulingRPC

0 likes · 19 min read

Aloha: A Scala‑Based Distributed Task Scheduling Framework – Overview, Extensions, and Architecture

21CTO

Apr 30, 2020 · Big Data

How to Choose a Worthwhile Technology: A Big Data Engineer’s 3‑Step Framework

The article outlines a three‑dimensional framework—technical depth, ecosystem breadth, and evolution capability—to help professionals evaluate whether a technology is worth investing time in, illustrated with real‑world examples from Hadoop, Spark, and Flink.

Big DataCareer AdviceFlink

0 likes · 10 min read

How to Choose a Worthwhile Technology: A Big Data Engineer’s 3‑Step Framework

Big Data Technology & Architecture

Apr 28, 2020 · Big Data

Big Data Practice Exercises: Spark, Kafka, and MySQL Integration with Scala and Java

This article presents a series of hands‑on big‑data exercises, including Spark Scala data analysis, Kafka topic creation and custom partitioning, and MySQL table design with Scala‑based streaming calculations, providing complete source code and step‑by‑step solutions for each task.

Big DataScalaSpark

0 likes · 25 min read

Big Data Practice Exercises: Spark, Kafka, and MySQL Integration with Scala and Java

Tencent Cloud Developer

Apr 28, 2020 · Big Data

Evolution of Ctrip Vacation Pricing Engine: Architecture, Challenges, and Optimizations

Ctrip’s vacation pricing engine evolved from a MySQL‑based synchronous queue to a Kafka‑driven, Spark‑parallelized architecture using HBase, dramatically cutting task generation from five hours to 1.5 hours, boosting price‑accuracy above 90 % while handling billions of calculations and external API constraints.

Sparkdistributed systemskafka

0 likes · 18 min read

Evolution of Ctrip Vacation Pricing Engine: Architecture, Challenges, and Optimizations

Python Programming Learning Circle

Apr 28, 2020 · Big Data

Multiple Ways to Create New Columns in PySpark DataFrames

This tutorial explains several techniques for adding new columns to PySpark DataFrames—including native Spark functions, user‑defined functions, RDD transformations, Pandas UDFs, and SQL queries—while demonstrating data loading, schema handling, and code examples for each method.

Big DataColumn CreationPySpark

0 likes · 9 min read

Multiple Ways to Create New Columns in PySpark DataFrames

Big Data Technology Architecture

Apr 24, 2020 · Big Data

Kyligence Kylin on Parquet: Architecture, Engine Design, and Performance Evaluation

The article introduces Kyligence's Kylin on Parquet solution, explains its plug‑in architecture, reasons for replacing HBase with Parquet, details the new Spark‑based build and query engines, auto‑tuning, global dictionary, fault‑tolerance features, and presents performance comparisons with Kylin 3.0.

Apache KylinData WarehouseParquet

0 likes · 11 min read

Kyligence Kylin on Parquet: Architecture, Engine Design, and Performance Evaluation

Big Data Technology & Architecture

Apr 20, 2020 · Big Data

Using Window Functions in Spark SQL: Aggregation, Ranking, and Partitioning

This article introduces Spark SQL window functions, explains the difference between aggregation and window functions, and demonstrates how to use various ranking functions such as ROW_NUMBER, RANK, DENSE_RANK, and NTILE with practical Scala code examples and partitioning options.

AggregationBig DataRanking

0 likes · 9 min read

Using Window Functions in Spark SQL: Aggregation, Ranking, and Partitioning

Python Programming Learning Circle

Apr 16, 2020 · Big Data

Getting Started with PySpark: Creating SparkContext, Parallelizing Data, and Basic DataFrame Operations

This tutorial demonstrates how to initialize a SparkContext in PySpark, perform simple parallel computations such as temperature conversion and reduction, create a SparkSession to read CSV data, and apply common DataFrame operations like selecting columns, adding new columns, filtering, grouping, and aggregating.

Big DataPySparkSpark

0 likes · 5 min read

Getting Started with PySpark: Creating SparkContext, Parallelizing Data, and Basic DataFrame Operations

Big Data Technology & Architecture

Apr 12, 2020 · Big Data

Understanding Spark and Flink RPC Implementations: A Code Reading Guide

This article explains how to read and compare the RPC implementations of Spark and Flink, covering Actor Model concepts, Akka integration, message handling, threading models, and practical code‑reading techniques while providing detailed code excerpts and architectural analysis.

FlinkRPCSpark

0 likes · 32 min read

Understanding Spark and Flink RPC Implementations: A Code Reading Guide

Big Data Technology & Architecture

Apr 8, 2020 · Big Data

Spark Job Execution Principles and Parameter Tuning for Hive on Spark

This article explains how Spark jobs run on YARN, describes the impact of stages, shuffle and task parallelism, and provides detailed recommendations for tuning Spark executor, memory, core, and parallelism settings to dramatically improve Hive‑on‑Spark TPCx‑BB benchmark performance on large datasets.

Big DataHiveSpark

0 likes · 12 min read

Spark Job Execution Principles and Parameter Tuning for Hive on Spark

Big Data Technology & Architecture

Mar 31, 2020 · Big Data

Comprehensive Spark Optimization Guide: Development, Resource, Skew, Shuffle, and Additional Tips

This article presents a detailed summary of Meituan's Spark optimization techniques, covering development‑level RDD tuning, resource parameter configuration, data‑skew mitigation, shuffle improvements, and the advantages of using DataFrame/Dataset APIs for better performance.

Big DataOptimizationPerformance Tuning

0 likes · 12 min read

Comprehensive Spark Optimization Guide: Development, Resource, Skew, Shuffle, and Additional Tips

58 Tech

Mar 26, 2020 · Big Data

LPA-Detector: Distributed Label Propagation with Confidence Weights for Large‑Scale Graph Risk Detection

The article introduces LPA-Detector, an open‑source project that redesigns the Label Propagation Algorithm using Spark GraphX to add node confidence weights and relationship influence, achieving significant improvements in execution efficiency and detection accuracy for massive graph data in risk‑control scenarios.

Big DataDistributed ComputingRisk Detection

0 likes · 8 min read

LPA-Detector: Distributed Label Propagation with Confidence Weights for Large‑Scale Graph Risk Detection

dbaplus Community

Mar 23, 2020 · Big Data

How to Detect and Resolve Data Skew in Spark and Hadoop

This article explains what data skew is in distributed big‑data systems like Spark and Hadoop, why it hurts performance, how to spot it using the Web UI or key statistics, and presents eight practical mitigation techniques ranging from filtering and shuffle parallelism to custom partitioners and broadcast joins.

Broadcast JoinData SkewHadoop

0 likes · 19 min read

How to Detect and Resolve Data Skew in Spark and Hadoop

Big Data Technology Architecture

Mar 21, 2020 · Big Data

Comprehensive Spark Performance Optimization: Development Tuning, Resource Configuration, Data Skew Handling, and Shuffle Tuning

This article presents a complete guide to Spark performance optimization, covering development‑time best practices, resource‑parameter tuning, systematic detection and resolution of data skew, and detailed shuffle‑related parameter adjustments, all illustrated with Scala code examples.

Data SkewPerformance OptimizationSpark

0 likes · 67 min read

Comprehensive Spark Performance Optimization: Development Tuning, Resource Configuration, Data Skew Handling, and Shuffle Tuning

Big Data Technology Architecture

Mar 16, 2020 · Big Data

Understanding Apache Hudi: Concepts, Architecture, Usage, and Best Practices

This article introduces Apache Hudi, explains its architecture and storage models, describes how it enables upserts and incremental queries on Hadoop, provides step‑by‑step guidance for integrating Hudi with Apache Spark, and outlines best practices and comparisons with Apache Kudu.

Apache HudiHadoopSpark

0 likes · 10 min read

Understanding Apache Hudi: Concepts, Architecture, Usage, and Best Practices

vivo Internet Technology

Mar 11, 2020 · Big Data

Understanding Spark Executor Memory Management and the Unified Memory Model

The article explains Spark’s executor memory layout under the UnifiedMemoryManager, detailing on‑heap and off‑heap divisions, the four memory regions, default fraction settings, how storage and execution memory share space, and provides heuristics and tuning tips for avoiding OOM and optimizing performance.

ExecutorMemory ManagementPerformance Tuning

0 likes · 24 min read

Understanding Spark Executor Memory Management and the Unified Memory Model

Big Data Technology & Architecture

Mar 8, 2020 · Big Data

Hive on Spark Tuning Parameters and Best Practices

This article explains how to tune Hive on Spark by adjusting driver, executor, and Hive configuration parameters—including CPU cores, memory allocations, dynamic allocation, and join thresholds—to achieve optimal performance when running on YARN.

Big DataHivePerformance Tuning

0 likes · 7 min read

Hive on Spark Tuning Parameters and Best Practices

Qunar Tech Salon

Feb 21, 2020 · Artificial Intelligence

Building an End‑to‑End Data‑Model Loop for Alibaba XiaoMi AI Services

The article describes how Alibaba's XiaoMi AI platform constructs a closed‑loop pipeline—from data collection and annotation to model training, evaluation, and real‑time deployment—using multi‑dimensional data processing, visualization, and Spark‑based engines to accelerate iterative improvements and address operational pain points.

AIBig DataModel Training

0 likes · 9 min read

Building an End‑to‑End Data‑Model Loop for Alibaba XiaoMi AI Services

Big Data Technology Architecture

Feb 19, 2020 · Big Data

Comparative Analysis of Hudi, Iceberg, and Delta Lake for Data Lake Storage

This article compares three open‑source data‑lake storage layers—Hudi, Iceberg, and Delta Lake—examining their shared reliance on meta‑files for schema and transaction handling, and detailing their differing designs for upserts, streaming support, query performance, and ecosystem integration.

Delta LakeHudiIceberg

0 likes · 13 min read

Comparative Analysis of Hudi, Iceberg, and Delta Lake for Data Lake Storage

dbaplus Community

Feb 18, 2020 · Big Data

Building RAP: iQIYI’s Real‑Time Big Data Analytics Platform with Druid, Spark & Flink

The article details iQIYI’s RAP platform, describing its real‑time analytics requirements, architectural evolution from RAP 1.x to 2.x, core design steps, integration of Druid, Spark, Flink, and KIS, and showcases business use cases such as membership monitoring, recommendation evaluation, and smart‑TV alerting.

DruidFlinkOLAP

0 likes · 14 min read

Building RAP: iQIYI’s Real‑Time Big Data Analytics Platform with Druid, Spark & Flink

DataFunTalk

Feb 17, 2020 · Artificial Intelligence

Building a Closed‑Loop AI System: From Data Collection to Model Deployment in Alibaba’s XiaoMi

This article explains how Alibaba’s XiaoMi team constructs a full‑cycle AI pipeline—covering real‑time and offline data processing, high‑dimensional visualization, model training, iterative feedback, and Spark‑based deployment—to accelerate intelligent product iteration while addressing common engineering pain points.

AIBig DataReal-time Processing

0 likes · 10 min read

Building a Closed‑Loop AI System: From Data Collection to Model Deployment in Alibaba’s XiaoMi

Big Data Technology Architecture

Feb 1, 2020 · Big Data

Apache Hudi 0.5.1 Release Highlights and Upgrade Guide

The Apache Hudi 0.5.1 release introduces upgraded Spark, Avro, Parquet and Kafka dependencies, new Scala support, timeline layout changes, CLI enhancements, DeltaStreamer parameter updates, Kafka offset enum revisions, key‑generator package relocation, Hive sync options, dynamic Bloom filter, bulk‑insert support, and AWS cloud storage compatibility.

Apache HudiDeltaStreamerRelease

0 likes · 6 min read

Apache Hudi 0.5.1 Release Highlights and Upgrade Guide

Big Data Technology & Architecture

Jan 30, 2020 · Big Data

Comprehensive Guide to Spark Performance Optimization (Development, Resource, Data Skew, and Shuffle Tuning)

This article provides an in‑depth, step‑by‑step guide to optimizing Spark jobs, covering development‑time best practices, resource‑parameter tuning, data‑skew detection and mitigation techniques, and shuffle‑stage performance tweaks, complete with Scala code examples and practical recommendations.

Big DataData SkewPerformance Optimization

0 likes · 67 min read

Comprehensive Guide to Spark Performance Optimization (Development, Resource, Data Skew, and Shuffle Tuning)

Big Data Technology & Architecture

Jan 25, 2020 · Big Data

Spark Scala Example: Find the Most Frequent Visitor ID in a 500‑Million‑Record Dataset

This article demonstrates how to generate 500 million visitor IDs with Spark, use map‑reduce operations to count occurrences, and identify the ID with the highest visit count, while discussing performance considerations such as memory spilling and cluster resources.

Big DataRDDScala

0 likes · 11 min read

Spark Scala Example: Find the Most Frequent Visitor ID in a 500‑Million‑Record Dataset

Big Data Technology & Architecture

Jan 13, 2020 · Big Data

130 Essential Big Data and Distributed Systems Interview Questions

This article compiles 130 interview questions spanning big data technologies, distributed systems, and core computer science concepts to help candidates prepare for technical interviews, offering a comprehensive resource for self‑study and review.

FlinkHadoopInterview Questions

0 likes · 12 min read

130 Essential Big Data and Distributed Systems Interview Questions

DataFunTalk

Jan 10, 2020 · Big Data

Design and Evolution of iQIYI's Real-Time Analytics Platform (RAP)

The article details iQIYI's Real-Time Analysis Platform (RAP), describing its motivation, architecture evolution from RAP 1.x to 2.x, OLAP engine selection, product design workflow, integration of Druid KIS and Flink, enhanced diagnostics, and real-world applications in membership monitoring, recommendation evaluation, and smart TV alerting.

DruidFlinkOLAP

0 likes · 12 min read

Design and Evolution of iQIYI's Real-Time Analytics Platform (RAP)

iQIYI Technical Product Team

Jan 9, 2020 · Big Data

Design and Evolution of iQIYI Real-Time Analysis Platform (RAP)

iQIYI’s Real‑Time Analysis Platform (RAP) combines Apache Druid with Spark/Flink to deliver minute‑level, low‑latency multidimensional analytics via a web wizard, supporting hundreds of streaming tasks and thousands of reports across membership, recommendation, and TV monitoring, while simplifying development and maintenance.

Apache DruidBig DataFlink

0 likes · 13 min read

Design and Evolution of iQIYI Real-Time Analysis Platform (RAP)

Big Data Technology & Architecture

Jan 7, 2020 · Big Data

Using HyperLogLog for High-Performance Pre-Aggregation in Big Data with Spark-Alchemy

The article explains how pre‑aggregation combined with the HyperLogLog algorithm and Spark‑Alchemy's native HLL functions can dramatically accelerate distinct‑count calculations in big‑data workloads while maintaining low error rates and cross‑system compatibility.

Approximate Distinct CountBig DataHyperLogLog

0 likes · 7 min read

Using HyperLogLog for High-Performance Pre-Aggregation in Big Data with Spark-Alchemy

Big Data Technology & Architecture

Jan 2, 2020 · Big Data

Structured Streaming: Design, Challenges, Programming Model, and Performance Evaluation

This article provides a comprehensive overview of Apache Spark Structured Streaming, describing its declarative API, the challenges of stream processing, the programming model with code examples, query planning, execution modes, production use cases, and performance benchmarks compared with other streaming systems.

Big DataSparkStreaming

0 likes · 42 min read

Structured Streaming: Design, Challenges, Programming Model, and Performance Evaluation

dbaplus Community

Jan 1, 2020 · Big Data

How Facebook Replaced Hundreds of Hive Jobs with a Single Spark Pipeline

Facebook migrated a massive, multi‑stage Hive‑based entity ranking pipeline to a single Spark job, detailing the challenges of scaling to 20 TB inputs, the reliability fixes, performance optimizations, and the resulting 4‑6× CPU speedup and reduced latency.

Big DataHiveReliability

0 likes · 16 min read

How Facebook Replaced Hundreds of Hive Jobs with a Single Spark Pipeline

ITPUB

Dec 27, 2019 · Big Data

How Facebook Scaled Entity Ranking from Hive to Spark: Lessons and Performance Gains

Facebook replaced a multi‑stage Hive pipeline for real‑time entity ranking with a single Spark job, applying extensive reliability fixes and performance tweaks that reduced CPU usage by up to six times, cut latency fivefold, and demonstrated the feasibility of shuffling over 90 TB of data in production.

Big DataHivePerformance Optimization

0 likes · 16 min read

How Facebook Scaled Entity Ranking from Hive to Spark: Lessons and Performance Gains

vivo Internet Technology

Dec 25, 2019 · Big Data

Understanding and Mitigating Data Skew in Spark and Hadoop

Data skew in Spark and Hadoop occurs when a few keys dominate shuffle traffic, causing slow tasks, OOM errors, and job failures; the article describes how to detect skew via UI metrics or sampling and offers mitigation tactics such as filtering keys, increasing shuffle partitions, custom partitioners, broadcast joins, salted keys, and Hadoop‑specific settings.

Data SkewPerformance OptimizationShuffle

0 likes · 18 min read

Understanding and Mitigating Data Skew in Spark and Hadoop

DataFunTalk

Dec 24, 2019 · Big Data

Deep Dive into PySpark Implementation: Multi‑Process Architecture, Java Integration, RDD/SQL Interfaces, Executor Communication, and Pandas UDF

This article explains PySpark's multi‑process architecture, how the Python driver uses Py4J to call Java/Scala APIs, the implementation of RDD and DataFrame interfaces, executor‑side process communication and serialization with Arrow, and the design of Pandas UDFs, while also discussing current limitations and future directions.

ArrowBig DataDistributed Computing

0 likes · 13 min read

Deep Dive into PySpark Implementation: Multi‑Process Architecture, Java Integration, RDD/SQL Interfaces, Executor Communication, and Pandas UDF

Big Data Technology & Architecture

Dec 22, 2019 · Big Data

Dynamic Resource Allocation in Spark Streaming: Problems, Mechanisms, and Practical Guidelines

The article explains Spark's default static resource allocation, analyzes the limitations of its Dynamic Resource Allocation (DRA) for streaming workloads, describes the internal Spark components and code paths involved, and proposes concrete design and configuration recommendations for implementing more responsive executor scaling.

Big DataDynamic Resource AllocationExecutor Management

0 likes · 11 min read

Dynamic Resource Allocation in Spark Streaming: Problems, Mechanisms, and Practical Guidelines

Youzan Coder

Dec 18, 2019 · Big Data

HBase Bulkload Practice at Youzan: From MapReduce to Spark Evolution

Youzan’s evolution of HBase bulk‑load—from manual MapReduce jobs to Hive‑SQL and finally Spark—demonstrates how generating HFiles on HDFS, partitioning by region, sorting keys, and handling serialization issues enables billions of records to be loaded efficiently without disrupting production clusters.

HBaseHadoopNoSQL

0 likes · 16 min read

HBase Bulkload Practice at Youzan: From MapReduce to Spark Evolution

Programmer DD

Dec 11, 2019 · Big Data

Big Data Architecture Secrets: Storage-Compute Separation & Spark in Action

This article explores how enterprises can tackle the explosive growth of data by adopting modern big‑data architectures, including storage‑compute separation, data‑driven workflows, risk‑control frameworks, and real‑world Spark optimizations, offering practical guidance for scalable, high‑performance analytics.

Big DataData ArchitectureData-Driven

0 likes · 12 min read

Big Data Architecture Secrets: Storage-Compute Separation & Spark in Action

UCloud Tech

Dec 4, 2019 · Big Data

How to Evolve Big Data Architectures for ZB‑Scale Analytics and Real‑World Use Cases

This article reviews the challenges of handling Zettabyte‑scale data, outlines practical big‑data processing architectures, discusses storage‑compute separation, data‑driven workflows, risk‑control frameworks, and shares concrete Spark implementations at MobTech, offering actionable insights for modern data engineers.

Data ArchitectureSparkStorage Compute Separation

0 likes · 13 min read

How to Evolve Big Data Architectures for ZB‑Scale Analytics and Real‑World Use Cases

Big Data Technology & Architecture

Dec 1, 2019 · Big Data

Dynamic Configuration Updates in Real-Time Streaming with Spark Broadcast Variables and Flink Broadcast State

This article explains how to dynamically update configuration data in real‑time Spark Streaming and Flink jobs using broadcast variables and broadcast state, providing Java code examples and discussing the limitations and practical considerations of each approach.

Big DataFlinkReal-time Streaming

0 likes · 8 min read

Dynamic Configuration Updates in Real-Time Streaming with Spark Broadcast Variables and Flink Broadcast State

Big Data Technology & Architecture

Nov 28, 2019 · Big Data

Resolving Unsupported Oracle Data Types in Spark SQL via Custom JdbcDialects

This article explains how to overcome Spark SQL's inability to handle certain Oracle data types, such as Timestamp with local timezone and FLOAT(126), by creating and registering a custom JdbcDialect that remaps unsupported types to compatible Spark types.

Big DataCustom DialectETL

0 likes · 8 min read

Resolving Unsupported Oracle Data Types in Spark SQL via Custom JdbcDialects

Meituan Technology Team

Nov 21, 2019 · Big Data

Designing a Platformized Jupyter Service Integrated with Spark for Meituan

Meituan Homestay created a platform‑wide Jupyter service built on JupyterHub and Kubernetes that integrates Spark, scheduling, documentation and storage, providing seamless, reproducible notebooks with custom extensions, magics and container isolation to unify data analysis, model training and production workflows.

Big DataJupyterPlatform

0 likes · 19 min read

Designing a Platformized Jupyter Service Integrated with Spark for Meituan

Big Data Technology & Architecture

Nov 17, 2019 · Big Data

Understanding Data Skew in Big Data Processing and Mitigation Strategies

Data skew, a common challenge in large-scale data processing where uneven key distribution leads to performance bottlenecks, is explored with examples from Hadoop, Spark, and Flink, alongside practical mitigation techniques such as hotspot key redesign, map‑side joins, and tuning framework parameters.

FlinkHadoopSpark

0 likes · 6 min read

Understanding Data Skew in Big Data Processing and Mitigation Strategies

DataFunTalk

Nov 7, 2019 · Big Data

Real-Time Computing Engine at Beike: Architecture, Practices, and Future Plans

This article details Beike's real‑time computing engine, covering its background, streaming platform built on Spark Streaming and Flink, data ingestion via Kafka, metadata handling, SQL‑based task development, monitoring, storage solutions, and future roadmap for resource management and AI‑enhanced monitoring.

Big DataFlinkMonitoring

0 likes · 14 min read

Real-Time Computing Engine at Beike: Architecture, Practices, and Future Plans

Architecture Digest

Nov 5, 2019 · Big Data

Architecture Overview of Taobao, Meituan, and Didi Big Data Platforms

This article examines the big‑data architectures of three leading Chinese internet companies—Taobao, Meituan, and Didi—detailing their data sources, synchronization mechanisms, batch and streaming processing layers, and the common scheduling components that unify their Hadoop‑based ecosystems.

Big DataData ArchitectureDidi

0 likes · 7 min read

Architecture Overview of Taobao, Meituan, and Didi Big Data Platforms

Big Data Technology & Architecture

Nov 4, 2019 · Big Data

Understanding Spark Checkpoint: Purpose, Mechanism, and Best Practices

This article explains why Spark checkpoints are needed for large or complex RDD pipelines, how they work by persisting data to reliable storage such as HDFS, and outlines practical steps and best‑practice recommendations for using checkpoints effectively in production environments.

Big DataCheckpointHDFS

0 likes · 6 min read

Understanding Spark Checkpoint: Purpose, Mechanism, and Best Practices

Big Data Technology & Architecture

Nov 3, 2019 · Big Data

Understanding Spark Shuffle and Smart Shuffle: Design, Implementation, and Performance Analysis

This article explains the evolution of Spark Shuffle from hash‑based to sort‑based, introduces the Smart Shuffle optimization, details their implementations and configurations, and presents performance comparisons using TPC‑DS benchmarks, highlighting significant speedups and reduced I/O overhead.

Big DataShuffleSmart Shuffle

0 likes · 7 min read

Understanding Spark Shuffle and Smart Shuffle: Design, Implementation, and Performance Analysis

Big Data Technology & Architecture

Oct 28, 2019 · Big Data

Big Data Technology and Architecture: Leveraging Spark and HBase for Real‑Time and Offline Processing

This article outlines the challenges of various big‑data scenarios such as financial risk control, recommendation systems, and social feeds, explains why Spark is chosen over alternatives, describes a one‑stop data platform architecture with Spark‑HBase integration, and shares best‑practice tips and case studies.

Big DataData ArchitectureHBase

0 likes · 7 min read

Big Data Technology and Architecture: Leveraging Spark and HBase for Real‑Time and Offline Processing

Big Data Technology & Architecture

Oct 20, 2019 · Big Data

Converting Spark RDD to DataSet/DataFrame: Two Methods and Handling Serialization Issues

This article explains two approaches—reflection‑based schema inference and programmatic schema definition—to transform a Spark RDD into a DataSet or DataFrame, demonstrates the required code, and discusses common Task‑not‑serializable errors with practical solutions.

Big DataRDDScala

0 likes · 8 min read

Converting Spark RDD to DataSet/DataFrame: Two Methods and Handling Serialization Issues

Big Data Technology & Architecture

Oct 14, 2019 · Big Data

Optimizing Spark PageRank: Cache, Checkpoint, Data Skew, and Resource Utilization

This article presents a comprehensive analysis of Spark PageRank performance, detailing the algorithm's basics, the original example code, and four key optimizations—caching with checkpointing, memory‑efficient data structures, handling data skew, and maximizing executor and driver resource usage—backed by experimental results and practical recommendations.

Big DataCacheCheckpoint

0 likes · 18 min read

Optimizing Spark PageRank: Cache, Checkpoint, Data Skew, and Resource Utilization

Ctrip Technology

Oct 11, 2019 · Artificial Intelligence

Intelligent Content Extraction and Generation Practices on Ctrip's Marco Polo AI Platform

This article details Ctrip's AI‑driven Marco Polo platform, describing how large‑scale NLP pipelines combine extraction, richness evaluation, semantic matching and deep‑learning generation (CopyNet, TA‑seq2seq) to produce high‑quality recommendation reasons across multiple product scenarios.

Content ExtractionNLPRecommendation Systems

0 likes · 16 min read

Intelligent Content Extraction and Generation Practices on Ctrip's Marco Polo AI Platform

58 Tech

Oct 10, 2019 · Big Data

Optimizing Real‑Time Feature Extraction at 58.com: Migrating from Spark Streaming to Flink

This article describes how 58.com’s commercial engineering team redesigned its real‑time feature‑mining pipeline—replacing a minute‑level Spark Streaming framework with Flink—to achieve sub‑second latency, higher throughput, stronger fault‑tolerance, and end‑to‑end exactly‑once semantics for user‑profile generation in the second‑hand‑car recommendation scenario.

Big DataExactly-onceFlink

0 likes · 14 min read

Optimizing Real‑Time Feature Extraction at 58.com: Migrating from Spark Streaming to Flink

dbaplus Community

Oct 8, 2019 · Big Data

How to Master Large-Scale Cluster Management: 10 Real-World Troubleshooting Cases

This article shares a senior data‑platform engineer's hands‑on experience managing dozens of thousand‑node clusters, detailing nine common cluster problems and step‑by‑step solutions—including performance tuning, RPC fixes, HDFS cleanup, Hive metadata repair, Spark shuffle optimization, HBase region recovery, and Kafka bottleneck mitigation.

Big DataHBaseHadoop

0 likes · 17 min read

How to Master Large-Scale Cluster Management: 10 Real-World Troubleshooting Cases

Big Data Technology & Architecture

Sep 6, 2019 · Big Data

Big Data Development Interview Guide and Skill Tree Overview

This article provides a comprehensive interview roadmap for big data developers, outlining essential Java fundamentals, JVM internals, Linux basics, distributed theory, core frameworks such as Hadoop, Spark, Flink, Kafka, Netty, HBase, Hive, and practical algorithm topics, while also offering resume and career advice for aspiring candidates.

FlinkHadoopJava

0 likes · 15 min read

Big Data Development Interview Guide and Skill Tree Overview

360 Tech Engineering

Sep 4, 2019 · Big Data

XSQL: A Low‑Barrier, Stable Multi‑Data‑Source Distributed Query Engine

XSQL is an open‑source, low‑threshold, highly stable distributed query engine that supports federated queries across heterogeneous data sources, offering push‑down optimization, metadata decentralization, multi‑engine integration, and seamless deployment on Spark/YARN for real‑time big‑data analytics.

Big DataDistributed QuerySQL Federation

0 likes · 14 min read

XSQL: A Low‑Barrier, Stable Multi‑Data‑Source Distributed Query Engine

Tencent Cloud Developer

Aug 30, 2019 · Big Data

How Tencent Cloud Leverages Spark, ElasticSearch, and Flink for PB‑Scale Data Warehousing

The cloud+ community and Kuaishou hosted a big‑data technology salon where experts detailed the evolution, architecture, and practical deployments of Spark‑based cloud data warehouses, ElasticSearch, Yarn, and Flink, highlighting trends, optimization techniques, and future directions for enterprise data analytics.

Big DataCloud ComputingData Warehouse

0 likes · 22 min read

How Tencent Cloud Leverages Spark, ElasticSearch, and Flink for PB‑Scale Data Warehousing

Beike Product & Technology

Aug 29, 2019 · Big Data

TiSpark Integration with TiDB/TiKV for Efficient Data Synchronization and OLAP in the Databus Project

This article introduces TiSpark—an extension of Spark that tightly integrates with TiDB/TiKV to enable high‑performance, scalable data synchronization and OLAP queries, details its architecture, key configuration, performance advantages over Spark SQL and Sqoop, and outlines its role in the Databus data‑integration platform.

Big DataData IntegrationPerformance Optimization

0 likes · 10 min read

TiSpark Integration with TiDB/TiKV for Efficient Data Synchronization and OLAP in the Databus Project

Huajiao Technology

Aug 27, 2019 · Artificial Intelligence

Mastering Collaborative Filtering: From Traditional Similarity to Deep Neural Models

This article provides a comprehensive technical overview of collaborative filtering, covering traditional user‑ and item‑based similarity methods, matrix‑factorization approaches for implicit feedback, various loss functions, and a suite of deep neural network models such as GMF, MLP, NeuMF, DMF, and ConvMF, together with implementation details, evaluation metrics, and practical deployment considerations.

Recommendation SystemsSparkTensorFlow

0 likes · 29 min read

Mastering Collaborative Filtering: From Traditional Similarity to Deep Neural Models

Meituan Technology Team

Aug 15, 2019 · Big Data

Inconsistent Predictions in XGBoost on Spark Due to Different Missing Value Handling

The discrepancy between XGBoost’s Java engine and Spark arose because XGBoost4j treats zero as the default missing value while Spark’s sparse vectors use NaN, causing inconsistent predictions, and was resolved by explicitly setting Float.NaN as the missing value or converting sparse vectors to dense so both engines handle zeros uniformly.

Data EngineeringSparkSparseVector

0 likes · 13 min read

Inconsistent Predictions in XGBoost on Spark Due to Different Missing Value Handling

Big Data Technology & Architecture

Aug 12, 2019 · Big Data

Spark SQL Parameter Tuning and Performance Optimization (Spark 2.3.2)

This article explains how to troubleshoot and tune Spark SQL configuration parameters—covering exception‑related settings such as spark.sql.hive.convertMetastoreParquet, file‑ignore options, and partition verification, as well as performance‑focused tweaks like broadcast join thresholds, adaptive execution, and parquet schema merging—while providing a comprehensive parameter reference table.

Big DataHive MigrationPerformance Optimization

0 likes · 23 min read

Spark SQL Parameter Tuning and Performance Optimization (Spark 2.3.2)

DataFunTalk

Aug 9, 2019 · Big Data

Performance Optimization Techniques for Spark and Spark Streaming Applications

This article explains how to improve Spark and Spark Streaming performance by tuning serialization, broadcast variables, parallelism, batch intervals, memory usage, garbage collection, and Kafka integration, providing practical code examples and real‑world optimization results.

Broadcast VariablesKryoMemory optimization

0 likes · 32 min read

Performance Optimization Techniques for Spark and Spark Streaming Applications

Big Data Technology & Architecture

Aug 3, 2019 · Big Data

Understanding SparkEnv Initialization: Components and Their Setup

This article walks through the SparkEnv initialization process in Apache Spark, detailing how the driver and executor environments are created, the key components such as SecurityManager, RpcEnv, SerializerManager, BroadcastManager, MapOutputTracker, ShuffleManager, MemoryManager, BlockManager, MetricsSystem, and OutputCommitCoordinator are instantiated, and how the final SparkEnv instance is assembled and stored.

Big DataDistributed ComputingScala

0 likes · 13 min read

Understanding SparkEnv Initialization: Components and Their Setup

dbaplus Community

Jul 30, 2019 · Big Data

Spark vs Flink: Which Real‑Time Engine Should You Choose for Kafka Streams?

With the surge in real‑time data from sensors and devices, choosing the right streaming engine is critical; this article compares Apache Spark and Apache Flink—examining their architectures, micro‑batch vs continuous processing, strengths, limitations, and use‑case suitability for Kafka‑driven pipelines.

Big DataFlinkSpark

0 likes · 14 min read

Spark vs Flink: Which Real‑Time Engine Should You Choose for Kafka Streams?

dbaplus Community

Jul 24, 2019 · Big Data

Essential Open-Source Tools Every Big Data Engineer Should Know

This article compiles a comprehensive list of common open‑source tools for big data platforms—covering programming languages, data collection, ETL, storage, analysis, query, management, and monitoring—to help learners and practitioners quickly locate and understand the technologies they need.

Big DataData EngineeringETL

0 likes · 15 min read

Essential Open-Source Tools Every Big Data Engineer Should Know

Tencent Cloud Developer

Jul 24, 2019 · Big Data

Implementing Custom Data Sources in Spark: TGSpark Data Source V2 Practice

The article explains how Tencent’s TGSpark leverages Spark DataSource V2 to create a custom source for TGMars storage, detailing shard‑aware design, push‑down of columns and filters, columnar batch loading, partition‑location reporting, and experimental results that show reduced shuffles and improved local computation when executor placement matches storage nodes.

Big DataColumn PushdownCustom Data Source

0 likes · 10 min read

Implementing Custom Data Sources in Spark: TGSpark Data Source V2 Practice

Tencent Cloud Developer

Jul 18, 2019 · Big Data

Tencent iData Analysis Center: Why We Chose Spark as Our Computing Platform

Tencent’s iData analysis center selected Spark as its new computing platform because, unlike ElasticSearch, TiDB, and other MPP solutions, Spark offers iterative processing, shuffle support, robust SQL and DAG scheduling, and flexible SMP‑style data exchange, enabling efficient OLAP on billions of game‑user records.

Big DataData PlatformDistributed Computing

0 likes · 13 min read

Tencent iData Analysis Center: Why We Chose Spark as Our Computing Platform

Big Data Technology & Architecture

Jul 17, 2019 · Big Data

How to Write Spark DataFrames to Hive Tables and Partitions

This article explains how to persist Spark DataFrames into Hive tables and specific partitions, covering the relevant write APIs, the need to select a database, and providing step‑by‑step Scala code examples for both Spark 1.6 and Spark 2.x versions, along with Hive table creation syntax.

Big DataHiveSQL

0 likes · 10 min read

How to Write Spark DataFrames to Hive Tables and Partitions

dbaplus Community

Jul 10, 2019 · Big Data

How Kuaishou Scales SQL on Hadoop: Architecture, Optimizations, and Lessons Learned

This article explains the SQL‑on‑Hadoop ecosystem—including Hive, Spark, SparkSQL, Presto and other solutions—then details Kuaishou's large‑scale platform architecture, performance bottlenecks, routing logic, high‑availability mechanisms, and a series of concrete optimizations that improve query speed, resource utilization, and operational stability.

High AvailabilityHiveSQL on Hadoop

0 likes · 19 min read

How Kuaishou Scales SQL on Hadoop: Architecture, Optimizations, and Lessons Learned

58 Tech

Jul 2, 2019 · Artificial Intelligence

Magic Mirror: A Visual Data‑Intelligence Platform for Low‑Code Machine Learning

Magic Mirror is a big‑data‑based visual analytics platform that lowers the barrier of machine‑learning for non‑experts while accelerating expert workflows through visual UI, modular algorithms, distributed feature generation, and automated binary‑classification modeling.

Automated modelingBig DataSpark

0 likes · 9 min read

Magic Mirror: A Visual Data‑Intelligence Platform for Low‑Code Machine Learning

Big Data Technology & Architecture

Jun 22, 2019 · Backend Development

Understanding Back Pressure in Flink and Its Implementation

The article explains what back pressure is in Flink streaming jobs, why it occurs when data generation outpaces downstream consumption, how Flink monitors it via stack‑trace sampling, configurable parameters, Web UI visualization, and compares the approach with Spark Streaming's back pressure mechanism.

FlinkSparkdata pipelines

0 likes · 5 min read

Understanding Back Pressure in Flink and Its Implementation

Big Data Technology & Architecture

Jun 19, 2019 · Big Data

Understanding Spark Structured Streaming StateStore: Architecture, Operations, and Fault Recovery

This article explains the design and implementation of Spark Structured Streaming's StateStore module, covering its distributed architecture, state sharding, versioning, batch read/write, migration, update/query APIs, maintenance compaction, and fault‑tolerance mechanisms that enable incremental continuous queries with exactly‑once guarantees.

Big DataSparkStateStore

0 likes · 8 min read

Understanding Spark Structured Streaming StateStore: Architecture, Operations, and Fault Recovery

Big Data Technology & Architecture

Jun 9, 2019 · Big Data

Optimizing Spark Shuffle: Can Fetch, Efficient Fetch, and Reliable Fetch

This article analyzes three Spark shuffle bottlenecks—oversized partitions that exceed Netty's 2 GB limit, excessive retry latency caused by dead executors, and insufficient data‑corruption checks—and presents concrete configuration changes, new block identifiers, executor‑liveness checks, and CRC‑32 verification to improve fetchability, efficiency, and reliability at scale.

Big DataShuffleSpark

0 likes · 18 min read

Optimizing Spark Shuffle: Can Fetch, Efficient Fetch, and Reliable Fetch

Big Data Technology & Architecture

Jun 5, 2019 · Big Data

Real-Time Advertising Click Counting with Spark Structured Streaming and Redis Streams

This article presents a complete solution for real‑time advertising click counting using Spark Structured Streaming combined with Redis Streams, detailing the business scenario, data flow, input/output formats, and step‑by‑step implementation including data extraction, processing, storage, and query via Spark‑SQL.

Big DataRedis StreamScala

0 likes · 11 min read

Real-Time Advertising Click Counting with Spark Structured Streaming and Redis Streams

DataFunTalk

Jun 3, 2019 · Big Data

Choosing a Real-Time Computing Engine Based on Kafka: Spark vs Flink

This article examines the need for real‑time computation, explains streaming versus real‑time concepts, and compares Apache Spark and Apache Flink—covering their architectures, micro‑batch and continuous processing, advantages, limitations, windowing, event‑time handling, and watermarks—to guide engine selection for Kafka‑driven workloads.

FlinkSparkStreaming

0 likes · 15 min read

Choosing a Real-Time Computing Engine Based on Kafka: Spark vs Flink

Big Data Technology & Architecture

Jun 1, 2019 · Big Data

Understanding Spark Executor Memory Management: On‑Heap, Off‑Heap, and Unified Memory

This article explains Spark's executor memory architecture, covering on‑heap and off‑heap memory planning, static and unified memory managers, storage and execution memory allocation, RDD persistence, eviction policies, and shuffle memory usage, providing practical guidance for performance tuning.

Big DataExecutorMemory Management

0 likes · 23 min read

Understanding Spark Executor Memory Management: On‑Heap, Off‑Heap, and Unified Memory

Big Data Technology & Architecture

May 30, 2019 · Big Data

Data Skew Optimization Techniques in Spark

This article explains the phenomenon, causes, detection methods, and a comprehensive set of solutions—including Hive preprocessing, key filtering, shuffle parallelism, two‑stage aggregation, map‑join, sampling, random prefixing, and combined strategies—to mitigate data skew in Spark jobs and improve performance.

Big DataData SkewShuffle

0 likes · 31 min read

Data Skew Optimization Techniques in Spark

Big Data Technology Architecture

May 18, 2019 · Big Data

Key Concepts of Kafka, Hadoop Shuffle, Spark Cluster Modes, HDFS I/O, and Spark RDD Operations

This article explains Kafka message structure and offset retrieval, details Hadoop's map and reduce shuffle processes, outlines Spark's deployment modes, describes HDFS read/write mechanisms, compares reduceByKey and groupByKey performance, and discusses Spark streaming integration with Kafka and data loss prevention.

HDFSHadoopRDD

0 likes · 10 min read

Key Concepts of Kafka, Hadoop Shuffle, Spark Cluster Modes, HDFS I/O, and Spark RDD Operations

Big Data Technology & Architecture

May 14, 2019 · Fundamentals

Zero‑Copy Data Transfer: Principles, Mechanisms, and Applications in Kafka and Spark

This article explains the traditional copy‑based data transmission process, introduces the zero‑copy technique—including basic sendfile(), scatter/gather DMA and mmap support—shows how it reduces context switches and copies, and demonstrates its practical use in Kafka and Spark for high‑throughput workloads.

Data TransferJava NIOSpark

0 likes · 12 min read

Zero‑Copy Data Transfer: Principles, Mechanisms, and Applications in Kafka and Spark

Big Data Technology Architecture

Apr 27, 2019 · Big Data

Understanding Spark Memory Management: On‑Heap and Off‑Heap Planning and Allocation

This article explains Spark's memory management architecture, covering on‑heap and off‑heap memory planning, the MemoryManager interface, static versus unified memory allocation strategies, and how dynamic borrowing improves resource utilization for Spark executors.

Memory ManagementSparkUnified Memory Manager

0 likes · 11 min read

Understanding Spark Memory Management: On‑Heap and Off‑Heap Planning and Allocation

Big Data Technology Architecture

Apr 23, 2019 · Big Data

Understanding Spark Shuffle: Stages, Evolution, and Source Code Structure

This article explains the concept of Spark Shuffle, details its two-phase write and read processes, describes the evolution from Hash‑based to Sort‑based and Tungsten‑based shuffles across Spark versions, and outlines the relevant source‑code components in Spark 2.1.

Shuffle EvolutionSparkSpark Internals

0 likes · 10 min read

Understanding Spark Shuffle: Stages, Evolution, and Source Code Structure

Youzan Coder

Apr 12, 2019 · Industry Insights

How Youzan Scaled Its Log Platform to Handle Billions of Daily Logs

This article details Youzan's evolution from a simple Flume‑based log collector to a multi‑tenant, Kafka‑buffered, Spark‑processed, HBase‑backed logging architecture that now handles hundreds of billions of log entries per day, highlighting challenges, design decisions, and future improvements.

ElasticsearchHBaseSpark

0 likes · 10 min read

How Youzan Scaled Its Log Platform to Handle Billions of Daily Logs