Tagged articles
607 articles
Page 5 of 7
Big Data Technology & Architecture
Big Data Technology & Architecture
May 16, 2020 · Big Data

Apache Kylin Single‑Node Installation Guide and Troubleshooting

This article provides a comprehensive step‑by‑step guide for installing Apache Kylin on a single machine, covering required software versions, environment variable configuration, Spark dependency handling, main Kylin properties, verification steps, and detailed solutions to common errors such as Zookeeper host issues, HTTP 404, Jackson conflicts, MapReduce jobhistory problems, missing Spark classes, HiveConf errors, and YARN shuffle service configuration.

Apache KylinBig DataHadoop
0 likes · 26 min read
Apache Kylin Single‑Node Installation Guide and Troubleshooting
Big Data Technology Architecture
Big Data Technology Architecture
May 15, 2020 · Big Data

Performance Tuning of Hive on Spark in YARN Mode

This article explains how to optimize Hive on Spark running on YARN, covering YARN node resource configuration, Spark executor and driver memory settings, dynamic allocation, parallelism, and key Hive parameters to achieve superior performance compared to Hive on MapReduce.

Cluster ConfigurationHiveSpark
0 likes · 11 min read
Performance Tuning of Hive on Spark in YARN Mode
Architect
Architect
May 12, 2020 · Big Data

An Overview of Apache Hudi: Architecture, Concepts, and Query Types

Apache Hudi is an open‑source data‑lake framework that leverages Spark and Hadoop‑compatible storage to provide efficient ingestion, incremental processing, and multiple query modes such as snapshot, incremental, and read‑optimized for large analytical datasets.

Apache HudiBig DataData Lake
0 likes · 11 min read
An Overview of Apache Hudi: Architecture, Concepts, and Query Types
Big Data Technology Architecture
Big Data Technology Architecture
May 10, 2020 · Big Data

The Flourishing Big Data Ecosystem and the Rise of Delta Lake

The article reviews the evolution of the big‑data ecosystem from 2017 to 2019, highlights Spark’s dominance, examines storage‑layer challenges of traditional Hive‑based warehouses, and explains how Delta Lake’s metadata‑driven library simplifies architecture, adds ACID features, and competes with Hudi and Iceberg.

Delta LakeSpark
0 likes · 8 min read
The Flourishing Big Data Ecosystem and the Rise of Delta Lake
Top Architect
Top Architect
May 4, 2020 · Backend Development

Aloha: A Scala‑Based Distributed Task Scheduling Framework – Overview, Extensions, and Architecture

Aloha is a Scala‑implemented distributed task scheduling and management framework built on Spark that provides plug‑in extensions, high‑availability Master‑Worker architecture, custom event listeners, and a lightweight Scala‑based RPC layer for managing long‑running jobs such as Spark, Flink, and ETL tasks.

ALOHABackendDistributed Scheduling
0 likes · 19 min read
Aloha: A Scala‑Based Distributed Task Scheduling Framework – Overview, Extensions, and Architecture
Tencent Cloud Developer
Tencent Cloud Developer
Apr 28, 2020 · Big Data

Evolution of Ctrip Vacation Pricing Engine: Architecture, Challenges, and Optimizations

Ctrip’s vacation pricing engine evolved from a MySQL‑based synchronous queue to a Kafka‑driven, Spark‑parallelized architecture using HBase, dramatically cutting task generation from five hours to 1.5 hours, boosting price‑accuracy above 90 % while handling billions of calculations and external API constraints.

Distributed SystemsKafkaSpark
0 likes · 18 min read
Evolution of Ctrip Vacation Pricing Engine: Architecture, Challenges, and Optimizations
Big Data Technology Architecture
Big Data Technology Architecture
Apr 24, 2020 · Big Data

Kyligence Kylin on Parquet: Architecture, Engine Design, and Performance Evaluation

The article introduces Kyligence's Kylin on Parquet solution, explains its plug‑in architecture, reasons for replacing HBase with Parquet, details the new Spark‑based build and query engines, auto‑tuning, global dictionary, fault‑tolerance features, and presents performance comparisons with Kylin 3.0.

Apache KylinData WarehouseParquet
0 likes · 11 min read
Kyligence Kylin on Parquet: Architecture, Engine Design, and Performance Evaluation
Python Programming Learning Circle
Python Programming Learning Circle
Apr 16, 2020 · Big Data

Getting Started with PySpark: Creating SparkContext, Parallelizing Data, and Basic DataFrame Operations

This tutorial demonstrates how to initialize a SparkContext in PySpark, perform simple parallel computations such as temperature conversion and reduction, create a SparkSession to read CSV data, and apply common DataFrame operations like selecting columns, adding new columns, filtering, grouping, and aggregating.

Big DataPySparkSpark
0 likes · 5 min read
Getting Started with PySpark: Creating SparkContext, Parallelizing Data, and Basic DataFrame Operations
58 Tech
58 Tech
Mar 26, 2020 · Big Data

LPA-Detector: Distributed Label Propagation with Confidence Weights for Large‑Scale Graph Risk Detection

The article introduces LPA-Detector, an open‑source project that redesigns the Label Propagation Algorithm using Spark GraphX to add node confidence weights and relationship influence, achieving significant improvements in execution efficiency and detection accuracy for massive graph data in risk‑control scenarios.

Big DataRisk DetectionSpark
0 likes · 8 min read
LPA-Detector: Distributed Label Propagation with Confidence Weights for Large‑Scale Graph Risk Detection
dbaplus Community
dbaplus Community
Mar 23, 2020 · Big Data

How to Detect and Resolve Data Skew in Spark and Hadoop

This article explains what data skew is in distributed big‑data systems like Spark and Hadoop, why it hurts performance, how to spot it using the Web UI or key statistics, and presents eight practical mitigation techniques ranging from filtering and shuffle parallelism to custom partitioners and broadcast joins.

Broadcast JoinData SkewHadoop
0 likes · 19 min read
How to Detect and Resolve Data Skew in Spark and Hadoop
Big Data Technology Architecture
Big Data Technology Architecture
Mar 21, 2020 · Big Data

Comprehensive Spark Performance Optimization: Development Tuning, Resource Configuration, Data Skew Handling, and Shuffle Tuning

This article presents a complete guide to Spark performance optimization, covering development‑time best practices, resource‑parameter tuning, systematic detection and resolution of data skew, and detailed shuffle‑related parameter adjustments, all illustrated with Scala code examples.

Data SkewPerformance OptimizationSpark
0 likes · 67 min read
Comprehensive Spark Performance Optimization: Development Tuning, Resource Configuration, Data Skew Handling, and Shuffle Tuning
Qunar Tech Salon
Qunar Tech Salon
Feb 21, 2020 · Artificial Intelligence

Building an End‑to‑End Data‑Model Loop for Alibaba XiaoMi AI Services

The article describes how Alibaba's XiaoMi AI platform constructs a closed‑loop pipeline—from data collection and annotation to model training, evaluation, and real‑time deployment—using multi‑dimensional data processing, visualization, and Spark‑based engines to accelerate iterative improvements and address operational pain points.

AIBig DataModel Training
0 likes · 9 min read
Building an End‑to‑End Data‑Model Loop for Alibaba XiaoMi AI Services
DataFunTalk
DataFunTalk
Feb 17, 2020 · Artificial Intelligence

Building a Closed‑Loop AI System: From Data Collection to Model Deployment in Alibaba’s XiaoMi

This article explains how Alibaba’s XiaoMi team constructs a full‑cycle AI pipeline—covering real‑time and offline data processing, high‑dimensional visualization, model training, iterative feedback, and Spark‑based deployment—to accelerate intelligent product iteration while addressing common engineering pain points.

AIBig DataReal-time Processing
0 likes · 10 min read
Building a Closed‑Loop AI System: From Data Collection to Model Deployment in Alibaba’s XiaoMi
Big Data Technology Architecture
Big Data Technology Architecture
Feb 1, 2020 · Big Data

Apache Hudi 0.5.1 Release Highlights and Upgrade Guide

The Apache Hudi 0.5.1 release introduces upgraded Spark, Avro, Parquet and Kafka dependencies, new Scala support, timeline layout changes, CLI enhancements, DeltaStreamer parameter updates, Kafka offset enum revisions, key‑generator package relocation, Hive sync options, dynamic Bloom filter, bulk‑insert support, and AWS cloud storage compatibility.

Apache HudiDeltaStreamerKafka
0 likes · 6 min read
Apache Hudi 0.5.1 Release Highlights and Upgrade Guide
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 30, 2020 · Big Data

Comprehensive Guide to Spark Performance Optimization (Development, Resource, Data Skew, and Shuffle Tuning)

This article provides an in‑depth, step‑by‑step guide to optimizing Spark jobs, covering development‑time best practices, resource‑parameter tuning, data‑skew detection and mitigation techniques, and shuffle‑stage performance tweaks, complete with Scala code examples and practical recommendations.

Big DataData SkewPerformance Optimization
0 likes · 67 min read
Comprehensive Guide to Spark Performance Optimization (Development, Resource, Data Skew, and Shuffle Tuning)
DataFunTalk
DataFunTalk
Jan 10, 2020 · Big Data

Design and Evolution of iQIYI's Real-Time Analytics Platform (RAP)

The article details iQIYI's Real-Time Analysis Platform (RAP), describing its motivation, architecture evolution from RAP 1.x to 2.x, OLAP engine selection, product design workflow, integration of Druid KIS and Flink, enhanced diagnostics, and real-world applications in membership monitoring, recommendation evaluation, and smart TV alerting.

DruidFlinkOLAP
0 likes · 12 min read
Design and Evolution of iQIYI's Real-Time Analytics Platform (RAP)
iQIYI Technical Product Team
iQIYI Technical Product Team
Jan 9, 2020 · Big Data

Design and Evolution of iQIYI Real-Time Analysis Platform (RAP)

iQIYI’s Real‑Time Analysis Platform (RAP) combines Apache Druid with Spark/Flink to deliver minute‑level, low‑latency multidimensional analytics via a web wizard, supporting hundreds of streaming tasks and thousands of reports across membership, recommendation, and TV monitoring, while simplifying development and maintenance.

Apache DruidBig DataFlink
0 likes · 13 min read
Design and Evolution of iQIYI Real-Time Analysis Platform (RAP)
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 2, 2020 · Big Data

Structured Streaming: Design, Challenges, Programming Model, and Performance Evaluation

This article provides a comprehensive overview of Apache Spark Structured Streaming, describing its declarative API, the challenges of stream processing, the programming model with code examples, query planning, execution modes, production use cases, and performance benchmarks compared with other streaming systems.

Big DataSparkStreaming
0 likes · 42 min read
Structured Streaming: Design, Challenges, Programming Model, and Performance Evaluation
ITPUB
ITPUB
Dec 27, 2019 · Big Data

How Facebook Scaled Entity Ranking from Hive to Spark: Lessons and Performance Gains

Facebook replaced a multi‑stage Hive pipeline for real‑time entity ranking with a single Spark job, applying extensive reliability fixes and performance tweaks that reduced CPU usage by up to six times, cut latency fivefold, and demonstrated the feasibility of shuffling over 90 TB of data in production.

Big DataHivePerformance Optimization
0 likes · 16 min read
How Facebook Scaled Entity Ranking from Hive to Spark: Lessons and Performance Gains
vivo Internet Technology
vivo Internet Technology
Dec 25, 2019 · Big Data

Understanding and Mitigating Data Skew in Spark and Hadoop

Data skew in Spark and Hadoop occurs when a few keys dominate shuffle traffic, causing slow tasks, OOM errors, and job failures; the article describes how to detect skew via UI metrics or sampling and offers mitigation tactics such as filtering keys, increasing shuffle partitions, custom partitioners, broadcast joins, salted keys, and Hadoop‑specific settings.

Data SkewPartitioningPerformance Optimization
0 likes · 18 min read
Understanding and Mitigating Data Skew in Spark and Hadoop
DataFunTalk
DataFunTalk
Dec 24, 2019 · Big Data

Deep Dive into PySpark Implementation: Multi‑Process Architecture, Java Integration, RDD/SQL Interfaces, Executor Communication, and Pandas UDF

This article explains PySpark's multi‑process architecture, how the Python driver uses Py4J to call Java/Scala APIs, the implementation of RDD and DataFrame interfaces, executor‑side process communication and serialization with Arrow, and the design of Pandas UDFs, while also discussing current limitations and future directions.

ArrowBig DataPySpark
0 likes · 13 min read
Deep Dive into PySpark Implementation: Multi‑Process Architecture, Java Integration, RDD/SQL Interfaces, Executor Communication, and Pandas UDF
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 22, 2019 · Big Data

Dynamic Resource Allocation in Spark Streaming: Problems, Mechanisms, and Practical Guidelines

The article explains Spark's default static resource allocation, analyzes the limitations of its Dynamic Resource Allocation (DRA) for streaming workloads, describes the internal Spark components and code paths involved, and proposes concrete design and configuration recommendations for implementing more responsive executor scaling.

Big DataDynamic Resource AllocationExecutor Management
0 likes · 11 min read
Dynamic Resource Allocation in Spark Streaming: Problems, Mechanisms, and Practical Guidelines
Youzan Coder
Youzan Coder
Dec 18, 2019 · Big Data

HBase Bulkload Practice at Youzan: From MapReduce to Spark Evolution

Youzan’s evolution of HBase bulk‑load—from manual MapReduce jobs to Hive‑SQL and finally Spark—demonstrates how generating HFiles on HDFS, partitioning by region, sorting keys, and handling serialization issues enables billions of records to be loaded efficiently without disrupting production clusters.

HBaseHadoopNoSQL
0 likes · 16 min read
HBase Bulkload Practice at Youzan: From MapReduce to Spark Evolution
Programmer DD
Programmer DD
Dec 11, 2019 · Big Data

Big Data Architecture Secrets: Storage-Compute Separation & Spark in Action

This article explores how enterprises can tackle the explosive growth of data by adopting modern big‑data architectures, including storage‑compute separation, data‑driven workflows, risk‑control frameworks, and real‑world Spark optimizations, offering practical guidance for scalable, high‑performance analytics.

Big DataData ArchitectureData-driven
0 likes · 12 min read
Big Data Architecture Secrets: Storage-Compute Separation & Spark in Action
UCloud Tech
UCloud Tech
Dec 4, 2019 · Big Data

How to Evolve Big Data Architectures for ZB‑Scale Analytics and Real‑World Use Cases

This article reviews the challenges of handling Zettabyte‑scale data, outlines practical big‑data processing architectures, discusses storage‑compute separation, data‑driven workflows, risk‑control frameworks, and shares concrete Spark implementations at MobTech, offering actionable insights for modern data engineers.

Data ArchitectureSparkStorage Compute Separation
0 likes · 13 min read
How to Evolve Big Data Architectures for ZB‑Scale Analytics and Real‑World Use Cases
Meituan Technology Team
Meituan Technology Team
Nov 21, 2019 · Big Data

Designing a Platformized Jupyter Service Integrated with Spark for Meituan

Meituan Homestay created a platform‑wide Jupyter service built on JupyterHub and Kubernetes that integrates Spark, scheduling, documentation and storage, providing seamless, reproducible notebooks with custom extensions, magics and container isolation to unify data analysis, model training and production workflows.

Big DataJupyterKubernetes
0 likes · 19 min read
Designing a Platformized Jupyter Service Integrated with Spark for Meituan
DataFunTalk
DataFunTalk
Nov 7, 2019 · Big Data

Real-Time Computing Engine at Beike: Architecture, Practices, and Future Plans

This article details Beike's real‑time computing engine, covering its background, streaming platform built on Spark Streaming and Flink, data ingestion via Kafka, metadata handling, SQL‑based task development, monitoring, storage solutions, and future roadmap for resource management and AI‑enhanced monitoring.

Big DataFlinkKafka
0 likes · 14 min read
Real-Time Computing Engine at Beike: Architecture, Practices, and Future Plans
Architecture Digest
Architecture Digest
Nov 5, 2019 · Big Data

Architecture Overview of Taobao, Meituan, and Didi Big Data Platforms

This article examines the big‑data architectures of three leading Chinese internet companies—Taobao, Meituan, and Didi—detailing their data sources, synchronization mechanisms, batch and streaming processing layers, and the common scheduling components that unify their Hadoop‑based ecosystems.

Big DataData ArchitectureDidi
0 likes · 7 min read
Architecture Overview of Taobao, Meituan, and Didi Big Data Platforms
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 3, 2019 · Big Data

Understanding Spark Shuffle and Smart Shuffle: Design, Implementation, and Performance Analysis

This article explains the evolution of Spark Shuffle from hash‑based to sort‑based, introduces the Smart Shuffle optimization, details their implementations and configurations, and presents performance comparisons using TPC‑DS benchmarks, highlighting significant speedups and reduced I/O overhead.

Big DataShuffleSmart Shuffle
0 likes · 7 min read
Understanding Spark Shuffle and Smart Shuffle: Design, Implementation, and Performance Analysis
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 28, 2019 · Big Data

Big Data Technology and Architecture: Leveraging Spark and HBase for Real‑Time and Offline Processing

This article outlines the challenges of various big‑data scenarios such as financial risk control, recommendation systems, and social feeds, explains why Spark is chosen over alternatives, describes a one‑stop data platform architecture with Spark‑HBase integration, and shares best‑practice tips and case studies.

Big DataData ArchitectureHBase
0 likes · 7 min read
Big Data Technology and Architecture: Leveraging Spark and HBase for Real‑Time and Offline Processing
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 14, 2019 · Big Data

Optimizing Spark PageRank: Cache, Checkpoint, Data Skew, and Resource Utilization

This article presents a comprehensive analysis of Spark PageRank performance, detailing the algorithm's basics, the original example code, and four key optimizations—caching with checkpointing, memory‑efficient data structures, handling data skew, and maximizing executor and driver resource usage—backed by experimental results and practical recommendations.

Big DataCacheCheckpoint
0 likes · 18 min read
Optimizing Spark PageRank: Cache, Checkpoint, Data Skew, and Resource Utilization
Ctrip Technology
Ctrip Technology
Oct 11, 2019 · Artificial Intelligence

Intelligent Content Extraction and Generation Practices on Ctrip's Marco Polo AI Platform

This article details Ctrip's AI‑driven Marco Polo platform, describing how large‑scale NLP pipelines combine extraction, richness evaluation, semantic matching and deep‑learning generation (CopyNet, TA‑seq2seq) to produce high‑quality recommendation reasons across multiple product scenarios.

Content ExtractionNLPRecommendation Systems
0 likes · 16 min read
Intelligent Content Extraction and Generation Practices on Ctrip's Marco Polo AI Platform
58 Tech
58 Tech
Oct 10, 2019 · Big Data

Optimizing Real‑Time Feature Extraction at 58.com: Migrating from Spark Streaming to Flink

This article describes how 58.com’s commercial engineering team redesigned its real‑time feature‑mining pipeline—replacing a minute‑level Spark Streaming framework with Flink—to achieve sub‑second latency, higher throughput, stronger fault‑tolerance, and end‑to‑end exactly‑once semantics for user‑profile generation in the second‑hand‑car recommendation scenario.

Big DataExactly-OnceFlink
0 likes · 14 min read
Optimizing Real‑Time Feature Extraction at 58.com: Migrating from Spark Streaming to Flink
dbaplus Community
dbaplus Community
Oct 8, 2019 · Big Data

How to Master Large-Scale Cluster Management: 10 Real-World Troubleshooting Cases

This article shares a senior data‑platform engineer's hands‑on experience managing dozens of thousand‑node clusters, detailing nine common cluster problems and step‑by‑step solutions—including performance tuning, RPC fixes, HDFS cleanup, Hive metadata repair, Spark shuffle optimization, HBase region recovery, and Kafka bottleneck mitigation.

Big DataCluster ManagementHBase
0 likes · 17 min read
How to Master Large-Scale Cluster Management: 10 Real-World Troubleshooting Cases
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 6, 2019 · Big Data

Big Data Development Interview Guide and Skill Tree Overview

This article provides a comprehensive interview roadmap for big data developers, outlining essential Java fundamentals, JVM internals, Linux basics, distributed theory, core frameworks such as Hadoop, Spark, Flink, Kafka, Netty, HBase, Hive, and practical algorithm topics, while also offering resume and career advice for aspiring candidates.

FlinkHadoopJava
0 likes · 15 min read
Big Data Development Interview Guide and Skill Tree Overview
360 Tech Engineering
360 Tech Engineering
Sep 4, 2019 · Big Data

XSQL: A Low‑Barrier, Stable Multi‑Data‑Source Distributed Query Engine

XSQL is an open‑source, low‑threshold, highly stable distributed query engine that supports federated queries across heterogeneous data sources, offering push‑down optimization, metadata decentralization, multi‑engine integration, and seamless deployment on Spark/YARN for real‑time big‑data analytics.

Big DataDistributed QuerySQL Federation
0 likes · 14 min read
XSQL: A Low‑Barrier, Stable Multi‑Data‑Source Distributed Query Engine
Tencent Cloud Developer
Tencent Cloud Developer
Aug 30, 2019 · Big Data

How Tencent Cloud Leverages Spark, ElasticSearch, and Flink for PB‑Scale Data Warehousing

The cloud+ community and Kuaishou hosted a big‑data technology salon where experts detailed the evolution, architecture, and practical deployments of Spark‑based cloud data warehouses, ElasticSearch, Yarn, and Flink, highlighting trends, optimization techniques, and future directions for enterprise data analytics.

Big DataData WarehouseElasticsearch
0 likes · 22 min read
How Tencent Cloud Leverages Spark, ElasticSearch, and Flink for PB‑Scale Data Warehousing
Beike Product & Technology
Beike Product & Technology
Aug 29, 2019 · Big Data

TiSpark Integration with TiDB/TiKV for Efficient Data Synchronization and OLAP in the Databus Project

This article introduces TiSpark—an extension of Spark that tightly integrates with TiDB/TiKV to enable high‑performance, scalable data synchronization and OLAP queries, details its architecture, key configuration, performance advantages over Spark SQL and Sqoop, and outlines its role in the Databus data‑integration platform.

Big DataData IntegrationPerformance Optimization
0 likes · 10 min read
TiSpark Integration with TiDB/TiKV for Efficient Data Synchronization and OLAP in the Databus Project
Huajiao Technology
Huajiao Technology
Aug 27, 2019 · Artificial Intelligence

Mastering Collaborative Filtering: From Traditional Similarity to Deep Neural Models

This article provides a comprehensive technical overview of collaborative filtering, covering traditional user‑ and item‑based similarity methods, matrix‑factorization approaches for implicit feedback, various loss functions, and a suite of deep neural network models such as GMF, MLP, NeuMF, DMF, and ConvMF, together with implementation details, evaluation metrics, and practical deployment considerations.

Deep LearningRecommendation SystemsSpark
0 likes · 29 min read
Mastering Collaborative Filtering: From Traditional Similarity to Deep Neural Models
Meituan Technology Team
Meituan Technology Team
Aug 15, 2019 · Big Data

Inconsistent Predictions in XGBoost on Spark Due to Different Missing Value Handling

The discrepancy between XGBoost’s Java engine and Spark arose because XGBoost4j treats zero as the default missing value while Spark’s sparse vectors use NaN, causing inconsistent predictions, and was resolved by explicitly setting Float.NaN as the missing value or converting sparse vectors to dense so both engines handle zeros uniformly.

SparkSparseVectorXGBoost
0 likes · 13 min read
Inconsistent Predictions in XGBoost on Spark Due to Different Missing Value Handling
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 12, 2019 · Big Data

Spark SQL Parameter Tuning and Performance Optimization (Spark 2.3.2)

This article explains how to troubleshoot and tune Spark SQL configuration parameters—covering exception‑related settings such as spark.sql.hive.convertMetastoreParquet, file‑ignore options, and partition verification, as well as performance‑focused tweaks like broadcast join thresholds, adaptive execution, and parquet schema merging—while providing a comprehensive parameter reference table.

Big DataHive MigrationParameter Tuning
0 likes · 23 min read
Spark SQL Parameter Tuning and Performance Optimization (Spark 2.3.2)
DataFunTalk
DataFunTalk
Aug 9, 2019 · Big Data

Performance Optimization Techniques for Spark and Spark Streaming Applications

This article explains how to improve Spark and Spark Streaming performance by tuning serialization, broadcast variables, parallelism, batch intervals, memory usage, garbage collection, and Kafka integration, providing practical code examples and real‑world optimization results.

Broadcast VariablesKryoMemory Optimization
0 likes · 32 min read
Performance Optimization Techniques for Spark and Spark Streaming Applications
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 3, 2019 · Big Data

Understanding SparkEnv Initialization: Components and Their Setup

This article walks through the SparkEnv initialization process in Apache Spark, detailing how the driver and executor environments are created, the key components such as SecurityManager, RpcEnv, SerializerManager, BroadcastManager, MapOutputTracker, ShuffleManager, MemoryManager, BlockManager, MetricsSystem, and OutputCommitCoordinator are instantiated, and how the final SparkEnv instance is assembled and stored.

Big DataScalaSpark
0 likes · 13 min read
Understanding SparkEnv Initialization: Components and Their Setup
dbaplus Community
dbaplus Community
Jul 30, 2019 · Big Data

Spark vs Flink: Which Real‑Time Engine Should You Choose for Kafka Streams?

With the surge in real‑time data from sensors and devices, choosing the right streaming engine is critical; this article compares Apache Spark and Apache Flink—examining their architectures, micro‑batch vs continuous processing, strengths, limitations, and use‑case suitability for Kafka‑driven pipelines.

Big DataFlinkKafka
0 likes · 14 min read
Spark vs Flink: Which Real‑Time Engine Should You Choose for Kafka Streams?
dbaplus Community
dbaplus Community
Jul 24, 2019 · Big Data

Essential Open-Source Tools Every Big Data Engineer Should Know

This article compiles a comprehensive list of common open‑source tools for big data platforms—covering programming languages, data collection, ETL, storage, analysis, query, management, and monitoring—to help learners and practitioners quickly locate and understand the technologies they need.

Big DataETLHadoop
0 likes · 15 min read
Essential Open-Source Tools Every Big Data Engineer Should Know
Tencent Cloud Developer
Tencent Cloud Developer
Jul 24, 2019 · Big Data

Implementing Custom Data Sources in Spark: TGSpark Data Source V2 Practice

The article explains how Tencent’s TGSpark leverages Spark DataSource V2 to create a custom source for TGMars storage, detailing shard‑aware design, push‑down of columns and filters, columnar batch loading, partition‑location reporting, and experimental results that show reduced shuffles and improved local computation when executor placement matches storage nodes.

Big DataColumn PushdownCustom Data Source
0 likes · 10 min read
Implementing Custom Data Sources in Spark: TGSpark Data Source V2 Practice
Tencent Cloud Developer
Tencent Cloud Developer
Jul 18, 2019 · Big Data

Tencent iData Analysis Center: Why We Chose Spark as Our Computing Platform

Tencent’s iData analysis center selected Spark as its new computing platform because, unlike ElasticSearch, TiDB, and other MPP solutions, Spark offers iterative processing, shuffle support, robust SQL and DAG scheduling, and flexible SMP‑style data exchange, enabling efficient OLAP on billions of game‑user records.

Big DataData PlatformMPP
0 likes · 13 min read
Tencent iData Analysis Center: Why We Chose Spark as Our Computing Platform
dbaplus Community
dbaplus Community
Jul 10, 2019 · Big Data

How Kuaishou Scales SQL on Hadoop: Architecture, Optimizations, and Lessons Learned

This article explains the SQL‑on‑Hadoop ecosystem—including Hive, Spark, SparkSQL, Presto and other solutions—then details Kuaishou's large‑scale platform architecture, performance bottlenecks, routing logic, high‑availability mechanisms, and a series of concrete optimizations that improve query speed, resource utilization, and operational stability.

HiveSQL on HadoopSpark
0 likes · 19 min read
How Kuaishou Scales SQL on Hadoop: Architecture, Optimizations, and Lessons Learned
58 Tech
58 Tech
Jul 2, 2019 · Artificial Intelligence

Magic Mirror: A Visual Data‑Intelligence Platform for Low‑Code Machine Learning

Magic Mirror is a big‑data‑based visual analytics platform that lowers the barrier of machine‑learning for non‑experts while accelerating expert workflows through visual UI, modular algorithms, distributed feature generation, and automated binary‑classification modeling.

Automated ModelingBig DataSpark
0 likes · 9 min read
Magic Mirror: A Visual Data‑Intelligence Platform for Low‑Code Machine Learning
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 22, 2019 · Backend Development

Understanding Back Pressure in Flink and Its Implementation

The article explains what back pressure is in Flink streaming jobs, why it occurs when data generation outpaces downstream consumption, how Flink monitors it via stack‑trace sampling, configurable parameters, Web UI visualization, and compares the approach with Spark Streaming's back pressure mechanism.

FlinkSparkdata pipelines
0 likes · 5 min read
Understanding Back Pressure in Flink and Its Implementation
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 19, 2019 · Big Data

Understanding Spark Structured Streaming StateStore: Architecture, Operations, and Fault Recovery

This article explains the design and implementation of Spark Structured Streaming's StateStore module, covering its distributed architecture, state sharding, versioning, batch read/write, migration, update/query APIs, maintenance compaction, and fault‑tolerance mechanisms that enable incremental continuous queries with exactly‑once guarantees.

Big DataSparkStateStore
0 likes · 8 min read
Understanding Spark Structured Streaming StateStore: Architecture, Operations, and Fault Recovery
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 9, 2019 · Big Data

Optimizing Spark Shuffle: Can Fetch, Efficient Fetch, and Reliable Fetch

This article analyzes three Spark shuffle bottlenecks—oversized partitions that exceed Netty's 2 GB limit, excessive retry latency caused by dead executors, and insufficient data‑corruption checks—and presents concrete configuration changes, new block identifiers, executor‑liveness checks, and CRC‑32 verification to improve fetchability, efficiency, and reliability at scale.

Big DataShuffleSpark
0 likes · 18 min read
Optimizing Spark Shuffle: Can Fetch, Efficient Fetch, and Reliable Fetch
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 5, 2019 · Big Data

Real-Time Advertising Click Counting with Spark Structured Streaming and Redis Streams

This article presents a complete solution for real‑time advertising click counting using Spark Structured Streaming combined with Redis Streams, detailing the business scenario, data flow, input/output formats, and step‑by‑step implementation including data extraction, processing, storage, and query via Spark‑SQL.

Big DataRedis StreamScala
0 likes · 11 min read
Real-Time Advertising Click Counting with Spark Structured Streaming and Redis Streams
DataFunTalk
DataFunTalk
Jun 3, 2019 · Big Data

Choosing a Real-Time Computing Engine Based on Kafka: Spark vs Flink

This article examines the need for real‑time computation, explains streaming versus real‑time concepts, and compares Apache Spark and Apache Flink—covering their architectures, micro‑batch and continuous processing, advantages, limitations, windowing, event‑time handling, and watermarks—to guide engine selection for Kafka‑driven workloads.

FlinkKafkaSpark
0 likes · 15 min read
Choosing a Real-Time Computing Engine Based on Kafka: Spark vs Flink
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 1, 2019 · Big Data

Understanding Spark Executor Memory Management: On‑Heap, Off‑Heap, and Unified Memory

This article explains Spark's executor memory architecture, covering on‑heap and off‑heap memory planning, static and unified memory managers, storage and execution memory allocation, RDD persistence, eviction policies, and shuffle memory usage, providing practical guidance for performance tuning.

Big DataExecutorMemory Management
0 likes · 23 min read
Understanding Spark Executor Memory Management: On‑Heap, Off‑Heap, and Unified Memory
Big Data Technology & Architecture
Big Data Technology & Architecture
May 30, 2019 · Big Data

Data Skew Optimization Techniques in Spark

This article explains the phenomenon, causes, detection methods, and a comprehensive set of solutions—including Hive preprocessing, key filtering, shuffle parallelism, two‑stage aggregation, map‑join, sampling, random prefixing, and combined strategies—to mitigate data skew in Spark jobs and improve performance.

Big DataData SkewShuffle
0 likes · 31 min read
Data Skew Optimization Techniques in Spark
Big Data Technology Architecture
Big Data Technology Architecture
May 18, 2019 · Big Data

Key Concepts of Kafka, Hadoop Shuffle, Spark Cluster Modes, HDFS I/O, and Spark RDD Operations

This article explains Kafka message structure and offset retrieval, details Hadoop's map and reduce shuffle processes, outlines Spark's deployment modes, describes HDFS read/write mechanisms, compares reduceByKey and groupByKey performance, and discusses Spark streaming integration with Kafka and data loss prevention.

HDFSHadoopKafka
0 likes · 10 min read
Key Concepts of Kafka, Hadoop Shuffle, Spark Cluster Modes, HDFS I/O, and Spark RDD Operations
Big Data Technology & Architecture
Big Data Technology & Architecture
May 14, 2019 · Fundamentals

Zero‑Copy Data Transfer: Principles, Mechanisms, and Applications in Kafka and Spark

This article explains the traditional copy‑based data transmission process, introduces the zero‑copy technique—including basic sendfile(), scatter/gather DMA and mmap support—shows how it reduces context switches and copies, and demonstrates its practical use in Kafka and Spark for high‑throughput workloads.

Data TransferJava NIOSpark
0 likes · 12 min read
Zero‑Copy Data Transfer: Principles, Mechanisms, and Applications in Kafka and Spark
Youzan Coder
Youzan Coder
Apr 12, 2019 · Industry Insights

How Youzan Scaled Its Log Platform to Handle Billions of Daily Logs

This article details Youzan's evolution from a simple Flume‑based log collector to a multi‑tenant, Kafka‑buffered, Spark‑processed, HBase‑backed logging architecture that now handles hundreds of billions of log entries per day, highlighting challenges, design decisions, and future improvements.

Distributed SystemsElasticsearchHBase
0 likes · 10 min read
How Youzan Scaled Its Log Platform to Handle Billions of Daily Logs
Alibaba Cloud Native
Alibaba Cloud Native
Apr 9, 2019 · Big Data

How Compute‑Storage Separation Cuts Costs and Boosts Performance for Big Data on Kubernetes

This article examines the challenges of big‑data storage in containerized environments, compares compute‑storage‑separated architectures with traditional setups, presents performance and cost benchmarks of Alibaba Cloud ECS instances, and outlines practical storage options such as OSS, NAS, and DFS for Spark workloads on Kubernetes.

Cloud NativeCompute-Storage SeparationKubernetes
0 likes · 14 min read
How Compute‑Storage Separation Cuts Costs and Boosts Performance for Big Data on Kubernetes
Architecture Digest
Architecture Digest
Mar 28, 2019 · Backend Development

Aloha: A Scala‑Based Distributed Task Scheduling and Management Framework

Aloha is a Scala‑implemented distributed scheduling framework built on Spark that provides extensible plugins, high‑availability master/worker architecture, REST submission, custom application interfaces, event listeners, and a Scala‑based RPC system for managing long‑running tasks such as Spark, Flink, and ETL jobs.

BackendDistributed SchedulingRPC
0 likes · 17 min read
Aloha: A Scala‑Based Distributed Task Scheduling and Management Framework
dbaplus Community
dbaplus Community
Mar 21, 2019 · Big Data

How Real-Time Data Platforms Evolve: From Storm to Flink and Kubernetes

This article summarizes Wang Xinchun's 2018 DAMS China Data Asset Management Summit talk, detailing the current state, core services, responsibilities, evolution, architecture, challenges, and future directions of a large‑scale real‑time data platform built on Storm, Spark, Flink, and Kubernetes, including a unified data management approach.

Data PlatformFlinkKubernetes
0 likes · 22 min read
How Real-Time Data Platforms Evolve: From Storm to Flink and Kubernetes
58 Tech
58 Tech
Mar 15, 2019 · Big Data

Optimizing Spark Join Operations in Spark Core and Spark SQL

This article explains how to improve Spark join performance by reducing shuffle, using appropriate partitioners, applying broadcast hash joins for small tables, and selecting the optimal join strategy (broadcast, shuffle hash, or sort‑merge) in both Spark Core and Spark SQL.

JOINShuffleSpark
0 likes · 6 min read
Optimizing Spark Join Operations in Spark Core and Spark SQL
Youzan Coder
Youzan Coder
Mar 8, 2019 · Big Data

Why Spark Shuffle Often Runs Out of Memory and How to Fix It

This article examines Spark's memory management and the shuffle process, identifies the components that consume the most memory during shuffle write and read, analyzes common OOM scenarios such as task concurrency and data skew, and offers configuration tips to prevent out‑of‑memory failures.

MemoryManagementOutOfMemoryShuffle
0 likes · 14 min read
Why Spark Shuffle Often Runs Out of Memory and How to Fix It
dbaplus Community
dbaplus Community
Mar 5, 2019 · Databases

How HTAP and DRDS HTAP Enable Real‑Time OLTP/OLAP Integration

This article explains the concepts of OLTP, OLAP and HTAP, describes the DRDS HTAP architecture—including its engine and storage layers, Fireworks Spark‑based engine, optimizer stages, and streaming capabilities—and demonstrates cross‑database MPP queries and streaming joins while outlining suitable use cases and limitations.

DRDSDatabase ArchitectureHTAP
0 likes · 17 min read
How HTAP and DRDS HTAP Enable Real‑Time OLTP/OLAP Integration
Beike Product & Technology
Beike Product & Technology
Feb 21, 2019 · Big Data

DATABUS Data Integration Platform: Architecture, Capabilities, and TiDB Ecosystem

The article presents an in‑depth overview of the DATABUS data integration platform, detailing its background, current challenges, core capabilities such as data syncing, metadata automation, real‑time subscriptions, and its reliance on TiDB, TiSpark, Hudi, and related big‑data technologies to enable near‑real‑time data warehousing.

Big DataData IntegrationHive
0 likes · 13 min read
DATABUS Data Integration Platform: Architecture, Capabilities, and TiDB Ecosystem
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 15, 2019 · Big Data

Big Data Mastery Roadmap

This article outlines a comprehensive series of over 500 planned tutorials covering Java advanced features, distributed theory, Hadoop, Spark, Flink, and various big‑data storage and processing technologies, designed to guide engineers transitioning into big‑data development from fundamentals to expert level.

Distributed SystemsFlinkHadoop
0 likes · 4 min read
Big Data Mastery Roadmap
Sohu Tech Products
Sohu Tech Products
Feb 13, 2019 · Big Data

Evolution and Implementation Details of Spark Shuffle Mechanisms

This article examines the historical evolution of Spark's shuffle implementations—from early Hash‑Based Shuffle to modern SortShuffleWriter, BypassMergeSortShuffleWriter, and UnsafeShuffleWriter—explaining their design choices, selection criteria, and the corresponding shuffle reader architecture in a production‑grade Spark 2.1.1 environment.

Big DataShuffleShuffle Writer
0 likes · 13 min read
Evolution and Implementation Details of Spark Shuffle Mechanisms
JD Tech
JD Tech
Jan 18, 2019 · Big Data

Technical Overview of JD's New Business Intelligence Platform: Offline OLAP, Real‑time Data, and Visualization Solutions

The article details JD's 2018 upgrade of its Business Intelligence platform, describing how unified offline OLAP with ClickHouse, Spark, and Scala, timeliness optimizations, and a React‑based visualization component library together improve data consistency, performance, and user experience for merchants.

ClickHouseData visualizationOLAP
0 likes · 7 min read
Technical Overview of JD's New Business Intelligence Platform: Offline OLAP, Real‑time Data, and Visualization Solutions
JD Tech
JD Tech
Jan 11, 2019 · Big Data

Spark Memory Management and Tuning Practices for Large-Scale Billing Systems

This article explains how Spark's memory management models and configuration parameters can be tuned to handle massive billing data efficiently, covering StaticMemoryManager vs UnifiedMemoryManager, storage and shuffle memory fractions, common OOM and file‑not‑found issues, and practical performance‑optimisation tips.

Memory ManagementSparkdistributed computing
0 likes · 9 min read
Spark Memory Management and Tuning Practices for Large-Scale Billing Systems
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 2, 2019 · Big Data

Optimizing Spark Direct Kafka Consumption: Subpartition Concurrency and Repartition Strategies

To address the long processing time caused by uneven Spark partitions when reading Kafka via the Direct approach, this article explains the SPARK‑22056 solution that modifies KafkaRDD.getPartitions to support a configurable 'topic.partition.subconcurrency' parameter, discusses its trade‑offs, and presents alternative repartition and multithreading techniques.

Big DataPartitioningScala
0 likes · 6 min read
Optimizing Spark Direct Kafka Consumption: Subpartition Concurrency and Repartition Strategies