Tagged articles

Spark

623 articles · Page 4 of 7

Big Data Technology & Architecture

Oct 13, 2021 · Big Data

God of Big Data: A Comprehensive Learning Path and Systematic Resources for Big Data Engineers

The "God of Big Data" project, launched in 2019, offers a detailed learning roadmap, systematic column resources covering Hadoop, Spark, Kafka, and more, and invites engineers transitioning from backend to big‑data development to follow curated articles, GitHub code, and CSDN tutorials.

Data EngineeringHadoopSpark

0 likes · 6 min read

God of Big Data: A Comprehensive Learning Path and Systematic Resources for Big Data Engineers

Java High-Performance Architecture

Oct 12, 2021 · Big Data

Unpacking the Core Technologies Behind Modern Big Data Platforms

This article breaks down a typical big data platform architecture into its four layers—data collection, storage and analysis, sharing, and real‑time computation—detailing the essential tools such as Flume, HDFS, Hive, Spark, DataX, and task scheduling systems that enable scalable, low‑latency data processing and delivery.

Big DataData ArchitectureDataX

0 likes · 8 min read

Unpacking the Core Technologies Behind Modern Big Data Platforms

Architecture Digest

Oct 11, 2021 · Big Data

Core Technologies and Architecture of a Big Data Platform

This article explains the typical architecture of a big‑data platform, detailing its four core layers—data collection, storage & analysis, data sharing, and application—and describing the key technologies such as Flume, DataX, HDFS, Hive, Spark, Spark Streaming, and task scheduling components.

Big DataData ArchitectureDataX

0 likes · 8 min read

Core Technologies and Architecture of a Big Data Platform

Big Data Technology Architecture

Sep 28, 2021 · Big Data

Integrating Apache Kyuubi with CDH 6 and Spark 3: Deployment, Configuration, and Performance Tuning

This guide explains how to deploy Apache Kyuubi on a CDH 6 cluster, replace HiveServer2 with Kyuubi, integrate Spark 3, apply necessary patches, configure environment and Spark settings, and optimize engine sharing for various workloads, providing complete code snippets and step‑by‑step instructions.

CDHHiveServer2Kyuubi

0 likes · 19 min read

Integrating Apache Kyuubi with CDH 6 and Spark 3: Deployment, Configuration, and Performance Tuning

Sep 24, 2021 · Big Data

How Didi Scaled Real‑Time Funnel Analysis with StarRocks: Architecture, Design, and Performance Tips

Didi's data architecture team migrated high‑volume, real‑time funnel analysis from ClickHouse to StarRocks, built a multi‑layer pipeline with Kafka, Flink/Spark, Hive, and materialized views, and achieved sub‑3‑second query times on billions of rows, while outlining future enhancements.

Big DataFunnel AnalysisHive

0 likes · 14 min read

How Didi Scaled Real‑Time Funnel Analysis with StarRocks: Architecture, Design, and Performance Tips

Big Data Technology & Architecture

Sep 23, 2021 · Big Data

Handling Non‑Splittable gzip Files in Hadoop and Spark: MapReduce Splits and Performance Considerations

This article explains how a 10 GB gzip file is stored and processed on HDFS, details the MapReduce split calculation using GzipCodec, and discusses why Spark reads such non‑splittable files with a single task, recommending file splitting or format conversion for better performance.

Data SplitsHadoopMapReduce

0 likes · 8 min read

Handling Non‑Splittable gzip Files in Hadoop and Spark: MapReduce Splits and Performance Considerations

Big Data Technology & Architecture

Sep 13, 2021 · Big Data

Understanding Bytecode, Code Generation, Serialization, and Data Processing Techniques in Spark and Flink

This article explains how bytecode and code‑generation improve Spark SQL performance, compares Java I/O and MapReduce InputFormats, reviews serialization choices in Spark and Flink, and describes reflection‑based DataFrame creation, storage‑memory eviction, fail‑fast design, and ConcurrentHashMap usage in big‑data frameworks.

FlinkJavaSpark

0 likes · 11 min read

Understanding Bytecode, Code Generation, Serialization, and Data Processing Techniques in Spark and Flink

Ctrip Technology

Sep 9, 2021 · Big Data

Building Data Lineage at Ctrip: Architecture, Implementation, and Real‑World Applications

This article describes how Ctrip built a data lineage system for its big data platform, covering the concept of data lineage, collection methods, open‑source tools such as Apache Atlas and DataHub, the in‑house table‑level and field‑level solutions, implementation details for Hive, Spark and Presto, storage in JanusGraph, and practical applications in data governance, metadata management, scheduling and sensitivity labeling.

Big DataHiveJanusGraph

0 likes · 16 min read

Building Data Lineage at Ctrip: Architecture, Implementation, and Real‑World Applications

IT Architects Alliance

Sep 5, 2021 · Big Data

Big Data Platform Architecture: Core Layers, Technologies, and Practices

This article outlines a typical big data platform architecture, detailing its core layers—data acquisition, storage and analysis, sharing, application, real‑time computation, and task scheduling—while introducing key technologies such as Flume, HDFS, Hive, Spark, DataX, and monitoring considerations.

Big DataData PlatformHadoop

0 likes · 9 min read

Big Data Platform Architecture: Core Layers, Technologies, and Practices

Architects' Tech Alliance

Sep 2, 2021 · Big Data

Core Technologies and Architecture of a Big Data Platform

The article outlines a typical big data platform architecture, detailing its core layers—data collection, storage and analysis, sharing, application, real-time computation, and task scheduling—while describing key technologies such as Flume, DataX, HDFS, Hive, Spark, Spark Streaming, and Redis.

Data ArchitectureData IntegrationHadoop

0 likes · 9 min read

Core Technologies and Architecture of a Big Data Platform

Qunar Tech Salon

Aug 26, 2021 · Big Data

Comprehensive Introduction to Apache Spark: History, Core Concepts, Architecture, and Performance Optimization

This article provides a thorough overview of Apache Spark, covering its origins, comparison with MapReduce, core concepts such as RDD, DAG, Jobs, Stages, and Tasks, the submission process, Web UI, and detailed performance tuning techniques including data skew mitigation.

Big DataData SkewMapReduce

0 likes · 15 min read

Comprehensive Introduction to Apache Spark: History, Core Concepts, Architecture, and Performance Optimization

Big Data Technology & Architecture

Aug 26, 2021 · Big Data

Interview Questions and Reflections on Java, JVM, Spark, and System Design

This article records an interview experience, presenting core questions on Java memory allocation, JVM parameters, Spark and MapReduce execution models, data skew causes and mitigation, real‑time framework scheduling, and system design for massive task scheduling, followed by analysis and learning recommendations.

JVMJavaSpark

0 likes · 5 min read

Interview Questions and Reflections on Java, JVM, Spark, and System Design

Big Data Technology Architecture

Aug 24, 2021 · Big Data

Comprehensive Guide to Spark Performance Optimization, Data Skew Mitigation, and Troubleshooting

This article presents a detailed collection of Spark performance‑tuning techniques—including submit‑script parameters, RDD and operator optimizations, parallelism and memory settings, broadcast variables, Kryo serialization, locality wait adjustments—as well as systematic methods for detecting and resolving data skew and common runtime issues such as shuffle failures, serialization errors, and JVM memory problems.

Data SkewJVM TuningShuffle

0 likes · 21 min read

Comprehensive Guide to Spark Performance Optimization, Data Skew Mitigation, and Troubleshooting

Big Data Technology Architecture

Aug 12, 2021 · Big Data

Enterprise Data Lake Architecture, Delta Lake Core Capabilities, and Stream‑Batch Integrated Analytics on Alibaba Cloud

This article explains the rapid growth of data, the limitations of traditional warehouses, and how a cloud‑based data lake built on object storage with Delta Lake format provides low‑cost, flexible, and ACID‑compliant analytics, followed by a step‑by‑step guide to ingest, manage, and analyze data using Alibaba Cloud DLF and Databricks DDI with Spark streaming and batch jobs.

Alibaba CloudDelta LakeSpark

0 likes · 19 min read

Enterprise Data Lake Architecture, Delta Lake Core Capabilities, and Stream‑Batch Integrated Analytics on Alibaba Cloud

Big Data Technology & Architecture

Aug 2, 2021 · Big Data

Comprehensive Big Data Interview Question Guide for Major Tech Companies

This article compiles extensive interview questions and topics covering Hadoop, Spark, Flink, Hive, Kafka, MySQL, Redis, Java fundamentals, and algorithms, organized by companies such as Xiaomi, ByteDance, Alibaba, Shopee, Tencent, Meituan, NetEase, and Baidu, to help candidates prepare effectively for big‑data engineering roles.

Big DataFlinkHadoop

0 likes · 22 min read

Comprehensive Big Data Interview Question Guide for Major Tech Companies

Big Data Technology Architecture

Jul 27, 2021 · Big Data

Key Components of the Big Data Ecosystem: Hadoop, Hive, HBase, Spark, Kafka, and Elasticsearch

This article introduces the most important and still mainstream components of the big data ecosystem—including Hadoop’s storage and compute framework, Hive data warehouse, HBase NoSQL database, Spark unified engine, Kafka messaging platform, and Elasticsearch search engine—explaining their core concepts, architectures, and typical use cases.

Big DataElasticsearchHBase

0 likes · 9 min read

Key Components of the Big Data Ecosystem: Hadoop, Hive, HBase, Spark, Kafka, and Elasticsearch

TAL Education Technology

Jul 22, 2021 · Big Data

Real-Time Monitoring Dashboard Solution in Future Cloud – Architecture, Technical Challenges, and Product Insights

This article presents the Future Cloud Business Monitoring real-time dashboard solution, detailing its technical architecture, key challenges in massive log processing, storage choices, product considerations, experience sharing, future plans, and concrete case studies such as live classroom monitoring.

ClickHouseSparkdashboard

0 likes · 15 min read

Real-Time Monitoring Dashboard Solution in Future Cloud – Architecture, Technical Challenges, and Product Insights

Big Data Technology & Architecture

Jul 15, 2021 · Big Data

Understanding Hive Architecture, Execution Flow, and the Shift to Tez and Spark

This article explains Hive's core components, execution architecture, how HiveQL is transformed into MapReduce jobs, the advantages of Tez over MapReduce in Hive 3.0+, and the integration of Spark with Hive for modern big‑data processing.

Data WarehouseHiveMapReduce

0 likes · 9 min read

Understanding Hive Architecture, Execution Flow, and the Shift to Tez and Spark

Big Data Technology Architecture

Jul 15, 2021 · Big Data

Resolving Spark Task Not Serializable Errors: Causes, Code Examples, and Best Practices

This article analyzes why Spark tasks fail with a "Task not serializable" exception when closures reference class members, demonstrates the issue with Scala code examples, and provides practical solutions such as using @transient annotations, moving functions to objects, and ensuring proper class serialization.

ScalaSparkTask Not Serializable

0 likes · 12 min read

Resolving Spark Task Not Serializable Errors: Causes, Code Examples, and Best Practices

Big Data Technology & Architecture

Jul 10, 2021 · Big Data

Comprehensive Big Data Learning Path and Interview Knowledge Map

This extensive guide outlines a modern big‑data learning roadmap, covering essential programming languages, Linux, databases, distributed system theory, networking, offline and real‑time computation, message queues, data warehouses, algorithms, backend skills, interview preparation, and practical advice for building a personal knowledge system.

FlinkHadoopSpark

0 likes · 24 min read

Comprehensive Big Data Learning Path and Interview Knowledge Map

TAL Education Technology

Jul 1, 2021 · Big Data

Optimization of A/B Test Metric Computation Using Spark and ClickHouse

This article details the design and multi‑stage optimization of an A/B testing metric system, describing its product architecture, Spark‑based computation engine, ClickHouse OLAP layer, cumulative calculation improvements, and batch processing techniques that reduced processing time from hours to a few minutes for hundreds of experiments and metrics.

A/B testingBig DataClickHouse

0 likes · 8 min read

Optimization of A/B Test Metric Computation Using Spark and ClickHouse

Big Data Technology Architecture

Jun 29, 2021 · Big Data

Implementing and Registering a Custom SparkListener in Apache Spark

This article explains how to create a custom SparkListener in Apache Spark, provides Scala code examples for the listener and a main application, and details two registration approaches—via Spark configuration or SparkContext—along with a comprehensive list of listener event methods.

Apache SparkScalaSpark

0 likes · 5 min read

Implementing and Registering a Custom SparkListener in Apache Spark

dbaplus Community

Jun 23, 2021 · Big Data

How Ctrip Finance Built a Real‑Time Binlog‑Based Data Lake with MySQL‑Hive Sync

This article details Ctrip Finance's end‑to‑end data‑foundation architecture that uses MySQL binlog collection via Canal, Kafka streaming, Spark‑Streaming persistence to HDFS, and a merge process to produce timely MySQL‑Hive snapshots, addressing performance, consistency, and delete‑handling challenges.

BinlogHiveReal-time Data

0 likes · 17 min read

How Ctrip Finance Built a Real‑Time Binlog‑Based Data Lake with MySQL‑Hive Sync

Big Data Technology & Architecture

Jun 21, 2021 · Big Data

Comprehensive Guide to Apache Kylin: Background, Architecture, Installation, Optimization, and Real‑World Use Cases

This article provides an in‑depth overview of Apache Kylin, covering its history, mission, core MOLAP principles, technical architecture, step‑by‑step installation (Docker and Hadoop), performance tuning, advanced cube settings, and detailed case studies from major companies such as Baidu, Lianjia, and Didi.

Apache KylinCubeDocker

0 likes · 53 min read

Comprehensive Guide to Apache Kylin: Background, Architecture, Installation, Optimization, and Real‑World Use Cases

Big Data Technology & Architecture

Jun 16, 2021 · Big Data

Practical Experience and Optimizations of Apache Iceberg in Tencent’s Big Data Ecosystem

This article reviews the advantages of Apache Iceberg for data lake storage, details Tencent’s custom optimizations and integration with Flink and Spark, and shares multiple real‑world implementations that demonstrate how Iceberg improves data consistency, reduces small‑file overhead, and enables near‑real‑time analytics in large‑scale big‑data environments.

Apache IcebergData LakeFlink

0 likes · 18 min read

Practical Experience and Optimizations of Apache Iceberg in Tencent’s Big Data Ecosystem

Jun 11, 2021 · Big Data

Comprehensive Guide to Fast and Stable Hive‑to‑HBase Data Transfer Using Bulkload, MapReduce, and Spark

This article explains how to efficiently move large volumes of data from Hive to HBase by leveraging HBase's bulkload mechanism, detailing the original MapReduce workflow, its performance bottlenecks, and a rewritten Spark‑based solution that simplifies ETL, improves partitioning, and achieves several‑fold speedup.

Big DataETLHBase

0 likes · 17 min read

Comprehensive Guide to Fast and Stable Hive‑to‑HBase Data Transfer Using Bulkload, MapReduce, and Spark

Big Data Technology Architecture

Jun 10, 2021 · Big Data

Understanding Apache Iceberg: Design, Architecture, and Its Application at NetEase Cloud Music

This article explains Apache Iceberg’s table‑format design, compares it with Hive’s limitations, details its snapshot‑based architecture and metadata handling, and describes how NetEase Cloud Music leveraged Iceberg to dramatically improve large‑scale log processing performance and stability.

Apache IcebergSparkmetadata management

0 likes · 12 min read

Understanding Apache Iceberg: Design, Architecture, and Its Application at NetEase Cloud Music

Big Data Technology & Architecture

Jun 4, 2021 · Big Data

Comprehensive Spark Interview Questions and Answers

This article provides a detailed collection of Spark interview questions covering deployment modes, performance advantages over MapReduce, shuffle mechanisms, RDD characteristics, optimization techniques, resource management, and various practical aspects of Spark on YARN, Mesos, and Kubernetes.

OptimizationRDDShuffle

0 likes · 21 min read

Comprehensive Spark Interview Questions and Answers

dbaplus Community

Jun 1, 2021 · Big Data

How Didi Boosted SQL Performance by 40%: Migrating 10k Hive Jobs to Spark

Didi migrated over 10,000 Hive SQL tasks to Spark SQL, achieving 85% Spark task share, cutting execution time by 40%, and reducing CPU and memory usage by 21% and 49% respectively, through a systematic migration process that addressed syntax, UDF, performance, and functional differences between the two engines.

Big DataHivePerformance Optimization

0 likes · 20 min read

How Didi Boosted SQL Performance by 40%: Migrating 10k Hive Jobs to Spark

Qunar Tech Salon

Jun 1, 2021 · Big Data

Integrating TensorFlow for Java with Spark‑Scala for Distributed Machine Learning Prediction

This article shares practical experience of building a high‑performance distributed prediction service by combining TensorFlow for Java with Spark‑Scala, covering framework selection, performance comparison, model training, loading, inference, deployment, and optimization techniques for large‑scale data processing.

Big DataJavaPerformance Optimization

0 likes · 16 min read

Integrating TensorFlow for Java with Spark‑Scala for Distributed Machine Learning Prediction

NetEase Game Operations Platform

May 22, 2021 · Big Data

Comprehensive Overview and Source Code Analysis of NetEase Spark Kyuubi

This article systematically introduces NetEase Kyuubi, an open‑source high‑performance JDBC and SQL execution engine built on Apache Spark, covering its background, core architecture, service discovery, session and operation management, startup processes, and key source‑code implementations with detailed code examples.

Apache ThriftBig DataDistributed Computing

0 likes · 47 min read

Comprehensive Overview and Source Code Analysis of NetEase Spark Kyuubi

JD Retail Technology

May 13, 2021 · Big Data

Evolution and Architecture of JD.com Self‑Operated Rebate Platform

The article details the development, challenges, and redesign of JD.com’s self‑operated rebate system, describing its early monolithic architecture, data‑intensive processing pipeline, migration to a modular, high‑availability platform built on Spark, Hive, and Elasticsearch, and the resulting performance and operational improvements.

Big DataETLHigh Availability

0 likes · 16 min read

Evolution and Architecture of JD.com Self‑Operated Rebate Platform

Apr 27, 2021 · Big Data

Implementing CDC‑to‑Hudi for Real‑Time Mutable Data in a Big Data System

This article describes how Linkflow migrated mutable customer data from MySQL to an Apache Hudi data lake using Debezium‑in‑Flink CDC, addressing challenges such as snapshot resumability, partial updates, row‑key merging, schema evolution, indexing, and concurrent writes to achieve minute‑level data freshness and improved offline processing performance.

Apache HudiBig DataCDC

0 likes · 21 min read

Implementing CDC‑to‑Hudi for Real‑Time Mutable Data in a Big Data System

Apr 26, 2021 · Big Data

Detailed Design and Practical Application of Apache Iceberg at NetEase Cloud Music

This article explains the motivations behind Apache Iceberg, its design principles such as snapshot and MVCC, compares it with Hive, and describes how NetEase Cloud Music adopted Iceberg to improve metadata handling, query performance, and operational stability for massive daily log data.

Apache IcebergBig DataData Lake

0 likes · 13 min read

Detailed Design and Practical Application of Apache Iceberg at NetEase Cloud Music

Big Data Technology & Architecture

Apr 16, 2021 · Big Data

Spark Job Execution Architecture: From Submission to Shuffle and Task Processing

This article explains how Spark coordinates master, worker, driver, and executor components to generate, submit, and run jobs, detailing the creation of logical and physical execution graphs, task allocation, result handling, and the shuffle read process with code examples and diagrams.

Job ExecutionShuffleSpark

0 likes · 14 min read

Spark Job Execution Architecture: From Submission to Shuffle and Task Processing

dbaplus Community

Apr 14, 2021 · Big Data

Master Spark Performance: Key Tuning, Shuffle & Join Optimization

This guide compiles practical Spark tuning techniques, covering essential configuration parameters, programming best‑practices, detailed shuffle mechanics, and join optimization strategies, while also addressing common errors and mitigation steps, enabling developers to improve performance and resource utilization in large‑scale data processing jobs.

Big DataError handlingShuffle

0 likes · 25 min read

Master Spark Performance: Key Tuning, Shuffle & Join Optimization

Big Data Technology & Architecture

Apr 14, 2021 · Big Data

Understanding Spark Shuffle: Write and Read Mechanisms Compared to Hadoop MapReduce

This article explains how Spark implements shuffle write and shuffle read, compares its high‑level and low‑level processes with Hadoop MapReduce, and details the internal data structures, memory‑disk trade‑offs, and configuration options that affect performance.

MapReduceMemoryManagementRDD

0 likes · 21 min read

Understanding Spark Shuffle: Write and Read Mechanisms Compared to Hadoop MapReduce

Big Data Technology & Architecture

Apr 13, 2021 · Big Data

Spark Job Generation and Execution: From Logical DAG to Physical Stages and Tasks

This article explains how Spark transforms a logical execution graph into a physical job by partitioning stages, applying pipeline concepts, and generating tasks, while illustrating the process with detailed code examples and the internal workflow of job submission.

Job SchedulingRDDScala

0 likes · 15 min read

Spark Job Generation and Execution: From Logical DAG to Physical Stages and Tasks

Big Data Technology & Architecture

Apr 11, 2021 · Big Data

Understanding Spark RDD Logical Execution Graph and Dependency Types

This article explains how Spark builds the logical execution graph for RDDs, describes the four-step job processing pipeline, details the various dependency types such as NarrowDependency and ShuffleDependency, and reviews common transformations and their data‑flow characteristics.

RDDShuffleSpark

0 likes · 19 min read

Understanding Spark RDD Logical Execution Graph and Dependency Types

Big Data Technology & Architecture

Apr 10, 2021 · Big Data

Understanding Spark Cache and Checkpoint Mechanisms

This article explains Spark's cache and checkpoint mechanisms, detailing when to use each, how they are implemented internally, how cached and checkpointed RDDs are stored and retrieved, and the differences between caching, persisting, and checkpointing for reliable big‑data processing.

CacheCheckpointRDD

0 likes · 13 min read

Understanding Spark Cache and Checkpoint Mechanisms

iQIYI Technical Product Team

Apr 9, 2021 · Big Data

Real-Time Data Warehouse at iQIYI Video Production Using Spark and ClickHouse

To meet iQIYI video production’s thousands‑QPS, petabyte‑scale, frequently‑updated data and large‑table join requirements, the team built a Spark‑plus‑ClickHouse real‑time warehouse that streams Kafka changes, joins HBase dimensions, and writes to ClickHouse, reducing reporting development time from days to hours while supporting both offline and real‑time analytics.

ClickHouseData EngineeringHBase

0 likes · 12 min read

Real-Time Data Warehouse at iQIYI Video Production Using Spark and ClickHouse

Apr 3, 2021 · Big Data

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning

This article explains advanced Spark performance tuning techniques, focusing on diagnosing and resolving data skew and shuffle bottlenecks through stage analysis, key distribution inspection, and a variety of practical solutions such as Hive pre‑processing, key filtering, parallelism increase, two‑stage aggregation, map‑join, and combined strategies, while also covering ShuffleManager internals and related configuration parameters.

Big DataData SkewPerformance Tuning

0 likes · 47 min read

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning

Apr 2, 2021 · Big Data

Spark Performance Optimization Guide: Development and Resource Tuning

This article provides a comprehensive guide to Spark performance optimization, covering development‑level tuning principles, resource configuration parameters, practical code examples, and best‑practice recommendations to achieve high‑throughput big‑data processing.

Big DataOptimizationRDD

0 likes · 33 min read

Spark Performance Optimization Guide: Development and Resource Tuning

Big Data Technology Architecture

Apr 1, 2021 · Big Data

Spark Adaptive Execution: Dynamic Shuffle Partition, Broadcast Join, and Skew Handling

The article explains the limitations of static shuffle partitions, execution‑plan estimation, and data skew in Spark SQL, and describes how Spark Adaptive Execution can automatically adjust shuffle partition numbers, switch join strategies, and mitigate skew through configurable parameters and code examples.

Adaptive ExecutionBroadcast JoinData Skew

0 likes · 11 min read

Spark Adaptive Execution: Dynamic Shuffle Partition, Broadcast Join, and Skew Handling

Big Data Technology & Architecture

Mar 30, 2021 · Big Data

Implementing Real-Time Data Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions

This article describes how Soul's data engineering team replaced nightly batch ETL with real-time Delta Lake ingestion on EMR, detailing the motivations, comparative analysis of Delta, Hudi, Iceberg, the implementation architecture, encountered issues such as data skew and schema evolution, and the solutions adopted to improve performance and reliability.

Data LakeData SkewDelta Lake

0 likes · 13 min read

Implementing Real-Time Data Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions

Big Data Technology Architecture

Mar 13, 2021 · Big Data

Understanding mapPartitions vs map in Apache Spark: Performance, Pitfalls, and Proper Usage

This article examines why many developers favor Spark's mapPartitions over map, analyzes the underlying source code, highlights common pitfalls such as complexity and OOM risks, and provides practical guidelines and code examples for correctly using mapPartitions in both simple and advanced scenarios.

IteratorScalaSpark

0 likes · 9 min read

Understanding mapPartitions vs map in Apache Spark: Performance, Pitfalls, and Proper Usage

Big Data Technology Architecture

Mar 10, 2021 · Big Data

Comprehensive Spark Performance Optimization: Development Tuning, Resource Configuration, Data Skew Solutions, and Shuffle Tuning

This guide presents a complete Spark performance optimization handbook covering development‑time best practices, resource‑parameter tuning, detailed data‑skew detection and mitigation techniques, advanced shuffle‑engine configurations, and practical code examples to help engineers build faster, more reliable Spark jobs.

Data SkewResource TuningShuffle

0 likes · 69 min read

Comprehensive Spark Performance Optimization: Development Tuning, Resource Configuration, Data Skew Solutions, and Shuffle Tuning

Big Data Technology Architecture

Mar 2, 2021 · Big Data

Implementing Real-Time Log Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions

This article describes how a data engineering team replaced nightly batch ETL with a Delta Lake‑based real‑time log ingestion pipeline on EMR, detailing the motivations, architecture, implementation steps, encountered issues such as data skew and schema evolution, and the practical solutions they applied to achieve low‑latency, reliable data delivery.

Delta LakeHiveSpark

0 likes · 14 min read

Implementing Real-Time Log Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions

Feb 28, 2021 · Big Data

Migrating Youzan Offline Spark Platform to Kubernetes: Architecture, Optimizations, and Lessons Learned

This article details how Youzan's offline Spark computing platform was transformed for the cloud‑native era by migrating from YARN to Kubernetes, introducing containerization, storage‑compute separation, dynamic allocation, deployment optimizations, and a collection of practical lessons to reduce cost and improve resource utilization.

Big DataPerformance OptimizationResource Management

0 likes · 27 min read

Migrating Youzan Offline Spark Platform to Kubernetes: Architecture, Optimizations, and Lessons Learned

Feb 26, 2021 · Big Data

Migrating Spark Offline Computing to Kubernetes: Architecture, Optimizations, and Lessons Learned

Youzan migrated its large‑scale offline Spark workloads from YARN to a cloud‑native Kubernetes architecture, separating storage and compute with Ceph FS, adding dynamic executor allocation and remote shuffle services, and applying numerous Spark and deployment tweaks that yielded elastic scaling, higher resource utilization, reduced costs, and valuable operational lessons.

Cloud NativePerformance OptimizationSpark

0 likes · 24 min read

Migrating Spark Offline Computing to Kubernetes: Architecture, Optimizations, and Lessons Learned

Feb 8, 2021 · Big Data

JD Remote Shuffle Service: Design, Implementation, and Performance Evaluation

This article presents JD's self‑developed Remote Shuffle Service for Spark, detailing its architecture, goals, implementation details, performance benchmarks, and real‑world production case studies that demonstrate its impact on shuffle efficiency and system stability in large‑scale data processing.

EtcdRemote Shuffle ServiceShuffle Optimization

0 likes · 17 min read

JD Remote Shuffle Service: Design, Implementation, and Performance Evaluation

Big Data Technology & Architecture

Feb 2, 2021 · Big Data

An Introduction to Apache Iceberg: Features, Spark & Flink Integration, and Real‑World Use Cases

This article provides a comprehensive overview of Apache Iceberg, covering its origins, key features, practical Spark and Flink code examples, notable deployments at Alibaba and Tencent, and its future role as a universal table format for big‑data analytics.

Apache IcebergData LakeFlink

0 likes · 9 min read

An Introduction to Apache Iceberg: Features, Spark & Flink Integration, and Real‑World Use Cases

Big Data Technology & Architecture

Jan 24, 2021 · Big Data

Comprehensive Interview Preparation Guide and Common Questions for Big Data Technologies

This article shares a non‑CS graduate's interview experience, study methods, and a detailed list of common interview questions covering Java fundamentals, data‑warehouse concepts, Spark, Kafka, Zookeeper, HBase, and Elasticsearch, along with personal reflections on advanced interview expectations.

Data WarehousingSpark

0 likes · 6 min read

Comprehensive Interview Preparation Guide and Common Questions for Big Data Technologies

JD Retail Technology

Jan 19, 2021 · Big Data

Design, Implementation, and Performance Evaluation of JD's Remote Shuffle Service for Spark

This article describes JD's research and production deployment of a self‑developed Remote Shuffle Service for Spark, covering its motivations, architectural design, cloud‑native features, monitoring, performance benchmarks against external shuffle solutions, and a real‑world promotion‑period case study that demonstrates improved stability and resource efficiency.

Cloud NativeRemote Shuffle ServiceShuffle Optimization

0 likes · 17 min read

Design, Implementation, and Performance Evaluation of JD's Remote Shuffle Service for Spark

Big Data Technology & Architecture

Jan 15, 2021 · Big Data

Evolution and Architecture of Major Chinese Big Data Platforms: Taobao, Didi, Meituan, 360, Kuaishou, and JD

This article reviews the evolution, architecture, and key components of major Chinese big‑data platforms—including those of Taobao, Didi, Meituan, 360, Kuaishou, and JD—highlighting data ingestion, storage, processing engines, scheduling systems, and service‑oriented designs that underpin their large‑scale data operations.

Big DataData PlatformHadoop

0 likes · 14 min read

Evolution and Architecture of Major Chinese Big Data Platforms: Taobao, Didi, Meituan, 360, Kuaishou, and JD

Jan 15, 2021 · Big Data

Optimizing Apache Kylin for Meituan's Sales OLAP: From MapReduce to Spark and Resource Tuning

This article presents a detailed case study of how Meituan's in‑store dining sales team identified severe efficiency issues in their Apache Kylin‑based OLAP system, dissected the construction process, and applied a step‑by‑step optimization roadmap—including engine migration, dimension pruning, resource configuration, and Spark‑based layered building—to boost query performance and achieve near‑perfect SLA.

Apache KylinBig DataMeituan

0 likes · 16 min read

Optimizing Apache Kylin for Meituan's Sales OLAP: From MapReduce to Spark and Resource Tuning

Big Data Technology & Architecture

Jan 13, 2021 · Big Data

My Month-Long Alibaba Mama Interview Experience: Spark, Kafka, and Big Data Technical Rounds

The author recounts a month‑long, four‑round technical interview at Alibaba Mama, detailing phone, on‑site, and HR stages, with deep discussions on Spark, Kafka, Hadoop, platform design, and backend fundamentals, while sharing resource links for big‑data interview preparation.

AlibabaData EngineeringHadoop

0 likes · 7 min read

My Month-Long Alibaba Mama Interview Experience: Spark, Kafka, and Big Data Technical Rounds

Big Data Technology & Architecture

Jan 12, 2021 · Big Data

Design and Implementation of Hourly Feature Coverage Metrics Using Spark and Elasticsearch

This article describes a high‑throughput solution for calculating hourly feature coverage, positive‑sample ratio and negative‑sample ratio on billions of records by streaming data with Spark, indexing per experiment‑hour in Elasticsearch, and executing parallel aggregation tasks with Java code.

ElasticsearchJavaSpark

0 likes · 7 min read

Design and Implementation of Hourly Feature Coverage Metrics Using Spark and Elasticsearch

Big Data Technology & Architecture

Jan 5, 2021 · Big Data

Setting Up Apache Spark Standalone with Docker and Using Apache Zeppelin for Data Processing

This guide demonstrates how to build a Docker‑based Spark standalone environment, configure Apache Zeppelin to connect to it, and perform data analysis on local CSV files, HDFS, and streaming sources such as Twitter and Kafka, with complete code examples.

Apache ZeppelinDockerScala

0 likes · 10 min read

Setting Up Apache Spark Standalone with Docker and Using Apache Zeppelin for Data Processing

Big Data Technology & Architecture

Jan 5, 2021 · Big Data

Improving Spark Job Parallelism on YARN: Diagnosis, Configuration, and Performance Gains

This article details a real‑world investigation of Spark SQL job latency on a YARN cluster, explains how switching the scheduler to FAIR mode, creating resource pools, and consolidating small Parquet files dramatically reduced scheduler delay and cut execution time from over 100 seconds to under 20 seconds.

ParquetPerformance OptimizationScheduler

0 likes · 13 min read

Improving Spark Job Parallelism on YARN: Diagnosis, Configuration, and Performance Gains

Big Data Technology & Architecture

Dec 29, 2020 · Big Data

Spark Performance Tuning: Common Parameters, Programming Tips, Shuffle and Join Optimization

This article provides a comprehensive guide to Spark performance tuning, covering essential configuration parameters, best‑practice programming recommendations, detailed shuffle mechanics, join optimization strategies, and common error troubleshooting for big‑data workloads.

JOINOptimizationPerformance Tuning

0 likes · 20 min read

Spark Performance Tuning: Common Parameters, Programming Tips, Shuffle and Join Optimization

Big Data Technology & Architecture

Dec 27, 2020 · Big Data

Understanding and Solving the Small File Problem in Big Data Systems

This article examines the pervasive small‑file issue in big‑data environments, explains its impact on storage and processing performance, and presents a comprehensive set of solutions—including file merging, Hadoop archives, SequenceFiles, HBase, CombineFileInputFormat, and Spark/Flink strategies—to mitigate metadata overhead and improve I/O efficiency.

FlinkHadoopNameNode

0 likes · 41 min read

Understanding and Solving the Small File Problem in Big Data Systems

JD Retail Technology

Dec 24, 2020 · Databases

Applying ClickHouse for Offline and Real‑Time Data Analysis in JD's Golden Eye Business

This article details JD's Golden Eye business's adoption of ClickHouse for offline and real‑time traffic data analysis, covering system architecture, data ingestion pipelines, high‑availability design, monitoring, performance optimizations, and practical trade‑offs, offering insights for large‑scale analytical database deployments.

ClickHouseData WarehouseOLAP

0 likes · 17 min read

Applying ClickHouse for Offline and Real‑Time Data Analysis in JD's Golden Eye Business

Big Data Technology & Architecture

Dec 20, 2020 · Big Data

Getting Started with Apache Zeppelin: Installation, Core Features, and Integration with JDBC, Spark, and Flink

This tutorial introduces Apache Zeppelin, explains REPL and Jupyter concepts, outlines its core features and project structure, and provides step‑by‑step instructions for installing Zeppelin, creating notebooks, and connecting to databases, Spark, and Flink with practical code examples.

Apache ZeppelinFlinkInstallation

0 likes · 11 min read

Getting Started with Apache Zeppelin: Installation, Core Features, and Integration with JDBC, Spark, and Flink

Dec 13, 2020 · Big Data

Understanding and Solving Data Skew in Hadoop and Spark

This article explains what data skew is, why it occurs in large‑scale Hadoop and Spark jobs, illustrates typical scenarios that cause it, and provides practical strategies and platform‑specific optimizations to detect, mitigate, and prevent skew in big‑data processing pipelines.

Distributed ComputingHadoopSpark

0 likes · 13 min read

Understanding and Solving Data Skew in Hadoop and Spark

Big Data Technology & Architecture

Dec 6, 2020 · Big Data

Integrating Spark with MongoDB: Architecture, Use Cases, and Code Samples

This article explains how Spark can be combined with MongoDB for large‑scale data processing, covering Spark fundamentals, comparisons with HDFS, practical integration patterns, performance benefits, real‑world case studies, and detailed code examples for deployment and analytics.

Data IntegrationDistributed ComputingMongoDB

0 likes · 18 min read

Integrating Spark with MongoDB: Architecture, Use Cases, and Code Samples

Big Data Technology Architecture

Nov 25, 2020 · Big Data

Data Lake Storage Architecture Selection and JindoFS on Alibaba Cloud

This article explains the concept and benefits of data lakes, outlines the storage and acceleration challenges they pose, presents an ideal checklist for selecting a data lake solution, and evaluates Alibaba Cloud's JindoFS against that checklist, highlighting its capabilities for big‑data and AI workloads.

Alibaba CloudBig DataData Lake

0 likes · 9 min read

Data Lake Storage Architecture Selection and JindoFS on Alibaba Cloud

Big Data Technology Architecture

Nov 24, 2020 · Big Data

Using DeltaLake for Industrial Data Platforms: Distributed Stream Processing, Batch‑Stream Fusion, and Transactional Support

This article shares practical experiences of building an industrial data middle‑platform with DeltaLake, covering heterogeneous distributed stream handling, batch‑stream unified analytics, and transactional/algorithm support to improve data timeliness, reliability, and operational efficiency in manufacturing environments.

Batch-Stream FusionBig DataDeltaLake

0 likes · 11 min read

Using DeltaLake for Industrial Data Platforms: Distributed Stream Processing, Batch‑Stream Fusion, and Transactional Support

Big Data Technology & Architecture

Nov 21, 2020 · Big Data

Big Data Performance Testing: Objectives, Timing, Steps, Tools, and Optimization

This article outlines the purpose, timing, procedures, tools, and optimization techniques for big data performance testing, providing detailed guidance on test planning, execution, metric collection, and analysis to ensure reliable and efficient big data system deployments.

Big DataHadoopSpark

0 likes · 7 min read

Big Data Performance Testing: Objectives, Timing, Steps, Tools, and Optimization

Meituan Technology Team

Nov 19, 2020 · Big Data

Optimizing Apache Kylin for High‑Performance OLAP in Meituan's Sales System

Meituan’s sales system “Qingtian” boosted OLAP performance by migrating Apache Kylin’s build engine from MapReduce to Spark, consolidating Hive files, refining dictionary creation, applying a By‑layer algorithm, and bulk‑loading cuboid files to HBase, cutting resource consumption and halving build time, ultimately reaching a 100 % SLA.

Apache KylinBig DataMeituan

0 likes · 15 min read

Optimizing Apache Kylin for High‑Performance OLAP in Meituan's Sales System

Big Data Technology & Architecture

Nov 16, 2020 · Big Data

Understanding Data Skew in Big Data: Causes, Symptoms, and Solutions for Hadoop and Spark

This article explains what data skew is, why it occurs in large‑scale Hadoop and Spark jobs, how to recognize its symptoms such as stuck reducers or OOM executors, and presents practical strategies—including business‑level adjustments, code refactoring, and platform‑specific tuning—to mitigate the problem.

Big DataHadoopSpark

0 likes · 13 min read

Understanding Data Skew in Big Data: Causes, Symptoms, and Solutions for Hadoop and Spark

Big Data Technology & Architecture

Nov 16, 2020 · Big Data

Understanding Spark Streaming Backpressure Mechanism and Source Code Analysis

This article explains why Spark Streaming introduced backpressure, how the dynamic rate‑control mechanism works, and provides a detailed walkthrough of the relevant source code, including the RateController class, its registration, and the execution flow that adjusts ingestion rates to match processing capacity.

RateControllerRateLimiterSpark

0 likes · 14 min read

Understanding Spark Streaming Backpressure Mechanism and Source Code Analysis

Nov 9, 2020 · Big Data

Trajectory-Based Population Flow Analysis for COVID‑19 Prevention Using HBase and Spark

The article presents a comprehensive big‑data solution that stores massive GPS trajectory records in HBase, processes them with Spark to identify individuals who visited a pandemic source region, and visualizes their spatio‑temporal distribution in target cities to support precise epidemic control measures.

Big DataCOVID-19HBase

0 likes · 8 min read

Trajectory-Based Population Flow Analysis for COVID‑19 Prevention Using HBase and Spark

Big Data Technology & Architecture

Nov 9, 2020 · Big Data

Understanding the Actor Model and Akka in Big Data RPC Systems

This article introduces the Actor model, its fundamental rules, and how it underpins Flink and Spark RPC mechanisms, then explains the Akka framework, its actor hierarchy, supervision, lifecycle, and dispatcher, providing a concise foundation for distributed big‑data processing.

AkkaFlinkSpark

0 likes · 10 min read

Understanding the Actor Model and Akka in Big Data RPC Systems

Big Data Technology & Architecture

Oct 29, 2020 · Fundamentals

Zero-Copy Data Transfer Mechanism: Principles, Implementations, and Applications in Java, Kafka, and Spark

This article explains the zero‑copy data transfer technique, compares it with traditional read/write approaches, shows Java NIO code examples, and discusses its use in high‑performance systems such as Kafka and Spark, highlighting the reductions in context switches and memory copies.

Data TransferJava NIOSpark

0 likes · 16 min read

Zero-Copy Data Transfer Mechanism: Principles, Implementations, and Applications in Java, Kafka, and Spark

Big Data Technology & Architecture

Oct 23, 2020 · Big Data

Overview of Real-Time Big Data Processing: Spark Structured Streaming, CarbonData, Flink, and Cloud Stream

This article provides a comprehensive overview of modern real‑time big‑data solutions, detailing Spark Structured Streaming capabilities, CarbonData’s storage architecture, Meituan’s Flink deployments, and Huawei Cloud Stream’s unified streaming service, highlighting their features, challenges, and future directions.

CarbonDataFlinkSpark

0 likes · 17 min read

Overview of Real-Time Big Data Processing: Spark Structured Streaming, CarbonData, Flink, and Cloud Stream

Big Data Technology & Architecture

Oct 21, 2020 · Big Data

An Introduction to Apache Hudi: Concepts, Design Principles, and Architecture

This article introduces Apache Hudi, explaining its core concepts, design principles, table architecture, write and compaction mechanisms, and the three query modes that enable efficient batch and incremental processing on modern data lakes.

Apache HudiBig DataData Lake

0 likes · 21 min read

An Introduction to Apache Hudi: Concepts, Design Principles, and Architecture

Tencent Cloud Developer

Oct 19, 2020 · Big Data

Improving Spark Write Performance for Massive Files on Object Storage with Tencent Cloud EMR

By parallelizing Spark’s driver‑side commit, trash, and move phases—previously single‑threaded operations that caused costly copy‑on‑rename when writing massive files to object storage—the Tencent Cloud EMR case achieved over a tenfold (1,100 %) speedup, making object storage a viable alternative to HDFS.

Big DataDistributed ComputingEMR

0 likes · 8 min read

Improving Spark Write Performance for Massive Files on Object Storage with Tencent Cloud EMR

Alibaba Cloud Developer

Sep 27, 2020 · Big Data

Why Spark on Kubernetes Needs a Remote Shuffle Service—and How It Boosts Performance

This article examines the challenges of running Spark on Kubernetes, introduces the Remote Shuffle Service architecture to overcome shuffle bottlenecks, details EMR on ACK integration, showcases performance gains with Terasort benchmarks, and outlines future cloud‑native big‑data strategies such as mixed‑cluster and serverless deployments.

EMRRemote Shuffle ServiceSpark

0 likes · 13 min read

Why Spark on Kubernetes Needs a Remote Shuffle Service—and How It Boosts Performance

Sep 25, 2020 · Big Data

Meituan Waimai Data Warehouse: Architecture Evolution, Governance, and Future Roadmap

The article details Meituan Waimai's offline data warehouse evolution from its initial V1.0 design through V2.0 improvements to the V3.0 modeling‑tool driven architecture, covering the four‑layer framework, Spark‑based ETL, data governance processes, resource optimization, security measures, and future development plans.

Big DataData GovernanceETL

0 likes · 22 min read

Meituan Waimai Data Warehouse: Architecture Evolution, Governance, and Future Roadmap

Big Data Technology & Architecture

Sep 2, 2020 · Big Data

An Overview of Apache Hudi: Architecture, Features, and Query Types

Apache Hudi is an open‑source data‑lake framework that leverages Spark to ingest, manage, and incrementally query large analytical datasets on HDFS‑compatible storage, offering features such as timeline management, copy‑on‑write and merge‑on‑read tables, and support for snapshot, incremental, and read‑optimized queries across engines like Hive, Spark SQL and Presto.

Apache HudiBig DataData Lake

0 likes · 12 min read

An Overview of Apache Hudi: Architecture, Features, and Query Types

Big Data Technology & Architecture

Aug 31, 2020 · Big Data

Integration Methods of Hive and Spark SQL (Potential Interview Topics)

This article provides a comprehensive guide on integrating Hive with Spark SQL, covering Hive‑on‑Spark and Spark‑on‑Hive setups, spark‑shell and spark‑sql usage, HiveServer2 with Beeline, Scala scripts for reading and writing Hive tables, and partition handling for aggregated results.

Big DataData IntegrationHive

0 likes · 7 min read

Integration Methods of Hive and Spark SQL (Potential Interview Topics)

Aug 30, 2020 · Big Data

Large-Scale Recommendation System Feature Engineering and Optimization with Spark and FESQL

This article explains how large-scale recommendation systems rely on efficient feature engineering, describes the three-layer architecture (offline, stream, online), and details how Spark SQL and the LLVM‑optimized FESQL engine improve performance and ensure offline‑online feature consistency.

Big DataFESQLLLVM

0 likes · 13 min read

Large-Scale Recommendation System Feature Engineering and Optimization with Spark and FESQL

Big Data Technology & Architecture

Aug 23, 2020 · Big Data

Apache Hudi Overview, Core Concepts, and Quick‑Start Guide

This article introduces Apache Hudi, explaining its storage types, query views, timeline feature, typical use cases such as near‑real‑time ingestion and incremental pipelines, and provides a step‑by‑step Scala/Spark quick‑start guide with code examples for compiling, inserting, updating, querying, and syncing data to Hive.

Apache HudiBig DataData Lake

0 likes · 18 min read

Apache Hudi Overview, Core Concepts, and Quick‑Start Guide

Big Data Technology & Architecture

Aug 22, 2020 · Big Data

Integrating Kerberos with Spark on CDH: Configuration, Deployment, and Troubleshooting Guide

This guide explains how to prepare a CDH‑based Spark environment for Kerberos authentication, covering prerequisite knowledge, classpath adjustments, HBase configuration files, Spark‑Env settings, user permission grants, Spark‑Submit execution, and common troubleshooting steps.

Big DataCDHHBase

0 likes · 12 min read

Integrating Kerberos with Spark on CDH: Configuration, Deployment, and Troubleshooting Guide

Big Data Technology & Architecture

Aug 21, 2020 · Big Data

Practical Guide to Building an Advertising Project with Spark and Kudu

This article provides a step‑by‑step tutorial on deploying a Spark‑based advertising data pipeline using Kudu, covering Hadoop setup, HDFS data loading, Spark application refactoring, Maven packaging, Yarn execution, and crontab scheduling for daily automated runs.

Big DataHadoopKudu

0 likes · 11 min read

Practical Guide to Building an Advertising Project with Spark and Kudu

Big Data Technology & Architecture

Aug 21, 2020 · Big Data

Spark + Kudu Advertising Business Project: Data Statistics and Processing Guide

This article demonstrates how to implement an advertising business data statistics pipeline using Spark and Kudu, detailing metric requirements, Scala processing code, complex SQL aggregations, schema design, and data sinking for verification.

Big DataKuduSQL

0 likes · 7 min read

Spark + Kudu Advertising Business Project: Data Statistics and Processing Guide

Big Data Technology & Architecture

Aug 21, 2020 · Big Data

Spark + Kudu Advertising Business Project: Step-by-Step Implementation

This article walks through the complete implementation of an advertising statistics pipeline using Spark and Kudu, covering requirement analysis, Scala code development, SQL queries, schema definition, and data sinking, with full code snippets and execution results.

Big DataKuduSQL

0 likes · 7 min read

Spark + Kudu Advertising Business Project: Step-by-Step Implementation

Big Data Technology & Architecture

Aug 21, 2020 · Big Data

Spark + Kudu Advertising Project: Refactoring, Scala Traits, ETL Processor, and Project Entry

This article walks through a Spark and Kudu advertising project, explaining the refactoring approach, Scala trait usage, implementation of ETL and province‑city statistics processors, and shows the complete Spark application entry point with full code examples.

Big DataETLKudu

0 likes · 7 min read

Spark + Kudu Advertising Project: Refactoring, Scala Traits, ETL Processor, and Project Entry

Big Data Technology & Architecture

Aug 21, 2020 · Big Data

Spark + Kudu Advertising Project: Province‑City Statistics and Data Persistence

This tutorial walks through a Spark‑Kudu advertising project that computes province‑city distribution statistics using SQL, defines the necessary schema, and demonstrates how to write the aggregated results back to a Kudu table for persistent storage, complete with Scala code examples.

Big DataData EngineeringKudu

0 likes · 4 min read

Spark + Kudu Advertising Project: Province‑City Statistics and Data Persistence

Big Data Technology & Architecture

Aug 19, 2020 · Big Data

Big Data ETL Project: Parsing Advertising JSON with Spark, IP Lookup, and Storing into Kudu

This tutorial describes how to place advertising JSON data on HDFS, use Spark for ETL and analysis, enrich logs with IP lookup, and persist the results into Kudu with daily scheduling, including code examples and schema definitions.

Big DataETLIP lookup

0 likes · 17 min read

Big Data ETL Project: Parsing Advertising JSON with Spark, IP Lookup, and Storing into Kudu

Beike Product & Technology

Aug 17, 2020 · Big Data

Bitmap-Based User Segmentation in a DMP Platform Using ClickHouse

This article describes how a data management platform (DMP) at Beike leverages ClickHouse bitmap structures and Spark pipelines to generate global numeric user IDs, design tag-specific bitmap rules for enum, continuous, and date attributes, handle boundary cases, and produce high‑performance bitmap SQL for real‑time user group estimation and complex segment logic.

Big DataClickHouseDMP

0 likes · 17 min read

Bitmap-Based User Segmentation in a DMP Platform Using ClickHouse

Big Data Technology & Architecture

Aug 11, 2020 · Big Data

Consuming Kerberos‑Protected Kafka Data with Spark Streaming and Storing into Kudu

This guide demonstrates how to configure a Spark Streaming application running on YARN in cluster mode to securely consume Kerberos‑protected Kafka topics and write the processed data into Kudu tables, including necessary Java code, Kerberos keytab setup, Kafka client configuration, and spark‑submit commands.

Big DataJavaKerberos

0 likes · 11 min read

Consuming Kerberos‑Protected Kafka Data with Spark Streaming and Storing into Kudu

Big Data Technology & Architecture

Aug 4, 2020 · Big Data

Manual Kafka Offset Management in Spark Streaming using createDirectStream (Java & Scala)

This article explains how to use Spark Streaming's Direct Approach with Kafka, manually manage offsets, and provides complete Java and Scala implementations—including a JavaKafkaManager class, a demo application, and a Scala KafkaManager—illustrating the creation of DirectKafkaInputDStream, offset handling, and integration with Spark.

JavaOffset ManagementScala

0 likes · 14 min read

Manual Kafka Offset Management in Spark Streaming using createDirectStream (Java & Scala)

Aug 1, 2020 · Big Data

Mastering User Profiling: A Comprehensive Big Data Blueprint

This article explains how enterprises can leverage massive raw and business data to build detailed user profiles, covering tag types, data architecture, development modules, project phases, key deliverables, and a real-world e‑commerce case study.

Big DataData WarehouseETL

0 likes · 22 min read

Mastering User Profiling: A Comprehensive Big Data Blueprint

Jul 24, 2020 · Artificial Intelligence

DLFlow: An End-to-End Deep Learning Solution for Big Data Offline Tasks

DLFlow, an end‑to‑end framework from Didi’s user‑profile team, merges Spark and TensorFlow to automate feature preprocessing, large‑scale distributed training, and massive prediction for big‑data offline tasks, offering configuration‑driven pipelines, task scheduling, and easy deployment that dramatically speeds model development.

Distributed ComputingModel DevelopmentSpark

0 likes · 9 min read

DLFlow: An End-to-End Deep Learning Solution for Big Data Offline Tasks

Big Data Technology & Architecture

Jul 18, 2020 · Big Data

Common Spark SQL, Spark Core, PySpark, and Streaming Issues and Their Solutions

This article compiles frequent Spark SQL, Spark Core, PySpark, and Streaming problems—such as filesystem errors, configuration pitfalls, memory limits, shuffle failures, and version incompatibilities—along with concise explanations of their causes and step‑by‑step remediation methods for big‑data environments.

Big DataPySparkSQL

0 likes · 14 min read

Common Spark SQL, Spark Core, PySpark, and Streaming Issues and Their Solutions

Full-Stack Internet Architecture

Jul 17, 2020 · Big Data

Understanding Apache Spark Unified (Dynamic) Memory Management

This article explains Spark's transition from static to unified memory management, detailing on‑heap and off‑heap memory regions, key configuration parameters, dynamic allocation behavior, and legacy mode, helping users optimize executor memory usage for both batch and streaming workloads.

ExecutorMemory ManagementSpark

0 likes · 7 min read

Understanding Apache Spark Unified (Dynamic) Memory Management

Big Data Technology & Architecture

Jul 16, 2020 · Big Data

Spark Configuration Parameters and Performance Tuning Guidelines

This article explains the purpose, default values, and practical tuning recommendations for common Spark submit options such as executor counts, memory settings, shuffle behavior, speculation, and various Spark SQL configurations to help users optimize big‑data workloads.

Big DataConfigurationExecutor

0 likes · 14 min read

Spark Configuration Parameters and Performance Tuning Guidelines

Tencent Cloud Developer

Jul 13, 2020 · Big Data

Building MVP: A Lightweight Big Data Analysis System for Product Growth

The article describes how a lightweight big‑data analysis platform called MVP was built from scratch—using a User‑Event‑Config model, HDFS + ClickHouse + Spark, and four modules for metric monitoring, root‑cause alerts, deep growth analysis, and A/B testing—enabling real‑time insights in seconds instead of days and dramatically accelerating product‑growth operations.

AARRR ModelClickHouseHDFS

0 likes · 9 min read

Building MVP: A Lightweight Big Data Analysis System for Product Growth