Tagged articles

RDD

34 articles · Page 1 of 1

May 22, 2025 · Big Data

Introduction to PySpark: Features, Core Components, Sample Code, and Use Cases

This article introduces PySpark as the Python API for Apache Spark, explains Spark's core concepts and advantages, details PySpark's main components and a simple code example, compares it with Pandas, and outlines typical big‑data scenarios and further learning directions.

Apache SparkBig DataDataFrames

0 likes · 5 min read

Introduction to PySpark: Features, Core Components, Sample Code, and Use Cases

Big Data Tech Team

Apr 17, 2025 · Big Data

Essential Spark Interview Q&A: Master Data Warehouse Engineer Questions

This article compiles a comprehensive set of Spark interview questions frequently asked by leading tech companies, providing detailed explanations of Spark’s performance mechanisms, architecture, RDD persistence, checkpointing, streaming, dependency types, HA setup, and practical coding examples to help data warehouse engineers prepare effectively.

Data WarehouseRDDSpark

0 likes · 21 min read

Essential Spark Interview Q&A: Master Data Warehouse Engineer Questions

DataFunSummit

Jul 11, 2024 · Big Data

Design Principles of the Spark Core – DataFun Introduction to Apache Spark (Part 1)

This article provides a comprehensive overview of Apache Spark, covering its origins, key characteristics, core concepts such as RDD, DAG, partitioning and dependencies, the internal architecture including SparkConf, SparkContext, SparkEnv, storage and scheduling systems, as well as deployment models and the company behind the product.

Apache SparkBig DataDistributed Computing

0 likes · 16 min read

Design Principles of the Spark Core – DataFun Introduction to Apache Spark (Part 1)

Model Perspective

Jan 6, 2024 · Fundamentals

Unlock Causal Insights: How Regression Discontinuity Design Works

Regression Discontinuity Design (RDD) leverages a predefined cutoff to compare individuals on either side, mimicking random assignment and allowing researchers to infer causal effects when randomized experiments are infeasible, with applications ranging from education scholarships to tax policies.

RDDpolicy analysisregression discontinuity

0 likes · 7 min read

Unlock Causal Insights: How Regression Discontinuity Design Works

JavaEdge

Apr 17, 2022 · Big Data

Why Spark Overtook MapReduce: Core Advantages and RDD Programming Model

The article explains how Spark, developed by UC Berkeley's AMP Lab, quickly surpassed MapReduce by offering faster execution, a simpler Scala‑based programming model, lazy RDD transformations, a rich ecosystem including SQL, Streaming, MLlib and GraphX, and practical code examples such as a three‑line WordCount.

Big DataMapReduceRDD

0 likes · 7 min read

Why Spark Overtook MapReduce: Core Advantages and RDD Programming Model

Big Data Technology & Architecture

Dec 1, 2021 · Big Data

Understanding Spark Core, RDD, and Scheduler Components: A Practical Guide

This article introduces Spark's core concepts, explains the RDD abstraction and its four main properties, and details the roles of DAGScheduler, SchedulerBackend, TaskScheduler, and ExecutorBackend, providing practical insights for beginners and intermediate users in big‑data processing.

Big DataDAGSchedulerRDD

0 likes · 9 min read

Understanding Spark Core, RDD, and Scheduler Components: A Practical Guide

Qunar Tech Salon

Aug 26, 2021 · Big Data

Comprehensive Introduction to Apache Spark: History, Core Concepts, Architecture, and Performance Optimization

This article provides a thorough overview of Apache Spark, covering its origins, comparison with MapReduce, core concepts such as RDD, DAG, Jobs, Stages, and Tasks, the submission process, Web UI, and detailed performance tuning techniques including data skew mitigation.

Big DataData SkewMapReduce

0 likes · 15 min read

Comprehensive Introduction to Apache Spark: History, Core Concepts, Architecture, and Performance Optimization

Big Data Technology & Architecture

Jul 4, 2021 · Big Data

Comprehensive Guide to Learning Apache Spark: Background, Core Concepts, Modules, Resources, and Optimization

This article provides a thorough learning roadmap for Apache Spark, covering its background papers, core concepts such as RDD and fault tolerance, module breakdown, recommended books and repositories, source‑code reading tips, hands‑on projects, and interview‑oriented optimization guidance.

Apache SparkPerformance OptimizationRDD

0 likes · 15 min read

Comprehensive Guide to Learning Apache Spark: Background, Core Concepts, Modules, Resources, and Optimization

Big Data Technology & Architecture

Jun 4, 2021 · Big Data

Comprehensive Spark Interview Questions and Answers

This article provides a detailed collection of Spark interview questions covering deployment modes, performance advantages over MapReduce, shuffle mechanisms, RDD characteristics, optimization techniques, resource management, and various practical aspects of Spark on YARN, Mesos, and Kubernetes.

OptimizationRDDShuffle

0 likes · 21 min read

Comprehensive Spark Interview Questions and Answers

Big Data Technology & Architecture

Apr 14, 2021 · Big Data

Understanding Spark Shuffle: Write and Read Mechanisms Compared to Hadoop MapReduce

This article explains how Spark implements shuffle write and shuffle read, compares its high‑level and low‑level processes with Hadoop MapReduce, and details the internal data structures, memory‑disk trade‑offs, and configuration options that affect performance.

MapReduceMemoryManagementRDD

0 likes · 21 min read

Understanding Spark Shuffle: Write and Read Mechanisms Compared to Hadoop MapReduce

Big Data Technology & Architecture

Apr 13, 2021 · Big Data

Spark Job Generation and Execution: From Logical DAG to Physical Stages and Tasks

This article explains how Spark transforms a logical execution graph into a physical job by partitioning stages, applying pipeline concepts, and generating tasks, while illustrating the process with detailed code examples and the internal workflow of job submission.

Job SchedulingRDDScala

0 likes · 15 min read

Spark Job Generation and Execution: From Logical DAG to Physical Stages and Tasks

Big Data Technology & Architecture

Apr 11, 2021 · Big Data

Understanding Spark RDD Logical Execution Graph and Dependency Types

This article explains how Spark builds the logical execution graph for RDDs, describes the four-step job processing pipeline, details the various dependency types such as NarrowDependency and ShuffleDependency, and reviews common transformations and their data‑flow characteristics.

RDDShuffleSpark

0 likes · 19 min read

Understanding Spark RDD Logical Execution Graph and Dependency Types

Big Data Technology & Architecture

Apr 10, 2021 · Big Data

Understanding Spark Cache and Checkpoint Mechanisms

This article explains Spark's cache and checkpoint mechanisms, detailing when to use each, how they are implemented internally, how cached and checkpointed RDDs are stored and retrieved, and the differences between caching, persisting, and checkpointing for reliable big‑data processing.

CacheCheckpointPerformance

0 likes · 13 min read

Understanding Spark Cache and Checkpoint Mechanisms

Architect

Apr 2, 2021 · Big Data

Spark Performance Optimization Guide: Development and Resource Tuning

This article provides a comprehensive guide to Spark performance optimization, covering development‑level tuning principles, resource configuration parameters, practical code examples, and best‑practice recommendations to achieve high‑throughput big‑data processing.

Big DataOptimizationPerformance

0 likes · 33 min read

Spark Performance Optimization Guide: Development and Resource Tuning

Tencent Cloud Developer

Nov 13, 2020 · Big Data

Apache Spark Core: Architecture, Components, and Execution Flow

Apache Spark Core is a high‑performance, fault‑tolerant engine that abstracts distributed computation through SparkContext, DAG and Task schedulers, supports in‑memory and disk storage, runs on various cluster managers (YARN, Kubernetes, etc.), and unifies batch, streaming, ML and graph processing via its rich ecosystem.

Apache SparkBig DataDAG scheduler

0 likes · 17 min read

Apache Spark Core: Architecture, Components, and Execution Flow

Big Data Technology & Architecture

May 28, 2020 · Big Data

From SQL to RDD: Understanding Spark's Internal Architecture

This article explains how Spark converts SQL queries into RDD operations by creating a SparkSession, registering temporary views, executing SQL, and then detailing the underlying InternalRow, TreeNode, and Expression structures that power the Catalyst optimizer.

CatalystInternalRowRDD

0 likes · 5 min read

From SQL to RDD: Understanding Spark's Internal Architecture

Big Data Technology & Architecture

Feb 8, 2020 · Big Data

A Practical Guide to Reading Apache Spark Source Code and Understanding Its Core Design

This article explains why Spark is a mature big‑data framework, recommends which Spark versions to study, lists essential research papers, describes how to set up the development environment, and outlines the key components of Spark’s core architecture for effective source‑code exploration.

Apache SparkBig DataRDD

0 likes · 6 min read

A Practical Guide to Reading Apache Spark Source Code and Understanding Its Core Design

Big Data Technology & Architecture

Jan 25, 2020 · Big Data

Spark Scala Example: Find the Most Frequent Visitor ID in a 500‑Million‑Record Dataset

This article demonstrates how to generate 500 million visitor IDs with Spark, use map‑reduce operations to count occurrences, and identify the ID with the highest visit count, while discussing performance considerations such as memory spilling and cluster resources.

Big DataRDDScala

0 likes · 11 min read

Spark Scala Example: Find the Most Frequent Visitor ID in a 500‑Million‑Record Dataset

Big Data Technology & Architecture

Nov 4, 2019 · Big Data

Understanding Spark Checkpoint: Purpose, Mechanism, and Best Practices

This article explains why Spark checkpoints are needed for large or complex RDD pipelines, how they work by persisting data to reliable storage such as HDFS, and outlines practical steps and best‑practice recommendations for using checkpoints effectively in production environments.

Big DataCheckpointHDFS

0 likes · 6 min read

Understanding Spark Checkpoint: Purpose, Mechanism, and Best Practices

Big Data Technology & Architecture

Oct 20, 2019 · Big Data

Converting Spark RDD to DataSet/DataFrame: Two Methods and Handling Serialization Issues

This article explains two approaches—reflection‑based schema inference and programmatic schema definition—to transform a Spark RDD into a DataSet or DataFrame, demonstrates the required code, and discusses common Task‑not‑serializable errors with practical solutions.

Big DataRDDScala

0 likes · 8 min read

Converting Spark RDD to DataSet/DataFrame: Two Methods and Handling Serialization Issues

Big Data Technology Architecture

Jul 10, 2019 · Big Data

Introduction to Apache Spark and Its Core Components

Apache Spark, an open‑source unified analytics engine from UC Berkeley’s AMP Lab, is the leading platform for large‑scale batch and streaming data processing, featuring components such as Spark SQL, Streaming, GraphX, MLlib, and core modules like DAGScheduler, TaskScheduler and BlockManager.

Apache SparkBlockManagerDAGScheduler

0 likes · 4 min read

Introduction to Apache Spark and Its Core Components

Big Data Technology Architecture

May 18, 2019 · Big Data

Key Concepts of Kafka, Hadoop Shuffle, Spark Cluster Modes, HDFS I/O, and Spark RDD Operations

This article explains Kafka message structure and offset retrieval, details Hadoop's map and reduce shuffle processes, outlines Spark's deployment modes, describes HDFS read/write mechanisms, compares reduceByKey and groupByKey performance, and discusses Spark streaming integration with Kafka and data loss prevention.

HDFSHadoopKafka

0 likes · 10 min read

Key Concepts of Kafka, Hadoop Shuffle, Spark Cluster Modes, HDFS I/O, and Spark RDD Operations

dbaplus Community

Aug 21, 2018 · Big Data

Master Spark Performance: Practical Development and Resource Tuning Guide

This article explains why Spark needs careful performance tuning, then details concrete development‑level optimizations (RDD reuse, persistence, shuffle avoidance, broadcast variables, Kryo serialization, data‑structure choices) and resource‑level settings (executor count, memory, cores, parallelism, memory fractions) with code examples and practical recommendations.

Broadcast VariablesKryo SerializationPerformance Tuning

0 likes · 32 min read

Master Spark Performance: Practical Development and Resource Tuning Guide

ITPUB

Jun 2, 2018 · Big Data

Mastering Spark: Core Concepts, Architecture, Streaming & Performance Tuning

This comprehensive guide explains Spark's ecosystem, execution principles, key features, deployment architectures, core concepts like RDD, Transformations, Actions, Jobs, Stages, Shuffle and Cache, as well as Spark Streaming mechanics and practical resource‑tuning tips for optimal big‑data processing.

Big DataPerformance TuningRDD

0 likes · 15 min read

Architect

Apr 17, 2017 · Big Data

Understanding Apache Spark Architecture: RDD, Computation Model, Cluster Modes, RPC, and Core Components

This article provides a comprehensive overview of Apache Spark's architecture, covering its RDD abstraction, computation model, various cluster deployment modes, RPC communication layer, startup procedures, core components, interaction flows, and block management for broadcast variables.

Apache SparkBig DataCluster Mode

0 likes · 15 min read

Understanding Apache Spark Architecture: RDD, Computation Model, Cluster Modes, RPC, and Core Components

ITPUB

Mar 22, 2017 · Big Data

Why Spark Beats MapReduce: The RDD Story and Spark SQL Evolution

This article walks through Spark’s origins, its core RDD concept, how it improves on Hadoop’s MapReduce, the role of in‑memory processing, functional programming support, and the emergence of Spark SQL with DataFrames and the Catalyst optimizer.

Big DataDistributed ComputingMapReduce

0 likes · 25 min read

Why Spark Beats MapReduce: The RDD Story and Spark SQL Evolution

Liulishuo Tech Team

Oct 17, 2016 · Big Data

Practical Tips and Common Pitfalls for Tuning Apache Spark Performance

This article shares hands‑on experience from Spark Summit attendees, covering why Spark is powerful, common performance problems such as slow jobs, OOM, data skew, excessive partitions, and provides concrete tuning advice on executors, cores, memory, and debugging techniques.

Apache SparkBig DataData Skew

0 likes · 11 min read

Practical Tips and Common Pitfalls for Tuning Apache Spark Performance

High Availability Architecture

May 19, 2016 · Big Data

Comprehensive Overview of Apache Spark: Architecture, RDD Principles, Execution Modes, and Spark 2.0 Features

This article provides an in‑depth technical overview of Apache Spark, covering its core concepts such as RDDs, transformation and action operations, execution models, Spark 2.0 enhancements like unified DataFrames/Datasets, whole‑stage code generation, Structured Streaming, and practical performance‑tuning guidance.

DataFramesPerformance OptimizationRDD

0 likes · 20 min read

Comprehensive Overview of Apache Spark: Architecture, RDD Principles, Execution Modes, and Spark 2.0 Features

21CTO

Apr 18, 2016 · Big Data

How Spark Runs on YARN: From Client Submission to Executor Execution

This article explains the end‑to‑end workflow of Spark on YARN, covering client initialization, ApplicationMaster actions, driver and executor roles, RDD fundamentals, SparkSQL processing, and practical code examples for building and tuning distributed Spark jobs.

Distributed ComputingRDDSpark

0 likes · 17 min read

How Spark Runs on YARN: From Client Submission to Executor Execution

Architecture Digest

Apr 18, 2016 · Big Data

Introduction to Apache Spark: Architecture, RDD, Spark on YARN, and SparkSQL

This article introduces Apache Spark’s core architecture, explains how Spark runs on YARN, details driver and executor roles, describes RDD concepts and dependencies, and outlines SparkSQL’s schema‑based query processing, providing code examples for HiveContext and JDBC integration.

Big DataDistributed ComputingRDD

0 likes · 14 min read

Introduction to Apache Spark: Architecture, RDD, Spark on YARN, and SparkSQL

21CTO

Mar 30, 2016 · Big Data

Unveiling Spark on YARN: From RDD Basics to Cluster Execution

This article explains Apache Spark’s core concepts, the RDD programming model, how Spark runs on YARN with driver and executor nodes, the distinction between transformations and actions, partitioning strategies, and an overview of SparkSQL processing.

Apache SparkRDDSparkSQL

0 likes · 18 min read

Unveiling Spark on YARN: From RDD Basics to Cluster Execution

dbaplus Community

Nov 27, 2015 · Big Data

Why Spark Is the Next Big Thing in Big Data: Core Concepts Explained

This article provides a comprehensive overview of Apache Spark, covering its origins, core concepts such as RDDs, transformations, actions, dependencies, execution modes, and key components like Spark SQL, Streaming, MLlib, and GraphX, while also offering practical code examples and visual illustrations.

DataFramesGraphXMLlib

0 likes · 18 min read

Why Spark Is the Next Big Thing in Big Data: Core Concepts Explained

Qunar Tech Salon

Aug 18, 2015 · Big Data

Overview of Spark Big Data Analytics Framework Components

Spark’s big‑data analytics ecosystem comprises core components such as the in‑memory RDD data structure, Streaming for real‑time processing, GraphX for graph analytics, MLlib for machine‑learning, Spark SQL for querying, the Tachyon file system, and SparkR, each enabling scalable, distributed computation.

Big DataGraphXMLlib

0 likes · 5 min read

Overview of Spark Big Data Analytics Framework Components

MaGe Linux Operations

Feb 3, 2015 · Big Data

Why Spark Beats Hadoop: Exploring RDDs, In‑Memory Computing, and Fault Tolerance

This article explains how Apache Spark improves on Hadoop MapReduce by keeping intermediate data in memory, introduces the core RDD abstraction, compares Spark’s transformations and actions with Hadoop, and shows how Spark can run on Standalone, YARN, and various programming languages such as Scala, Java, and Python.

Big DataJavaRDD

0 likes · 20 min read

Why Spark Beats Hadoop: Exploring RDDs, In‑Memory Computing, and Fault Tolerance