Tag

RDD

0 views collected around this technical thread.

Python Programming Learning Circle
Python Programming Learning Circle
May 22, 2025 · Big Data

Introduction to PySpark: Features, Core Components, Sample Code, and Use Cases

This article introduces PySpark as the Python API for Apache Spark, explains Spark's core concepts and advantages, details PySpark's main components and a simple code example, compares it with Pandas, and outlines typical big‑data scenarios and further learning directions.

Apache SparkBig DataDataFrames
0 likes · 5 min read
Introduction to PySpark: Features, Core Components, Sample Code, and Use Cases
DataFunSummit
DataFunSummit
Jul 11, 2024 · Big Data

Design Principles of the Spark Core – DataFun Introduction to Apache Spark (Part 1)

This article provides a comprehensive overview of Apache Spark, covering its origins, key characteristics, core concepts such as RDD, DAG, partitioning and dependencies, the internal architecture including SparkConf, SparkContext, SparkEnv, storage and scheduling systems, as well as deployment models and the company behind the product.

Apache SparkBig DataData Processing
0 likes · 16 min read
Design Principles of the Spark Core – DataFun Introduction to Apache Spark (Part 1)
Model Perspective
Model Perspective
Jan 6, 2024 · Fundamentals

Unlock Causal Insights: How Regression Discontinuity Design Works

Regression Discontinuity Design (RDD) leverages a predefined cutoff to compare individuals on either side, mimicking random assignment and allowing researchers to infer causal effects when randomized experiments are infeasible, with applications ranging from education scholarships to tax policies.

RDDcausal inferencepolicy analysis
0 likes · 7 min read
Unlock Causal Insights: How Regression Discontinuity Design Works
Architect
Architect
Apr 2, 2021 · Big Data

Spark Performance Optimization Guide: Development and Resource Tuning

This article provides a comprehensive guide to Spark performance optimization, covering development‑level tuning principles, resource configuration parameters, practical code examples, and best‑practice recommendations to achieve high‑throughput big‑data processing.

Big DataOptimizationRDD
0 likes · 33 min read
Spark Performance Optimization Guide: Development and Resource Tuning
Tencent Cloud Developer
Tencent Cloud Developer
Nov 13, 2020 · Big Data

Apache Spark Core: Architecture, Components, and Execution Flow

Apache Spark Core is a high‑performance, fault‑tolerant engine that abstracts distributed computation through SparkContext, DAG and Task schedulers, supports in‑memory and disk storage, runs on various cluster managers (YARN, Kubernetes, etc.), and unifies batch, streaming, ML and graph processing via its rich ecosystem.

Apache SparkBig DataDAG Scheduler
0 likes · 17 min read
Apache Spark Core: Architecture, Components, and Execution Flow
Big Data Technology Architecture
Big Data Technology Architecture
Jul 10, 2019 · Big Data

Introduction to Apache Spark and Its Core Components

Apache Spark, an open‑source unified analytics engine from UC Berkeley’s AMP Lab, is the leading platform for large‑scale batch and streaming data processing, featuring components such as Spark SQL, Streaming, GraphX, MLlib, and core modules like DAGScheduler, TaskScheduler and BlockManager.

Apache SparkBlockManagerDAGScheduler
0 likes · 4 min read
Introduction to Apache Spark and Its Core Components
Big Data Technology Architecture
Big Data Technology Architecture
May 18, 2019 · Big Data

Key Concepts of Kafka, Hadoop Shuffle, Spark Cluster Modes, HDFS I/O, and Spark RDD Operations

This article explains Kafka message structure and offset retrieval, details Hadoop's map and reduce shuffle processes, outlines Spark's deployment modes, describes HDFS read/write mechanisms, compares reduceByKey and groupByKey performance, and discusses Spark streaming integration with Kafka and data loss prevention.

Big DataHDFSHadoop
0 likes · 10 min read
Key Concepts of Kafka, Hadoop Shuffle, Spark Cluster Modes, HDFS I/O, and Spark RDD Operations
Liulishuo Tech Team
Liulishuo Tech Team
Oct 17, 2016 · Big Data

Practical Tips and Common Pitfalls for Tuning Apache Spark Performance

This article shares hands‑on experience from Spark Summit attendees, covering why Spark is powerful, common performance problems such as slow jobs, OOM, data skew, excessive partitions, and provides concrete tuning advice on executors, cores, memory, and debugging techniques.

Apache SparkBig DataExecutor Configuration
0 likes · 11 min read
Practical Tips and Common Pitfalls for Tuning Apache Spark Performance
High Availability Architecture
High Availability Architecture
May 19, 2016 · Big Data

Comprehensive Overview of Apache Spark: Architecture, RDD Principles, Execution Modes, and Spark 2.0 Features

This article provides an in‑depth technical overview of Apache Spark, covering its core concepts such as RDDs, transformation and action operations, execution models, Spark 2.0 enhancements like unified DataFrames/Datasets, whole‑stage code generation, Structured Streaming, and practical performance‑tuning guidance.

Big DataDataFramesRDD
0 likes · 20 min read
Comprehensive Overview of Apache Spark: Architecture, RDD Principles, Execution Modes, and Spark 2.0 Features
Architecture Digest
Architecture Digest
Apr 18, 2016 · Big Data

Introduction to Apache Spark: Architecture, RDD, Spark on YARN, and SparkSQL

This article introduces Apache Spark’s core architecture, explains how Spark runs on YARN, details driver and executor roles, describes RDD concepts and dependencies, and outlines SparkSQL’s schema‑based query processing, providing code examples for HiveContext and JDBC integration.

Big DataRDDSpark
0 likes · 14 min read
Introduction to Apache Spark: Architecture, RDD, Spark on YARN, and SparkSQL
Qunar Tech Salon
Qunar Tech Salon
Aug 18, 2015 · Big Data

Overview of Spark Big Data Analytics Framework Components

Spark’s big‑data analytics ecosystem comprises core components such as the in‑memory RDD data structure, Streaming for real‑time processing, GraphX for graph analytics, MLlib for machine‑learning, Spark SQL for querying, the Tachyon file system, and SparkR, each enabling scalable, distributed computation.

Big DataGraphXMLlib
0 likes · 5 min read
Overview of Spark Big Data Analytics Framework Components