Tagged articles
34 articles
Page 1 of 1
Big Data Tech Team
Big Data Tech Team
Apr 17, 2025 · Big Data

Essential Spark Interview Q&A: Master Data Warehouse Engineer Questions

This article compiles a comprehensive set of Spark interview questions frequently asked by leading tech companies, providing detailed explanations of Spark’s performance mechanisms, architecture, RDD persistence, checkpointing, streaming, dependency types, HA setup, and practical coding examples to help data warehouse engineers prepare effectively.

RDDSparkSpark Streaming
0 likes · 21 min read
Essential Spark Interview Q&A: Master Data Warehouse Engineer Questions
DataFunSummit
DataFunSummit
Jul 11, 2024 · Big Data

Design Principles of the Spark Core – DataFun Introduction to Apache Spark (Part 1)

This article provides a comprehensive overview of Apache Spark, covering its origins, key characteristics, core concepts such as RDD, DAG, partitioning and dependencies, the internal architecture including SparkConf, SparkContext, SparkEnv, storage and scheduling systems, as well as deployment models and the company behind the product.

Apache SparkBig DataRDD
0 likes · 16 min read
Design Principles of the Spark Core – DataFun Introduction to Apache Spark (Part 1)
Model Perspective
Model Perspective
Jan 6, 2024 · Fundamentals

Unlock Causal Insights: How Regression Discontinuity Design Works

Regression Discontinuity Design (RDD) leverages a predefined cutoff to compare individuals on either side, mimicking random assignment and allowing researchers to infer causal effects when randomized experiments are infeasible, with applications ranging from education scholarships to tax policies.

RDDpolicy analysisregression discontinuity
0 likes · 7 min read
Unlock Causal Insights: How Regression Discontinuity Design Works
JavaEdge
JavaEdge
Apr 17, 2022 · Big Data

Why Spark Overtook MapReduce: Core Advantages and RDD Programming Model

The article explains how Spark, developed by UC Berkeley's AMP Lab, quickly surpassed MapReduce by offering faster execution, a simpler Scala‑based programming model, lazy RDD transformations, a rich ecosystem including SQL, Streaming, MLlib and GraphX, and practical code examples such as a three‑line WordCount.

Big DataMapReduceRDD
0 likes · 7 min read
Why Spark Overtook MapReduce: Core Advantages and RDD Programming Model
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 4, 2021 · Big Data

Comprehensive Guide to Learning Apache Spark: Background, Core Concepts, Modules, Resources, and Optimization

This article provides a thorough learning roadmap for Apache Spark, covering its background papers, core concepts such as RDD and fault tolerance, module breakdown, recommended books and repositories, source‑code reading tips, hands‑on projects, and interview‑oriented optimization guidance.

Apache SparkLearning PathPerformance Optimization
0 likes · 15 min read
Comprehensive Guide to Learning Apache Spark: Background, Core Concepts, Modules, Resources, and Optimization
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 4, 2021 · Big Data

Comprehensive Spark Interview Questions and Answers

This article provides a detailed collection of Spark interview questions covering deployment modes, performance advantages over MapReduce, shuffle mechanisms, RDD characteristics, optimization techniques, resource management, and various practical aspects of Spark on YARN, Mesos, and Kubernetes.

RDDShuffleSpark
0 likes · 21 min read
Comprehensive Spark Interview Questions and Answers
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 10, 2021 · Big Data

Understanding Spark Cache and Checkpoint Mechanisms

This article explains Spark's cache and checkpoint mechanisms, detailing when to use each, how they are implemented internally, how cached and checkpointed RDDs are stored and retrieved, and the differences between caching, persisting, and checkpointing for reliable big‑data processing.

CacheCheckpointRDD
0 likes · 13 min read
Understanding Spark Cache and Checkpoint Mechanisms
Architect
Architect
Apr 2, 2021 · Big Data

Spark Performance Optimization Guide: Development and Resource Tuning

This article provides a comprehensive guide to Spark performance optimization, covering development‑level tuning principles, resource configuration parameters, practical code examples, and best‑practice recommendations to achieve high‑throughput big‑data processing.

Big DataRDDResource Tuning
0 likes · 33 min read
Spark Performance Optimization Guide: Development and Resource Tuning
Tencent Cloud Developer
Tencent Cloud Developer
Nov 13, 2020 · Big Data

Apache Spark Core: Architecture, Components, and Execution Flow

Apache Spark Core is a high‑performance, fault‑tolerant engine that abstracts distributed computation through SparkContext, DAG and Task schedulers, supports in‑memory and disk storage, runs on various cluster managers (YARN, Kubernetes, etc.), and unifies batch, streaming, ML and graph processing via its rich ecosystem.

Apache SparkBig DataDAG scheduler
0 likes · 17 min read
Apache Spark Core: Architecture, Components, and Execution Flow
Big Data Technology Architecture
Big Data Technology Architecture
Jul 10, 2019 · Big Data

Introduction to Apache Spark and Its Core Components

Apache Spark, an open‑source unified analytics engine from UC Berkeley’s AMP Lab, is the leading platform for large‑scale batch and streaming data processing, featuring components such as Spark SQL, Streaming, GraphX, MLlib, and core modules like DAGScheduler, TaskScheduler and BlockManager.

Apache SparkBlockManagerDAGScheduler
0 likes · 4 min read
Introduction to Apache Spark and Its Core Components
Big Data Technology Architecture
Big Data Technology Architecture
May 18, 2019 · Big Data

Key Concepts of Kafka, Hadoop Shuffle, Spark Cluster Modes, HDFS I/O, and Spark RDD Operations

This article explains Kafka message structure and offset retrieval, details Hadoop's map and reduce shuffle processes, outlines Spark's deployment modes, describes HDFS read/write mechanisms, compares reduceByKey and groupByKey performance, and discusses Spark streaming integration with Kafka and data loss prevention.

HDFSHadoopKafka
0 likes · 10 min read
Key Concepts of Kafka, Hadoop Shuffle, Spark Cluster Modes, HDFS I/O, and Spark RDD Operations
dbaplus Community
dbaplus Community
Aug 21, 2018 · Big Data

Master Spark Performance: Practical Development and Resource Tuning Guide

This article explains why Spark needs careful performance tuning, then details concrete development‑level optimizations (RDD reuse, persistence, shuffle avoidance, broadcast variables, Kryo serialization, data‑structure choices) and resource‑level settings (executor count, memory, cores, parallelism, memory fractions) with code examples and practical recommendations.

Broadcast VariablesKryo SerializationRDD
0 likes · 32 min read
Master Spark Performance: Practical Development and Resource Tuning Guide
ITPUB
ITPUB
Jun 2, 2018 · Big Data

Mastering Spark: Core Concepts, Architecture, Streaming & Performance Tuning

This comprehensive guide explains Spark's ecosystem, execution principles, key features, deployment architectures, core concepts like RDD, Transformations, Actions, Jobs, Stages, Shuffle and Cache, as well as Spark Streaming mechanics and practical resource‑tuning tips for optimal big‑data processing.

Big DataClusterRDD
0 likes · 15 min read
Mastering Spark: Core Concepts, Architecture, Streaming & Performance Tuning
ITPUB
ITPUB
Mar 22, 2017 · Big Data

Why Spark Beats MapReduce: The RDD Story and Spark SQL Evolution

This article walks through Spark’s origins, its core RDD concept, how it improves on Hadoop’s MapReduce, the role of in‑memory processing, functional programming support, and the emergence of Spark SQL with DataFrames and the Catalyst optimizer.

Big DataMapReduceRDD
0 likes · 25 min read
Why Spark Beats MapReduce: The RDD Story and Spark SQL Evolution
High Availability Architecture
High Availability Architecture
May 19, 2016 · Big Data

Comprehensive Overview of Apache Spark: Architecture, RDD Principles, Execution Modes, and Spark 2.0 Features

This article provides an in‑depth technical overview of Apache Spark, covering its core concepts such as RDDs, transformation and action operations, execution models, Spark 2.0 enhancements like unified DataFrames/Datasets, whole‑stage code generation, Structured Streaming, and practical performance‑tuning guidance.

DataFramesPerformance OptimizationRDD
0 likes · 20 min read
Comprehensive Overview of Apache Spark: Architecture, RDD Principles, Execution Modes, and Spark 2.0 Features
21CTO
21CTO
Apr 18, 2016 · Big Data

How Spark Runs on YARN: From Client Submission to Executor Execution

This article explains the end‑to‑end workflow of Spark on YARN, covering client initialization, ApplicationMaster actions, driver and executor roles, RDD fundamentals, SparkSQL processing, and practical code examples for building and tuning distributed Spark jobs.

RDDSparkSparkSQL
0 likes · 17 min read
How Spark Runs on YARN: From Client Submission to Executor Execution
21CTO
21CTO
Mar 30, 2016 · Big Data

Unveiling Spark on YARN: From RDD Basics to Cluster Execution

This article explains Apache Spark’s core concepts, the RDD programming model, how Spark runs on YARN with driver and executor nodes, the distinction between transformations and actions, partitioning strategies, and an overview of SparkSQL processing.

Apache SparkRDDSparkSQL
0 likes · 18 min read
Unveiling Spark on YARN: From RDD Basics to Cluster Execution
dbaplus Community
dbaplus Community
Nov 27, 2015 · Big Data

Why Spark Is the Next Big Thing in Big Data: Core Concepts Explained

This article provides a comprehensive overview of Apache Spark, covering its origins, core concepts such as RDDs, transformations, actions, dependencies, execution modes, and key components like Spark SQL, Streaming, MLlib, and GraphX, while also offering practical code examples and visual illustrations.

DataFramesGraphXMLlib
0 likes · 18 min read
Why Spark Is the Next Big Thing in Big Data: Core Concepts Explained
Qunar Tech Salon
Qunar Tech Salon
Aug 18, 2015 · Big Data

Overview of Spark Big Data Analytics Framework Components

Spark’s big‑data analytics ecosystem comprises core components such as the in‑memory RDD data structure, Streaming for real‑time processing, GraphX for graph analytics, MLlib for machine‑learning, Spark SQL for querying, the Tachyon file system, and SparkR, each enabling scalable, distributed computation.

Big DataGraphXMLlib
0 likes · 5 min read
Overview of Spark Big Data Analytics Framework Components