Tagged articles

Spark

623 articles · Page 6 of 7

Apr 9, 2019 · Big Data

How Compute‑Storage Separation Cuts Costs and Boosts Performance for Big Data on Kubernetes

This article examines the challenges of big‑data storage in containerized environments, compares compute‑storage‑separated architectures with traditional setups, presents performance and cost benchmarks of Alibaba Cloud ECS instances, and outlines practical storage options such as OSS, NAS, and DFS for Spark workloads on Kubernetes.

Cloud NativeCompute-Storage SeparationSpark

0 likes · 14 min read

How Compute‑Storage Separation Cuts Costs and Boosts Performance for Big Data on Kubernetes

Alibaba Cloud Native

Apr 2, 2019 · Big Data

Inside Spark Operator: How Kubernetes Manages Spark Jobs End‑to‑End

This article explains the internal architecture of Spark Operator, covering Kubernetes operator fundamentals, CRD definitions, code layout, job submission flow, state machine handling, monitoring integration, and troubleshooting techniques for reliable Spark workloads on Kubernetes.

Big DataCRDGo

0 likes · 11 min read

Inside Spark Operator: How Kubernetes Manages Spark Jobs End‑to‑End

Architecture Digest

Mar 28, 2019 · Backend Development

Aloha: A Scala‑Based Distributed Task Scheduling and Management Framework

Aloha is a Scala‑implemented distributed scheduling framework built on Spark that provides extensible plugins, high‑availability master/worker architecture, REST submission, custom application interfaces, event listeners, and a Scala‑based RPC system for managing long‑running tasks such as Spark, Flink, and ETL jobs.

Distributed SchedulingRPCScala

0 likes · 17 min read

Aloha: A Scala‑Based Distributed Task Scheduling and Management Framework

dbaplus Community

Mar 21, 2019 · Big Data

How Real-Time Data Platforms Evolve: From Storm to Flink and Kubernetes

This article summarizes Wang Xinchun's 2018 DAMS China Data Asset Management Summit talk, detailing the current state, core services, responsibilities, evolution, architecture, challenges, and future directions of a large‑scale real‑time data platform built on Storm, Spark, Flink, and Kubernetes, including a unified data management approach.

Data PlatformFlinkReal-time Streaming

0 likes · 22 min read

How Real-Time Data Platforms Evolve: From Storm to Flink and Kubernetes

58 Tech

Mar 15, 2019 · Big Data

Optimizing Spark Join Operations in Spark Core and Spark SQL

This article explains how to improve Spark join performance by reducing shuffle, using appropriate partitioners, applying broadcast hash joins for small tables, and selecting the optimal join strategy (broadcast, shuffle hash, or sort‑merge) in both Spark Core and Spark SQL.

JOINOptimizationShuffle

0 likes · 6 min read

Optimizing Spark Join Operations in Spark Core and Spark SQL

dbaplus Community

Mar 14, 2019 · Operations

How Top Internet Companies Scale Spark CI/CD Across Tens of Thousands of Nodes

This article details a practical, production‑grade Spark CI/CD workflow using GitLab and Jenkins, covering source management, multi‑branch release strategies, automated testing, gray‑release, hot‑fix handling, and rollback mechanisms for large‑scale deployments.

Big DataCI/CDContinuous Delivery

0 likes · 17 min read

How Top Internet Companies Scale Spark CI/CD Across Tens of Thousands of Nodes

Youzan Coder

Mar 8, 2019 · Big Data

Why Spark Shuffle Often Runs Out of Memory and How to Fix It

This article examines Spark's memory management and the shuffle process, identifies the components that consume the most memory during shuffle write and read, analyzes common OOM scenarios such as task concurrency and data skew, and offers configuration tips to prevent out‑of‑memory failures.

MemoryManagementOutOfMemoryShuffle

0 likes · 14 min read

Why Spark Shuffle Often Runs Out of Memory and How to Fix It

dbaplus Community

Mar 5, 2019 · Databases

How HTAP and DRDS HTAP Enable Real‑Time OLTP/OLAP Integration

This article explains the concepts of OLTP, OLAP and HTAP, describes the DRDS HTAP architecture—including its engine and storage layers, Fireworks Spark‑based engine, optimizer stages, and streaming capabilities—and demonstrates cross‑database MPP queries and streaming joins while outlining suitable use cases and limitations.

DRDSDatabase ArchitectureHTAP

0 likes · 17 min read

How HTAP and DRDS HTAP Enable Real‑Time OLTP/OLAP Integration

DataFunTalk

Mar 1, 2019 · Big Data

Renrenche Mobile Data Platform: Architecture, Real‑Time Computing, and BI Solutions

The article presents Renrenche’s end‑to‑end mobile data platform, detailing its overall architecture, real‑time Spark‑based computation engine, Web IDE, metadata management, BI reporting built on ClickHouse, and how data‑driven practices empower both online and offline business operations.

BI reportingBig DataClickHouse

0 likes · 15 min read

Renrenche Mobile Data Platform: Architecture, Real‑Time Computing, and BI Solutions

Beike Product & Technology

Feb 21, 2019 · Big Data

DATABUS Data Integration Platform: Architecture, Capabilities, and TiDB Ecosystem

The article presents an in‑depth overview of the DATABUS data integration platform, detailing its background, current challenges, core capabilities such as data syncing, metadata automation, real‑time subscriptions, and its reliance on TiDB, TiSpark, Hudi, and related big‑data technologies to enable near‑real‑time data warehousing.

Big DataData IntegrationHive

0 likes · 13 min read

DATABUS Data Integration Platform: Architecture, Capabilities, and TiDB Ecosystem

Big Data Technology & Architecture

Feb 15, 2019 · Big Data

Big Data Mastery Roadmap

This article outlines a comprehensive series of over 500 planned tutorials covering Java advanced features, distributed theory, Hadoop, Spark, Flink, and various big‑data storage and processing technologies, designed to guide engineers transitioning into big‑data development from fundamentals to expert level.

Data EngineeringFlinkHadoop

0 likes · 4 min read

Sohu Tech Products

Feb 13, 2019 · Big Data

Evolution and Implementation Details of Spark Shuffle Mechanisms

This article examines the historical evolution of Spark's shuffle implementations—from early Hash‑Based Shuffle to modern SortShuffleWriter, BypassMergeSortShuffleWriter, and UnsafeShuffleWriter—explaining their design choices, selection criteria, and the corresponding shuffle reader architecture in a production‑grade Spark 2.1.1 environment.

Big DataDistributed ComputingShuffle

0 likes · 13 min read

Evolution and Implementation Details of Spark Shuffle Mechanisms

JD Tech

Jan 18, 2019 · Big Data

Technical Overview of JD's New Business Intelligence Platform: Offline OLAP, Real‑time Data, and Visualization Solutions

The article details JD's 2018 upgrade of its Business Intelligence platform, describing how unified offline OLAP with ClickHouse, Spark, and Scala, timeliness optimizations, and a React‑based visualization component library together improve data consistency, performance, and user experience for merchants.

ClickHouseData VisualizationOLAP

0 likes · 7 min read

Technical Overview of JD's New Business Intelligence Platform: Offline OLAP, Real‑time Data, and Visualization Solutions

JD Tech

Jan 11, 2019 · Big Data

Spark Memory Management and Tuning Practices for Large-Scale Billing Systems

This article explains how Spark's memory management models and configuration parameters can be tuned to handle massive billing data efficiently, covering StaticMemoryManager vs UnifiedMemoryManager, storage and shuffle memory fractions, common OOM and file‑not‑found issues, and practical performance‑optimisation tips.

Distributed ComputingMemory ManagementPerformance Tuning

0 likes · 9 min read

Spark Memory Management and Tuning Practices for Large-Scale Billing Systems

360 Quality & Efficiency

Jan 4, 2019 · Big Data

Overview of Big Data Processing Engines: MapReduce, Tez, Spark, and Flink

This article reviews the evolution and characteristics of major big‑data processing engines—from first‑generation Hadoop MapReduce to second‑generation DAG‑based Tez, third‑generation in‑memory Spark, and fourth‑generation real‑time Flink—highlighting their batch and streaming use cases.

Big DataFlinkMapReduce

0 likes · 9 min read

Overview of Big Data Processing Engines: MapReduce, Tez, Spark, and Flink

Big Data Technology & Architecture

Jan 2, 2019 · Big Data

Optimizing Spark Direct Kafka Consumption: Subpartition Concurrency and Repartition Strategies

To address the long processing time caused by uneven Spark partitions when reading Kafka via the Direct approach, this article explains the SPARK‑22056 solution that modifies KafkaRDD.getPartitions to support a configurable 'topic.partition.subconcurrency' parameter, discusses its trade‑offs, and presents alternative repartition and multithreading techniques.

Big DataScalaSpark

0 likes · 6 min read

Optimizing Spark Direct Kafka Consumption: Subpartition Concurrency and Repartition Strategies

Big Data Technology & Architecture

Jan 2, 2019 · Big Data

Understanding Spark Streaming Backpressure Mechanism

The article explains how Spark Streaming backpressure, introduced in version 1.5, automatically adjusts data ingestion rates based on processing delays, replaces manual rate limits, and details its architecture, configuration parameters, and usage for preventing data backlog and executor OOM.

Big DataRate ControlSpark

0 likes · 6 min read

Understanding Spark Streaming Backpressure Mechanism

Big Data Technology & Architecture

Jan 1, 2019 · Big Data

Insights from the Real-Time Big Data Meetup: Spark Structured Streaming Overview

The meetup on September 8, co‑hosted by InfoQ and Huawei Cloud, featured Databricks engineer Tathagata Das explaining Spark Structured Streaming’s concepts, fault‑tolerance, performance, event‑time handling, and real‑world use cases such as Apple’s security platform, highlighting its scalability and integration with various data sources.

Big DataSparkStructured Streaming

0 likes · 8 min read

Insights from the Real-Time Big Data Meetup: Spark Structured Streaming Overview

Big Data Technology & Architecture

Dec 31, 2018 · Big Data

Overview of the Big Data Ecosystem and Core Technologies

This article provides a comprehensive overview of the big data ecosystem, explaining key components such as Hadoop, HDFS, Spark, Hive, Pig, HBase, and related tools, and describes how they work together to store, process, and analyze massive datasets efficiently.

Big DataHadoopHive

0 likes · 16 min read

Overview of the Big Data Ecosystem and Core Technologies

Architects Research Society

Dec 30, 2018 · Big Data

Overview of Major Apache Big Data Processing Frameworks

This article provides a concise overview of numerous Apache open‑source projects—including Ignite, MapReduce, Pig, JAQL, Spark, Storm, Flink, Apex, REEF, Twill, and Beam—that enable distributed in‑memory storage, real‑time and batch processing, and advanced analytics for large‑scale data workloads.

Big DataDistributed ComputingFlink

0 likes · 22 min read

Overview of Major Apache Big Data Processing Frameworks

21CTO

Nov 20, 2018 · Big Data

What Languages and Tools Do Big Data Experts Use? Insights from 31 IT Leaders

Based on interviews with 31 IT leaders from 28 organizations, this article reveals the most popular programming languages, frameworks, and platforms—such as Python, Scala, Spark, Kafka, TensorFlow, and Tableau—currently driving big‑data extraction, analysis, and reporting, and highlights emerging trends and tool preferences.

Big DataPythonSpark

0 likes · 12 min read

What Languages and Tools Do Big Data Experts Use? Insights from 31 IT Leaders

dbaplus Community

Nov 4, 2018 · Databases

How Spark Turns Traditional Databases into Powerful OLAP Engines

This article examines why traditional relational databases like MySQL struggle with analytical workloads, compares ROLAP and MOLAP approaches, explains Spark’s architecture and its advantages for OLAP, and details how Alibaba Cloud’s DRDS HTAP leverages a Spark‑based engine to deliver real‑time distributed query processing.

Data WarehouseDatabasesHTAP

0 likes · 11 min read

How Spark Turns Traditional Databases into Powerful OLAP Engines

Tencent Cloud Developer

Oct 30, 2018 · Big Data

Big Data Technology Trends and Cloud Data Warehouse Architecture Practices

The article reviews recent big-data trends—from Hadoop’s evolution and Spark’s in-memory advances to emerging storage like Ozone—while detailing data-warehouse models, query-optimizer techniques, and cloud-native architectures that integrate diverse data sources, enabling scalable, AI-ready analytics and modern data-lake capabilities.

Big DataCloud Data WarehouseData Lake

0 likes · 30 min read

Big Data Technology Trends and Cloud Data Warehouse Architecture Practices

360 Quality & Efficiency

Oct 15, 2018 · Big Data

An Introduction to Big Data Concepts, Hadoop Ecosystem, and Common Frameworks

This article provides a comprehensive overview of big data fundamentals, including the 4V characteristics, the Hadoop 2.0 layered architecture, a comparison between Hadoop and Spark, classification of common big‑data tools, and the typical offline and real‑time data processing workflows.

ETLHadoopSpark

0 likes · 6 min read

An Introduction to Big Data Concepts, Hadoop Ecosystem, and Common Frameworks

Java Captain

Oct 1, 2018 · Big Data

What Is a Big Data Development Engineer? Roles, Skills, and Differences from Traditional Development

The article explains what a big data development engineer does, the tools and skills required such as Hadoop, Hive, Spark and Kafka, how they process massive logs to compute metrics like PV and UV, and compares this role with conventional business system development.

Data EngineeringHadoopSpark

0 likes · 9 min read

What Is a Big Data Development Engineer? Roles, Skills, and Differences from Traditional Development

dbaplus Community

Aug 21, 2018 · Big Data

Master Spark Performance: Practical Development and Resource Tuning Guide

This article explains why Spark needs careful performance tuning, then details concrete development‑level optimizations (RDD reuse, persistence, shuffle avoidance, broadcast variables, Kryo serialization, data‑structure choices) and resource‑level settings (executor count, memory, cores, parallelism, memory fractions) with code examples and practical recommendations.

Broadcast VariablesKryo SerializationPerformance Tuning

0 likes · 32 min read

Master Spark Performance: Practical Development and Resource Tuning Guide

Big Data and Microservices

Aug 21, 2018 · Big Data

How to Build a Scalable Hadoop‑Spark Big Data Analytics Platform

This article explains why BI is essential for big data platforms, outlines the value hierarchy of data, details the Hadoop‑based analysis workflow, and provides step‑by‑step guidance for constructing both pure Hadoop and hybrid Hadoop‑Spark analytics architectures.

BIBig Data ArchitectureData Lake

0 likes · 12 min read

How to Build a Scalable Hadoop‑Spark Big Data Analytics Platform

High Availability Architecture

Aug 21, 2018 · Databases

Nebula: A Scalable Versioned Data Storage Platform for Airbnb Search Backends

Nebula is a schema‑less, versioned data storage service built by Airbnb that unifies real‑time random access and offline batch processing, supporting low‑latency queries, incremental updates, and scalable snapshots using DynamoDB, HFileService, Spark pipelines, and Kafka streams.

AirbnbDynamoDBNebula

0 likes · 13 min read

Nebula: A Scalable Versioned Data Storage Platform for Airbnb Search Backends

Alibaba Cloud Developer

Aug 13, 2018 · Big Data

How Ele.me Evolved Its Real‑Time Engine: From Storm to Flink

This article examines Ele.me’s big‑data platform evolution, comparing Storm, Spark Streaming, Structured Streaming, and Flink, detailing their architectures, consistency semantics, performance trade‑offs, and why Flink became the preferred real‑time computation engine for the company.

Big DataFlinkSpark

0 likes · 15 min read

How Ele.me Evolved Its Real‑Time Engine: From Storm to Flink

dbaplus Community

Jun 14, 2018 · Big Data

Designing Scalable Hadoop‑Based Data Analytics Platforms: Architecture & Best Practices

This article explains how enterprises can build a scalable data analytics platform on Hadoop by outlining the multi‑layer architecture, storage options, data synchronization methods, and ETL/offline computation techniques, while highlighting practical component choices such as Hive, HBase, Spark, and Oozie.

Big DataData ArchitectureData Lake

0 likes · 10 min read

Designing Scalable Hadoop‑Based Data Analytics Platforms: Architecture & Best Practices

ITPUB

Jun 14, 2018 · Big Data

Why Suning.com Sticks with Hadoop: Insights into China’s Big Data Platform Choices

Amid declining Hadoop usage reports, Suning.com’s 2018‑2020 big‑data platform case study reveals why the retailer still relies on Hadoop’s mature ecosystem, how it integrates HDFS, HBase, YARN, Hive, Spark, Flink and emerging tools, and what future resource‑management plans it envisions.

Data PlatformFlinkHadoop

0 likes · 11 min read

Why Suning.com Sticks with Hadoop: Insights into China’s Big Data Platform Choices

Liulishuo Tech Team

Jun 12, 2018 · Big Data

Highlights from Spark+AI Summit 2018: Hydrogen, MLflow, Delta, Spark 2.3, and Shuffle Optimization

The 2018 Spark+AI Summit in San Francisco showcased Spark's evolution toward unified AI and big‑data processing, introducing the Hydrogen project with gang scheduling, the open‑source MLflow platform, the Delta unified analytics engine, Spark 2.3 enhancements, and Facebook's shuffle I/O optimizations.

AIDelta LakeHydrogen

0 likes · 8 min read

Highlights from Spark+AI Summit 2018: Hydrogen, MLflow, Delta, Spark 2.3, and Shuffle Optimization

ITPUB

Jun 10, 2018 · Big Data

13 Must‑Know Open‑Source Tools in the Hadoop Ecosystem

This article introduces Hadoop’s origins and core challenges, then presents thirteen essential open‑source tools spanning resource scheduling, real‑time query engines, and additional processing frameworks, detailing each project's purpose, key features, and repository locations to help practitioners choose the right component for big‑data workloads.

HadoopImpalaSpark

0 likes · 12 min read

13 Must‑Know Open‑Source Tools in the Hadoop Ecosystem

ITPUB

Jun 4, 2018 · Big Data

Is Hadoop Really Declining? Expert Insights Show Why the Ecosystem Stays Strong

Despite Gartner's 2017 claim that Hadoop is nearing the end of its production maturity, a series of interviews with Chinese big‑data experts reveal that Hadoop's ecosystem remains robust, with core components like HDFS, YARN, Spark, and HBase continuing to dominate the market.

Big DataGartnerHadoop

0 likes · 9 min read

Is Hadoop Really Declining? Expert Insights Show Why the Ecosystem Stays Strong

ITPUB

Jun 3, 2018 · Big Data

Spark vs Hadoop: Which Distributed System Fits Your Data Needs?

An in‑depth comparison of Hadoop and Spark examines their architectures, performance, cost, security, and machine‑learning capabilities, helping readers decide which open‑source distributed processing platform best matches their batch, streaming, and analytical workloads.

Big DataHadoopSpark

0 likes · 13 min read

Spark vs Hadoop: Which Distributed System Fits Your Data Needs?

ITPUB

Jun 2, 2018 · Big Data

Mastering Spark: Core Concepts, Architecture, Streaming & Performance Tuning

This comprehensive guide explains Spark's ecosystem, execution principles, key features, deployment architectures, core concepts like RDD, Transformations, Actions, Jobs, Stages, Shuffle and Cache, as well as Spark Streaming mechanics and practical resource‑tuning tips for optimal big‑data processing.

Big DataPerformance TuningRDD

0 likes · 15 min read

Mastering Spark: Core Concepts, Architecture, Streaming & Performance Tuning

ITPUB

May 31, 2018 · Big Data

Mastering Spark on DataMagic: Fast‑Track Your Big Data Skills

This article explains Spark's role in the DataMagic platform, outlines four practical steps to quickly master Spark, details key configuration and parallelism settings, shows how to modify Spark code, and provides operational tips for cluster management and job troubleshooting.

Big DataConfigurationDataMagic

0 likes · 10 min read

Mastering Spark on DataMagic: Fast‑Track Your Big Data Skills

dbaplus Community

May 30, 2018 · Big Data

Understanding Spark Executor Memory Management: On‑Heap, Off‑Heap, and Unified Strategies

This article explains Spark's executor memory architecture, covering on‑heap and off‑heap allocation, static versus unified memory managers, storage and execution memory handling, RDD persistence levels, eviction policies, and shuffle memory usage, providing practical formulas and configuration tips for optimal performance.

Big DataExecutorMemory Management

0 likes · 23 min read

Understanding Spark Executor Memory Management: On‑Heap, Off‑Heap, and Unified Strategies

Tencent Cloud Developer

Apr 12, 2018 · Big Data

Spark Usage in DataMagic Platform: A Practical Guide

This guide explains how DataMagic leverages Spark on YARN for fast, scalable offline analytics—covering Spark’s core role, four steps to master its terminology, configurations, parallelism, and code modification, plus practical deployment scripts, dynamic resource tuning, MongoDB export, job troubleshooting, and cluster upkeep for trillion‑record workloads.

DataMagicSparkSpark optimization

0 likes · 11 min read

Spark Usage in DataMagic Platform: A Practical Guide

Qunar Tech Salon

Apr 9, 2018 · Big Data

Analysis of Apache Spark 2.2.1 Memory Management Model

This article examines Spark's unified memory manager in version 2.2.1, detailing on‑heap and off‑heap memory regions, the four on‑heap memory pools, dynamic execution‑storage memory sharing, task memory accounting, and provides concrete calculation examples to explain UI discrepancies and runtime memory limits.

Big DataExecutorMemory Management

0 likes · 13 min read

Analysis of Apache Spark 2.2.1 Memory Management Model

ITFLY8 Architecture Home

Feb 25, 2018 · Big Data

Building Scalable Data Platforms with SMACK: Spark, Mesos, Akka, Cassandra & Kafka

Learn how to construct a scalable data processing platform using the SMACK stack—Spark, Mesos, Akka, Cassandra, and Kafka—covering storage design, processing workflows, resource management, deployment options, and fault‑tolerant task execution for both batch and streaming workloads.

AkkaCassandraMesos

0 likes · 14 min read

Building Scalable Data Platforms with SMACK: Spark, Mesos, Akka, Cassandra & Kafka

Huawei Cloud Developer Alliance

Jan 16, 2018 · Artificial Intelligence

How to Build a Scalable Spark-Based Text Sentiment Analysis System

This article walks through constructing a Spark-powered text sentiment analysis pipeline—from crawling movie reviews, preprocessing and feature extraction with jieba and TF‑IDF, to training Naive Bayes and SVM classifiers—while discussing Spark's advantages and ways to improve model accuracy.

Big DataNLPPython

0 likes · 19 min read

How to Build a Scalable Spark-Based Text Sentiment Analysis System

MaGe Linux Operations

Jan 14, 2018 · Artificial Intelligence

7 Essential Python Tools Every Data Scientist Must Master

This article introduces seven must‑know Python tools—including IPython, GraphLab Create, Pandas, PuLP, Matplotlib, Scikit‑Learn, and Spark—explaining their key features and how they empower data scientists to work efficiently in production environments.

GraphLabIPythonPandas

0 likes · 9 min read

7 Essential Python Tools Every Data Scientist Must Master

Meituan Technology Team

Oct 12, 2017 · Big Data

Spark Aggregation Operations Deep Dive

The seminar provides professionals with a comprehensive deep dive into Spark’s aggregation mechanisms, examining memory overhead, performance bottlenecks, and practical optimization techniques for large‑scale distributed data processing, enabling attendees to tackle real‑world big‑data challenges more efficiently.

OptimizationSparkTechnical Seminar

0 likes · 1 min read

dbaplus Community

Sep 26, 2017 · Big Data

How to Avoid Common Spark SQL Pitfalls and Boost Performance

This article shares a comprehensive set of practical tips and solutions for common Spark SQL issues—including out‑of‑memory errors, UDF‑induced GC, thread blocking, system‑property initialization, speculation side‑effects, accumulator traps, concurrent job scheduling, and excessive logging—helping engineers improve stability and efficiency of their Spark‑based financial systems.

AccumulatorMemory ManagementPerformance Tuning

0 likes · 15 min read

How to Avoid Common Spark SQL Pitfalls and Boost Performance

21CTO

Sep 26, 2017 · Big Data

How NTE Algorithm Accelerates New Common‑Friend Discovery in Billion‑Scale Graphs

Introducing the NTE (New Triangle Enumeration) algorithm, a divide‑and‑conquer approach that transforms the computation of newly added common friends in massive social graphs into efficient triangle enumeration tasks, with detailed implementations using GraphX‑based GTE, join‑based JTE, and sort‑based STE methods.

GraphXSocial Network AnalysisSpark

0 likes · 12 min read

How NTE Algorithm Accelerates New Common‑Friend Discovery in Billion‑Scale Graphs

Qunar Tech Salon

Sep 25, 2017 · Big Data

Comprehensive Guide to Spark Ecosystem: Data Warehouse, Machine Learning, Streaming, and Enterprise Use Cases

This article provides an extensive overview of Apache Spark’s ecosystem—including its data‑warehouse capabilities, ML/MLlib libraries, streaming with Spark Streaming, external frameworks, and real‑world enterprise case studies—while also noting a promotional announcement for a React Native conference.

Big DataData WarehouseHive

0 likes · 21 min read

Comprehensive Guide to Spark Ecosystem: Data Warehouse, Machine Learning, Streaming, and Enterprise Use Cases

Hujiang Technology

Sep 15, 2017 · Fundamentals

Technical Salon – Evolving Architecture Practices (Shanghai, Sep 24)

The event showcases three technical talks covering the evolution of Hujiang's storage architecture, the transformation of the Dianrong payment system into a public service, and an introduction to the TiSpark project that integrates TiDB with Spark, highlighting design choices, trade‑offs, and future directions.

CephSparkTiDB

0 likes · 3 min read

Technical Salon – Evolving Architecture Practices (Shanghai, Sep 24)

Architecture Digest

Sep 3, 2017 · Big Data

An Overview of Big Data Processing Frameworks: Batch, Stream, and Hybrid Systems

This article introduces the evolution of big‑data processing from Google’s MapReduce concept to modern open‑source frameworks, defines big data and its 3V characteristics, outlines typical processing pipelines, and compares batch, stream, and hybrid systems such as Hadoop, Storm, Samza, Spark, and Flink.

Batch ProcessingBig DataFlink

0 likes · 20 min read

An Overview of Big Data Processing Frameworks: Batch, Stream, and Hybrid Systems

dbaplus Community

Aug 21, 2017 · Big Data

How to Tackle Spark Data Skew: Practical Solutions and Real‑World Examples

This article explains what Spark data skew is, why it hurts performance, and presents six practical mitigation techniques—including adjusting parallelism, custom partitioners, map‑side joins, and adding random prefixes—backed by detailed experiments, code snippets, and performance comparisons.

Data SkewMap-side JoinPartitioner

0 likes · 18 min read

How to Tackle Spark Data Skew: Practical Solutions and Real‑World Examples

21CTO

Aug 13, 2017 · Artificial Intelligence

How Distributed Machine Learning Platforms Compare: Spark, PMLS, TensorFlow

This article surveys distributed machine‑learning platforms, classifies them into basic data‑flow, parameter‑server, and advanced data‑flow models, examines Spark, PMLS (Petuum), TensorFlow and MXNet, presents performance comparisons on EC2 instances, and discusses bottlenecks, fault tolerance, and future research directions.

Parameter ServerSparkTensorFlow

0 likes · 12 min read

How Distributed Machine Learning Platforms Compare: Spark, PMLS, TensorFlow

High Availability Architecture

Aug 2, 2017 · Artificial Intelligence

A Comparative Study of Distributed Machine Learning Platforms: Design Methods and Evaluation

This article surveys design approaches for distributed machine learning platforms, classifies them into basic dataflow, parameter‑server, and advanced dataflow models, examines examples such as Spark, PMLS, TensorFlow and MXNet, and presents performance evaluations and future research directions.

Parameter ServerSparkTensorFlow

0 likes · 10 min read

A Comparative Study of Distributed Machine Learning Platforms: Design Methods and Evaluation

High Availability Architecture

Jul 19, 2017 · Artificial Intelligence

Weiflow: A Scalable Machine Learning Workflow Framework for Sina Weibo

The article introduces Weiflow, a dual‑layer DAG‑based machine‑learning workflow framework designed for Sina Weibo, and explains how its modular XML configuration, Scala implementation, and integration with Spark, TensorFlow, Hive, Storm, and Flink improve development efficiency, scalability, and execution performance across the entire ML pipeline.

Big DataDAGScala

0 likes · 16 min read

Weiflow: A Scalable Machine Learning Workflow Framework for Sina Weibo

High Availability Architecture

Jul 12, 2017 · Artificial Intelligence

Machine Learning Platform and Risk‑Control Applications at DianRong Net

The article presents a comprehensive overview of DianRong Net's in‑house machine‑learning platform built on Spark, its workflow, pain points it addresses, risk‑control case studies using graph mining, and practical tips for improving model performance through data, algorithms, hyper‑parameter tuning and ensemble methods.

Big DataModel OptimizationSpark

0 likes · 14 min read

Machine Learning Platform and Risk‑Control Applications at DianRong Net

21CTO

Jun 9, 2017 · Big Data

From Hadoop to Spark: A Complete Roadmap to Becoming a Big Data Architect

This guide walks beginners through the essential big‑data ecosystem—from understanding Hadoop’s core components and mastering MapReduce, to using Hive, SparkSQL, Kafka, and real‑time frameworks like Storm, while also covering data ingestion, export, scheduling, and introductory machine‑learning techniques.

Big DataData EngineeringHive

0 likes · 20 min read

From Hadoop to Spark: A Complete Roadmap to Becoming a Big Data Architect

Architecture Digest

Jun 9, 2017 · Big Data

A Comprehensive Guide for Big Data Beginners: From Hadoop Fundamentals to Machine Learning

This guide walks beginners through the entire big‑data ecosystem, covering the 4V characteristics, core open‑source frameworks, Hadoop setup, Hive and SQL on Hadoop, data ingestion and export tools, task scheduling, real‑time processing with Kafka, Storm and Spark Streaming, and an introduction to machine‑learning applications.

HadoopHiveSpark

0 likes · 17 min read

A Comprehensive Guide for Big Data Beginners: From Hadoop Fundamentals to Machine Learning

Liulishuo Tech Team

Jun 8, 2017 · Big Data

Highlights from Spark Summit 2017: New Features in Spark 2.2, Deep Learning Integration, and Structured Streaming

The article recaps Liulishuo engineers' experience at Spark Summit 2017, covering Spark 2.2's cost‑based optimizer, production‑ready Structured Streaming, deep‑learning support via UDFs, live demos recognizing James Bond, and insights from vendor booths and industry case studies.

SQLSparkSpark 2.2

0 likes · 5 min read

Highlights from Spark Summit 2017: New Features in Spark 2.2, Deep Learning Integration, and Structured Streaming

MaGe Linux Operations

May 24, 2017 · Big Data

Demystifying Big Data: From HDFS to Spark, Hive, and Real‑Time Streaming

This article explains how big data challenges traditional storage, introduces HDFS for distributed file management, describes parallel processing frameworks like MapReduce, Tez, and Spark, compares higher‑level tools such as Hive and Pig, and explores real‑time streaming and key‑value stores for low‑latency analytics.

HadoopHiveKey-Value Store

0 likes · 9 min read

Demystifying Big Data: From HDFS to Spark, Hive, and Real‑Time Streaming

Suning Technology

May 18, 2017 · Big Data

Why Apache Flink Beats Spark and Storm in Stream Processing

This article examines Apache Flink's stream‑processing architecture, compares its native streaming model, fault‑tolerance, performance and SQL capabilities with Spark and Storm, and concludes that Flink offers a more powerful and efficient solution despite some maturity gaps.

Apache FlinkSparkStorm

0 likes · 12 min read

Why Apache Flink Beats Spark and Storm in Stream Processing

ITPUB

May 8, 2017 · Big Data

Master Spark Performance: Practical Tuning Tips and Real‑World Examples

This article explains essential Spark concepts, illustrates common performance bottlenecks, and provides concrete tuning strategies for memory, CPU, serialization, data locality, file I/O, and shuffle reduction, backed by real‑world examples and visual metrics.

Big DataCPU optimizationConfiguration

0 likes · 19 min read

Master Spark Performance: Practical Tuning Tips and Real‑World Examples

Architects' Tech Alliance

May 7, 2017 · Big Data

Building a Complete Big Data Platform: From Hadoop Basics to Real‑Time Analytics

This guide walks beginners through the entire big‑data ecosystem—explaining the 4V characteristics, listing essential open‑source components, teaching Hadoop setup, Hive and SparkSQL usage, data ingestion with Sqoop, Flume and Kafka, task scheduling with Oozie, and real‑time processing with Storm and Spark Streaming.

Big DataHadoopHive

0 likes · 20 min read

Building a Complete Big Data Platform: From Hadoop Basics to Real‑Time Analytics

MaGe Linux Operations

May 3, 2017 · Big Data

From Storage to Real‑Time: The Evolution of Big Data Technologies

This article outlines the three historical stages of big data technology—from early storage and batch processing, through market‑driven integration with Hive, to today’s focus on speed with Spark, Impala and streaming—while detailing the Hadoop ecosystem components such as HDFS, MapReduce, KV stores and emerging solutions like YDB.

HDFSHadoopHive

0 likes · 13 min read

From Storage to Real‑Time: The Evolution of Big Data Technologies

Architecture Digest

Apr 24, 2017 · Big Data

Understanding and Solving Data Skew in Hadoop and Spark

This article explains what data skew is, why it occurs in large‑scale Hadoop and Spark jobs, illustrates typical symptoms, and presents practical strategies—including business‑level adjustments, code tweaks, and platform‑specific tuning—to mitigate and resolve skew in big‑data processing.

Big DataData SkewHadoop

0 likes · 11 min read

Understanding and Solving Data Skew in Hadoop and Spark

ITPUB

Apr 11, 2017 · Big Data

Understanding Spark Executor Memory Management: On‑heap, Off‑heap, and Unified Approaches

This article explains Spark's executor memory architecture, covering on‑heap and off‑heap allocation, static versus unified memory managers, storage and execution memory handling, RDD persistence, eviction policies, and the role of Tungsten's page‑based management in optimizing performance.

Big DataExecutorMemory Management

0 likes · 23 min read

Understanding Spark Executor Memory Management: On‑heap, Off‑heap, and Unified Approaches

Qunar Tech Salon

Apr 11, 2017 · Big Data

Implementing Dynamic Scaling for Spark on Mesos Using Marathon and Docker

This article describes how a team migrated Spark 1.6.x running on Mesos to a Marathon‑Docker based architecture that provides dynamic executor scaling, resolves configuration and resource‑allocation issues, and improves monitoring, fault‑tolerance, and upgrade processes for large‑scale streaming workloads.

DockerDynamic ScalingMarathon

0 likes · 17 min read

Implementing Dynamic Scaling for Spark on Mesos Using Marathon and Docker

ITFLY8 Architecture Home

Mar 26, 2017 · Big Data

How to Build Scalable Log Monitoring and Analytics with ELK, Kafka, and Spark

This article explains various enterprise log types, recommends monitoring tools like Cacti, Zabbix, Splunk, and the ELK stack, and details architectures for handling server, application, and user‑click logs using technologies such as Logstash, Elasticsearch, Kibana, Kafka, Flume, and Spark.

AnalyticsBig DataELK

0 likes · 26 min read

How to Build Scalable Log Monitoring and Analytics with ELK, Kafka, and Spark

ITPUB

Mar 22, 2017 · Big Data

Why Spark Beats MapReduce: The RDD Story and Spark SQL Evolution

This article walks through Spark’s origins, its core RDD concept, how it improves on Hadoop’s MapReduce, the role of in‑memory processing, functional programming support, and the emergence of Spark SQL with DataFrames and the Catalyst optimizer.

Big DataDistributed ComputingMapReduce

0 likes · 25 min read

Why Spark Beats MapReduce: The RDD Story and Spark SQL Evolution

Qunar Tech Salon

Mar 1, 2017 · Big Data

Building Prism: Qunar’s Real‑Time Data Platform and DevOps Journey

The article describes how Qunar designed and evolved its Prism real‑time data platform—leveraging ELK, Kafka, Spark, Docker, and Mesos—to improve data collection, monitoring, and analysis, reduce deployment time, and support scalable DevOps operations across the company.

Big DataELKReal-time Data

0 likes · 11 min read

Building Prism: Qunar’s Real‑Time Data Platform and DevOps Journey

ITFLY8 Architecture Home

Feb 24, 2017 · Big Data

How ELK, Kafka, and Spark Streaming Revolutionize Log Management in Big Data Environments

This article explores the evolution of log processing in the big‑data era, detailing how ELK Stack, Kafka, and Spark Streaming work together to provide scalable, real‑time log collection, analysis, and visualization for modern cloud‑native operations.

Big DataELKLog Processing

0 likes · 12 min read

How ELK, Kafka, and Spark Streaming Revolutionize Log Management in Big Data Environments

Efficient Ops

Feb 23, 2017 · Operations

How Qunar Built Prism: A Real‑Time Data Platform That Halves Deployment Time

This article describes how Qunar’s Prism platform combines ELK, Kafka, Spark, Docker and other open‑source tools to create a real‑time data pipeline that speeds up problem localization, reduces deployment time, and improves resource utilization across development and operations teams.

DockerELKReal-time Data

0 likes · 14 min read

How Qunar Built Prism: A Real‑Time Data Platform That Halves Deployment Time

Architecture Digest

Feb 11, 2017 · Big Data

LeKe Sports Big Data Platform Evolution: From Early ETL Reporting to 2.0 Streaming Architecture

The article describes how LeKe Sports built and continuously upgraded its Hadoop‑based big data platform—from a manual ETL‑to‑Elasticsearch reporting system to a 2.0 architecture featuring Spark Streaming, SQL‑based query layers, Elasticsearch indexing, and cloud‑native storage and backup solutions—to meet rapidly growing PB‑scale data demands.

Big DataData PlatformETL

0 likes · 5 min read

LeKe Sports Big Data Platform Evolution: From Early ETL Reporting to 2.0 Streaming Architecture

Qunar Tech Salon

Jan 24, 2017 · Artificial Intelligence

Practical Approaches to Deploying Machine Learning Models: Real‑time SOA, PMML, Rserve, and Spark

This article shares practical engineering experiences for deploying machine learning models in various scenarios—real‑time low‑volume predictions via Rserve or Python‑httpserve, high‑throughput real‑time serving with PMML‑wrapped Java classes, and offline batch predictions using simple shell scripts—detailing tools, performance considerations, and implementation steps.

Model DeploymentPMMLPython

0 likes · 11 min read

Practical Approaches to Deploying Machine Learning Models: Real‑time SOA, PMML, Rserve, and Spark

Ctrip Technology

Jan 5, 2017 · Artificial Intelligence

Practical Approaches to Deploying Machine Learning Models: PMML, Rserve, and Spark in Production

This article shares practical engineering experiences for deploying machine learning models in production, covering three typical scenarios—real‑time small data, real‑time large data, and offline predictions—and detailing how to use PMML, Rserve, Spark, shell scripts, and related tools to meet performance and operational requirements.

Model DeploymentPMMLRserve

0 likes · 12 min read

Practical Approaches to Deploying Machine Learning Models: PMML, Rserve, and Spark in Production

dbaplus Community

Dec 18, 2016 · Big Data

How DWS Uses Log‑Based Architecture for Real‑Time Data Integration

This article explains the design and implementation of the DWS platform, detailing its log‑driven architecture with Dbus, Wormhole, and Swifts, the technical choices behind real‑time data extraction, transformation, and delivery, and real‑world use cases in finance.

CDCCanalReal-time Streaming

0 likes · 22 min read

How DWS Uses Log‑Based Architecture for Real‑Time Data Integration

Java High-Performance Architecture

Dec 13, 2016 · Big Data

What Is Apache Beam and How Does It Simplify Distributed Data Processing?

Apache Beam is an open‑source, unified programming model for distributed data processing that lets developers write pipelines once and run them on multiple execution engines such as Spark, Flink, or Dataflow, simplifying code reuse and easing migration between frameworks.

Apache BeamDistributed ComputingJava

0 likes · 5 min read

What Is Apache Beam and How Does It Simplify Distributed Data Processing?

Architects' Tech Alliance

Dec 6, 2016 · Big Data

How Hulu’s Segmentation System Powers Big Data Marketing at Scale

At the 2016 WOT Big Data Technology Summit, Hulu’s senior R&D manager Zhao Kunliang presented the company’s Segmentation system, detailing its Hadoop‑based architecture, Spark and Spark Streaming processing, the custom Nesto query engine, and the challenges and innovations involved in supporting large‑scale marketing and advertising analytics.

HadoopMarketing AnalyticsNesto

0 likes · 5 min read

How Hulu’s Segmentation System Powers Big Data Marketing at Scale

StarRing Big Data Open Lab

Nov 18, 2016 · Big Data

Unveiling Modern Big Data Architecture: Key Technologies and Trends

This article reviews a comprehensive big‑data lecture covering traditional databases, Hadoop ecosystems, commercial big‑data platforms, computing models, analysis techniques, visualization, and leading vendors, highlighting how these technologies shape today’s data‑driven enterprises.

Big DataData ArchitectureHadoop

0 likes · 14 min read

Unveiling Modern Big Data Architecture: Key Technologies and Trends

StarRing Big Data Open Lab

Nov 11, 2016 · Big Data

Why SQL Still Rules Big Data—and How NoSQL & NewSQL Fit In

The article explores the evolution of data processing from Hadoop and Spark to modern SQL, NoSQL, and NewSQL solutions, comparing their architectures, performance trade‑offs, and use‑cases, while illustrating concepts with examples like MapReduce, Hive, Impala, and streaming platforms such as Storm.

Big DataHadoopNewSQL

0 likes · 14 min read

Why SQL Still Rules Big Data—and How NoSQL & NewSQL Fit In

StarRing Big Data Open Lab

Oct 8, 2016 · Big Data

Evolving Data Warehouses with Hadoop & Spark: Core Technologies

Data warehouses centralize and transform enterprise data for multidimensional analysis, and modern demands have spawned four types—traditional, real‑time, associative discovery, and data marts—each with distinct technical requirements, while Hadoop‑based solutions like Transwarp Data Hub address challenges of scale, variety, latency, and security.

Big DataDistributed ComputingHadoop

0 likes · 21 min read

Evolving Data Warehouses with Hadoop & Spark: Core Technologies

GF Securities FinTech

Sep 21, 2016 · Big Data

How GF Securities Leverages Lambda/Kappa Architectures for Real-Time Stock Analytics

This article explains how GF Securities built a customized Lambda/Kappa‑style big‑data platform that integrates CEP, Spark, Flink and Kafka to deliver low‑latency stock price alerts, real‑time news, and capital‑flow trading strategies for the finance industry.

CEPLambda architectureSpark

0 likes · 18 min read

How GF Securities Leverages Lambda/Kappa Architectures for Real-Time Stock Analytics

Architecture Digest

Sep 17, 2016 · Big Data

Spark Introduction and Integration with MongoDB: Architecture, Use Cases, and Code Samples

This article introduces Apache Spark as a fast, general‑purpose big‑data engine, explains its ecosystem, compares HDFS with MongoDB, and demonstrates how Spark can be combined with MongoDB through the Mongo‑Spark connector, including real‑world case studies and sample code.

Big DataConnectorMongoDB

0 likes · 18 min read

Spark Introduction and Integration with MongoDB: Architecture, Use Cases, and Code Samples

Ctrip Technology

Aug 19, 2016 · Big Data

Ctrip's Big Data Architecture and Personalized Recommendation System

This article describes how Ctrip transformed its traditional application architecture into a high‑concurrency, big‑data‑driven platform, detailing storage, compute, and business‑layer redesigns that enable massive data ingestion, real‑time user‑intent services, and a scalable personalized recommendation system.

Big DataCtripHadoop

0 likes · 14 min read

Ctrip's Big Data Architecture and Personalized Recommendation System

MaGe Linux Operations

Aug 11, 2016 · Big Data

Essential MapReduce, HBase, and Spark Configuration Parameters for Faster, More Stable Jobs

This article compiles the most frequently used configuration parameters for MapReduce, HBase, and Spark, explaining their purposes and recommended settings to improve job performance, reliability, and resource utilization in big‑data environments.

Big DataConfigurationHBase

0 likes · 8 min read

Essential MapReduce, HBase, and Spark Configuration Parameters for Faster, More Stable Jobs

ITPUB

Jul 19, 2016 · Big Data

From Traditional Data Warehouses to Big Data: Practical Techniques and Migration Insights

The talk shares hands‑on experiences and best‑practice methods for traditional data‑warehouse processing, public and behavioral data handling in big‑data environments, and practical guidance for migrating legacy warehouses to modern Hadoop‑based platforms, emphasizing data governance, security, and performance optimization.

Big DataData GovernanceData Warehouse

0 likes · 13 min read

From Traditional Data Warehouses to Big Data: Practical Techniques and Migration Insights

Architecture Digest

Jul 16, 2016 · Big Data

Building a Closed-Loop Data Platform: Architecture, Technologies, and Case Studies

This article describes how to design and implement a closed‑loop data platform using Python, Java, and Spark stacks, covering data acquisition, structuring, mining, visualization, real‑time processing, and deployment with Docker, ELK, Kafka, and cloud services, illustrated by three industry case studies.

DockerELKSpark

0 likes · 13 min read

Building a Closed-Loop Data Platform: Architecture, Technologies, and Case Studies

dbaplus Community

Jul 4, 2016 · Databases

Why Oracle’s Demise Doesn’t Signal the End of SQL – Insights from a NoSQL Migration

The article explains how a company is retiring Oracle due to cost and scalability, outlines a staged move to NoSQL and cloud storage, argues that SQL remains vital, and shares practical examples of Spark SQL rewrites and JDBC‑based Hive integration.

DatabasesNoSQLOracle

0 likes · 9 min read

Why Oracle’s Demise Doesn’t Signal the End of SQL – Insights from a NoSQL Migration

Efficient Ops

Jun 30, 2016 · Big Data

How Spark Enables Real‑Time Microservice Performance Profiling

This article explains how IBM Research and Cloudinsight use Apache Spark to capture, analyze, and visualize microservice communication in real time, addressing challenges of observability, bottleneck detection, and latency attribution in large‑scale cloud environments.

Operational MonitoringSparkperformance profiling

0 likes · 10 min read

How Spark Enables Real‑Time Microservice Performance Profiling

iFlytek Mobile Internet Technology Team

Jun 14, 2016 · Big Data

How BitMap Accelerates Active-Day Distribution Calculations in Big Data

BitMap, a space‑saving bit‑array structure, can replace costly I/O‑heavy Spark jobs for computing user active‑day distributions by converting joins and distinct operations into fast bitwise logic, enabling efficient 30‑day rolling metrics with minimal memory and superior performance, as demonstrated by real‑world benchmarks.

Active DaysBig DataSpark

0 likes · 8 min read

How BitMap Accelerates Active-Day Distribution Calculations in Big Data

dbaplus Community

Jun 7, 2016 · Big Data

What Is Big Data? Value, Platforms, and How to Harness Its Power

This article explains what big data is, where its value lies, how to design and build a big data platform, and the essential steps to turn massive data into actionable business insights while addressing technical and operational challenges.

BIBig DataData Value

0 likes · 16 min read

What Is Big Data? Value, Platforms, and How to Harness Its Power

360 Quality & Efficiency

Jun 6, 2016 · Big Data

Spark and MongoDB Tutorial: Daily Active User Statistics with Scala

This tutorial guides readers through using Apache Spark and MongoDB to compute daily active user statistics, covering Spark fundamentals, a Spark‑vs‑Hadoop comparison, MongoDB use cases, environment setup, Scala code workflow, Maven compilation, and job submission on a YARN cluster.

Big DataMongoDBScala

0 likes · 11 min read

Spark and MongoDB Tutorial: Daily Active User Statistics with Scala

Architecture Digest

May 25, 2016 · Big Data

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning

This article provides a comprehensive guide on tackling Spark performance bottlenecks by diagnosing data skew, locating the offending stages and operators, and applying a range of practical solutions—including Hive pre‑processing, key filtering, shuffle parallelism, two‑stage aggregation, map‑join, and combined strategies—followed by an in‑depth discussion of shuffle manager evolution and key configuration parameters for fine‑tuning.

Big DataData SkewPerformance Tuning

0 likes · 35 min read

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning

High Availability Architecture

May 19, 2016 · Big Data

Comprehensive Overview of Apache Spark: Architecture, RDD Principles, Execution Modes, and Spark 2.0 Features

This article provides an in‑depth technical overview of Apache Spark, covering its core concepts such as RDDs, transformation and action operations, execution models, Spark 2.0 enhancements like unified DataFrames/Datasets, whole‑stage code generation, Structured Streaming, and practical performance‑tuning guidance.

DataFramesPerformance OptimizationRDD

0 likes · 20 min read

Comprehensive Overview of Apache Spark: Architecture, RDD Principles, Execution Modes, and Spark 2.0 Features

Meituan Technology Team

May 13, 2016 · Big Data

Spark Performance Optimization Guide: Data Skew and Shuffle Tuning

This advanced Spark performance guide explains how data skew arises during shuffles and presents eight practical solutions—including Hive preprocessing, key filtering, increased shuffle parallelism, two‑stage aggregation, map joins, sampling, random prefixes, and combined strategies—while also detailing key shuffle‑tuning parameters such as spark.shuffle.file.buffer, spark.reducer.maxSizeInFlight, and spark.shuffle.manager to improve memory usage and execution speed.

Big DataData SkewPerformance Optimization

0 likes · 33 min read

Spark Performance Optimization Guide: Data Skew and Shuffle Tuning

Architecture Digest

May 4, 2016 · Big Data

Upgrading Spark from 1.4.1 to 1.6.1: Memory, Storage, and Operational Challenges

The article details the author’s experience upgrading a production Spark cluster from version 1.4.1 to 1.6.1, exposing memory‑spill, unified memory, BlockManager deadlock, Yarn‑kill, UI quirks, and Spark‑SQL compatibility issues, and proposes concrete code‑level fixes for each problem.

Big DataDistributed ComputingMemory Management

0 likes · 14 min read

Upgrading Spark from 1.4.1 to 1.6.1: Memory, Storage, and Operational Challenges

Meituan Technology Team

Apr 29, 2016 · Big Data

Introduction to Spark in Big Data

Apache Spark, a versatile big‑data platform supporting batch processing, SQL queries, real‑time streaming, and machine‑learning workloads, dramatically accelerates data‑intensive jobs, as demonstrated by Meituan‑Dianping, where its high‑performance engine reduces execution times and enhances scalability across diverse analytical and operational pipelines.

Batch ProcessingBig DataSpark

0 likes · 1 min read

Architecture Digest

Apr 25, 2016 · Big Data

Curated Learning Resources for Spark and Scala Beginners

This article compiles a comprehensive list of tutorials, books, online courses, and tools to help beginners get started with Apache Spark and the Scala programming language, including setup instructions, code snippets, and links to free and paid learning materials.

Big DataLearning ResourcesScala

0 likes · 7 min read

Curated Learning Resources for Spark and Scala Beginners

21CTO

Apr 18, 2016 · Big Data

How Spark Runs on YARN: From Client Submission to Executor Execution

This article explains the end‑to‑end workflow of Spark on YARN, covering client initialization, ApplicationMaster actions, driver and executor roles, RDD fundamentals, SparkSQL processing, and practical code examples for building and tuning distributed Spark jobs.

Distributed ComputingRDDSpark

0 likes · 17 min read

How Spark Runs on YARN: From Client Submission to Executor Execution

Java High-Performance Architecture

Apr 18, 2016 · Big Data

Why Spark Is Outpacing Hadoop: Speed, Real‑Time Processing, and ML Advantages

The article explains how Spark has become the leading open‑source big‑data platform, highlighting its superior speed, in‑memory processing, real‑time streaming, and built‑in machine‑learning library compared with Hadoop’s slower, disk‑based MapReduce approach and reliance on external storage and ML tools.

Big DataHadoopReal-time Processing

0 likes · 5 min read

Why Spark Is Outpacing Hadoop: Speed, Real‑Time Processing, and ML Advantages

Architecture Digest

Apr 18, 2016 · Big Data

Introduction to Apache Spark: Architecture, RDD, Spark on YARN, and SparkSQL

This article introduces Apache Spark’s core architecture, explains how Spark runs on YARN, details driver and executor roles, describes RDD concepts and dependencies, and outlines SparkSQL’s schema‑based query processing, providing code examples for HiveContext and JDBC integration.

Big DataDistributed ComputingRDD

0 likes · 14 min read

Introduction to Apache Spark: Architecture, RDD, Spark on YARN, and SparkSQL