Tagged articles
156 articles
Page 2 of 2
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 22, 2019 · Big Data

Dynamic Resource Allocation in Spark Streaming: Problems, Mechanisms, and Practical Guidelines

The article explains Spark's default static resource allocation, analyzes the limitations of its Dynamic Resource Allocation (DRA) for streaming workloads, describes the internal Spark components and code paths involved, and proposes concrete design and configuration recommendations for implementing more responsive executor scaling.

Big DataDynamic Resource AllocationExecutor Management
0 likes · 11 min read
Dynamic Resource Allocation in Spark Streaming: Problems, Mechanisms, and Practical Guidelines
DataFunTalk
DataFunTalk
Nov 21, 2019 · Big Data

Evolution of 58.com Real-Time Computing Platform and the One-Stop Streaming Data Processing System Wstream

The article details the technical evolution of 58.com’s real-time computing platform—from Storm and Spark Streaming to a Flink‑based one‑stop solution called Wstream—covering use cases, architecture, stability measures, migration from Storm, operational diagnostics, and future development plans.

Big DataFlinkReal-time Streaming
0 likes · 11 min read
Evolution of 58.com Real-Time Computing Platform and the One-Stop Streaming Data Processing System Wstream
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 19, 2019 · Big Data

Performance Optimization Practices for Meituan's Hadoop YARN Fair Scheduler

This article presents a comprehensive analysis of Meituan's Hadoop YARN fair scheduler, detailing its architecture, resource abstractions, scheduling workflow, performance bottlenecks, fine‑grained metrics, and a series of optimization techniques—including sorting improvements, job‑skip reduction, parallel queue sorting, and robust rollout strategies—to achieve high‑throughput, low‑latency scheduling for large‑scale offline, streaming, and machine‑learning workloads.

Big DataFair SchedulerResource Management
0 likes · 24 min read
Performance Optimization Practices for Meituan's Hadoop YARN Fair Scheduler
Tencent Cloud Developer
Tencent Cloud Developer
Sep 11, 2019 · Big Data

YARN Practice and Technical Evolution at Kuaishou

Jiaoxiao Fang’s talk details Kuaishou’s YARN deployment, covering its architecture, support for offline, real‑time and ML workloads, and recent enhancements such as event‑handling stability, refined preemption, high‑throughput parallel scheduling, shuffle‑caching for small I/O, plus plans for job protection and multi‑cluster resource utilization.

Big DataCluster OptimizationDistributed Systems
0 likes · 16 min read
YARN Practice and Technical Evolution at Kuaishou
Tencent Cloud Developer
Tencent Cloud Developer
Aug 30, 2019 · Big Data

How Tencent Cloud Leverages Spark, ElasticSearch, and Flink for PB‑Scale Data Warehousing

The cloud+ community and Kuaishou hosted a big‑data technology salon where experts detailed the evolution, architecture, and practical deployments of Spark‑based cloud data warehouses, ElasticSearch, Yarn, and Flink, highlighting trends, optimization techniques, and future directions for enterprise data analytics.

Big DataElasticsearchFlink
0 likes · 22 min read
How Tencent Cloud Leverages Spark, ElasticSearch, and Flink for PB‑Scale Data Warehousing
Qunar Tech Salon
Qunar Tech Salon
Aug 22, 2019 · Big Data

Performance Optimization Practices for Meituan's Hadoop YARN Fair Scheduler

This article details Meituan's experience optimizing the Hadoop YARN fair scheduler, covering background challenges, architectural components, resource abstractions, scheduling flow, performance metrics, a series of code‑level optimizations, stability strategies for production rollout, and future directions for large‑scale cluster scheduling.

Big DataFair SchedulerLoad Simulation
0 likes · 23 min read
Performance Optimization Practices for Meituan's Hadoop YARN Fair Scheduler
Meituan Technology Team
Meituan Technology Team
Aug 1, 2019 · Big Data

Performance Optimization Practices for Meituan's Hadoop YARN Fair Scheduler

Meituan improved its custom Hadoop YARN Fair Scheduler by pre‑computing resource usage, filtering zero‑demand jobs, and parallelizing queue sorting, which reduced sorting time from 30 s to 5 s per minute, boosted container‑per‑second throughput to 50 k, enabled live roll‑backs, and prepared the system for clusters up to 10 k nodes and future scaling to hundreds of thousands.

Big DataFair SchedulerHadoop
0 likes · 24 min read
Performance Optimization Practices for Meituan's Hadoop YARN Fair Scheduler
21CTO
21CTO
Jun 28, 2019 · Big Data

Master Hadoop High Availability: A Complete Step‑by‑Step HA HDFS & YARN Guide

This article provides a comprehensive, language‑agnostic tutorial on building a highly available Hadoop cluster, covering HDFS and YARN HA architectures, QJM shared storage, required components, configuration files, installation commands, startup procedures, verification steps, and troubleshooting references.

Cluster SetupHDFSHadoop
0 likes · 20 min read
Master Hadoop High Availability: A Complete Step‑by‑Step HA HDFS & YARN Guide
DataFunTalk
DataFunTalk
Jun 17, 2019 · Big Data

Understanding Hadoop’s Core Competitiveness in the Trillion‑Scale Data Era

This article explores Hadoop’s role in the big‑data era, detailing its architecture, core components such as HDFS, YARN, MapReduce, Ozone and Submarine, the challenges of trillion‑scale data, and why its scalability, cost efficiency, and a mature ecosystem give it a competitive edge.

Data LakeDistributed SystemsHadoop
0 likes · 11 min read
Understanding Hadoop’s Core Competitiveness in the Trillion‑Scale Data Era
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 16, 2019 · Big Data

Features, Configuration Parameters, and Implementation Details of Hadoop Capacity Scheduler

The article provides a comprehensive overview of Hadoop's Capacity Scheduler, describing its resource‑allocation features, configurable XML parameters, queue access controls, dynamic configuration updates, and the internal workflow of application initialization and resource scheduling within YARN.

CapacitySchedulerHadoopResourceManagement
0 likes · 13 min read
Features, Configuration Parameters, and Implementation Details of Hadoop Capacity Scheduler
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 26, 2019 · Big Data

Deploying Apache Flink Clusters: Standalone and YARN Modes

This guide explains how to set up an Apache Flink cluster on CentOS 7 using three deployment methods—Local, Standalone, and Flink on YARN/Kubernetes—including host configuration, SSH setup, package distribution, configuration file editing, cluster start/stop commands, YARN resource manager concepts, session commands, job submission, fault‑tolerance settings, and log inspection.

Big DataCluster DeploymentConfiguration
0 likes · 11 min read
Deploying Apache Flink Clusters: Standalone and YARN Modes
Youzan Coder
Youzan Coder
Feb 1, 2019 · Big Data

Design and Implementation of Log Parsing for a Big Data Offline Task Platform

The article describes a log‑parsing feature for Youzan’s big‑data offline platform that captures runtime logs from Hive, Spark, DataX, MapReduce and HBase jobs, categorizes scheduling types, extracts metrics such as read/write bytes, shuffle volume and GC time, and processes them in real time via a Filebeat‑Logstash‑Kafka‑Spark‑Streaming pipeline storing results in Redis for monitoring, optimization and resource‑usage ranking.

Big DataResource MonitoringYARN
0 likes · 7 min read
Design and Implementation of Log Parsing for a Big Data Offline Task Platform
Youzan Coder
Youzan Coder
Jan 16, 2019 · Big Data

How Youzan Scaled Real‑Time Analytics with Flink: Architecture, Pitfalls, and Lessons

This article walks through Youzan's real‑time platform architecture, explains why Flink was chosen over Spark Structured Streaming, details practical challenges such as container over‑provisioning and monitoring overhead, shares solutions for Spring integration and async caching, and outlines future directions for SQL‑based streaming and scheduler improvements.

Big DataFlinkReal-time Streaming
0 likes · 19 min read
How Youzan Scaled Real‑Time Analytics with Flink: Architecture, Pitfalls, and Lessons
JD Tech
JD Tech
Jul 9, 2018 · Big Data

JD's Large‑Scale Hadoop Cluster Resource Management and Scheduling Architecture

This article describes how JD built a multi‑regional, ten‑thousand‑node Hadoop ecosystem, unified resource management with YARN, introduced a three‑level Router scheduling layer, optimized performance, and integrated deep‑learning frameworks to achieve high availability, cost efficiency, and scalable big‑data processing.

Distributed SchedulingHadoopJD.com
0 likes · 12 min read
JD's Large‑Scale Hadoop Cluster Resource Management and Scheduling Architecture
ITPUB
ITPUB
Jun 10, 2018 · Big Data

13 Must‑Know Open‑Source Tools in the Hadoop Ecosystem

This article introduces Hadoop’s origins and core challenges, then presents thirteen essential open‑source tools spanning resource scheduling, real‑time query engines, and additional processing frameworks, detailing each project's purpose, key features, and repository locations to help practitioners choose the right component for big‑data workloads.

HadoopImpalaSpark
0 likes · 12 min read
13 Must‑Know Open‑Source Tools in the Hadoop Ecosystem
ITPUB
ITPUB
Jun 4, 2018 · Big Data

Is Hadoop Really Declining? Expert Insights Show Why the Ecosystem Stays Strong

Despite Gartner's 2017 claim that Hadoop is nearing the end of its production maturity, a series of interviews with Chinese big‑data experts reveal that Hadoop's ecosystem remains robust, with core components like HDFS, YARN, Spark, and HBase continuing to dominate the market.

Big DataEcosystemGartner
0 likes · 9 min read
Is Hadoop Really Declining? Expert Insights Show Why the Ecosystem Stays Strong
ITPUB
ITPUB
May 31, 2018 · Big Data

Mastering Spark on DataMagic: Fast‑Track Your Big Data Skills

This article explains Spark's role in the DataMagic platform, outlines four practical steps to quickly master Spark, details key configuration and parallelism settings, shows how to modify Spark code, and provides operational tips for cluster management and job troubleshooting.

Big DataCluster ManagementConfiguration
0 likes · 10 min read
Mastering Spark on DataMagic: Fast‑Track Your Big Data Skills
21CTO
21CTO
May 17, 2018 · Big Data

Understanding Hadoop MapReduce and YARN: Architecture, Shuffle, and Scaling

This article explains Hadoop's core components, the MapReduce programming model, the detailed shuffle and merge processes, and how YARN replaces the classic JobTracker/TaskTracker architecture to improve scalability and resource utilization in large‑scale data processing clusters.

HadoopShuffleYARN
0 likes · 12 min read
Understanding Hadoop MapReduce and YARN: Architecture, Shuffle, and Scaling
Architects' Tech Alliance
Architects' Tech Alliance
May 14, 2018 · Big Data

Understanding Hadoop MapReduce Architecture and YARN: Components, Workflow, and Optimization

This article explains Hadoop's distributed storage and processing framework, details the MapReduce programming model, describes the classic JobTracker/TaskTracker architecture, outlines the shuffle and combine phases, and introduces YARN as a scalable replacement with its ResourceManager, ApplicationMaster, and NodeManager components.

Big DataHadoopMapReduce
0 likes · 13 min read
Understanding Hadoop MapReduce Architecture and YARN: Components, Workflow, and Optimization
Tencent Cloud Developer
Tencent Cloud Developer
Apr 12, 2018 · Big Data

Spark Usage in DataMagic Platform: A Practical Guide

This guide explains how DataMagic leverages Spark on YARN for fast, scalable offline analytics—covering Spark’s core role, four steps to master its terminology, configurations, parallelism, and code modification, plus practical deployment scripts, dynamic resource tuning, MongoDB export, job troubleshooting, and cluster upkeep for trillion‑record workloads.

DataMagicSparkSpark optimization
0 likes · 11 min read
Spark Usage in DataMagic Platform: A Practical Guide
dbaplus Community
dbaplus Community
Apr 7, 2018 · Cloud Native

What Makes Distributed Schedulers Tick? Patterns from YARN to Kubernetes

This article surveys the architecture of cluster resource managers and task schedulers—covering definitions, design principles, and three main categories (centralized, two‑level, and shared‑state) with concrete examples such as Hadoop YARN, Mesos, Spark Drizzle, Borg and Kubernetes—while highlighting their trade‑offs in scalability, fault‑tolerance, and flexibility.

KubernetesMesosOmega
0 likes · 27 min read
What Makes Distributed Schedulers Tick? Patterns from YARN to Kubernetes
ITPUB
ITPUB
Mar 29, 2018 · Big Data

Demystifying Hadoop: MapReduce, Shuffle, and YARN Architecture

This article explains Hadoop’s core components, the MapReduce programming model, the detailed shuffle and merge processes, and how YARN replaces the classic JobTracker/TaskTracker design to improve scalability and resource utilization in large‑scale data processing clusters.

Big DataHadoopMapReduce
0 likes · 15 min read
Demystifying Hadoop: MapReduce, Shuffle, and YARN Architecture
Beike Product & Technology
Beike Product & Technology
Mar 9, 2018 · Big Data

How Lianjia Built a Low‑Latency Real‑Time Data Platform with Spark Streaming

This article details Lianjia's journey of designing and implementing a low‑latency, stable real‑time computing platform using Spark Streaming on YARN, covering technical selection, architecture components, version compatibility challenges, exactly‑once semantics, graceful shutdown, Kafka tuning, and future enhancements.

Big DataExactly-OnceKafka
0 likes · 11 min read
How Lianjia Built a Low‑Latency Real‑Time Data Platform with Spark Streaming
58 Tech
58 Tech
Feb 7, 2018 · Frontend Development

ArthurCI: Accelerating Frontend Continuous Integration with Stable Infrastructure

The article introduces ArthurCI, a front‑end continuous‑integration platform developed by 58, detailing its design, performance optimizations such as yarn caching and parallel webpack compression, ease‑of‑use integration steps, stability features, and future data‑driven enhancements, while comparing it with tools like TravisCI.

CIDevOpsYARN
0 likes · 9 min read
ArthurCI: Accelerating Frontend Continuous Integration with Stable Infrastructure
Full-Stack DevOps & Kubernetes
Full-Stack DevOps & Kubernetes
Oct 21, 2017 · Big Data

Deploy Hadoop CDH5.4 on CentOS 6: Install HDFS, YARN, and WebHDFS

This guide walks through preparing three CentOS 6.9 nodes, configuring hostnames, time sync, password‑less SSH, disabling IPv6, installing JDK, downloading CDH 5.4, setting up core‑site and hdfs‑site XML files, formatting the NameNode, starting HDFS services, configuring YARN and MapReduce, and verifying the installations via the Web UI.

Big DataCDHCentOS
0 likes · 18 min read
Deploy Hadoop CDH5.4 on CentOS 6: Install HDFS, YARN, and WebHDFS
dbaplus Community
dbaplus Community
Sep 26, 2017 · Big Data

How to Avoid Common Spark SQL Pitfalls and Boost Performance

This article shares a comprehensive set of practical tips and solutions for common Spark SQL issues—including out‑of‑memory errors, UDF‑induced GC, thread blocking, system‑property initialization, speculation side‑effects, accumulator traps, concurrent job scheduling, and excessive logging—helping engineers improve stability and efficiency of their Spark‑based financial systems.

AccumulatorMemory ManagementSpark
0 likes · 15 min read
How to Avoid Common Spark SQL Pitfalls and Boost Performance
Ctrip Technology
Ctrip Technology
Sep 20, 2017 · Big Data

Building a Real‑Time Computing Platform with Spark Streaming at Ctrip: Design, Implementation, and Lessons Learned

This article describes how Ctrip migrated its large‑scale real‑time platform from JStorm to Spark Streaming, detailing the architectural design, the Muise Spark Core encapsulation, operational metrics, encountered pitfalls, and future plans to adopt Flink and Beam for streaming workloads.

Big DataExactly-OnceSpark Streaming
0 likes · 22 min read
Building a Real‑Time Computing Platform with Spark Streaming at Ctrip: Design, Implementation, and Lessons Learned
Node Underground
Node Underground
Jul 20, 2017 · Frontend Development

How to Build a Minimal Package Manager from Scratch

This article explains why package managers are essential, showcases Yarn's step‑by‑step tutorial for creating a simple package manager, and highlights how the resulting tool handles classic challenges like circular dependencies and file‑structure optimization.

Software DevelopmentYARNdependency resolution
0 likes · 2 min read
How to Build a Minimal Package Manager from Scratch
Tencent Music Tech Team
Tencent Music Tech Team
Jun 23, 2017 · Backend Development

New Features and Changes in npm@5: Detailed Overview and Comparison with Yarn

npm 5 introduces automatic package‑lock generation, default --save, enhanced Git and file‑dependency handling, new prepack/postpack scripts, stronger integrity checks, a fully managed cache and registry tweaks, while narrowing Yarn’s speed advantage despite early bugs, making it a compelling alternative for npm‑centric workflows.

YARNdependency managementnpm
0 likes · 15 min read
New Features and Changes in npm@5: Detailed Overview and Comparison with Yarn
Node Underground
Node Underground
Jun 22, 2017 · Backend Development

8 Essential Node.js Practices Every Backend Developer Should Follow

This article presents eight practical recommendations for Node.js developers, covering dependency locking, lifecycle scripts, modern JavaScript, promises with async/await, code formatting with Prettier, continuous integration testing, security headers via Helmet, and serving over HTTPS.

HTTPSNode.jsPrettier
0 likes · 4 min read
8 Essential Node.js Practices Every Backend Developer Should Follow
Qunar Tech Salon
Qunar Tech Salon
Mar 14, 2017 · Backend Development

Node.js 2016 Review, Applications, and 2017 Outlook

This article reviews the major Node.js events of 2016—including version updates, the left‑pad controversy, Yarn, Chrome DevTools debugging, and ecosystem tools—describes common application scenarios and framework selection criteria, and offers predictions for Node.js development in 2017.

AsyncBackendFramework
0 likes · 17 min read
Node.js 2016 Review, Applications, and 2017 Outlook
CSS Magic
CSS Magic
Oct 13, 2016 · Frontend Development

Yarn Explained: Facebook’s Faster, Safer JavaScript Package Manager

The article details how Facebook built Yarn to overcome npm’s consistency, security, and speed limitations, describing the evolution of their package‑management workflow, Yarn’s lockfile architecture, parallel installation process, additional features, production adoption, and simple commands to get started.

JavaScriptYARNfrontend
0 likes · 13 min read
Yarn Explained: Facebook’s Faster, Safer JavaScript Package Manager
MaGe Linux Operations
MaGe Linux Operations
Aug 23, 2016 · Big Data

Step-by-Step Guide to Building a Hadoop Cluster on CentOS 6.5

This article provides a comprehensive, hands‑on tutorial for setting up a Hadoop 2.6.4 cluster on a CentOS 6.5 development server, covering SSH password‑less login, user/group creation, DNS configuration, JDK installation, environment variables, Hadoop installation, HDFS and YARN configuration, and troubleshooting native library warnings.

Big DataCentOSCluster Setup
0 likes · 12 min read
Step-by-Step Guide to Building a Hadoop Cluster on CentOS 6.5
Hulu Beijing
Hulu Beijing
May 31, 2016 · Big Data

What’s New in Hadoop 3.0? Key Features and Improvements Explained

Hadoop 3.0, built on JDK 1.8, adds erasure‑coded HDFS, multi‑NameNode support, native MapReduce task optimizations, cgroup‑based YARN memory and disk isolation, and container resizing, with an alpha slated for summer and a GA release expected in November or December.

Big DataHDFSHadoop
0 likes · 5 min read
What’s New in Hadoop 3.0? Key Features and Improvements Explained
21CTO
21CTO
Apr 18, 2016 · Big Data

How Spark Runs on YARN: From Client Submission to Executor Execution

This article explains the end‑to‑end workflow of Spark on YARN, covering client initialization, ApplicationMaster actions, driver and executor roles, RDD fundamentals, SparkSQL processing, and practical code examples for building and tuning distributed Spark jobs.

RDDSparkSparkSQL
0 likes · 17 min read
How Spark Runs on YARN: From Client Submission to Executor Execution
21CTO
21CTO
Mar 30, 2016 · Big Data

Unveiling Spark on YARN: From RDD Basics to Cluster Execution

This article explains Apache Spark’s core concepts, the RDD programming model, how Spark runs on YARN with driver and executor nodes, the distinction between transformations and actions, partitioning strategies, and an overview of SparkSQL processing.

Apache SparkRDDSparkSQL
0 likes · 18 min read
Unveiling Spark on YARN: From RDD Basics to Cluster Execution
ITPUB
ITPUB
Feb 24, 2016 · Big Data

How Pepperdata Optimizes Hadoop Cluster Resources and Improves Performance

The article explains how Hadoop clusters suffer from resource contention among multiple users, why YARN alone often fails to prioritize workloads, and how Pepperdata provides deeper visibility and automatic adjustments that reduce low‑priority usage, cut node count, and lower cloud costs.

Big DataCluster ManagementHadoop
0 likes · 7 min read
How Pepperdata Optimizes Hadoop Cluster Resources and Improves Performance

TalkingData’s Journey to Building a Mobile Big Data Platform with Spark and YARN

This article recounts how TalkingData progressively introduced Spark into its Hadoop‑YARN based mobile big‑data platform, detailing early architectures, migration challenges, performance gains, the fully Spark‑centric redesign with Kafka and Spark Streaming, encountered pitfalls, and future plans for further optimization.

Data PlatformHadoopSpark
0 likes · 16 min read
TalkingData’s Journey to Building a Mobile Big Data Platform with Spark and YARN
Hulu Beijing
Hulu Beijing
Aug 14, 2015 · Big Data

How Voidbox Bridges Docker and YARN for Scalable Big Data Workloads

Voidbox integrates Docker containers with YARN to simplify distributed application development, improve deployment, boost cluster efficiency, and provide fault‑tolerant, DAG‑based execution modes, enabling seamless resource management for Hadoop‑based big data jobs.

Big DataCluster ComputingDAG
0 likes · 17 min read
How Voidbox Bridges Docker and YARN for Scalable Big Data Workloads