Tagged articles

Hadoop

413 articles · Page 3 of 5

Sep 6, 2019 · Big Data

Big Data Development Interview Guide and Skill Tree Overview

This article provides a comprehensive interview roadmap for big data developers, outlining essential Java fundamentals, JVM internals, Linux basics, distributed theory, core frameworks such as Hadoop, Spark, Flink, Kafka, Netty, HBase, Hive, and practical algorithm topics, while also offering resume and career advice for aspiring candidates.

FlinkHadoopKafka

0 likes · 15 min read

Big Data Development Interview Guide and Skill Tree Overview

Architects' Tech Alliance

Aug 24, 2019 · Big Data

Reimagining Big Data in a Post‑Hadoop World

The article analyzes the decline of Hadoop as the dominant big‑data platform, explains how cloud‑based services are replacing its complex on‑premises architecture, and outlines the lessons and future directions for enterprises navigating a post‑Hadoop landscape.

Big DataHadoopdata lakes

0 likes · 12 min read

Reimagining Big Data in a Post‑Hadoop World

360 Tech Engineering

Aug 22, 2019 · Big Data

Design and Implementation of XStore: A Hadoop‑Based Sample Storage System

This article details the design, architecture, and operational experience of XStore, a Hadoop‑backed sample storage system that handles billions of APK and other binary samples, addressing functional and non‑functional requirements such as real‑time upload, large‑scale storage, high‑performance reads, and disaster recovery.

HBaseHDFSHadoop

0 likes · 11 min read

Design and Implementation of XStore: A Hadoop‑Based Sample Storage System

360 Zhihui Cloud Developer

Aug 22, 2019 · Big Data

Mastering HDFS: Architecture, Read/Write, and Best Practices Explained

This article provides a comprehensive overview of HDFS, covering its purpose, architecture, read/write processes, component roles, command-line tools, replica placement strategies, and the advantages and disadvantages of using Hadoop's distributed file system for large-scale data storage.

Data ReplicationDistributed File SystemHDFS

0 likes · 11 min read

Mastering HDFS: Architecture, Read/Write, and Best Practices Explained

Architects' Tech Alliance

Aug 20, 2019 · Big Data

Current State and Future Trends of Hadoop in the Big Data Landscape

Despite recent market turbulence and negative headlines, Hadoop's revenue continues to grow, driven by cloud migration, evolving storage solutions, and increasing adoption of related projects like Spark and Kafka, positioning it as a leading data‑lake technology.

Apache SparkBig DataData Lake

0 likes · 8 min read

Current State and Future Trends of Hadoop in the Big Data Landscape

dbaplus Community

Aug 19, 2019 · Big Data

Automating Fault Recovery in 5,000‑Node Hadoop Clusters with Fabric & CM_API

This article explains how a large‑scale Hadoop environment can automatically detect common failures—such as swap usage, clock drift, agent crashes, role outages, and disk imbalance—and recover them using Prometheus alerts, Fabric/Paramiko remote execution, and Cloudera Manager APIs, complete with code examples and step‑by‑step commands.

Big Data OperationsCM_APICluster Automation

0 likes · 12 min read

Automating Fault Recovery in 5,000‑Node Hadoop Clusters with Fabric & CM_API

Big Data Technology Architecture

Aug 16, 2019 · Big Data

In‑Depth Overview of HBase Architecture

This article provides a comprehensive, illustrated explanation of Apache HBase's architecture, covering its master‑slave components, region management, Zookeeper coordination, data flow for reads and writes, storage structures, compaction processes, fault recovery, and the system's strengths and limitations within the Hadoop ecosystem.

CompactionHBaseHadoop

0 likes · 21 min read

Big Data Technology Architecture

Aug 5, 2019 · Big Data

Zookeeper in Distributed Systems: Roles in Kafka, Hadoop, HBase, and Solr

This article explains Zookeeper’s core concepts, its ZAB consensus protocol, and surveys its essential roles in major big‑data components such as Kafka, Hadoop, HBase, and Solr, illustrating how it provides configuration, naming, coordination, leader election, and high‑availability services across distributed architectures.

HBaseHadoopKafka

0 likes · 5 min read

Zookeeper in Distributed Systems: Roles in Kafka, Hadoop, HBase, and Solr

Big Data Technology Architecture

Aug 2, 2019 · Big Data

The Rise and Decline of Hadoop: Market Shifts, Ecosystem Evolution, and Future Outlook

This article examines Hadoop’s historical development, the recent financial troubles of its three major vendors, the impact of public‑cloud services, competition from technologies like MongoDB and Elasticsearch, and how the evolving ecosystem and hybrid cloud strategies shape Hadoop’s relevance today.

Cloud ComputingHadoopHadoop Vendors

0 likes · 23 min read

The Rise and Decline of Hadoop: Market Shifts, Ecosystem Evolution, and Future Outlook

Meituan Technology Team

Aug 1, 2019 · Big Data

Performance Optimization Practices for Meituan's Hadoop YARN Fair Scheduler

Meituan improved its custom Hadoop YARN Fair Scheduler by pre‑computing resource usage, filtering zero‑demand jobs, and parallelizing queue sorting, which reduced sorting time from 30 s to 5 s per minute, boosted container‑per‑second throughput to 50 k, enabled live roll‑backs, and prepared the system for clusters up to 10 k nodes and future scaling to hundreds of thousands.

Big DataFair SchedulerHadoop

0 likes · 24 min read

Performance Optimization Practices for Meituan's Hadoop YARN Fair Scheduler

vivo Internet Technology

Jul 29, 2019 · Big Data

Is Hadoop Dead? An Analysis of Cloudera’s Move Toward an Enterprise Data Cloud

While Hadoop remains a powerful but complex batch‑processing engine, Cloudera’s merger with Hortonworks and its pivot toward an enterprise data cloud—offering hybrid, multi‑cloud analytics, security, and governance—signals a strategic shift that keeps Hadoop relevant yet no longer central amid rising competitors like MongoDB and Elasticsearch.

Cloud ComputingClouderaData Governance

0 likes · 10 min read

Is Hadoop Dead? An Analysis of Cloudera’s Move Toward an Enterprise Data Cloud

dbaplus Community

Jul 24, 2019 · Big Data

Essential Open-Source Tools Every Big Data Engineer Should Know

This article compiles a comprehensive list of common open‑source tools for big data platforms—covering programming languages, data collection, ETL, storage, analysis, query, management, and monitoring—to help learners and practitioners quickly locate and understand the technologies they need.

Big DataData EngineeringETL

0 likes · 15 min read

Essential Open-Source Tools Every Big Data Engineer Should Know

Full-Stack Internet Architecture

Jul 24, 2019 · Big Data

Eight Strategies for Handling Massive Data in Internet Applications

The article outlines eight practical techniques—including caching, page staticization, database optimization, hot‑data separation, operation merging, read‑write splitting, distributed databases, and the use of NoSQL and Hadoop—to efficiently store and serve massive data volumes in large‑scale internet services.

CachingHadoopNoSQL

0 likes · 7 min read

Eight Strategies for Handling Massive Data in Internet Applications

System Architect Go

Jul 19, 2019 · Big Data

Introduction to HBase: Architecture, Data Model, and Operations

This article provides a comprehensive overview of HBase, covering its distributed column‑oriented architecture, data model components, storage mechanisms, read/write processes, WAL lifecycle, MemStore flushing, region splitting and merging, and failure recovery within the Hadoop ecosystem.

Big DataDistributed storageHBase

0 likes · 20 min read

Introduction to HBase: Architecture, Data Model, and Operations

Big Data Technology & Architecture

Jun 30, 2019 · Big Data

Curated Collection of Big Data, Flink, Hadoop and Real‑Time Computing Articles from the “Big Data Technology and Architecture” Series

This article presents a carefully organized catalogue of over a hundred technical posts covering Flink source‑code analysis, fundamental and advanced big‑data structures, Hadoop ecosystem components, real‑time streaming with Spark and Kafka, as well as system design guidelines and miscellaneous insights, each linked to its original publication for easy reference.

Big DataFlinkHadoop

0 likes · 6 min read

21CTO

Jun 28, 2019 · Big Data

Master Hadoop High Availability: A Complete Step‑by‑Step HA HDFS & YARN Guide

This article provides a comprehensive, language‑agnostic tutorial on building a highly available Hadoop cluster, covering HDFS and YARN HA architectures, QJM shared storage, required components, configuration files, installation commands, startup procedures, verification steps, and troubleshooting references.

Cluster SetupHDFSHadoop

0 likes · 20 min read

Master Hadoop High Availability: A Complete Step‑by‑Step HA HDFS & YARN Guide

Beike Product & Technology

Jun 28, 2019 · Big Data

Hadoop NameNode Performance Bottlenecks and Solutions: Federation, ViewFS, FastCopy, Balance & Mover

This article analyzes the performance and stability bottlenecks of a Hadoop 2.7.3 NameNode caused by memory limits, RPC QPS, and long restart times, and presents a comprehensive solution stack—including HDFS federation, ViewFS, FastCopy, and tuned Balance/Mover tools—to improve scalability and reduce downtime.

BalanceFastCopyFederation

0 likes · 11 min read

Hadoop NameNode Performance Bottlenecks and Solutions: Federation, ViewFS, FastCopy, Balance & Mover

Architecture Digest

Jun 26, 2019 · Big Data

Guide to Setting Up Hadoop High Availability (HA) Cluster with HDFS and YARN

This article provides a step‑by‑step tutorial on configuring Hadoop high availability, covering HDFS HA architecture, Quorum Journal Manager synchronization, NameNode failover, YARN HA, required pre‑conditions, cluster planning, configuration files, service startup, and verification procedures.

Big DataCluster SetupHDFS

0 likes · 16 min read

Guide to Setting Up Hadoop High Availability (HA) Cluster with HDFS and YARN

Didi Tech

Jun 22, 2019 · Big Data

Analysis of Hadoop RPC Architecture and Implementation

The article examines Hadoop’s RPC framework—detailing its client‑server workflow, core classes (RPC, Client, Server), dynamic proxy handling, NIO‑based server threading, configurable concurrency controls such as FairCallQueue, and a practical HDFS mkdir command example, illustrating high‑performance distributed communication.

Big DataHadoopNIO

0 likes · 17 min read

Analysis of Hadoop RPC Architecture and Implementation

DataFunTalk

Jun 17, 2019 · Big Data

Understanding Hadoop’s Core Competitiveness in the Trillion‑Scale Data Era

This article explores Hadoop’s role in the big‑data era, detailing its architecture, core components such as HDFS, YARN, MapReduce, Ozone and Submarine, the challenges of trillion‑scale data, and why its scalability, cost efficiency, and a mature ecosystem give it a competitive edge.

Data LakeHadoopMapReduce

0 likes · 11 min read

Understanding Hadoop’s Core Competitiveness in the Trillion‑Scale Data Era

Full-Stack Internet Architecture

Jun 8, 2019 · Big Data

The Story of Doug Cutting: From Stanford to Hadoop and Beyond

This article chronicles Doug Cutting's journey from his humble beginnings at Stanford through his pioneering work on Lucene, Nutch, and Hadoop, highlighting how his innovations in search and distributed computing reshaped the big data landscape and led to the rise of Cloudera.

Big DataClouderaDoug Cutting

0 likes · 8 min read

The Story of Doug Cutting: From Stanford to Hadoop and Beyond

Big Data Technology & Architecture

May 22, 2019 · Big Data

Key Changes and New Features in Apache Flink 1.8.0 Release

Apache Flink 1.8.0 introduces incremental state cleanup with TTL, updates Hadoop support, deprecates TableEnvironment static methods, adds new Kafka deserialization schema, modifies Maven dependencies, and provides several configuration and Table API enhancements for better stream‑processing performance and compatibility.

Apache FlinkHadoopTable API

0 likes · 7 min read

Key Changes and New Features in Apache Flink 1.8.0 Release

Big Data Technology Architecture

May 18, 2019 · Big Data

Key Concepts of Kafka, Hadoop Shuffle, Spark Cluster Modes, HDFS I/O, and Spark RDD Operations

This article explains Kafka message structure and offset retrieval, details Hadoop's map and reduce shuffle processes, outlines Spark's deployment modes, describes HDFS read/write mechanisms, compares reduceByKey and groupByKey performance, and discusses Spark streaming integration with Kafka and data loss prevention.

HDFSHadoopKafka

0 likes · 10 min read

Key Concepts of Kafka, Hadoop Shuffle, Spark Cluster Modes, HDFS I/O, and Spark RDD Operations

dbaplus Community

May 13, 2019 · Big Data

Tackling HDFS Performance Bottlenecks: Real‑World Optimizations from VIP.com

This article examines the performance challenges encountered after upgrading a large‑scale HDFS cluster at VIP.com, explains the root causes of NameNode RPC latency, and presents concrete solutions—including delayed block reports, configurable block deletion, federation redesign, client monitoring, temp‑directory sharding, and small‑file handling—along with configuration snippets and real‑world results.

Big DataFederationHDFS

0 likes · 13 min read

Tackling HDFS Performance Bottlenecks: Real‑World Optimizations from VIP.com

Big Data Technology & Architecture

Apr 28, 2019 · Databases

Introduction to HBase: Architecture, Concepts, and Common Commands

This article introduces HBase, a distributed column‑oriented NoSQL database built on Hadoop, explains its architecture, data model, key concepts such as rowkeys, column families, timestamps, regions, and ZooKeeper, outlines its main features and typical use cases, and provides common HBase shell commands with examples.

Big DataDatabaseHadoop

0 likes · 21 min read

Introduction to HBase: Architecture, Concepts, and Common Commands

dbaplus Community

Apr 25, 2019 · Big Data

Cutting Hadoop Storage Costs: Replication, Compression, Tiering & Erasure Coding

This article shares practical strategies used in a multi‑petabyte Hadoop environment to slash storage expenses, covering reduced replication, selective compression formats, tiered storage policies, and erasure coding, while weighing trade‑offs in reliability, performance, and operational complexity.

HDFSHadoopStorage Optimization

0 likes · 10 min read

Cutting Hadoop Storage Costs: Replication, Compression, Tiering & Erasure Coding

Big Data Technology & Architecture

Apr 24, 2019 · Big Data

Hive SQL Optimization Techniques and Best Practices

This article provides a comprehensive guide to Hive SQL performance tuning, covering optimization goals, common pitfalls, execution flow, table and job settings, map, shuffle, reduce, and query-level improvements such as join, bucket join, group‑by, and count‑distinct optimizations.

Big DataHadoopHive

0 likes · 11 min read

Hive SQL Optimization Techniques and Best Practices

Big Data Technology & Architecture

Apr 20, 2019 · Big Data

Weekly Hadoop Knowledge Points: Compression Formats, MapReduce Join, Hive Setup, and YARN Capacity Scheduler

This weekly bulletin summarizes four Hadoop knowledge points—compression formats, MapReduce join techniques, Hive installation, and YARN Capacity Scheduler—while also sharing personal updates about a PhD graduation, the upcoming May Day holiday, and a request for likes and shares.

Big DataHadoopHive

0 likes · 2 min read

Weekly Hadoop Knowledge Points: Compression Formats, MapReduce Join, Hive Setup, and YARN Capacity Scheduler

Alibaba Cloud Developer

Apr 19, 2019 · Databases

Mastering HBase: From Basics to Architecture and Cluster Design

This article introduces HBase, its origins from Google Bigtable, core concepts such as RowKey, Column Family, and Versioning, and explains its logical and physical table views, storage mechanisms, and cluster architecture within the Hadoop ecosystem.

BigtableDatabaseDistributed storage

0 likes · 8 min read

Mastering HBase: From Basics to Architecture and Cluster Design

Big Data Technology & Architecture

Apr 17, 2019 · Big Data

Step-by-Step Guide to Installing Hive 2.1.0 on a Hadoop 2.7.1 Cluster (Ubuntu 14.04)

This tutorial provides a comprehensive, step-by-step procedure for setting up Hive 2.1.0 on a Hadoop 2.7.1 cluster running Ubuntu 14.04, covering environment preparation, Hive installation, configuration of environment variables, MySQL metastore integration, client setup, service startup, and basic verification commands.

Big DataHadoopHive

0 likes · 8 min read

Step-by-Step Guide to Installing Hive 2.1.0 on a Hadoop 2.7.1 Cluster (Ubuntu 14.04)

Big Data Technology & Architecture

Apr 16, 2019 · Big Data

Features, Configuration Parameters, and Implementation Details of Hadoop Capacity Scheduler

The article provides a comprehensive overview of Hadoop's Capacity Scheduler, describing its resource‑allocation features, configurable XML parameters, queue access controls, dynamic configuration updates, and the internal workflow of application initialization and resource scheduling within YARN.

CapacitySchedulerHadoopResourceManagement

0 likes · 13 min read

Features, Configuration Parameters, and Implementation Details of Hadoop Capacity Scheduler

Big Data Technology & Architecture

Apr 15, 2019 · Big Data

Map‑Side Join and Reduce‑Side Join Examples in Hadoop MapReduce (Java)

This article provides two reusable Java code samples that demonstrate how to perform a map‑side join and a reduce‑side join in Hadoop MapReduce, enabling efficient joining of a large dataset with a smaller reference table.

Big DataHadoopJOIN

0 likes · 8 min read

Map‑Side Join and Reduce‑Side Join Examples in Hadoop MapReduce (Java)

Big Data Technology & Architecture

Apr 12, 2019 · Big Data

Weekly Knowledge Summary: Yarn Resource Scheduler, Hadoop Rack Awareness, HDFS Data Flow, and Small File Solutions

This weekly note shares personal updates and a concise technical overview covering Yarn's resource scheduling, Hadoop's rack‑aware architecture, HDFS data flow, and practical solutions to the HDFS small‑file problem, along with links to further reading and upcoming work plans.

Big DataHDFSHadoop

0 likes · 5 min read

Weekly Knowledge Summary: Yarn Resource Scheduler, Hadoop Rack Awareness, HDFS Data Flow, and Small File Solutions

Architecture Digest

Apr 11, 2019 · Big Data

Understanding Hadoop and HBase: Installation, Configuration, and Basic Operations

This guide introduces Hadoop and HBase fundamentals, explains their architectures and advantages, and provides step‑by‑step instructions for setting up a multi‑node Hadoop cluster, configuring core services, installing HBase, and performing basic HBase shell operations.

Big DataConfigurationHBase

0 likes · 18 min read

Understanding Hadoop and HBase: Installation, Configuration, and Basic Operations

Big Data Technology & Architecture

Apr 10, 2019 · Big Data

Understanding Hadoop DistributedCache: Concepts, API Usage, and Example

This article explains Hadoop's DistributedCache mechanism, its APIs for adding cache files and archives, common use cases, important considerations, the basic workflow, and provides a complete Java Map-side join example demonstrating how to distribute and access cached data in MapReduce jobs.

DistributedCacheHadoopMapReduce

0 likes · 10 min read

Understanding Hadoop DistributedCache: Concepts, API Usage, and Example

Big Data Technology & Architecture

Apr 8, 2019 · Big Data

Understanding HDFS Data Blocks, Rack Awareness, and Dynamic Node Addition

This article explains how HDFS stores files in replicated data blocks, implements rack awareness to improve reliability and performance, shows the necessary configuration in core-site.xml, provides sample scripts, and demonstrates how to add new DataNode machines without restarting the NameNode.

Big DataData BlockDynamic Node Addition

0 likes · 10 min read

Understanding HDFS Data Blocks, Rack Awareness, and Dynamic Node Addition

Big Data Technology & Architecture

Apr 7, 2019 · Big Data

Understanding YARN: Background, Architecture, and Execution Process

This article explains why YARN was created to overcome the limitations of MapReduce 1.x, describes its architecture—including ResourceManager, NodeManager, ApplicationMaster, Container, and Client—and outlines the step‑by‑step execution flow that enables multiple computation frameworks to run on Hadoop.

Big DataDistributed ComputingHadoop

0 likes · 11 min read

Understanding YARN: Background, Architecture, and Execution Process

Big Data Technology & Architecture

Apr 4, 2019 · Big Data

Weekly Knowledge Points: Interview Reflections, Hadoop Introduction, MapReduce and HDFS Overview

This weekly briefing shares five curated resources covering interview reflections, a concise Hadoop introduction, the principles of MapReduce, an overview of HDFS, and upcoming plans to study Hive and HBase, emphasizing the distributed nature of big‑data processing.

Big DataHDFSHadoop

0 likes · 3 min read

Weekly Knowledge Points: Interview Reflections, Hadoop Introduction, MapReduce and HDFS Overview

Big Data Technology & Architecture

Apr 2, 2019 · Big Data

Understanding Hadoop MapReduce: Programming Model, WordCount Example, and Job Execution Mechanism

The article explains Hadoop's MapReduce framework as both a programming model and execution engine, detailing its map and reduce phases, the WordCount example code, job startup components, data shuffling, partitioning, and how large‑scale distributed computations are orchestrated across a cluster.

Big DataDistributed ComputingHadoop

0 likes · 10 min read

Understanding Hadoop MapReduce: Programming Model, WordCount Example, and Job Execution Mechanism

Big Data Technology & Architecture

Apr 1, 2019 · Big Data

Comprehensive Overview of Hadoop: Core Modules, HDFS Architecture, MapReduce, YARN, and a Scala WordCount Example

This article provides a detailed introduction to Hadoop's ecosystem—including its core modules (Common, HDFS, YARN, MapReduce), the design of a high‑availability HDFS cluster, the principles of distributed file systems, and a complete Scala WordCount MapReduce program—offering a solid foundation for big‑data practitioners.

Big DataHDFSHadoop

0 likes · 15 min read

Comprehensive Overview of Hadoop: Core Modules, HDFS Architecture, MapReduce, YARN, and a Scala WordCount Example

dbaplus Community

Mar 27, 2019 · Big Data

How eBay Cut Hadoop Job Runtime by 60%: Real‑World CAL Log Optimization

This article explains how eBay's CAL team reduced Hadoop MapReduce job execution time and resource consumption by over 60% through targeted GC tuning, data‑skew mitigation, and algorithmic improvements, boosting job success rates to nearly 100% while handling petabyte‑scale log data.

Big DataData SkewGC Tuning

0 likes · 12 min read

How eBay Cut Hadoop Job Runtime by 60%: Real‑World CAL Log Optimization

Architects' Tech Alliance

Mar 18, 2019 · Big Data

Understanding HDFS Architecture, NameNode HA, and Read/Write Processes

This article explains the concepts and architecture of HDFS, the high‑availability mechanisms of NameNode including quorum‑based shared storage, the detailed read and write workflows of the distributed file system, and discusses its typical use cases and limitations.

Big DataHAHDFS

0 likes · 16 min read

Understanding HDFS Architecture, NameNode HA, and Read/Write Processes

dbaplus Community

Mar 13, 2019 · Operations

How We Upgraded a 100‑Node Hadoop Cluster with Ansible and Ambari

This article details the step‑by‑step process of modernizing a large‑scale Hadoop deployment—identifying legacy pain points, evaluating three migration strategies, selecting an in‑place upgrade using Ambari‑managed HDP, and automating the entire workflow with Ansible to minimize downtime and operational risk.

AmbariAnsibleHadoop

0 likes · 13 min read

How We Upgraded a 100‑Node Hadoop Cluster with Ansible and Ambari

Big Data Technology & Architecture

Feb 15, 2019 · Big Data

Big Data Mastery Roadmap

This article outlines a comprehensive series of over 500 planned tutorials covering Java advanced features, distributed theory, Hadoop, Spark, Flink, and various big‑data storage and processing technologies, designed to guide engineers transitioning into big‑data development from fundamentals to expert level.

Data EngineeringFlinkHadoop

0 likes · 4 min read

Didi Tech

Jan 31, 2019 · Big Data

Router-Based Federation in Hadoop: Architecture, Components, and Didi’s Deployment

Router‑Based Federation replaces Hadoop’s single‑point HDFS bottleneck with a server‑side global namespace managed by Routers and a State Store, enabling scalable, highly available sub‑clusters; Didi back‑ported the feature, deployed five Routers, fixed numerous bugs, and contributed patches to improve stability and functionality.

Big DataHDFSHadoop

0 likes · 11 min read

Router-Based Federation in Hadoop: Architecture, Components, and Didi’s Deployment

58 Tech

Jan 14, 2019 · Operations

Analysis of a Linux Kernel Futex Bug Causing Java and Xtrabackup Hang on Hadoop Nodes

A detailed investigation reveals that a futex bug in Linux kernel 2.6.32-504 causes Java programs on Hadoop and xtrabackup processes to hang, and the issue can be resolved by upgrading to a newer kernel version.

FutexHadoopLinux

0 likes · 12 min read

Analysis of a Linux Kernel Futex Bug Causing Java and Xtrabackup Hang on Hadoop Nodes

Big Data Technology & Architecture

Jan 3, 2019 · Big Data

Deploying Apache Flink on YARN and Running Flink Jobs

This tutorial explains how to deploy Apache Flink on a Hadoop YARN cluster, covering both YARN session mode and direct job submission, and demonstrates running the built‑in WordCount example with command‑line options for input, output, and resource configuration.

Apache FlinkBig DataFlink Deployment

0 likes · 8 min read

Deploying Apache Flink on YARN and Running Flink Jobs

Big Data Technology & Architecture

Dec 31, 2018 · Big Data

Overview of the Big Data Ecosystem and Core Technologies

This article provides a comprehensive overview of the big data ecosystem, explaining key components such as Hadoop, HDFS, Spark, Hive, Pig, HBase, and related tools, and describes how they work together to store, process, and analyze massive datasets efficiently.

Big DataHadoopHive

0 likes · 16 min read

Overview of the Big Data Ecosystem and Core Technologies

Python Crawling & Data Mining

Dec 5, 2018 · Big Data

Prepare Offline CDH 5.14 Installation Files on CentOS 6.7

This guide details the required system environment, download links, and offline file list for setting up Cloudera CDH 5.14 on a CentOS 6.7 server, including JDK, parcels, manager package, and MySQL connector, and explains how to upload them via Filezilla.

Big DataCDHCentOS

0 likes · 4 min read

Prepare Offline CDH 5.14 Installation Files on CentOS 6.7

Tencent Cloud Developer

Oct 30, 2018 · Big Data

Big Data Technology Trends and Cloud Data Warehouse Architecture Practices

The article reviews recent big-data trends—from Hadoop’s evolution and Spark’s in-memory advances to emerging storage like Ozone—while detailing data-warehouse models, query-optimizer techniques, and cloud-native architectures that integrate diverse data sources, enabling scalable, AI-ready analytics and modern data-lake capabilities.

Big DataCloud Data WarehouseData Lake

0 likes · 30 min read

Big Data Technology Trends and Cloud Data Warehouse Architecture Practices

360 Quality & Efficiency

Oct 15, 2018 · Big Data

An Introduction to Big Data Concepts, Hadoop Ecosystem, and Common Frameworks

This article provides a comprehensive overview of big data fundamentals, including the 4V characteristics, the Hadoop 2.0 layered architecture, a comparison between Hadoop and Spark, classification of common big‑data tools, and the typical offline and real‑time data processing workflows.

ETLHadoopSpark

0 likes · 6 min read

An Introduction to Big Data Concepts, Hadoop Ecosystem, and Common Frameworks

Java Captain

Oct 1, 2018 · Big Data

What Is a Big Data Development Engineer? Roles, Skills, and Differences from Traditional Development

The article explains what a big data development engineer does, the tools and skills required such as Hadoop, Hive, Spark and Kafka, how they process massive logs to compute metrics like PV and UV, and compares this role with conventional business system development.

Data EngineeringHadoopSpark

0 likes · 9 min read

What Is a Big Data Development Engineer? Roles, Skills, and Differences from Traditional Development

JD Tech

Sep 20, 2018 · Big Data

Optimizing Local Storage Systems for Large‑Scale Hadoop HDFS Clusters

This article explains the architecture of Hadoop HDFS, identifies performance bottlenecks in page cache and metadata handling on DataNodes, and presents four practical optimization techniques—including cache‑buffer separation, barrier disabling, directory restructuring, and real‑time monitoring—demonstrating significant throughput and latency improvements in large‑scale clusters.

HDFSHadoopLinux kernel

0 likes · 14 min read

Optimizing Local Storage Systems for Large‑Scale Hadoop HDFS Clusters

Big Data and Microservices

Sep 4, 2018 · Big Data

Exploring Five Big Data Architectures—from Traditional to Unified AI Designs

The article examines the evolution of big‑data processing by comparing five prevalent architectures—traditional Hadoop‑based stacks, streaming‑only designs, Kappa, Lambda, and the unified Unifield model—highlighting their strengths, weaknesses, and suitable scenarios while discussing the limitations of classic BI systems and the role of distributed storage, computation, and machine‑learning integration.

Big DataData ArchitectureHadoop

0 likes · 14 min read

Exploring Five Big Data Architectures—from Traditional to Unified AI Designs

Big Data and Microservices

Aug 21, 2018 · Big Data

How to Build a Scalable Hadoop‑Spark Big Data Analytics Platform

This article explains why BI is essential for big data platforms, outlines the value hierarchy of data, details the Hadoop‑based analysis workflow, and provides step‑by‑step guidance for constructing both pure Hadoop and hybrid Hadoop‑Spark analytics architectures.

BIBig Data ArchitectureData Lake

0 likes · 12 min read

How to Build a Scalable Hadoop‑Spark Big Data Analytics Platform

Architects Research Society

Jul 27, 2018 · Big Data

Overview of Apache Hive Features, Usage, and Management

Apache Hive is an open‑source data‑warehouse system built on Hadoop that enables users to read, write, and manage large distributed datasets using SQL‑like queries, offering features such as ETL support, various file‑format connectors, extensible UDFs, and integration with tools like Tez, Spark, and MapReduce.

Apache HiveBig DataData Warehouse

0 likes · 5 min read

Overview of Apache Hive Features, Usage, and Management

iQIYI Technical Product Team

Jul 26, 2018 · Big Data

Gear: An Internal Workflow Scheduling System for Hadoop at iQIYI

Gear is iQIYI’s internal, high‑availability workflow scheduler built on Apache Oozie and extended with a YAML‑based definition language, GitLab‑driven submission, and a web UI, enabling thousands of daily Hadoop/Spark jobs, complex dependencies, retries, and monitoring, and evolving from SSH‑centric 1.x to feature‑rich 2.x.

HadoopJob ManagementOozie

0 likes · 14 min read

Gear: An Internal Workflow Scheduling System for Hadoop at iQIYI

Big Data and Microservices

Jul 24, 2018 · Big Data

Why Hadoop Still Leads Big Data Processing: Core Advantages Explained

This article introduces Hadoop’s open‑source big‑data framework, explains its core components HDFS and MapReduce, and outlines four key advantages—ease of deployment, robustness, scalability, and simplicity—while also covering HBase as the Hadoop‑based column‑oriented database.

Big DataDistributed ComputingHBase

0 likes · 4 min read

Why Hadoop Still Leads Big Data Processing: Core Advantages Explained

Meitu Technology

Jul 24, 2018 · Big Data

Exploring Big Data Cluster Security: Evaluation of Kerberos, Apache Sentry, and Apache Ranger

The article evaluates Kerberos, Apache Sentry, and Apache Ranger for securing Meitu’s large‑scale Hadoop ecosystem, highlighting Ranger’s comprehensive, fine‑grained, policy‑based authorization across HDFS, HBase, Hive, YARN, Storm, and Kafka, and describing its configuration, LDAP integration, and custom SDK implementation.

Access ControlApache RangerApache Sentry

0 likes · 12 min read

Exploring Big Data Cluster Security: Evaluation of Kerberos, Apache Sentry, and Apache Ranger

JD Tech

Jul 10, 2018 · Big Data

Deploying Hadoop KMS for Transparent HDFS Encryption: A Step‑by‑Step Guide

This article details a complete, hands‑on deployment of Hadoop KMS on a CentOS‑based Hadoop 2.6.1 cluster, covering environment setup, configuration file changes, key generation, service startup, encryption‑zone creation, user permission tuning, verification procedures, and common troubleshooting tips.

HDFSHadoopKMS

0 likes · 19 min read

Deploying Hadoop KMS for Transparent HDFS Encryption: A Step‑by‑Step Guide

JD Tech

Jul 9, 2018 · Big Data

JD's Large‑Scale Hadoop Cluster Resource Management and Scheduling Architecture

This article describes how JD built a multi‑regional, ten‑thousand‑node Hadoop ecosystem, unified resource management with YARN, introduced a three‑level Router scheduling layer, optimized performance, and integrated deep‑learning frameworks to achieve high availability, cost efficiency, and scalable big‑data processing.

Distributed SchedulingHadoopJD.com

0 likes · 12 min read

JD's Large‑Scale Hadoop Cluster Resource Management and Scheduling Architecture

UCloud Tech

Jul 7, 2018 · Big Data

How UMStor and HAdapter Power Big Data Cloud Migration with Superior Performance

The article reports on UCloud's subsidiary presenting at ArchSummit 2018 in Shenzhen, detailing the evolution to the digital era, challenges of PB‑scale data storage, and their solution using NFS‑Ganesha, Hadapter, and UMStor to achieve efficient big‑data‑on‑cloud performance and a data‑lake model.

Data LakeDistributed storageHadoop

0 likes · 10 min read

How UMStor and HAdapter Power Big Data Cloud Migration with Superior Performance

JD Tech

Jul 4, 2018 · Big Data

ClickHouse Overview: Features, Performance, Engines, and Comparison with Hadoop

This article introduces ClickHouse as a high‑performance, column‑oriented database designed for real‑time big‑data analytics, outlines its key features, performance characteristics, supported interfaces, differences from Hadoop, and explains its main storage engines—MergeTree and Distributed—while also noting its current limitations.

ClickHouseColumnar DatabaseHadoop

0 likes · 11 min read

ClickHouse Overview: Features, Performance, Engines, and Comparison with Hadoop

360 Quality & Efficiency

Jun 28, 2018 · Big Data

An Introduction to Apache Hive: Architecture, Workflow, Storage, Advantages, and Comparison with Traditional Databases

This article provides a concise overview of Apache Hive, covering its definition, Hadoop background, architecture, query workflow, storage model, advantages, disadvantages, and a comparison with traditional relational databases, helping readers understand how Hive enables SQL-like queries on data stored in HDFS.

Data WarehouseHadoopHive

0 likes · 5 min read

An Introduction to Apache Hive: Architecture, Workflow, Storage, Advantages, and Comparison with Traditional Databases

Architecture Digest

Jun 22, 2018 · Databases

Distributed Databases for OLAP: MPP, Hadoop Ecosystem, and Like‑Mesa (ClickHouse/Palo) Overview

This article examines the evolution and classification of distributed databases for OLAP workloads, comparing traditional RDBMS, MPP solutions such as Teradata and Greenplum, Hadoop‑based ecosystems, and newer architectures like ClickHouse and Palo, while highlighting their architectural traits, strengths, and limitations.

ClickHouseHadoopMPP

0 likes · 17 min read

Distributed Databases for OLAP: MPP, Hadoop Ecosystem, and Like‑Mesa (ClickHouse/Palo) Overview

ITPUB

Jun 19, 2018 · Big Data

Is Hadoop Still Relevant? Comparing Hadoop, PostgreSQL, and Storm

The article examines Hadoop's relevance by contrasting it with PostgreSQL and Storm, discussing when each technology fits big‑data challenges such as volume, velocity, and variety, and highlighting cost, complexity, and use‑case considerations for enterprises.

Batch ProcessingHadoopPostgreSQL

0 likes · 8 min read

Is Hadoop Still Relevant? Comparing Hadoop, PostgreSQL, and Storm

dbaplus Community

Jun 14, 2018 · Big Data

Designing Scalable Hadoop‑Based Data Analytics Platforms: Architecture & Best Practices

This article explains how enterprises can build a scalable data analytics platform on Hadoop by outlining the multi‑layer architecture, storage options, data synchronization methods, and ETL/offline computation techniques, while highlighting practical component choices such as Hive, HBase, Spark, and Oozie.

Big DataData ArchitectureData Lake

0 likes · 10 min read

Designing Scalable Hadoop‑Based Data Analytics Platforms: Architecture & Best Practices

Full-Stack Internet Architecture

Jun 14, 2018 · Big Data

What Is Big Data? Definitions, Technologies, Skills, and Use Cases

This article explains the definition of big data, its characteristic 3Vs, common data sources, supporting IT infrastructure, key technologies such as Hadoop and Spark, specialized databases, required skills, and several practical business use cases.

Apache SparkData LakeHadoop

0 likes · 8 min read

What Is Big Data? Definitions, Technologies, Skills, and Use Cases

ITPUB

Jun 14, 2018 · Big Data

Why Suning.com Sticks with Hadoop: Insights into China’s Big Data Platform Choices

Amid declining Hadoop usage reports, Suning.com’s 2018‑2020 big‑data platform case study reveals why the retailer still relies on Hadoop’s mature ecosystem, how it integrates HDFS, HBase, YARN, Hive, Spark, Flink and emerging tools, and what future resource‑management plans it envisions.

Data PlatformFlinkHadoop

0 likes · 11 min read

Why Suning.com Sticks with Hadoop: Insights into China’s Big Data Platform Choices

ITPUB

Jun 10, 2018 · Big Data

13 Must‑Know Open‑Source Tools in the Hadoop Ecosystem

This article introduces Hadoop’s origins and core challenges, then presents thirteen essential open‑source tools spanning resource scheduling, real‑time query engines, and additional processing frameworks, detailing each project's purpose, key features, and repository locations to help practitioners choose the right component for big‑data workloads.

HadoopImpalaOpen-source

0 likes · 12 min read

13 Must‑Know Open‑Source Tools in the Hadoop Ecosystem

ITPUB

Jun 4, 2018 · Big Data

Is Hadoop Really Declining? Expert Insights Show Why the Ecosystem Stays Strong

Despite Gartner's 2017 claim that Hadoop is nearing the end of its production maturity, a series of interviews with Chinese big‑data experts reveal that Hadoop's ecosystem remains robust, with core components like HDFS, YARN, Spark, and HBase continuing to dominate the market.

Big DataGartnerHadoop

0 likes · 9 min read

Is Hadoop Really Declining? Expert Insights Show Why the Ecosystem Stays Strong

ITPUB

Jun 3, 2018 · Big Data

Spark vs Hadoop: Which Distributed System Fits Your Data Needs?

An in‑depth comparison of Hadoop and Spark examines their architectures, performance, cost, security, and machine‑learning capabilities, helping readers decide which open‑source distributed processing platform best matches their batch, streaming, and analytical workloads.

Big DataHadoopPerformance

0 likes · 13 min read

Spark vs Hadoop: Which Distributed System Fits Your Data Needs?

Beike Product & Technology

Jun 1, 2018 · Big Data

Design and Evolution of Lianjia's Big Data Platform: Architecture, Challenges, and Solutions

This article details Lianjia's journey from a Hadoop‑based 0.0 data platform to a sophisticated 2.0 architecture, describing the three‑layer design, OLAP engine choices, transparent compression techniques, operational challenges, and practical recommendations for building and maintaining large‑scale big data systems.

HadoopKylinOLAP

0 likes · 15 min read

Design and Evolution of Lianjia's Big Data Platform: Architecture, Challenges, and Solutions

DataFunTalk

May 29, 2018 · Big Data

Design, Challenges, and Evolution of Lianjia's Big Data Platform Architecture (0 → 1.0 → 2.0)

This article details the evolution of Lianjia's massive‑data platform from its initial 0 version through 1.0 and 2.0, describing architectural challenges, a three‑layer redesign, data‑engine selection (ROLAP, MOLAP, Kylin), transparent compression techniques, and practical lessons for large‑scale data systems.

Big DataHadoopKylin

0 likes · 14 min read

Design, Challenges, and Evolution of Lianjia's Big Data Platform Architecture (0 → 1.0 → 2.0)

Architecture Digest

May 28, 2018 · Big Data

Building a Real-Time Stream Processing Platform with Hadoop Ecosystem (Kafka, Spark Streaming, HBase)

This guide details how to construct a real-time data processing platform on CentOS 7 using the Hadoop ecosystem—installing and configuring Zookeeper, Maven, Hadoop, Kafka, HBase, Spark, and Flume—followed by a Spark Streaming job that consumes Kafka messages and writes them into HBase.

Big DataFlumeHBase

0 likes · 14 min read

Building a Real-Time Stream Processing Platform with Hadoop Ecosystem (Kafka, Spark Streaming, HBase)

21CTO

May 17, 2018 · Big Data

Understanding Hadoop MapReduce and YARN: Architecture, Shuffle, and Scaling

This article explains Hadoop's core components, the MapReduce programming model, the detailed shuffle and merge processes, and how YARN replaces the classic JobTracker/TaskTracker architecture to improve scalability and resource utilization in large‑scale data processing clusters.

Distributed ComputingHadoopShuffle

0 likes · 12 min read

Understanding Hadoop MapReduce and YARN: Architecture, Shuffle, and Scaling

Architects' Tech Alliance

May 14, 2018 · Big Data

Understanding Hadoop MapReduce Architecture and YARN: Components, Workflow, and Optimization

This article explains Hadoop's distributed storage and processing framework, details the MapReduce programming model, describes the classic JobTracker/TaskTracker architecture, outlines the shuffle and combine phases, and introduces YARN as a scalable replacement with its ResourceManager, ApplicationMaster, and NodeManager components.

Big DataHadoopMapReduce

0 likes · 13 min read

Understanding Hadoop MapReduce Architecture and YARN: Components, Workflow, and Optimization

Python Crawling & Data Mining

Apr 22, 2018 · Big Data

How to Set Up CDH 5.14 on CentOS 6.7: Complete Offline Installation Guide

This guide details the required system environment, download sources, and step‑by‑step offline file preparation for installing Cloudera's CDH 5.14 on a CentOS 6.7 server, including JDK, parcels, and MySQL connector setup.

Big DataCDHCentOS

0 likes · 4 min read

How to Set Up CDH 5.14 on CentOS 6.7: Complete Offline Installation Guide

ITPUB

Mar 29, 2018 · Big Data

Demystifying Hadoop: MapReduce, Shuffle, and YARN Architecture

This article explains Hadoop’s core components, the MapReduce programming model, the detailed shuffle and merge processes, and how YARN replaces the classic JobTracker/TaskTracker design to improve scalability and resource utilization in large‑scale data processing clusters.

Big DataHadoopMapReduce

0 likes · 15 min read

Demystifying Hadoop: MapReduce, Shuffle, and YARN Architecture

Snowball Engineer Team

Mar 23, 2018 · Big Data

Redesigning Snowball's Log Collection Architecture During Hadoop Cluster Expansion

The article details Snowball's challenges with a saturated CDH Hadoop cluster, outlines the limitations of the original Kafka‑based log pipeline, and explains how a comprehensive redesign using FlumeNG, Spillable Memory Channels, and custom HDFS sinks resolves latency, data loss, and high‑load issues while supporting future growth.

Cluster MigrationFlumeNGHadoop

0 likes · 6 min read

Redesigning Snowball's Log Collection Architecture During Hadoop Cluster Expansion

Beike Product & Technology

Mar 9, 2018 · Big Data

Design and Implementation of Transparent Compression for Hadoop Using ZFS

The article presents a comprehensive solution for reducing Hadoop cluster storage consumption by applying ZFS‑based transparent compression and data‑governance techniques, detailing the technical background, design choices, implementation steps, performance optimizations, and observed storage savings.

Big DataData GovernanceHadoop

0 likes · 12 min read

Design and Implementation of Transparent Compression for Hadoop Using ZFS

dbaplus Community

Mar 7, 2018 · Big Data

Taming Massive HDFS Data Growth: Monitoring, Capacity Planning & Hive Optimization

The article outlines a systematic approach for large‑scale Hadoop clusters to monitor daily data growth, identify abnormal paths, manage rapid expansion, clean unused cold data, and implement capacity forecasts, while providing concrete daily and quarterly actions, Hive‑specific strategies, and practical examples to keep storage under control.

Big DataData GrowthHDFS

0 likes · 17 min read

Taming Massive HDFS Data Growth: Monitoring, Capacity Planning & Hive Optimization

Huawei Cloud Developer Alliance

Feb 27, 2018 · Big Data

Master Vehicle IoT in 5 Minutes: Data Open & Collection Explained

This article introduces vehicle networking concepts, explains Huawei's OceanConnect solution, and details how data open and data collection are implemented using platforms like DAP, Kafka, and Hadoop to provide reliable, real‑time vehicle information for various applications.

Data OpenHadoopHuawei OceanConnect

0 likes · 6 min read

Master Vehicle IoT in 5 Minutes: Data Open & Collection Explained

Architecture Digest

Feb 1, 2018 · Fundamentals

How Search Engines Work: Building Inverted Indexes

This article explains the core of search engine technology by describing what an inverted index is, how it is built using single‑pass memory and multi‑way merge methods, how indexes can be partitioned and incrementally updated, and how Hadoop can be used for large‑scale indexing.

Big DataHadoopIndexing

0 likes · 10 min read

How Search Engines Work: Building Inverted Indexes

dbaplus Community

Jan 17, 2018 · Big Data

Mastering Hadoop YARN: CPU & Memory Management Strategies for Large‑Scale Clusters

This article explores Hadoop YARN’s evolution, multi‑tenant design, queue and node‑label scheduling, real‑world resource allocation challenges, and data‑driven tools that automate diagnostics and visualizations to optimize CPU and memory usage across massive clusters.

CPUHadoopNode Labels

0 likes · 16 min read

Mastering Hadoop YARN: CPU & Memory Management Strategies for Large‑Scale Clusters

StarRing Big Data Open Lab

Dec 1, 2017 · Big Data

Master Inceptor: Essential Q&A for Getting Started with This Big Data Engine

This guide answers the most common questions about Inceptor, covering its purpose, installation, command‑line interaction, table creation, partitioning, execution modes, error handling, column alteration, query planning, data migration, and CSV import settings.

BeelineData MigrationHadoop

0 likes · 13 min read

Master Inceptor: Essential Q&A for Getting Started with This Big Data Engine

ITPUB

Nov 23, 2017 · Big Data

7 Typical Big Data Projects Every Hadoop Engineer Should Know

The article outlines seven common big‑data initiatives—data integration, specialized analytics, Hadoop‑as‑a‑service, stream processing, complex event handling, ETL pipelines, and SAS replacement—explaining their goals, typical technologies such as HDFS, Hive, Spark, Storm, Kafka, and practical considerations for enterprises adopting Hadoop ecosystems.

Data IntegrationHadoopproject types

0 likes · 8 min read

7 Typical Big Data Projects Every Hadoop Engineer Should Know

Full-Stack DevOps & Kubernetes

Oct 21, 2017 · Big Data

Deploy Hadoop CDH5.4 on CentOS 6: Install HDFS, YARN, and WebHDFS

This guide walks through preparing three CentOS 6.9 nodes, configuring hostnames, time sync, password‑less SSH, disabling IPv6, installing JDK, downloading CDH 5.4, setting up core‑site and hdfs‑site XML files, formatting the NameNode, starting HDFS services, configuring YARN and MapReduce, and verifying the installations via the Web UI.

Big DataCDHCentOS

0 likes · 18 min read

Deploy Hadoop CDH5.4 on CentOS 6: Install HDFS, YARN, and WebHDFS

37 Interactive Technology Team

Oct 19, 2017 · Big Data

Ambari Technical Practice for Managing Hadoop Big Data Platforms

The team adopted Apache Ambari to streamline deployment, scaling, monitoring, and upgrade of their Hadoop‑centric big‑data platform, overcoming HA cluster takeover and custom Hive 2.1 integration through a three‑phase test, gray‑scale, and production rollout, thereby improving management efficiency and reducing O&M costs.

AmbariHDPHadoop

0 likes · 18 min read

Ambari Technical Practice for Managing Hadoop Big Data Platforms

21CTO

Sep 25, 2017 · Big Data

How Meitu Scaled Its Billion-User Data Analytics: Architecture Evolution and Lessons

This article explains how Meitu built and evolved a large‑scale data statistics platform to handle billions of users, detailing the challenges of growing data volume, the architectural shifts from simple scripts to Hadoop, and the design of modular components for job management, scheduling, execution, and future expansion.

Big DataData PlatformHadoop

0 likes · 16 min read

How Meitu Scaled Its Billion-User Data Analytics: Architecture Evolution and Lessons

21CTO

Sep 5, 2017 · Big Data

Build a PHP Word Count with Hadoop MapReduce: Step-by-Step Guide

This article explains what MapReduce is, when to use it, and how to implement a PHP word‑count and a gold‑price average calculation on an Apache Hadoop cluster, covering installation hints, mapper and reducer scripts, testing commands, and visualizing results with gnuplot.

Big DataGnuplotHadoop

0 likes · 10 min read

Build a PHP Word Count with Hadoop MapReduce: Step-by-Step Guide

Architecture Digest

Sep 3, 2017 · Big Data

An Overview of Big Data Processing Frameworks: Batch, Stream, and Hybrid Systems

This article introduces the evolution of big‑data processing from Google’s MapReduce concept to modern open‑source frameworks, defines big data and its 3V characteristics, outlines typical processing pipelines, and compares batch, stream, and hybrid systems such as Hadoop, Storm, Samza, Spark, and Flink.

Batch ProcessingBig DataFlink

0 likes · 20 min read

An Overview of Big Data Processing Frameworks: Batch, Stream, and Hybrid Systems

Meituan Technology Team

Aug 25, 2017 · Big Data

Data Platform Integration and Multi‑Data‑Center Architecture at Meituan‑Dianping

After Meituan merged with Dianping, engineers unified two massive Hadoop ecosystems across Beijing and Shanghai by breaking the project into four phases—unify, copy, switch, fuse—standardizing versions, implementing zone‑aware transfers, cross‑realm Kerberos, and federated metadata to achieve a single, reliable multi‑data‑center platform.

Big DataCluster FusionData Platform

0 likes · 32 min read

Data Platform Integration and Multi‑Data‑Center Architecture at Meituan‑Dianping

21CTO

Aug 21, 2017 · Big Data

Rethinking Hadoop: When to Use It and How Cloud Computing Changes the Game

This article reviews when Hadoop is appropriate, outlines its core features and limitations, explains cloud computing concepts and service models, and highlights the benefits of pre‑built Hadoop images for accelerating big‑data projects.

Big DataHadoopPre-built Images

0 likes · 13 min read

Rethinking Hadoop: When to Use It and How Cloud Computing Changes the Game

Architects' Tech Alliance

Aug 17, 2017 · Big Data

Big Data Overview: Market Demand, Core Technologies, Learning Path, and Salary Landscape

This article examines the booming demand for big data professionals, explains the 4V characteristics, lists essential open‑source tools, outlines learning routes and required skills, and presents salary data for various big‑data roles in China.

CareerHadoopSalary

0 likes · 8 min read

Big Data Overview: Market Demand, Core Technologies, Learning Path, and Salary Landscape

ITFLY8 Architecture Home

Jul 26, 2017 · Big Data

Inside Taobao’s Massive Data Architecture: From Hadoop “Cloud Ladder” to Real‑Time “Galaxy”

This article details Taobao’s multi‑layer massive data platform, covering its five‑tier architecture, the 1500‑node Hadoop “Cloud Ladder” for batch processing, the low‑latency “Galaxy” stream engine, MySQL‑based MyFOX, HBase‑based Prom storage, the glider middle‑layer, and sophisticated caching strategies that together support petabytes of data and millions of daily queries.

Big DataCachingHBase

0 likes · 16 min read

Inside Taobao’s Massive Data Architecture: From Hadoop “Cloud Ladder” to Real‑Time “Galaxy”

Architecture Digest

Jul 21, 2017 · Big Data

Step-by-Step Guide to Building a High-Availability Hadoop HDFS and YARN Cluster

This article provides a comprehensive, step-by-step tutorial for setting up a high‑availability Hadoop cluster, covering user creation, JDK installation, host configuration, SSH setup, firewall and SELinux adjustments, Zookeeper deployment, HDFS and YARN HA configuration, essential XML files, and failover testing.

Cluster SetupHDFSHadoop

0 likes · 20 min read

Step-by-Step Guide to Building a High-Availability Hadoop HDFS and YARN Cluster

Architects' Tech Alliance

Jul 11, 2017 · Big Data

Understanding HDFS Architecture and Its Integration with NFS and Various Storage Solutions

This article reviews the fundamental concepts of HDFS, explains its master‑slave architecture with NameNode and DataNode, describes block replication, and discusses various implementations—including native HDFS, NetApp/Lustre, GPFS/Ceph, and Isilon—as well as HDFS‑to‑NFS gateway integration.

Big DataDistributed File SystemHDFS

0 likes · 7 min read

Understanding HDFS Architecture and Its Integration with NFS and Various Storage Solutions

StarRing Big Data Open Lab

Jun 16, 2017 · Big Data

How TDH Dominated the TPCx‑HS 10TB Benchmark: Strategies and Results

The article details how StarRocks and Cisco’s joint TPCx‑HS 10TB benchmark placed the TDH platform at the top of the performance ranking, explains the test setup, describes the pre‑ and post‑optimization strategies for TeraGen and TeraSort, and outlines the hardware configuration and key tuning parameters.

Big DataHadoopPerformance Optimization

0 likes · 10 min read

How TDH Dominated the TPCx‑HS 10TB Benchmark: Strategies and Results

Architecture Digest

Jun 15, 2017 · Big Data

Implementing a Distributed Stepwise Queue with Zookeeper for Hadoop Profit Calculation

This article demonstrates how to use Zookeeper as a distributed stepwise queue to coordinate multiple Hadoop MapReduce jobs for purchase, sales, and other cost calculations, automatically triggering a profit computation once all tasks complete, and provides full Java code examples and deployment instructions.

HadoopMapReduceZookeeper

0 likes · 21 min read

Implementing a Distributed Stepwise Queue with Zookeeper for Hadoop Profit Calculation