Tagged articles
407 articles
Page 2 of 5
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 10, 2021 · Big Data

Comprehensive Big Data Learning Path and Interview Knowledge Map

This extensive guide outlines a modern big‑data learning roadmap, covering essential programming languages, Linux, databases, distributed system theory, networking, offline and real‑time computation, message queues, data warehouses, algorithms, backend skills, interview preparation, and practical advice for building a personal knowledge system.

FlinkHadoopLearning Path
0 likes · 24 min read
Comprehensive Big Data Learning Path and Interview Knowledge Map
Laravel Tech Community
Laravel Tech Community
Jun 25, 2021 · Big Data

Apache Kudu 1.15.0 – New Features and Improvements

Apache Kudu 1.15.0 adds experimental multi‑row transaction support (currently INSERT and INSERT_IGNORE), Raft‑based master configuration tools, table comment synchronization with Hive Metastore, per‑table size and row‑count limits configurable via flags or the kudu table set_limit tool, a customizable Kerberos principal flag, and TLS v1.3 with optional cipher‑suite selection, collectively enhancing low‑latency random access and analytical capabilities in the Hadoop ecosystem.

Apache KuduBig DataHadoop
0 likes · 3 min read
Apache Kudu 1.15.0 – New Features and Improvements
iQIYI Technical Product Team
iQIYI Technical Product Team
Jun 11, 2021 · Big Data

Becoming an Apache Hadoop Committer: The Journey of iQIYI’s Zhu Qi and Open‑Source Insights

Zhu Qi, the first iQIYI Hadoop Committer, illustrates how a decade‑long record of code contributions, deep understanding of distributed‑computing, and a company’s open‑source culture can transform a chemical‑materials graduate into a leading Apache Hadoop contributor, while the article outlines the Committer role and a three‑stage roadmap for aspiring contributors.

CommitterHadoopcareer advice
0 likes · 9 min read
Becoming an Apache Hadoop Committer: The Journey of iQIYI’s Zhu Qi and Open‑Source Insights
58 Tech
58 Tech
May 28, 2021 · Big Data

Practical Upgrade Experience of Hadoop 3.2.1 in 58.com Data Platform: HDFS, YARN, and MR3

This article details the end‑to‑end upgrade of a 5000‑node Hadoop 2.6.0 cluster to Hadoop 3.2.1 at 58.com, covering HDFS migration, RBF and EC adoption, Yarn federation and rolling upgrades, MR3 integration, extensive compatibility testing, and operational lessons learned for large‑scale big‑data platforms.

Big DataCluster UpgradeHDFS
0 likes · 19 min read
Practical Upgrade Experience of Hadoop 3.2.1 in 58.com Data Platform: HDFS, YARN, and MR3
UCloud Tech
UCloud Tech
May 21, 2021 · Big Data

How US3 Hadoop Adapter Cuts Big Data Storage Costs and Boosts Performance

This article explains how UCloud's US3 object storage, combined with a custom Hadoop adapter, separates compute and storage, optimizes file system operations, and leverages caching and specialized APIs to dramatically reduce storage costs and improve read/write performance for large‑scale Hadoop workloads.

Big DataCacheHadoop
0 likes · 13 min read
How US3 Hadoop Adapter Cuts Big Data Storage Costs and Boosts Performance
Tencent Cloud Developer
Tencent Cloud Developer
May 19, 2021 · Industry Insights

How Cloud‑Native Principles Transform Big Data Infrastructure

The article analyzes how cloud‑native concepts such as DevOps, micro‑services, continuous delivery, and containerization can be applied to big‑data foundations, outlining four guiding principles—industrialized delivery, cost quantification, load‑adaptive scaling, and data‑centric design—and describing concrete Hadoop‑based architectures and Tencent Cloud solutions that lower cost while boosting performance.

Big DataCost OptimizationData Infrastructure
0 likes · 22 min read
How Cloud‑Native Principles Transform Big Data Infrastructure
Architecture Digest
Architecture Digest
May 17, 2021 · Big Data

Technical Architecture Overview of Toutiao: Data Pipeline, User Modeling, Recommendation System, and Microservices

The article provides a comprehensive technical overview of Toutiao's rapid growth, detailing its massive user base, data collection and processing pipelines, user modeling, cold‑start strategies, recommendation engines, storage solutions, push notification mechanisms, and the underlying microservice and PaaS architecture.

Big DataHadoopKafka
0 likes · 8 min read
Technical Architecture Overview of Toutiao: Data Pipeline, User Modeling, Recommendation System, and Microservices
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 15, 2021 · Big Data

Hive and Hadoop Interview Questions and Answers

This article provides a comprehensive collection of interview-style questions and detailed answers covering Hive concepts, Hadoop architecture, MapReduce mechanics, HDFS operations, and performance optimization techniques for big‑data processing environments.

HadoopMapReducedata-warehouse
0 likes · 41 min read
Hive and Hadoop Interview Questions and Answers
Big Data Technology Architecture
Big Data Technology Architecture
Apr 8, 2021 · Big Data

Managing Small Files in Spark SQL: Causes, Impact, and Practical Solutions

This article explains the small‑file problem in Spark SQL on HDFS, its impact on NameNode memory and query performance, describes how dynamic partition inserts and shuffle settings generate many files, and presents practical solutions such as partition‑based distribution, random bucketing and adaptive query execution to control file count.

Big DataHadoopSmall Files
0 likes · 12 min read
Managing Small Files in Spark SQL: Causes, Impact, and Practical Solutions
dbaplus Community
dbaplus Community
Mar 20, 2021 · Big Data

How a Bank Boosted Data Ingestion Speed 50% Using Sqoop Direct Mode on Hadoop

This article details how a bank transformed its retail system data pipeline from a monolithic DB2 setup to a distributed Oracle‑Hadoop architecture, evaluated five extraction tools, selected Sqoop direct mode, and implemented customizations to achieve over 50% performance gains and reliable incremental data capture.

Big DataDirect ModeHadoop
0 likes · 11 min read
How a Bank Boosted Data Ingestion Speed 50% Using Sqoop Direct Mode on Hadoop
dbaplus Community
dbaplus Community
Mar 17, 2021 · Big Data

How We Cut PBs of Waste and Optimized HDFS with Tiered Storage and Cloud Migration

This article details a three‑part technical sharing that covers cost governance for offline Hadoop clusters, a large‑scale data‑center migration with architecture upgrades, and a tiered storage strategy using EC and COS to reduce storage costs and improve performance in a cloud‑native big‑data environment.

Big Data MigrationCOSCloud Native
0 likes · 10 min read
How We Cut PBs of Waste and Optimized HDFS with Tiered Storage and Cloud Migration
Big Data Technology Architecture
Big Data Technology Architecture
Mar 2, 2021 · Big Data

Understanding and Managing Small Files in Hadoop HDFS

This article explains what small files are in Hadoop HDFS, how they degrade NameNode memory, RPC performance, and application throughput, and provides practical strategies—including detection, configuration, and merging techniques—to mitigate their impact on storage and processing layers.

HDFSHadoop
0 likes · 12 min read
Understanding and Managing Small Files in Hadoop HDFS
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 22, 2021 · Big Data

Key New Features and Improvements in Hadoop 3.x

Hadoop 3.x upgrades the platform to JDK 1.8 and introduces a range of enhancements across common components, HDFS, YARN, and MapReduce, including erasure coding, multi‑NameNode high availability, cgroup‑based resource isolation, native map‑output collectors, and split client libraries, while also adding support for Azure and Aliyun distributed file systems.

HDFSHadoopMapReduce
0 likes · 7 min read
Key New Features and Improvements in Hadoop 3.x
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 15, 2021 · Big Data

Evolution and Architecture of Major Chinese Big Data Platforms: Taobao, Didi, Meituan, 360, Kuaishou, and JD

This article reviews the evolution, architecture, and key components of major Chinese big‑data platforms—including those of Taobao, Didi, Meituan, 360, Kuaishou, and JD—highlighting data ingestion, storage, processing engines, scheduling systems, and service‑oriented designs that underpin their large‑scale data operations.

Big DataData PlatformHadoop
0 likes · 14 min read
Evolution and Architecture of Major Chinese Big Data Platforms: Taobao, Didi, Meituan, 360, Kuaishou, and JD
dbaplus Community
dbaplus Community
Jan 5, 2021 · Big Data

How Ctrip Built a Scalable Unified Log Framework for Payment Data

Facing massive, heterogeneous logs from numerous payment services, Ctrip’s data team designed a unified logging framework that extends log4j2, streams logs via Kafka to HDFS using a customized Camus pipeline, partitions and stores data in ORC for efficient Hive analysis, while addressing format, storage, and performance challenges.

Big DataCamusHadoop
0 likes · 16 min read
How Ctrip Built a Scalable Unified Log Framework for Payment Data
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 27, 2020 · Big Data

Understanding and Solving the Small File Problem in Big Data Systems

This article examines the pervasive small‑file issue in big‑data environments, explains its impact on storage and processing performance, and presents a comprehensive set of solutions—including file merging, Hadoop archives, SequenceFiles, HBase, CombineFileInputFormat, and Spark/Flink strategies—to mitigate metadata overhead and improve I/O efficiency.

FlinkHadoopNameNode
0 likes · 41 min read
Understanding and Solving the Small File Problem in Big Data Systems
dbaplus Community
dbaplus Community
Dec 22, 2020 · Big Data

How eBay Migrated 10 PB of HDFS Data Across Namespaces in Just 2 Hours

This article details how eBay's ADI Hadoop team tackled a massive 10 PB, 10‑million‑file migration by optimizing DistCp with Fastcopy, load‑balancing, ACL handling, and failure recovery, ultimately completing the transfer within a two‑hour window while preserving cluster stability and performance.

Big DataDistcpHDFS
0 likes · 16 min read
How eBay Migrated 10 PB of HDFS Data Across Namespaces in Just 2 Hours
Architect
Architect
Dec 13, 2020 · Big Data

Understanding and Solving Data Skew in Hadoop and Spark

This article explains what data skew is, why it occurs in large‑scale Hadoop and Spark jobs, illustrates typical scenarios that cause it, and provides practical strategies and platform‑specific optimizations to detect, mitigate, and prevent skew in big‑data processing pipelines.

HadoopSparkdistributed computing
0 likes · 13 min read
Understanding and Solving Data Skew in Hadoop and Spark
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 9, 2020 · Big Data

Handling Small Files in Hive: Configuration, Compression, and File Format Optimization

The article explains why Hive tables generate many small files on HDFS, describes the performance impact on NameNode and MapReduce, and provides detailed configuration steps and compression techniques—including input and output file merging, various Hive file formats, and partition optimization—to efficiently manage storage and resource consumption in big‑data environments.

HadoopSmall Filescompression
0 likes · 19 min read
Handling Small Files in Hive: Configuration, Compression, and File Format Optimization
Practical DevOps Architecture
Practical DevOps Architecture
Nov 27, 2020 · Big Data

Step-by-Step Guide to Install and Configure a Hadoop 2.8.2 Cluster

This tutorial provides a complete walkthrough for downloading Hadoop 2.8.2, setting up a three‑node master‑slave cluster, configuring core, HDFS, MapReduce and YARN settings, creating required directories, distributing the installation, starting the services, verifying the cluster status, and finally shutting it down.

Big DataCluster SetupHDFS
0 likes · 5 min read
Step-by-Step Guide to Install and Configure a Hadoop 2.8.2 Cluster
DataFunTalk
DataFunTalk
Nov 26, 2020 · Big Data

Evolution of 58.com Commercial Data Warehouse: From 0‑1 to 3.0 Architecture and Technology

This article details the evolution of 58.com’s commercial data warehouse across three phases—1.0, 2.0, and 3.0—covering its scale, four‑layer architecture, migration from legacy Hadoop‑MapReduce pipelines to Flume/Kafka and Flink streaming, code optimizations, monitoring, and productization for real‑time business insights.

Big DataETLHadoop
0 likes · 9 min read
Evolution of 58.com Commercial Data Warehouse: From 0‑1 to 3.0 Architecture and Technology
ITPUB
ITPUB
Nov 19, 2020 · Fundamentals

How HugePages Boost Database and Hadoop Performance on Linux

This article explains Linux HugePages, how to view and configure them, demonstrates code and Kubernetes examples, and details how larger memory pages reduce management overhead and lock memory to improve performance for memory‑intensive services like databases and Hadoop.

HadoopLinuxMemory Management
0 likes · 10 min read
How HugePages Boost Database and Hadoop Performance on Linux
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 16, 2020 · Big Data

Understanding Data Skew in Big Data: Causes, Symptoms, and Solutions for Hadoop and Spark

This article explains what data skew is, why it occurs in large‑scale Hadoop and Spark jobs, how to recognize its symptoms such as stuck reducers or OOM executors, and presents practical strategies—including business‑level adjustments, code refactoring, and platform‑specific tuning—to mitigate the problem.

Big DataHadoopSpark
0 likes · 13 min read
Understanding Data Skew in Big Data: Causes, Symptoms, and Solutions for Hadoop and Spark
DataFunSummit
DataFunSummit
Nov 15, 2020 · Big Data

Evolution of 58.com Commercial Data Warehouse: From 0‑1 to 3.0 Using Hadoop, Flume, Kafka, Spark, and Flink

This article details the three‑stage evolution of 58.com’s commercial data warehouse, describing its massive scale, four‑layer architecture, technical challenges, migrations from MapReduce to Hive and Flink, real‑time streaming upgrades, and the resulting improvements in stability, accuracy, and timeliness.

Big DataData ArchitectureFlink
0 likes · 10 min read
Evolution of 58.com Commercial Data Warehouse: From 0‑1 to 3.0 Using Hadoop, Flume, Kafka, Spark, and Flink
Ctrip Technology
Ctrip Technology
Sep 10, 2020 · Big Data

Design and Implementation of a Unified Log Framework for Ctrip Payment Center

The article describes the design, architecture, and operational details of a unified logging framework at Ctrip's payment center, covering log production via a Log4j2 extension, Kafka‑Camus collection, Hive/ORC storage, MapReduce parsing optimizations, and governance strategies for massive daily TB‑scale data.

Big DataCamusData Governance
0 likes · 15 min read
Design and Implementation of a Unified Log Framework for Ctrip Payment Center
MaGe Linux Operations
MaGe Linux Operations
Sep 7, 2020 · Databases

Step-by-Step Guide to Installing an HBase Cluster on Hadoop

This article explains what HBase is, describes its Master, RegionServer, and Zookeeper components, and provides detailed environment preparation and configuration steps—including host setup, SSH key distribution, JDK installation, HBase deployment, configuration file edits, and cluster startup—so you can run HBase on a Hadoop cluster.

HBaseHadoopbigdata
0 likes · 8 min read
Step-by-Step Guide to Installing an HBase Cluster on Hadoop
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 1, 2020 · Big Data

Configuring Hadoop to Support LZO Compression

This guide explains how to enable LZO compression in Hadoop by installing the twitter‑provided hadoop‑lzo library, updating core‑site.xml, synchronizing files across nodes, creating LZO indexes, and running a WordCount MapReduce job with LZO‑compressed output.

Big DataConfigurationHadoop
0 likes · 6 min read
Configuring Hadoop to Support LZO Compression
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 16, 2020 · Big Data

Comprehensive Overview of HDFS: Architecture, Advantages, Limitations, Commands, and Advanced Features

This article provides a detailed introduction to HDFS, covering its application scenarios, core architecture, fault‑tolerance benefits, drawbacks such as high latency and small‑file inefficiency, essential shell and API commands, cluster management procedures, and newer Hadoop 2.0 features like HA, Federation, snapshots, ACLs, and heterogeneous storage.

Big DataCLIHA
0 likes · 10 min read
Comprehensive Overview of HDFS: Architecture, Advantages, Limitations, Commands, and Advanced Features
Programmer DD
Programmer DD
Jul 7, 2020 · Big Data

How to Choose a Worthwhile Technology: Depth, Ecosystem, and Evolution

The article outlines a three‑dimensional framework—technical depth, ecosystem breadth, and evolution capability—to help engineers decide which big‑data or stream‑processing technology (such as Hadoop, Spark, or Flink) is worth investing time in, and provides practical tips like using Google Trends and GitHub awesome lists.

Big DataFlinkHadoop
0 likes · 12 min read
How to Choose a Worthwhile Technology: Depth, Ecosystem, and Evolution
DataFunTalk
DataFunTalk
Jul 5, 2020 · Big Data

ByteDance’s Optimizations to Hadoop YARN: Enhancing Utilization, Multi‑Load Scenarios, Stability, and Multi‑Region Active‑Active

This article describes ByteDance’s four‑year series of customizations to Hadoop YARN—covering utilization improvements, multi‑load scenario optimizations, stability enhancements, and multi‑region active‑active deployment—along with practical production experiences, architectural details, and future work directions.

ByteDanceCluster OptimizationHadoop
0 likes · 12 min read
ByteDance’s Optimizations to Hadoop YARN: Enhancing Utilization, Multi‑Load Scenarios, Stability, and Multi‑Region Active‑Active
dbaplus Community
dbaplus Community
Jun 18, 2020 · Databases

How a Hybrid Data Warehouse Transformed Banking Data Services

This article details the 2015 hybrid data‑warehouse design implemented at Guangdong Huaxing Bank, explaining its real‑time, historical, and archival layers, the data‑bus concept, and how mixing in‑memory, relational, and Hadoop technologies addressed modern banking data‑volume, latency, and unstructured‑data challenges.

BankingBig DataHadoop
0 likes · 20 min read
How a Hybrid Data Warehouse Transformed Banking Data Services
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 18, 2020 · Big Data

CPU Resource Isolation in YARN with Linux cgroups

This article introduces Linux cgroups, explains their CPU subsystem files and parameters, demonstrates how to create and configure cgroups, and details how YARN leverages cgroups for CPU resource isolation through configuration settings and code implementations, comparing soft and hard limit approaches.

HadoopLinuxYARN
0 likes · 10 min read
CPU Resource Isolation in YARN with Linux cgroups
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Jun 5, 2020 · Big Data

Why Serverless Big Data Is the Future of Scalable Analytics

The article traces the evolution from on‑premise relational databases to self‑built Hadoop clusters, cloud‑hosted Hadoop, and finally to semi‑managed and serverless big‑data services, highlighting their advantages, challenges, and the four key pillars—security, elasticity, intelligence, and usability—that will shape the future of serverless big‑data analytics.

Data AnalyticsHadoopcloud computing
0 likes · 10 min read
Why Serverless Big Data Is the Future of Scalable Analytics
Big Data Technology Architecture
Big Data Technology Architecture
Jun 4, 2020 · Big Data

58.com Big Data Offline Computing Platform: Architecture, Scaling, Optimization, and Cross‑Data‑Center Migration

This article presents a comprehensive case study of 58.com’s massive Hadoop‑based offline computing platform, detailing its architecture, scaling challenges, performance‑tuning measures, YARN and SparkSQL upgrades, and the systematic cross‑data‑center migration of thousands of nodes and petabytes of data.

Big DataData MigrationHadoop
0 likes · 23 min read
58.com Big Data Offline Computing Platform: Architecture, Scaling, Optimization, and Cross‑Data‑Center Migration
Big Data Technology & Architecture
Big Data Technology & Architecture
May 28, 2020 · Big Data

Hadoop System Bottleneck Detection and MapReduce Optimization Guide

This article provides a comprehensive guide on detecting Hadoop system bottlenecks, analyzing resource constraints, and applying practical MapReduce performance tuning techniques—including baseline creation, counter analysis, combiner usage, compression, and proper Writable types—to achieve optimal big‑data processing efficiency.

Big DataHadoopMapReduce
0 likes · 11 min read
Hadoop System Bottleneck Detection and MapReduce Optimization Guide
Big Data Technology & Architecture
Big Data Technology & Architecture
May 26, 2020 · Information Security

Step-by-Step Guide to Integrating Kerberos Authentication with the Cloudera Platform

This article provides a comprehensive tutorial on Kerberos fundamentals, its authentication workflow, and detailed procedures for installing, configuring, and enabling Kerberos security on a Cloudera (Hadoop) cluster running on CentOS, including code snippets, configuration files, and post‑deployment testing steps.

AuthenticationBig DataCloudera
0 likes · 17 min read
Step-by-Step Guide to Integrating Kerberos Authentication with the Cloudera Platform
Big Data Technology & Architecture
Big Data Technology & Architecture
May 16, 2020 · Big Data

Apache Kylin Single‑Node Installation Guide and Troubleshooting

This article provides a comprehensive step‑by‑step guide for installing Apache Kylin on a single machine, covering required software versions, environment variable configuration, Spark dependency handling, main Kylin properties, verification steps, and detailed solutions to common errors such as Zookeeper host issues, HTTP 404, Jackson conflicts, MapReduce jobhistory problems, missing Spark classes, HiveConf errors, and YARN shuffle service configuration.

Apache KylinBig DataHadoop
0 likes · 26 min read
Apache Kylin Single‑Node Installation Guide and Troubleshooting
Big Data Technology & Architecture
Big Data Technology & Architecture
May 13, 2020 · Big Data

Analysis of Hadoop HDFS Data Read and Write Process

This article explains the underlying principles of Hadoop HDFS read and write operations, detailing how the client interacts with NameNode and DataNodes, the role of FsDataInputStream and FsDataOutputStream, block location retrieval, pipeline replication, and file closure steps.

Big DataData ReadData Write
0 likes · 8 min read
Analysis of Hadoop HDFS Data Read and Write Process
Big Data Technology Architecture
Big Data Technology Architecture
May 6, 2020 · Big Data

Ozone vs HDFS: Why Ozone Cannot Replace Hadoop’s Core Storage

In this article, senior Alibaba engineer Zheng Kai analyzes Ozone’s role in the Hadoop ecosystem, arguing that despite its usefulness, Ozone cannot solve Hadoop’s core challenges of complexity, cost, and performance, and that Hadoop must focus on storage innovation, compute‑storage separation, and cloud integration to stay relevant.

HDFSHadoopOzone
0 likes · 14 min read
Ozone vs HDFS: Why Ozone Cannot Replace Hadoop’s Core Storage
Big Data Technology & Architecture
Big Data Technology & Architecture
May 6, 2020 · Big Data

Step-by-Step Guide to Installing and Configuring a Hadoop Cluster on Three Virtual Machines

This article provides a comprehensive, hands‑on tutorial for preparing three VMs, installing JDK and Hadoop, configuring core‑site.xml, hdfs‑site.xml, mapred‑site.xml, yarn‑site.xml, setting environment variables, distributing the package, starting HDFS and YARN, and verifying the cluster via web UI and jps commands.

Big DataCluster SetupHDFS
0 likes · 14 min read
Step-by-Step Guide to Installing and Configuring a Hadoop Cluster on Three Virtual Machines
Architecture Digest
Architecture Digest
May 4, 2020 · Databases

HBase Overview, Architecture, Installation, and Basic Shell Operations

This article provides a comprehensive introduction to HBase, covering its origins, key characteristics, architecture components, installation steps, basic shell commands for table management, data structures, read/write processes, and high‑availability configuration within the Hadoop ecosystem.

Big DataHBaseHadoop
0 likes · 14 min read
HBase Overview, Architecture, Installation, and Basic Shell Operations
Big Data Technology Architecture
Big Data Technology Architecture
Apr 20, 2020 · Big Data

Introduction to HDFS: Architecture, Features, Replication, Rack Awareness, and Metadata Management

This article provides a comprehensive overview of Hadoop Distributed File System (HDFS), covering its streaming data access model, key characteristics, master‑slave architecture, block storage and replication mechanisms, rack‑aware placement strategy, and how the NameNode manages metadata and checkpoints.

Distributed File SystemHDFSHadoop
0 likes · 7 min read
Introduction to HDFS: Architecture, Features, Replication, Rack Awareness, and Metadata Management
dbaplus Community
dbaplus Community
Apr 15, 2020 · Big Data

How Ctrip Scaled Hadoop Across Data Centers: Architecture and Lessons

This article details Ctrip's Hadoop evolution, the challenges of expanding across multiple data centers, the evaluation of multi‑cluster versus single‑cluster designs, and the concrete architectural changes, migration tools, bandwidth monitoring, and future plans that enabled a stable cross‑datacenter big‑data platform.

Big DataCross-DataCenterHDFS
0 likes · 19 min read
How Ctrip Scaled Hadoop Across Data Centers: Architecture and Lessons
dbaplus Community
dbaplus Community
Mar 23, 2020 · Big Data

How to Detect and Resolve Data Skew in Spark and Hadoop

This article explains what data skew is in distributed big‑data systems like Spark and Hadoop, why it hurts performance, how to spot it using the Web UI or key statistics, and presents eight practical mitigation techniques ranging from filtering and shuffle parallelism to custom partitioners and broadcast joins.

Broadcast JoinData SkewHadoop
0 likes · 19 min read
How to Detect and Resolve Data Skew in Spark and Hadoop
Open Source Linux
Open Source Linux
Mar 12, 2020 · Big Data

Step-by-Step Guide to Build a Hadoop 2.9.2 Cluster on CentOS 7.5

This tutorial walks you through setting up a three‑node Hadoop 2.9.2 cluster on CentOS 7.5, covering environment preparation, password‑less SSH, user creation, JDK installation, Hadoop extraction, configuration file edits, directory setup, ownership changes, service startup, and verification via web UIs.

Big DataCentOSCluster Setup
0 likes · 13 min read
Step-by-Step Guide to Build a Hadoop 2.9.2 Cluster on CentOS 7.5
ITPUB
ITPUB
Mar 2, 2020 · Big Data

Mastering ZooKeeper: Core Concepts and Real-World Big Data Applications

This article explains ZooKeeper’s architecture, key concepts such as roles, sessions, ZNodes, versioning, ACLs, and watchers, and demonstrates how it powers essential big‑data components like Hadoop’s ResourceManager and HBase’s master election, naming service, and distributed locking.

Big DataDistributed CoordinationHBase
0 likes · 23 min read
Mastering ZooKeeper: Core Concepts and Real-World Big Data Applications
Yanxuan Tech Team
Yanxuan Tech Team
Feb 17, 2020 · Big Data

Why Data Warehouses Matter: From Basics to the Hadoop Ecosystem

This article explains the purpose of data as a strategic asset, compares traditional databases with data warehouses, outlines key characteristics and related concepts of data warehouses, and introduces the Hadoop ecosystem components that support large‑scale data storage and analysis.

AnalyticsETLHadoop
0 likes · 14 min read
Why Data Warehouses Matter: From Basics to the Hadoop Ecosystem
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 13, 2020 · Big Data

Optimizing Hadoop MapReduce Jobs for eBay CAL System to Reduce Execution Time and Resource Usage

This article describes how eBay's Central Application Logging (CAL) system generates massive daily logs, the challenges of Hadoop MapReduce job performance and resource consumption, and the step‑by‑step optimizations—reducing GC time, mitigating data skew, and improving algorithms—that cut execution time by over 60%, lowered cluster resource usage, and raised job success rates to nearly 100%.

Big DataData SkewHadoop
0 likes · 11 min read
Optimizing Hadoop MapReduce Jobs for eBay CAL System to Reduce Execution Time and Resource Usage
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 23, 2020 · Big Data

Understanding Apache Hudi: Incremental Processing and Low‑Latency Data Management on Hadoop

This article explains how Apache Hudi enables efficient, low‑latency incremental data ingestion and processing on Hadoop by providing a unified service layer, describing its motivation, architecture, storage components, write and read paths, compaction, fault recovery, and incremental query capabilities.

Apache HudiHadoopIncremental Processing
0 likes · 17 min read
Understanding Apache Hudi: Incremental Processing and Low‑Latency Data Management on Hadoop
Didi Tech
Didi Tech
Jan 5, 2020 · Big Data

Rolling Upgrade of HDFS from 2.7 to 3.2: Experience, Issues and Solutions

The team performed a rolling upgrade of HDFS from 2.7 to 3.2 on large clusters, resolving EditLog, Fsimage, StringTable and authentication incompatibilities by omitting EC data, using fallback images, rolling back commits and first upgrading to the latest 2.x release, following a staged JournalNode‑NameNode‑DataNode procedure, validating with rehearsals and a custom trash‑management tool, and achieving uninterrupted service, improved stability, performance and cost efficiency.

Big DataCluster MigrationHDFS
0 likes · 11 min read
Rolling Upgrade of HDFS from 2.7 to 3.2: Experience, Issues and Solutions
Youzan Coder
Youzan Coder
Dec 18, 2019 · Big Data

HBase Bulkload Practice at Youzan: From MapReduce to Spark Evolution

Youzan’s evolution of HBase bulk‑load—from manual MapReduce jobs to Hive‑SQL and finally Spark—demonstrates how generating HFiles on HDFS, partitioning by region, sorting keys, and handling serialization issues enables billions of records to be loaded efficiently without disrupting production clusters.

HBaseHadoopNoSQL
0 likes · 16 min read
HBase Bulkload Practice at Youzan: From MapReduce to Spark Evolution
Architecture Digest
Architecture Digest
Nov 5, 2019 · Big Data

Architecture Overview of Taobao, Meituan, and Didi Big Data Platforms

This article examines the big‑data architectures of three leading Chinese internet companies—Taobao, Meituan, and Didi—detailing their data sources, synchronization mechanisms, batch and streaming processing layers, and the common scheduling components that unify their Hadoop‑based ecosystems.

Big DataData ArchitectureDidi
0 likes · 7 min read
Architecture Overview of Taobao, Meituan, and Didi Big Data Platforms
dbaplus Community
dbaplus Community
Oct 28, 2019 · Big Data

Quickly Analyze Hadoop NameNode RPC with ELK and Grafana

This guide shows how to reduce excessive NameNode RPC calls caused by frequent HDFS directory listings and demonstrates a complete ELK pipeline—Filebeat, Kafka/Logstash, Elasticsearch, and Kibana—plus Grafana dashboards for real‑time monitoring of Hadoop RPC operations.

ELKGrafanaHadoop
0 likes · 9 min read
Quickly Analyze Hadoop NameNode RPC with ELK and Grafana
Big Data Technology Architecture
Big Data Technology Architecture
Oct 15, 2019 · Big Data

Introduction to Apache Kylin: A Fast Big Data OLAP Engine

Apache Kylin is an open‑source, Hadoop‑based OLAP engine that provides sub‑second, multi‑dimensional SQL queries on massive datasets, with features such as cube pre‑computation, real‑time analytics, and seamless BI tool integration, and its latest v2.6.4 release adds numerous fixes and improvements.

Apache KylinBI IntegrationHadoop
0 likes · 4 min read
Introduction to Apache Kylin: A Fast Big Data OLAP Engine
dbaplus Community
dbaplus Community
Oct 8, 2019 · Big Data

How to Master Large-Scale Cluster Management: 10 Real-World Troubleshooting Cases

This article shares a senior data‑platform engineer's hands‑on experience managing dozens of thousand‑node clusters, detailing nine common cluster problems and step‑by‑step solutions—including performance tuning, RPC fixes, HDFS cleanup, Hive metadata repair, Spark shuffle optimization, HBase region recovery, and Kafka bottleneck mitigation.

Big DataCluster ManagementHBase
0 likes · 17 min read
How to Master Large-Scale Cluster Management: 10 Real-World Troubleshooting Cases
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 23, 2019 · Big Data

Applying Apache Kylin for Large‑Scale OLAP at Meituan: Architecture, Challenges, and Performance Evaluation

This article describes Meituan’s large‑scale OLAP requirements, how Apache Kylin was integrated to meet them, the architectural solutions, performance benchmarks against other engines, and future work, providing practical insights for building stable, precise, and high‑performance analytics platforms.

Apache KylinBig DataHadoop
0 likes · 20 min read
Applying Apache Kylin for Large‑Scale OLAP at Meituan: Architecture, Challenges, and Performance Evaluation
360 Tech Engineering
360 Tech Engineering
Sep 19, 2019 · Big Data

Understanding HDFS: Architecture, Read/Write Operations, Component Roles, Commands, and Pros & Cons

This article provides a comprehensive overview of HDFS, covering its purpose, architecture, read/write mechanisms, replication strategies, component responsibilities, common command‑line tools, and the advantages and disadvantages of using Hadoop Distributed File System for large‑scale data storage.

Distributed File SystemHDFSHadoop
0 likes · 10 min read
Understanding HDFS: Architecture, Read/Write Operations, Component Roles, Commands, and Pros & Cons
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 11, 2019 · Big Data

Big Data Technology and Architecture: Case Studies of Taobao, Didi, and Meituan

This article reviews the evolution and key components of big data platforms at leading Chinese internet companies—Taobao, Didi, and Meituan—detailing their data sources, synchronization tools, storage layers, processing engines, and scheduling systems to provide practical guidance for building robust big data infrastructures.

Big DataData PlatformETL
0 likes · 9 min read
Big Data Technology and Architecture: Case Studies of Taobao, Didi, and Meituan
Tencent Cloud Developer
Tencent Cloud Developer
Sep 11, 2019 · Big Data

YARN Practice and Technical Evolution at Kuaishou

Jiaoxiao Fang’s talk details Kuaishou’s YARN deployment, covering its architecture, support for offline, real‑time and ML workloads, and recent enhancements such as event‑handling stability, refined preemption, high‑throughput parallel scheduling, shuffle‑caching for small I/O, plus plans for job protection and multi‑cluster resource utilization.

Big DataCluster OptimizationDistributed Systems
0 likes · 16 min read
YARN Practice and Technical Evolution at Kuaishou
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 6, 2019 · Big Data

Big Data Development Interview Guide and Skill Tree Overview

This article provides a comprehensive interview roadmap for big data developers, outlining essential Java fundamentals, JVM internals, Linux basics, distributed theory, core frameworks such as Hadoop, Spark, Flink, Kafka, Netty, HBase, Hive, and practical algorithm topics, while also offering resume and career advice for aspiring candidates.

FlinkHadoopKafka
0 likes · 15 min read
Big Data Development Interview Guide and Skill Tree Overview
Architects' Tech Alliance
Architects' Tech Alliance
Aug 24, 2019 · Big Data

Reimagining Big Data in a Post‑Hadoop World

The article analyzes the decline of Hadoop as the dominant big‑data platform, explains how cloud‑based services are replacing its complex on‑premises architecture, and outlines the lessons and future directions for enterprises navigating a post‑Hadoop landscape.

Big DataDistributed SystemsHadoop
0 likes · 12 min read
Reimagining Big Data in a Post‑Hadoop World
360 Tech Engineering
360 Tech Engineering
Aug 22, 2019 · Big Data

Design and Implementation of XStore: A Hadoop‑Based Sample Storage System

This article details the design, architecture, and operational experience of XStore, a Hadoop‑backed sample storage system that handles billions of APK and other binary samples, addressing functional and non‑functional requirements such as real‑time upload, large‑scale storage, high‑performance reads, and disaster recovery.

HBaseHDFSHadoop
0 likes · 11 min read
Design and Implementation of XStore: A Hadoop‑Based Sample Storage System
dbaplus Community
dbaplus Community
Aug 19, 2019 · Big Data

Automating Fault Recovery in 5,000‑Node Hadoop Clusters with Fabric & CM_API

This article explains how a large‑scale Hadoop environment can automatically detect common failures—such as swap usage, clock drift, agent crashes, role outages, and disk imbalance—and recover them using Prometheus alerts, Fabric/Paramiko remote execution, and Cloudera Manager APIs, complete with code examples and step‑by‑step commands.

Big Data OperationsCM_APICluster Automation
0 likes · 12 min read
Automating Fault Recovery in 5,000‑Node Hadoop Clusters with Fabric & CM_API