Tagged articles

Hadoop

413 articles · Page 2 of 5

Aug 4, 2021 · Big Data

Essential HDFS Shell Commands for Managing Hadoop Files

This guide explains how to use the HDFS shell (preferred via hdfs dfs) to list, copy, move, delete, and snapshot files in a Hadoop cluster, detailing command syntax, URI handling, generic options, and practical examples for each operation.

Big DataHDFSHadoop

0 likes · 9 min read

Essential HDFS Shell Commands for Managing Hadoop Files

The Dominant Programmer

Aug 2, 2021 · Big Data

How to Build a Beginner Hadoop Cluster on CentOS 7

This article introduces Apache Hadoop’s open‑source framework, explains its core components such as HDFS, MapReduce, ZooKeeper, HBase, Hive, Pig, Mahout, Sqoop, Flume, Chukwa, Oozie, Ambari and YARN, and outlines the steps to set up a beginner‑level Hadoop cluster on CentOS 7.

Big DataCentOS 7HBase

0 likes · 11 min read

How to Build a Beginner Hadoop Cluster on CentOS 7

Big Data Technology & Architecture

Aug 2, 2021 · Big Data

Comprehensive Big Data Interview Question Guide for Major Tech Companies

This article compiles extensive interview questions and topics covering Hadoop, Spark, Flink, Hive, Kafka, MySQL, Redis, Java fundamentals, and algorithms, organized by companies such as Xiaomi, ByteDance, Alibaba, Shopee, Tencent, Meituan, NetEase, and Baidu, to help candidates prepare effectively for big‑data engineering roles.

Big DataFlinkHadoop

0 likes · 22 min read

Comprehensive Big Data Interview Question Guide for Major Tech Companies

Big Data Technology Architecture

Jul 27, 2021 · Big Data

Key Components of the Big Data Ecosystem: Hadoop, Hive, HBase, Spark, Kafka, and Elasticsearch

This article introduces the most important and still mainstream components of the big data ecosystem—including Hadoop’s storage and compute framework, Hive data warehouse, HBase NoSQL database, Spark unified engine, Kafka messaging platform, and Elasticsearch search engine—explaining their core concepts, architectures, and typical use cases.

Big DataElasticsearchHBase

0 likes · 9 min read

Key Components of the Big Data Ecosystem: Hadoop, Hive, HBase, Spark, Kafka, and Elasticsearch

Big Data Technology & Architecture

Jul 19, 2021 · Big Data

Understanding Hadoop: MapReduce, HDFS, YARN, and Core Big Data Concepts

This article provides a comprehensive overview of Hadoop’s core components—including MapReduce programming model, HDFS storage architecture, and YARN resource management—while discussing common challenges like data skew and small files, and offering learning resources for aspiring big‑data engineers.

Data SkewHDFSHadoop

0 likes · 9 min read

Understanding Hadoop: MapReduce, HDFS, YARN, and Core Big Data Concepts

UCloud Tech

Jul 13, 2021 · Big Data

Step‑by‑Step Guide to Deploy UCloud’s Free USDP Big Data Platform on CentOS

This article walks you through the complete installation and configuration of UCloud's free USDP (UCloud Data Platform) on a three‑node CentOS 7.2‑7.6 cluster, covering environment preparation, package download, repair scripts, MySQL setup, service startup, web UI activation, monitoring, and a quick Hive query example.

CentOSCluster DeploymentHadoop

0 likes · 19 min read

Step‑by‑Step Guide to Deploy UCloud’s Free USDP Big Data Platform on CentOS

Big Data Technology & Architecture

Jul 10, 2021 · Big Data

Comprehensive Big Data Learning Path and Interview Knowledge Map

This extensive guide outlines a modern big‑data learning roadmap, covering essential programming languages, Linux, databases, distributed system theory, networking, offline and real‑time computation, message queues, data warehouses, algorithms, backend skills, interview preparation, and practical advice for building a personal knowledge system.

FlinkHadoopSpark

0 likes · 24 min read

Comprehensive Big Data Learning Path and Interview Knowledge Map

Laravel Tech Community

Jun 25, 2021 · Big Data

Apache Kudu 1.15.0 – New Features and Improvements

Apache Kudu 1.15.0 adds experimental multi‑row transaction support (currently INSERT and INSERT_IGNORE), Raft‑based master configuration tools, table comment synchronization with Hive Metastore, per‑table size and row‑count limits configurable via flags or the kudu table set_limit tool, a customizable Kerberos principal flag, and TLS v1.3 with optional cipher‑suite selection, collectively enhancing low‑latency random access and analytical capabilities in the Hadoop ecosystem.

Apache KuduBig DataHadoop

0 likes · 3 min read

Apache Kudu 1.15.0 – New Features and Improvements

Tencent Cloud Developer

Jun 21, 2021 · Industry Insights

How Hadoop YARN on Kubernetes Pods Supercharge Resource Utilization and Cut Costs

This article explains how Tencent Cloud EMR integrated Hadoop YARN with Kubernetes Pods to create a hybrid online‑offline deployment, implement elastic autoscaling and multi‑label resource allocation, and achieve several‑hundred‑percent improvements in CPU utilization while preserving cluster stability.

Big DataCloud NativeHadoop

0 likes · 11 min read

How Hadoop YARN on Kubernetes Pods Supercharge Resource Utilization and Cut Costs

iQIYI Technical Product Team

Jun 11, 2021 · Big Data

Becoming an Apache Hadoop Committer: The Journey of iQIYI’s Zhu Qi and Open‑Source Insights

Zhu Qi, the first iQIYI Hadoop Committer, illustrates how a decade‑long record of code contributions, deep understanding of distributed‑computing, and a company’s open‑source culture can transform a chemical‑materials graduate into a leading Apache Hadoop contributor, while the article outlines the Committer role and a three‑stage roadmap for aspiring contributors.

Career AdviceCommitterHadoop

0 likes · 9 min read

Becoming an Apache Hadoop Committer: The Journey of iQIYI’s Zhu Qi and Open‑Source Insights

58 Tech

May 28, 2021 · Big Data

Practical Upgrade Experience of Hadoop 3.2.1 in 58.com Data Platform: HDFS, YARN, and MR3

This article details the end‑to‑end upgrade of a 5000‑node Hadoop 2.6.0 cluster to Hadoop 3.2.1 at 58.com, covering HDFS migration, RBF and EC adoption, Yarn federation and rolling upgrades, MR3 integration, extensive compatibility testing, and operational lessons learned for large‑scale big‑data platforms.

Big DataHDFSHadoop

0 likes · 19 min read

Practical Upgrade Experience of Hadoop 3.2.1 in 58.com Data Platform: HDFS, YARN, and MR3

Big Data Technology & Architecture

May 23, 2021 · Big Data

Comprehensive Guide to Hive: Fundamentals, SQL Syntax, Performance Tuning, and Interview Preparation

This extensive article introduces Hive as a Hadoop‑based data warehouse, explains its architecture, core concepts, DDL/DML syntax, functions, performance‑optimization techniques, data‑skew handling, and provides a collection of common interview questions for Hive practitioners.

Data WarehouseHadoopHive

0 likes · 66 min read

Comprehensive Guide to Hive: Fundamentals, SQL Syntax, Performance Tuning, and Interview Preparation

UCloud Tech

May 21, 2021 · Big Data

How US3 Hadoop Adapter Cuts Big Data Storage Costs and Boosts Performance

This article explains how UCloud's US3 object storage, combined with a custom Hadoop adapter, separates compute and storage, optimizes file system operations, and leverages caching and specialized APIs to dramatically reduce storage costs and improve read/write performance for large‑scale Hadoop workloads.

Big DataCacheHadoop

0 likes · 13 min read

How US3 Hadoop Adapter Cuts Big Data Storage Costs and Boosts Performance

Tencent Cloud Developer

May 19, 2021 · Industry Insights

How Cloud‑Native Principles Transform Big Data Infrastructure

The article analyzes how cloud‑native concepts such as DevOps, micro‑services, continuous delivery, and containerization can be applied to big‑data foundations, outlining four guiding principles—industrialized delivery, cost quantification, load‑adaptive scaling, and data‑centric design—and describing concrete Hadoop‑based architectures and Tencent Cloud solutions that lower cost while boosting performance.

Big DataData InfrastructureHadoop

0 likes · 22 min read

How Cloud‑Native Principles Transform Big Data Infrastructure

Architecture Digest

May 17, 2021 · Big Data

Technical Architecture Overview of Toutiao: Data Pipeline, User Modeling, Recommendation System, and Microservices

The article provides a comprehensive technical overview of Toutiao's rapid growth, detailing its massive user base, data collection and processing pipelines, user modeling, cold‑start strategies, recommendation engines, storage solutions, push notification mechanisms, and the underlying microservice and PaaS architecture.

Big DataHadoopKafka

0 likes · 8 min read

Technical Architecture Overview of Toutiao: Data Pipeline, User Modeling, Recommendation System, and Microservices

Practical DevOps Architecture

Apr 28, 2021 · Big Data

Step-by-Step Hadoop Environment Setup and Configuration on Three Linux Servers

This guide walks through preparing three Linux servers, installing JDK 1.8, configuring Hadoop core, HDFS, MapReduce, and YARN XML files, setting Java environment variables, formatting HDFS, and starting all services to access the Hadoop web UI.

Big DataConfigurationHDFS

0 likes · 4 min read

Step-by-Step Hadoop Environment Setup and Configuration on Three Linux Servers

Big Data Technology & Architecture

Apr 25, 2021 · Big Data

Data Lake Storage Architecture Selection and JindoFS on Alibaba Cloud

The article explains the concept and advantages of data lakes, outlines the major storage and acceleration challenges they face, provides a checklist for ideal data‑lake solutions, and details how Alibaba Cloud's JindoFS addresses those challenges with object‑storage‑based, high‑performance, scalable features.

Alibaba CloudData LakeHadoop

0 likes · 9 min read

Data Lake Storage Architecture Selection and JindoFS on Alibaba Cloud

Big Data Technology & Architecture

Apr 15, 2021 · Big Data

Hive and Hadoop Interview Questions and Answers

This article provides a comprehensive collection of interview-style questions and detailed answers covering Hive concepts, Hadoop architecture, MapReduce mechanics, HDFS operations, and performance optimization techniques for big‑data processing environments.

Data WarehouseHadoopHive

0 likes · 41 min read

Hive and Hadoop Interview Questions and Answers

Big Data Technology Architecture

Apr 8, 2021 · Big Data

Managing Small Files in Spark SQL: Causes, Impact, and Practical Solutions

This article explains the small‑file problem in Spark SQL on HDFS, its impact on NameNode memory and query performance, describes how dynamic partition inserts and shuffle settings generate many files, and presents practical solutions such as partition‑based distribution, random bucketing and adaptive query execution to control file count.

Big DataHadoopPerformance

0 likes · 12 min read

Managing Small Files in Spark SQL: Causes, Impact, and Practical Solutions

dbaplus Community

Mar 20, 2021 · Big Data

How a Bank Boosted Data Ingestion Speed 50% Using Sqoop Direct Mode on Hadoop

This article details how a bank transformed its retail system data pipeline from a monolithic DB2 setup to a distributed Oracle‑Hadoop architecture, evaluated five extraction tools, selected Sqoop direct mode, and implemented customizations to achieve over 50% performance gains and reliable incremental data capture.

Big DataDirect ModeHadoop

0 likes · 11 min read

How a Bank Boosted Data Ingestion Speed 50% Using Sqoop Direct Mode on Hadoop

dbaplus Community

Mar 17, 2021 · Big Data

How We Cut PBs of Waste and Optimized HDFS with Tiered Storage and Cloud Migration

This article details a three‑part technical sharing that covers cost governance for offline Hadoop clusters, a large‑scale data‑center migration with architecture upgrades, and a tiered storage strategy using EC and COS to reduce storage costs and improve performance in a cloud‑native big‑data environment.

Big Data MigrationCOSCloud Native

0 likes · 10 min read

How We Cut PBs of Waste and Optimized HDFS with Tiered Storage and Cloud Migration

Big Data Technology Architecture

Mar 2, 2021 · Big Data

Understanding and Managing Small Files in Hadoop HDFS

This article explains what small files are in Hadoop HDFS, how they degrade NameNode memory, RPC performance, and application throughput, and provides practical strategies—including detection, configuration, and merging techniques—to mitigate their impact on storage and processing layers.

HDFSHadoop

0 likes · 12 min read

Understanding and Managing Small Files in Hadoop HDFS

Liangxu Linux

Feb 22, 2021 · Information Security

Mastering Apache Ranger: Architecture, Workflow, and Batch Policy Automation

This article explains Apache Ranger's role as a centralized security framework for Hadoop, detailing its key features, architecture, policy workflow, practical administration examples, and how to automate bulk policy management with Java and REST APIs.

Access ControlApache RangerHadoop

0 likes · 11 min read

Mastering Apache Ranger: Architecture, Workflow, and Batch Policy Automation

Full-Stack Internet Architecture

Jan 27, 2021 · Big Data

Introduction to Hadoop: Architecture, HDFS, MapReduce, and YARN Overview

This article provides a comprehensive overview of Hadoop, covering its origins, core components such as HDFS, MapReduce, and YARN, their architectures, data storage and processing mechanisms, fault‑tolerance features, scheduling strategies, and practical optimization techniques for large‑scale distributed computing.

Big DataDistributed ComputingHDFS

0 likes · 33 min read

Introduction to Hadoop: Architecture, HDFS, MapReduce, and YARN Overview

Big Data Technology & Architecture

Jan 22, 2021 · Big Data

Key New Features and Improvements in Hadoop 3.x

Hadoop 3.x upgrades the platform to JDK 1.8 and introduces a range of enhancements across common components, HDFS, YARN, and MapReduce, including erasure coding, multi‑NameNode high availability, cgroup‑based resource isolation, native map‑output collectors, and split client libraries, while also adding support for Azure and Aliyun distributed file systems.

HDFSHadoopMapReduce

0 likes · 7 min read

Key New Features and Improvements in Hadoop 3.x

Big Data Technology & Architecture

Jan 15, 2021 · Big Data

Evolution and Architecture of Major Chinese Big Data Platforms: Taobao, Didi, Meituan, 360, Kuaishou, and JD

This article reviews the evolution, architecture, and key components of major Chinese big‑data platforms—including those of Taobao, Didi, Meituan, 360, Kuaishou, and JD—highlighting data ingestion, storage, processing engines, scheduling systems, and service‑oriented designs that underpin their large‑scale data operations.

Big DataData PlatformHadoop

0 likes · 14 min read

Evolution and Architecture of Major Chinese Big Data Platforms: Taobao, Didi, Meituan, 360, Kuaishou, and JD

Big Data Technology & Architecture

Jan 13, 2021 · Big Data

My Month-Long Alibaba Mama Interview Experience: Spark, Kafka, and Big Data Technical Rounds

The author recounts a month‑long, four‑round technical interview at Alibaba Mama, detailing phone, on‑site, and HR stages, with deep discussions on Spark, Kafka, Hadoop, platform design, and backend fundamentals, while sharing resource links for big‑data interview preparation.

AlibabaData EngineeringHadoop

0 likes · 7 min read

My Month-Long Alibaba Mama Interview Experience: Spark, Kafka, and Big Data Technical Rounds

Big Data Technology & Architecture

Jan 12, 2021 · Big Data

Hadoop Interview Questions and Topics – HDFS, MapReduce, YARN, and Optimization

This article compiles a comprehensive set of Hadoop interview questions covering HDFS write and read processes, architecture, fault‑tolerance, NameNode metadata management, MapReduce scheduling, combiner and partition roles, YARN scheduling strategies, and various optimization techniques for both MapReduce and HDFS.

HDFSHadoopMapReduce

0 likes · 5 min read

Hadoop Interview Questions and Topics – HDFS, MapReduce, YARN, and Optimization

dbaplus Community

Jan 5, 2021 · Big Data

How Ctrip Built a Scalable Unified Log Framework for Payment Data

Facing massive, heterogeneous logs from numerous payment services, Ctrip’s data team designed a unified logging framework that extends log4j2, streams logs via Kafka to HDFS using a customized Camus pipeline, partitions and stores data in ORC for efficient Hive analysis, while addressing format, storage, and performance challenges.

Big DataCamusHadoop

0 likes · 16 min read

How Ctrip Built a Scalable Unified Log Framework for Payment Data

Big Data Technology & Architecture

Dec 27, 2020 · Big Data

Understanding and Solving the Small File Problem in Big Data Systems

This article examines the pervasive small‑file issue in big‑data environments, explains its impact on storage and processing performance, and presents a comprehensive set of solutions—including file merging, Hadoop archives, SequenceFiles, HBase, CombineFileInputFormat, and Spark/Flink strategies—to mitigate metadata overhead and improve I/O efficiency.

FlinkHadoopNameNode

0 likes · 41 min read

Understanding and Solving the Small File Problem in Big Data Systems

dbaplus Community

Dec 22, 2020 · Big Data

How eBay Migrated 10 PB of HDFS Data Across Namespaces in Just 2 Hours

This article details how eBay's ADI Hadoop team tackled a massive 10 PB, 10‑million‑file migration by optimizing DistCp with Fastcopy, load‑balancing, ACL handling, and failure recovery, ultimately completing the transfer within a two‑hour window while preserving cluster stability and performance.

Big DataDistcpHDFS

0 likes · 16 min read

How eBay Migrated 10 PB of HDFS Data Across Namespaces in Just 2 Hours

Architect

Dec 13, 2020 · Big Data

Understanding and Solving Data Skew in Hadoop and Spark

This article explains what data skew is, why it occurs in large‑scale Hadoop and Spark jobs, illustrates typical scenarios that cause it, and provides practical strategies and platform‑specific optimizations to detect, mitigate, and prevent skew in big‑data processing pipelines.

Distributed ComputingHadoopSpark

0 likes · 13 min read

Understanding and Solving Data Skew in Hadoop and Spark

Big Data Technology & Architecture

Dec 9, 2020 · Big Data

Handling Small Files in Hive: Configuration, Compression, and File Format Optimization

The article explains why Hive tables generate many small files on HDFS, describes the performance impact on NameNode and MapReduce, and provides detailed configuration steps and compression techniques—including input and output file merging, various Hive file formats, and partition optimization—to efficiently manage storage and resource consumption in big‑data environments.

HadoopHiveSmall Files

0 likes · 19 min read

Handling Small Files in Hive: Configuration, Compression, and File Format Optimization

Practical DevOps Architecture

Nov 27, 2020 · Big Data

Step-by-Step Guide to Install and Configure a Hadoop 2.8.2 Cluster

This tutorial provides a complete walkthrough for downloading Hadoop 2.8.2, setting up a three‑node master‑slave cluster, configuring core, HDFS, MapReduce and YARN settings, creating required directories, distributing the installation, starting the services, verifying the cluster status, and finally shutting it down.

Big DataCluster SetupHDFS

0 likes · 5 min read

Step-by-Step Guide to Install and Configure a Hadoop 2.8.2 Cluster

DataFunTalk

Nov 26, 2020 · Big Data

Evolution of 58.com Commercial Data Warehouse: From 0‑1 to 3.0 Architecture and Technology

This article details the evolution of 58.com’s commercial data warehouse across three phases—1.0, 2.0, and 3.0—covering its scale, four‑layer architecture, migration from legacy Hadoop‑MapReduce pipelines to Flume/Kafka and Flink streaming, code optimizations, monitoring, and productization for real‑time business insights.

Big DataETLHadoop

0 likes · 9 min read

Evolution of 58.com Commercial Data Warehouse: From 0‑1 to 3.0 Architecture and Technology

Big Data Technology Architecture

Nov 25, 2020 · Big Data

Data Lake Storage Architecture Selection and JindoFS on Alibaba Cloud

This article explains the concept and benefits of data lakes, outlines the storage and acceleration challenges they pose, presents an ideal checklist for selecting a data lake solution, and evaluates Alibaba Cloud's JindoFS against that checklist, highlighting its capabilities for big‑data and AI workloads.

Alibaba CloudBig DataData Lake

0 likes · 9 min read

Big Data Technology & Architecture

Nov 21, 2020 · Big Data

Big Data Performance Testing: Objectives, Timing, Steps, Tools, and Optimization

This article outlines the purpose, timing, procedures, tools, and optimization techniques for big data performance testing, providing detailed guidance on test planning, execution, metric collection, and analysis to ensure reliable and efficient big data system deployments.

BenchmarkBig DataHadoop

0 likes · 7 min read

Big Data Performance Testing: Objectives, Timing, Steps, Tools, and Optimization

ITPUB

Nov 19, 2020 · Fundamentals

How HugePages Boost Database and Hadoop Performance on Linux

This article explains Linux HugePages, how to view and configure them, demonstrates code and Kubernetes examples, and details how larger memory pages reduce management overhead and lock memory to improve performance for memory‑intensive services like databases and Hadoop.

HadoopHugePagesLinux

0 likes · 10 min read

How HugePages Boost Database and Hadoop Performance on Linux

Big Data Technology & Architecture

Nov 16, 2020 · Big Data

Understanding Data Skew in Big Data: Causes, Symptoms, and Solutions for Hadoop and Spark

This article explains what data skew is, why it occurs in large‑scale Hadoop and Spark jobs, how to recognize its symptoms such as stuck reducers or OOM executors, and presents practical strategies—including business‑level adjustments, code refactoring, and platform‑specific tuning—to mitigate the problem.

Big DataHadoopSpark

0 likes · 13 min read

Understanding Data Skew in Big Data: Causes, Symptoms, and Solutions for Hadoop and Spark

DataFunSummit

Nov 15, 2020 · Big Data

Evolution of 58.com Commercial Data Warehouse: From 0‑1 to 3.0 Using Hadoop, Flume, Kafka, Spark, and Flink

This article details the three‑stage evolution of 58.com’s commercial data warehouse, describing its massive scale, four‑layer architecture, technical challenges, migrations from MapReduce to Hive and Flink, real‑time streaming upgrades, and the resulting improvements in stability, accuracy, and timeliness.

Big DataData ArchitectureData Warehouse

0 likes · 10 min read

Evolution of 58.com Commercial Data Warehouse: From 0‑1 to 3.0 Using Hadoop, Flume, Kafka, Spark, and Flink

Ctrip Technology

Sep 10, 2020 · Big Data

Design and Implementation of a Unified Log Framework for Ctrip Payment Center

The article describes the design, architecture, and operational details of a unified logging framework at Ctrip's payment center, covering log production via a Log4j2 extension, Kafka‑Camus collection, Hive/ORC storage, MapReduce parsing optimizations, and governance strategies for massive daily TB‑scale data.

Big DataCamusData Governance

0 likes · 15 min read

Design and Implementation of a Unified Log Framework for Ctrip Payment Center

MaGe Linux Operations

Sep 7, 2020 · Databases

Step-by-Step Guide to Installing an HBase Cluster on Hadoop

This article explains what HBase is, describes its Master, RegionServer, and Zookeeper components, and provides detailed environment preparation and configuration steps—including host setup, SSH key distribution, JDK installation, HBase deployment, configuration file edits, and cluster startup—so you can run HBase on a Hadoop cluster.

HBaseHadoopbigdata

0 likes · 8 min read

Step-by-Step Guide to Installing an HBase Cluster on Hadoop

Big Data Technology & Architecture

Sep 1, 2020 · Big Data

Configuring Hadoop to Support LZO Compression

This guide explains how to enable LZO compression in Hadoop by installing the twitter‑provided hadoop‑lzo library, updating core‑site.xml, synchronizing files across nodes, creating LZO indexes, and running a WordCount MapReduce job with LZO‑compressed output.

Big DataConfigurationHadoop

0 likes · 6 min read

Configuring Hadoop to Support LZO Compression

Laravel Tech Community

Aug 24, 2020 · Backend Development

Apache Calcite 1.25.0 Released with Spatial Functions and SQL Interval Support

Apache Calcite 1.25.0, a dynamic data management framework for Hadoop, has been released, introducing support for spatial functions and SQL interval expressions, removing deprecated methods from the previous version, and providing updated Maven dependency coordinates for the core library.

Apache CalciteData ManagementHadoop

0 likes · 2 min read

Apache Calcite 1.25.0 Released with Spatial Functions and SQL Interval Support

Big Data Technology & Architecture

Aug 21, 2020 · Big Data

Practical Guide to Building an Advertising Project with Spark and Kudu

This article provides a step‑by‑step tutorial on deploying a Spark‑based advertising data pipeline using Kudu, covering Hadoop setup, HDFS data loading, Spark application refactoring, Maven packaging, Yarn execution, and crontab scheduling for daily automated runs.

Big DataHadoopKudu

0 likes · 11 min read

Practical Guide to Building an Advertising Project with Spark and Kudu

HaoDF Tech Team

Aug 19, 2020 · Big Data

Practical Guide to Apache Sentry and Kerberos Integration for Hadoop Access Control

This article explains the principles, architecture, features, and step‑by‑step deployment of Apache Sentry with Kerberos to provide role‑based access control across Hadoop components such as Hive, Impala, and HDFS, including command‑line examples and visual diagrams.

Access ControlApache SentryHadoop

0 likes · 13 min read

Practical Guide to Apache Sentry and Kerberos Integration for Hadoop Access Control

Big Data Technology & Architecture

Aug 16, 2020 · Big Data

Comprehensive Overview of HDFS: Architecture, Advantages, Limitations, Commands, and Advanced Features

This article provides a detailed introduction to HDFS, covering its application scenarios, core architecture, fault‑tolerance benefits, drawbacks such as high latency and small‑file inefficiency, essential shell and API commands, cluster management procedures, and newer Hadoop 2.0 features like HA, Federation, snapshots, ACLs, and heterogeneous storage.

Big DataCLIHA

0 likes · 10 min read

Comprehensive Overview of HDFS: Architecture, Advantages, Limitations, Commands, and Advanced Features

Big Data Technology & Architecture

Aug 13, 2020 · Big Data

Configuring Kerberos‑Enabled HDFS Access with Maven in a Hadoop Cluster

This guide walks through setting up a Maven project, adding Hadoop dependencies, configuring Kerberos (krb5.conf and keytab), loading core‑site.xml, and providing Java utility classes to initialize the HDFS client and list files in an HA‑enabled Hadoop cluster.

Big DataHDFSHadoop

0 likes · 5 min read

Configuring Kerberos‑Enabled HDFS Access with Maven in a Hadoop Cluster

Big Data Technology & Architecture

Aug 12, 2020 · Big Data

Real‑time User Behavior Collection Using Flume, Kafka, and Spark Streaming on Hadoop

This guide explains how to continuously collect web‑service user behavior logs, route them through Flume agents to Kafka, and finally ingest them with Spark Streaming into HDFS, covering environment preparation, configuration files, deployment steps, and verification procedures.

Big DataFlumeHadoop

0 likes · 9 min read

Real‑time User Behavior Collection Using Flume, Kafka, and Spark Streaming on Hadoop

Youzan Coder

Jul 29, 2020 · Big Data

How We Migrated a 200‑Node Hadoop Cluster Across Data Centers: Lessons and Strategies

This article presents a comprehensive case study of migrating a 200‑plus node Hadoop offline platform across data centers, covering background, architecture, solution evaluation, detailed implementation steps, consistency checks, operational safeguards, encountered issues, and future recommendations.

Big DataDP PlatformData Consistency

0 likes · 21 min read

How We Migrated a 200‑Node Hadoop Cluster Across Data Centers: Lessons and Strategies

Big Data Technology & Architecture

Jul 28, 2020 · Big Data

Enabling CGroup in Hadoop Yarn NodeManager to Limit Container CPU Resources

This article explains how to enable Linux CGroup support in Hadoop Yarn NodeManager to limit container CPU usage, detailing required configuration properties, hierarchy setup, CPU limit parameters, and a critical kernel version caveat.

Big DataCPUHadoop

0 likes · 7 min read

Enabling CGroup in Hadoop Yarn NodeManager to Limit Container CPU Resources

Big Data Technology & Architecture

Jul 27, 2020 · Big Data

How to View Hadoop/YARN Application Logs via History Server and Yarn Commands

This guide explains how to retrieve Hadoop/YARN application logs using the History Server UI, Yarn command‑line tools, and direct HDFS log access, including commands for listing applications, fetching specific logs, and locating the remote log directory.

Big DataCLIHDFS

0 likes · 4 min read

How to View Hadoop/YARN Application Logs via History Server and Yarn Commands

Big Data Technology & Architecture

Jul 10, 2020 · Big Data

Understanding Namenode Metadata Persistence: FsImage, EditLog, and SecondaryNamenode

This article explains how Hadoop's Namenode persists metadata using FsImage and EditLog, describes the checkpoint process during startup, and details the role of SecondaryNamenode in merging these files for efficient recovery, while also encouraging readers to like and share the content.

EditLogFsImageHadoop

0 likes · 4 min read

Understanding Namenode Metadata Persistence: FsImage, EditLog, and SecondaryNamenode

Programmer DD

Jul 7, 2020 · Big Data

How to Choose a Worthwhile Technology: Depth, Ecosystem, and Evolution

The article outlines a three‑dimensional framework—technical depth, ecosystem breadth, and evolution capability—to help engineers decide which big‑data or stream‑processing technology (such as Hadoop, Spark, or Flink) is worth investing time in, and provides practical tips like using Google Trends and GitHub awesome lists.

Big DataFlinkHadoop

0 likes · 12 min read

How to Choose a Worthwhile Technology: Depth, Ecosystem, and Evolution

DataFunTalk

Jul 5, 2020 · Big Data

ByteDance’s Optimizations to Hadoop YARN: Enhancing Utilization, Multi‑Load Scenarios, Stability, and Multi‑Region Active‑Active

This article describes ByteDance’s four‑year series of customizations to Hadoop YARN—covering utilization improvements, multi‑load scenario optimizations, stability enhancements, and multi‑region active‑active deployment—along with practical production experiences, architectural details, and future work directions.

ByteDanceCluster OptimizationHadoop

0 likes · 12 min read

ByteDance’s Optimizations to Hadoop YARN: Enhancing Utilization, Multi‑Load Scenarios, Stability, and Multi‑Region Active‑Active

dbaplus Community

Jun 18, 2020 · Databases

How a Hybrid Data Warehouse Transformed Banking Data Services

This article details the 2015 hybrid data‑warehouse design implemented at Guangdong Huaxing Bank, explaining its real‑time, historical, and archival layers, the data‑bus concept, and how mixing in‑memory, relational, and Hadoop technologies addressed modern banking data‑volume, latency, and unstructured‑data challenges.

Big DataData WarehouseHadoop

0 likes · 20 min read

How a Hybrid Data Warehouse Transformed Banking Data Services

Big Data Technology & Architecture

Jun 18, 2020 · Big Data

CPU Resource Isolation in YARN with Linux cgroups

This article introduces Linux cgroups, explains their CPU subsystem files and parameters, demonstrates how to create and configure cgroups, and details how YARN leverages cgroups for CPU resource isolation through configuration settings and code implementations, comparing soft and hard limit approaches.

HadoopLinuxYARN

0 likes · 10 min read

CPU Resource Isolation in YARN with Linux cgroups

Huawei Cloud Developer Alliance

Jun 5, 2020 · Big Data

Why Serverless Big Data Is the Future of Scalable Analytics

The article traces the evolution from on‑premise relational databases to self‑built Hadoop clusters, cloud‑hosted Hadoop, and finally to semi‑managed and serverless big‑data services, highlighting their advantages, challenges, and the four key pillars—security, elasticity, intelligence, and usability—that will shape the future of serverless big‑data analytics.

Cloud ComputingHadoopdata analytics

0 likes · 10 min read

Why Serverless Big Data Is the Future of Scalable Analytics

Big Data Technology Architecture

Jun 4, 2020 · Big Data

58.com Big Data Offline Computing Platform: Architecture, Scaling, Optimization, and Cross‑Data‑Center Migration

This article presents a comprehensive case study of 58.com’s massive Hadoop‑based offline computing platform, detailing its architecture, scaling challenges, performance‑tuning measures, YARN and SparkSQL upgrades, and the systematic cross‑data‑center migration of thousands of nodes and petabytes of data.

Big DataData MigrationHadoop

0 likes · 23 min read

58.com Big Data Offline Computing Platform: Architecture, Scaling, Optimization, and Cross‑Data‑Center Migration

Big Data Technology Architecture

May 29, 2020 · Big Data

An Overview of Apache Avro: Schema, Serialization Formats, Container Files, and RPC Usage

Apache Avro is a high‑performance binary data serialization system originating from Hadoop that uses JSON‑defined schemas to enable compact storage, efficient network transfer, container file formats for MapReduce, and RPC communication without requiring code generation or explicit field numbers.

AvroData SerializationHadoop

0 likes · 9 min read

An Overview of Apache Avro: Schema, Serialization Formats, Container Files, and RPC Usage

Big Data Technology & Architecture

May 28, 2020 · Big Data

Hadoop System Bottleneck Detection and MapReduce Optimization Guide

This article provides a comprehensive guide on detecting Hadoop system bottlenecks, analyzing resource constraints, and applying practical MapReduce performance tuning techniques—including baseline creation, counter analysis, combiner usage, compression, and proper Writable types—to achieve optimal big‑data processing efficiency.

Big DataHadoopMapReduce

0 likes · 11 min read

Hadoop System Bottleneck Detection and MapReduce Optimization Guide

Big Data Technology & Architecture

May 26, 2020 · Information Security

Step-by-Step Guide to Integrating Kerberos Authentication with the Cloudera Platform

This article provides a comprehensive tutorial on Kerberos fundamentals, its authentication workflow, and detailed procedures for installing, configuring, and enabling Kerberos security on a Cloudera (Hadoop) cluster running on CentOS, including code snippets, configuration files, and post‑deployment testing steps.

Big DataClouderaHadoop

0 likes · 17 min read

Step-by-Step Guide to Integrating Kerberos Authentication with the Cloudera Platform

Big Data Technology Architecture

May 24, 2020 · Big Data

HBase Region State Machine and Transition Details

The article explains how HBase tracks each region's lifecycle states in hbase:meta and ZooKeeper, lists all possible states with their color codes, and describes the master‑region server interactions for opening, closing, splitting, and merging regions.

DatabaseHBaseHadoop

0 likes · 7 min read

HBase Region State Machine and Transition Details

Big Data Technology Architecture

May 21, 2020 · Big Data

Near Real-Time Ingestion, Analysis, Incremental Pipelines, and Data Distribution with Apache Hudi

The article explains how Apache Hudi enables near‑real‑time data ingestion from various sources, supports low‑latency analytics, provides incremental processing pipelines, and simplifies data distribution on Hadoop, improving efficiency and reducing operational complexity.

Apache HudiBig DataHadoop

0 likes · 6 min read

Near Real-Time Ingestion, Analysis, Incremental Pipelines, and Data Distribution with Apache Hudi

Big Data Technology & Architecture

May 16, 2020 · Big Data

Apache Kylin Single‑Node Installation Guide and Troubleshooting

This article provides a comprehensive step‑by‑step guide for installing Apache Kylin on a single machine, covering required software versions, environment variable configuration, Spark dependency handling, main Kylin properties, verification steps, and detailed solutions to common errors such as Zookeeper host issues, HTTP 404, Jackson conflicts, MapReduce jobhistory problems, missing Spark classes, HiveConf errors, and YARN shuffle service configuration.

Apache KylinBig DataHadoop

0 likes · 26 min read

Apache Kylin Single‑Node Installation Guide and Troubleshooting

Big Data Technology & Architecture

May 13, 2020 · Big Data

Analysis of Hadoop HDFS Data Read and Write Process

This article explains the underlying principles of Hadoop HDFS read and write operations, detailing how the client interacts with NameNode and DataNodes, the role of FsDataInputStream and FsDataOutputStream, block location retrieval, pipeline replication, and file closure steps.

Big DataData ReadData Write

0 likes · 8 min read

Analysis of Hadoop HDFS Data Read and Write Process

Big Data Technology Architecture

May 10, 2020 · Big Data

Understanding Apache Hudi: Incremental Processing and Low‑Latency Data Management on Hadoop

This article explains how Apache Hudi provides an incremental processing framework that enables efficient, low‑latency data ingestion, storage, and query capabilities on Hadoop, detailing its architecture, storage layout, compaction, write and read paths, and support for real‑time and batch analytics.

HadoopHudidata ingestion

0 likes · 15 min read

Understanding Apache Hudi: Incremental Processing and Low‑Latency Data Management on Hadoop

Big Data Technology Architecture

May 6, 2020 · Big Data

Ozone vs HDFS: Why Ozone Cannot Replace Hadoop’s Core Storage

In this article, senior Alibaba engineer Zheng Kai analyzes Ozone’s role in the Hadoop ecosystem, arguing that despite its usefulness, Ozone cannot solve Hadoop’s core challenges of complexity, cost, and performance, and that Hadoop must focus on storage innovation, compute‑storage separation, and cloud integration to stay relevant.

CloudHDFSHadoop

0 likes · 14 min read

Ozone vs HDFS: Why Ozone Cannot Replace Hadoop’s Core Storage

Big Data Technology & Architecture

May 6, 2020 · Big Data

Step-by-Step Guide to Installing and Configuring a Hadoop Cluster on Three Virtual Machines

This article provides a comprehensive, hands‑on tutorial for preparing three VMs, installing JDK and Hadoop, configuring core‑site.xml, hdfs‑site.xml, mapred‑site.xml, yarn‑site.xml, setting environment variables, distributing the package, starting HDFS and YARN, and verifying the cluster via web UI and jps commands.

Big DataCluster SetupHDFS

0 likes · 14 min read

Step-by-Step Guide to Installing and Configuring a Hadoop Cluster on Three Virtual Machines

Architecture Digest

May 4, 2020 · Databases

HBase Overview, Architecture, Installation, and Basic Shell Operations

This article provides a comprehensive introduction to HBase, covering its origins, key characteristics, architecture components, installation steps, basic shell commands for table management, data structures, read/write processes, and high‑availability configuration within the Hadoop ecosystem.

Big DataHBaseHadoop

0 likes · 14 min read

HBase Overview, Architecture, Installation, and Basic Shell Operations

21CTO

Apr 30, 2020 · Big Data

How to Choose a Worthwhile Technology: A Big Data Engineer’s 3‑Step Framework

The article outlines a three‑dimensional framework—technical depth, ecosystem breadth, and evolution capability—to help professionals evaluate whether a technology is worth investing time in, illustrated with real‑world examples from Hadoop, Spark, and Flink.

Big DataCareer AdviceFlink

0 likes · 10 min read

How to Choose a Worthwhile Technology: A Big Data Engineer’s 3‑Step Framework

Big Data Technology Architecture

Apr 20, 2020 · Big Data

Introduction to HDFS: Architecture, Features, Replication, Rack Awareness, and Metadata Management

This article provides a comprehensive overview of Hadoop Distributed File System (HDFS), covering its streaming data access model, key characteristics, master‑slave architecture, block storage and replication mechanisms, rack‑aware placement strategy, and how the NameNode manages metadata and checkpoints.

Distributed File SystemHDFSHadoop

0 likes · 7 min read

Introduction to HDFS: Architecture, Features, Replication, Rack Awareness, and Metadata Management

dbaplus Community

Apr 15, 2020 · Big Data

How Ctrip Scaled Hadoop Across Data Centers: Architecture and Lessons

This article details Ctrip's Hadoop evolution, the challenges of expanding across multiple data centers, the evaluation of multi‑cluster versus single‑cluster designs, and the concrete architectural changes, migration tools, bandwidth monitoring, and future plans that enabled a stable cross‑datacenter big‑data platform.

Big DataCross-DataCenterDistributed storage

0 likes · 19 min read

How Ctrip Scaled Hadoop Across Data Centers: Architecture and Lessons

Big Data Technology & Architecture

Apr 15, 2020 · Big Data

Understanding HDFS SecondaryNameNode and the Checkpoint Process

This article explains the role of HDFS SecondaryNameNode, the structure of fsimage and edits files, how checkpointing works—including configuration parameters and steps—and how the process changes when NameNode high availability is enabled.

Big DataCheckpointFilesystem

0 likes · 6 min read

Understanding HDFS SecondaryNameNode and the Checkpoint Process

DataFunTalk

Apr 9, 2020 · Big Data

Scaling and Optimizing 58.com’s Hadoop‑Based Offline Computing Platform: Architecture, Challenges, and Solutions

This article details how 58.com built a massive Hadoop‑based offline computing platform with over 4,000 servers and hundreds of petabytes of storage, addressing scaling, stability, GC, YARN scheduling, SparkSQL migration, storage operations, and a large‑scale cross‑datacenter migration.

Big DataData MigrationHadoop

0 likes · 24 min read

Scaling and Optimizing 58.com’s Hadoop‑Based Offline Computing Platform: Architecture, Challenges, and Solutions

Big Data Technology & Architecture

Apr 9, 2020 · Big Data

Optimizing Hadoop and Hive Jobs with Filters, Projections, and Predicate Pushdown

The article explains how applying filters, projections, and predicate pushdown in Hadoop and Hive reduces data volume, speeds up MapReduce jobs, and improves performance, while also covering join limitations and providing a Java Mapper example for practical implementation.

Big DataHadoopHive

0 likes · 4 min read

Optimizing Hadoop and Hive Jobs with Filters, Projections, and Predicate Pushdown

Big Data Technology Architecture

Mar 28, 2020 · Big Data

Apache Kylin: From Extreme OLAP Engine to an Analytical Data Warehouse for Big Data

The article chronicles Apache Kylin's evolution from an Apache incubator OLAP engine to a comprehensive analytical data warehouse, highlighting its five‑year growth, extensive enterprise adoption, core data‑warehouse features, and the community’s rebranding to better reflect its big‑data capabilities.

AnalyticsApache KylinData Warehouse

0 likes · 7 min read

Apache Kylin: From Extreme OLAP Engine to an Analytical Data Warehouse for Big Data

dbaplus Community

Mar 23, 2020 · Big Data

How to Detect and Resolve Data Skew in Spark and Hadoop

This article explains what data skew is in distributed big‑data systems like Spark and Hadoop, why it hurts performance, how to spot it using the Web UI or key statistics, and presents eight practical mitigation techniques ranging from filtering and shuffle parallelism to custom partitioners and broadcast joins.

Broadcast JoinData SkewHadoop

0 likes · 19 min read

How to Detect and Resolve Data Skew in Spark and Hadoop

Big Data Technology Architecture

Mar 16, 2020 · Big Data

Understanding Apache Hudi: Concepts, Architecture, Usage, and Best Practices

This article introduces Apache Hudi, explains its architecture and storage models, describes how it enables upserts and incremental queries on Hadoop, provides step‑by‑step guidance for integrating Hudi with Apache Spark, and outlines best practices and comparisons with Apache Kudu.

Apache HudiHadoopSpark

0 likes · 10 min read

Understanding Apache Hudi: Concepts, Architecture, Usage, and Best Practices

Open Source Linux

Mar 12, 2020 · Big Data

Step-by-Step Guide to Build a Hadoop 2.9.2 Cluster on CentOS 7.5

This tutorial walks you through setting up a three‑node Hadoop 2.9.2 cluster on CentOS 7.5, covering environment preparation, password‑less SSH, user creation, JDK installation, Hadoop extraction, configuration file edits, directory setup, ownership changes, service startup, and verification via web UIs.

Big DataCentOSCluster Setup

0 likes · 13 min read

Step-by-Step Guide to Build a Hadoop 2.9.2 Cluster on CentOS 7.5

ITPUB

Mar 2, 2020 · Big Data

Mastering ZooKeeper: Core Concepts and Real-World Big Data Applications

This article explains ZooKeeper’s architecture, key concepts such as roles, sessions, ZNodes, versioning, ACLs, and watchers, and demonstrates how it powers essential big‑data components like Hadoop’s ResourceManager and HBase’s master election, naming service, and distributed locking.

Big DataDistributed CoordinationDistributed Lock

0 likes · 23 min read

Mastering ZooKeeper: Core Concepts and Real-World Big Data Applications

Big Data Technology & Architecture

Feb 24, 2020 · Big Data

Apache Ozone: Architecture, Design Principles, and Deployment Guide

This article introduces Apache Ozone, a scalable distributed object storage system for Hadoop, covering its background, core components, design principles, architecture, deployment steps, configuration examples, and basic command‑line operations for managing volumes, buckets, and keys.

Big DataCLIDeployment

0 likes · 18 min read

Apache Ozone: Architecture, Design Principles, and Deployment Guide

Yanxuan Tech Team

Feb 17, 2020 · Big Data

Why Data Warehouses Matter: From Basics to the Hadoop Ecosystem

This article explains the purpose of data as a strategic asset, compares traditional databases with data warehouses, outlines key characteristics and related concepts of data warehouses, and introduces the Hadoop ecosystem components that support large‑scale data storage and analysis.

AnalyticsETLHadoop

0 likes · 14 min read

Why Data Warehouses Matter: From Basics to the Hadoop Ecosystem

Big Data Technology & Architecture

Feb 13, 2020 · Big Data

Optimizing Hadoop MapReduce Jobs for eBay CAL System to Reduce Execution Time and Resource Usage

This article describes how eBay's Central Application Logging (CAL) system generates massive daily logs, the challenges of Hadoop MapReduce job performance and resource consumption, and the step‑by‑step optimizations—reducing GC time, mitigating data skew, and improving algorithms—that cut execution time by over 60%, lowered cluster resource usage, and raised job success rates to nearly 100%.

Big DataData SkewGC

0 likes · 11 min read

Optimizing Hadoop MapReduce Jobs for eBay CAL System to Reduce Execution Time and Resource Usage

Big Data Technology & Architecture

Feb 9, 2020 · Big Data

Understanding Hadoop's Circular Buffer in the Shuffle Phase

This article explains how Hadoop's MapReduce shuffle uses a circular buffer to store serialized key/value pairs and their metadata, detailing its structure, initialization, write path, spill logic, and the background thread that sorts and writes data to disk.

Big DataHadoopMapReduce

0 likes · 24 min read

Understanding Hadoop's Circular Buffer in the Shuffle Phase

Big Data Technology & Architecture

Jan 23, 2020 · Big Data

Understanding Apache Hudi: Incremental Processing and Low‑Latency Data Management on Hadoop

This article explains how Apache Hudi enables efficient, low‑latency incremental data ingestion and processing on Hadoop by providing a unified service layer, describing its motivation, architecture, storage components, write and read paths, compaction, fault recovery, and incremental query capabilities.

Apache HudiHadoopIncremental Processing

0 likes · 17 min read

Big Data Technology & Architecture

Jan 13, 2020 · Big Data

130 Essential Big Data and Distributed Systems Interview Questions

This article compiles 130 interview questions spanning big data technologies, distributed systems, and core computer science concepts to help candidates prepare for technical interviews, offering a comprehensive resource for self‑study and review.

FlinkHadoopInterview Questions

0 likes · 12 min read

130 Essential Big Data and Distributed Systems Interview Questions

Didi Tech

Jan 5, 2020 · Big Data

Rolling Upgrade of HDFS from 2.7 to 3.2: Experience, Issues and Solutions

The team performed a rolling upgrade of HDFS from 2.7 to 3.2 on large clusters, resolving EditLog, Fsimage, StringTable and authentication incompatibilities by omitting EC data, using fallback images, rolling back commits and first upgrading to the latest 2.x release, following a staged JournalNode‑NameNode‑DataNode procedure, validating with rehearsals and a custom trash‑management tool, and achieving uninterrupted service, improved stability, performance and cost efficiency.

Big DataCluster MigrationHDFS

0 likes · 11 min read

Rolling Upgrade of HDFS from 2.7 to 3.2: Experience, Issues and Solutions

Huawei Cloud Developer Alliance

Dec 27, 2019 · Big Data

How to Compile and Install CDH Hadoop on Kunpeng Cloud: Step‑by‑Step Guide

This article walks through the full‑stack process of migrating and compiling the CDH Hadoop distribution on Kunpeng cloud servers, covering environment setup, dependency installation, source code adjustments, common build errors, and final packaging for a production‑ready big‑data platform.

Big DataCDHCompilation

0 likes · 14 min read

How to Compile and Install CDH Hadoop on Kunpeng Cloud: Step‑by‑Step Guide

Youzan Coder

Dec 18, 2019 · Big Data

HBase Bulkload Practice at Youzan: From MapReduce to Spark Evolution

Youzan’s evolution of HBase bulk‑load—from manual MapReduce jobs to Hive‑SQL and finally Spark—demonstrates how generating HFiles on HDFS, partitioning by region, sorting keys, and handling serialization issues enables billions of records to be loaded efficiently without disrupting production clusters.

HBaseHadoopNoSQL

0 likes · 16 min read

HBase Bulkload Practice at Youzan: From MapReduce to Spark Evolution

Big Data Technology & Architecture

Nov 17, 2019 · Big Data

Understanding Data Skew in Big Data Processing and Mitigation Strategies

Data skew, a common challenge in large-scale data processing where uneven key distribution leads to performance bottlenecks, is explored with examples from Hadoop, Spark, and Flink, alongside practical mitigation techniques such as hotspot key redesign, map‑side joins, and tuning framework parameters.

FlinkHadoopSpark

0 likes · 6 min read

Understanding Data Skew in Big Data Processing and Mitigation Strategies

Architecture Digest

Nov 5, 2019 · Big Data

Architecture Overview of Taobao, Meituan, and Didi Big Data Platforms

This article examines the big‑data architectures of three leading Chinese internet companies—Taobao, Meituan, and Didi—detailing their data sources, synchronization mechanisms, batch and streaming processing layers, and the common scheduling components that unify their Hadoop‑based ecosystems.

Big DataData ArchitectureDidi

0 likes · 7 min read

Architecture Overview of Taobao, Meituan, and Didi Big Data Platforms

DataFunTalk

Oct 30, 2019 · Big Data

Choosing an IoT Big Data Platform: Hadoop vs TDengine and Other Time‑Series Databases

This article examines the challenges of selecting an IoT big‑data platform, compares traditional real‑time databases, Hadoop‑based solutions, and modern time‑series databases such as TDengine, InfluxDB and ClickHouse, and provides practical case studies and criteria for making an informed choice.

HadoopIoTTDengine

0 likes · 18 min read

Choosing an IoT Big Data Platform: Hadoop vs TDengine and Other Time‑Series Databases

dbaplus Community

Oct 28, 2019 · Big Data

Quickly Analyze Hadoop NameNode RPC with ELK and Grafana

This guide shows how to reduce excessive NameNode RPC calls caused by frequent HDFS directory listings and demonstrates a complete ELK pipeline—Filebeat, Kafka/Logstash, Elasticsearch, and Kibana—plus Grafana dashboards for real‑time monitoring of Hadoop RPC operations.

ELKGrafanaHadoop

0 likes · 9 min read

Quickly Analyze Hadoop NameNode RPC with ELK and Grafana

Big Data Technology Architecture

Oct 15, 2019 · Big Data

Introduction to Apache Kylin: A Fast Big Data OLAP Engine

Apache Kylin is an open‑source, Hadoop‑based OLAP engine that provides sub‑second, multi‑dimensional SQL queries on massive datasets, with features such as cube pre‑computation, real‑time analytics, and seamless BI tool integration, and its latest v2.6.4 release adds numerous fixes and improvements.

Apache KylinBI IntegrationHadoop

0 likes · 4 min read

Introduction to Apache Kylin: A Fast Big Data OLAP Engine

dbaplus Community

Oct 8, 2019 · Big Data

How to Master Large-Scale Cluster Management: 10 Real-World Troubleshooting Cases

This article shares a senior data‑platform engineer's hands‑on experience managing dozens of thousand‑node clusters, detailing nine common cluster problems and step‑by‑step solutions—including performance tuning, RPC fixes, HDFS cleanup, Hive metadata repair, Spark shuffle optimization, HBase region recovery, and Kafka bottleneck mitigation.

Big DataHBaseHadoop

0 likes · 17 min read

How to Master Large-Scale Cluster Management: 10 Real-World Troubleshooting Cases

Big Data Technology & Architecture

Sep 23, 2019 · Big Data

Applying Apache Kylin for Large‑Scale OLAP at Meituan: Architecture, Challenges, and Performance Evaluation

This article describes Meituan’s large‑scale OLAP requirements, how Apache Kylin was integrated to meet them, the architectural solutions, performance benchmarks against other engines, and future work, providing practical insights for building stable, precise, and high‑performance analytics platforms.

Apache KylinBig DataData Warehouse

0 likes · 20 min read

Applying Apache Kylin for Large‑Scale OLAP at Meituan: Architecture, Challenges, and Performance Evaluation

360 Tech Engineering

Sep 19, 2019 · Big Data

Understanding HDFS: Architecture, Read/Write Operations, Component Roles, Commands, and Pros & Cons

This article provides a comprehensive overview of HDFS, covering its purpose, architecture, read/write mechanisms, replication strategies, component responsibilities, common command‑line tools, and the advantages and disadvantages of using Hadoop Distributed File System for large‑scale data storage.

Distributed File SystemHDFSHadoop

0 likes · 10 min read

Understanding HDFS: Architecture, Read/Write Operations, Component Roles, Commands, and Pros & Cons

Big Data Technology & Architecture

Sep 11, 2019 · Big Data

Big Data Technology and Architecture: Case Studies of Taobao, Didi, and Meituan

This article reviews the evolution and key components of big data platforms at leading Chinese internet companies—Taobao, Didi, and Meituan—detailing their data sources, synchronization tools, storage layers, processing engines, and scheduling systems to provide practical guidance for building robust big data infrastructures.

Big DataData PlatformETL

0 likes · 9 min read

Big Data Technology and Architecture: Case Studies of Taobao, Didi, and Meituan

Tencent Cloud Developer

Sep 11, 2019 · Big Data

YARN Practice and Technical Evolution at Kuaishou

Jiaoxiao Fang’s talk details Kuaishou’s YARN deployment, covering its architecture, support for offline, real‑time and ML workloads, and recent enhancements such as event‑handling stability, refined preemption, high‑throughput parallel scheduling, shuffle‑caching for small I/O, plus plans for job protection and multi‑cluster resource utilization.

Big DataCluster OptimizationHadoop

0 likes · 16 min read

YARN Practice and Technical Evolution at Kuaishou