Tagged articles
407 articles
Page 4 of 5
MaGe Linux Operations
MaGe Linux Operations
May 24, 2017 · Big Data

Demystifying Big Data: From HDFS to Spark, Hive, and Real‑Time Streaming

This article explains how big data challenges traditional storage, introduces HDFS for distributed file management, describes parallel processing frameworks like MapReduce, Tez, and Spark, compares higher‑level tools such as Hive and Pig, and explores real‑time streaming and key‑value stores for low‑latency analytics.

HadoopMapReduceSpark
0 likes · 9 min read
Demystifying Big Data: From HDFS to Spark, Hive, and Real‑Time Streaming
dbaplus Community
dbaplus Community
May 16, 2017 · Big Data

Master HDFS: Theory, Shell Commands, and Java API Hands‑On Guide

This comprehensive tutorial explains HDFS fundamentals, its metadata management and advantages, then walks you through setting up a Hadoop environment, executing core shell commands, and using the Java API with complete code examples, enabling you to confidently operate HDFS in practice.

Distributed File SystemHadoopJava API
0 likes · 15 min read
Master HDFS: Theory, Shell Commands, and Java API Hands‑On Guide
ITFLY8 Architecture Home
ITFLY8 Architecture Home
May 10, 2017 · Big Data

How Hadoop Implements Distributed File Systems: From GFS Theory to Practice

This article explains the fundamentals of distributed file systems by linking Google’s GFS, MapReduce, and BigTable concepts to Hadoop’s open‑source implementation, covering terminology, architecture, server roles, data distribution, RPC protocols, file operations, fault recovery, consistency, load balancing, and garbage collection.

GFSHDFSHadoop
0 likes · 34 min read
How Hadoop Implements Distributed File Systems: From GFS Theory to Practice
MaGe Linux Operations
MaGe Linux Operations
May 3, 2017 · Big Data

From Storage to Real‑Time: The Evolution of Big Data Technologies

This article outlines the three historical stages of big data technology—from early storage and batch processing, through market‑driven integration with Hive, to today’s focus on speed with Spark, Impala and streaming—while detailing the Hadoop ecosystem components such as HDFS, MapReduce, KV stores and emerging solutions like YDB.

HDFSHadoopMapReduce
0 likes · 13 min read
From Storage to Real‑Time: The Evolution of Big Data Technologies
Efficient Ops
Efficient Ops
May 2, 2017 · Big Data

Mastering ZooKeeper: Core Concepts and Real-World Big Data Applications

This article introduces ZooKeeper’s fundamental architecture, explains its key concepts such as cluster roles, sessions, ZNodes, watches, and ACLs, and then details how it powers essential distributed coordination tasks—including configuration management, naming services, master election, and distributed locks—in large‑scale Hadoop and HBase ecosystems.

Distributed CoordinationDistributed LocksHBase
0 likes · 25 min read
Mastering ZooKeeper: Core Concepts and Real-World Big Data Applications
Architecture Digest
Architecture Digest
Apr 24, 2017 · Big Data

Understanding and Solving Data Skew in Hadoop and Spark

This article explains what data skew is, why it occurs in large‑scale Hadoop and Spark jobs, illustrates typical symptoms, and presents practical strategies—including business‑level adjustments, code tweaks, and platform‑specific tuning—to mitigate and resolve skew in big‑data processing.

Big DataData SkewHadoop
0 likes · 11 min read
Understanding and Solving Data Skew in Hadoop and Spark
Meituan Technology Team
Meituan Technology Team
Apr 14, 2017 · Big Data

Practical Experience of HDFS Federation at Meituan: Challenges, Improvements, and Automation

Meituan‑Dianping migrated its 2,000‑node HDFS cluster to Federation by fixing ViewFs compatibility, simplifying mount points, leveraging FastCopy for massive data moves, improving token handling, and automating split‑workflow steps, thereby overcoming single‑NameNode bottlenecks and providing a practical blueprint for large‑scale Hadoop deployments.

Big DataFastCopyFederation
0 likes · 22 min read
Practical Experience of HDFS Federation at Meituan: Challenges, Improvements, and Automation
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Apr 14, 2017 · Information Security

Implementing a Lightweight User Authentication Mechanism for Hadoop at Tongcheng Travel

This article describes the design, implementation, and deployment of a custom Hadoop security solution that introduces username‑password authentication via RPC, integrates a new protobuf protocol, modifies NameNode behavior, and provides rollout tools to secure a large‑scale shared Hadoop cluster without service interruption.

AuthenticationHadoopKerberos
0 likes · 9 min read
Implementing a Lightweight User Authentication Mechanism for Hadoop at Tongcheng Travel
Java High-Performance Architecture
Java High-Performance Architecture
Apr 4, 2017 · Big Data

Master MapReduce: Principles, Process, and 7 Hands‑On Examples

This tutorial quickly introduces the MapReduce model, explains its core principles and execution flow, and guides you through seven practical examples—from basic WordCount to custom serialization, partitioning, joins, and friend‑recommendation—while providing test data and an optional ready‑made Hadoop environment for hands‑on practice.

HadoopMapReduceTutorial
0 likes · 3 min read
Master MapReduce: Principles, Process, and 7 Hands‑On Examples
Meituan Technology Team
Meituan Technology Team
Mar 17, 2017 · Big Data

Optimizing Hadoop NameNode Restart in HA with QJM

By applying a series of JIRA patches and configuration tweaks—such as shrinking the fsLock scope, increasing checkpoint transaction thresholds, off‑loading quota calculations, simplifying BlockReport handling, and async processing of mis‑replicated blocks—the Hadoop HA NameNode restart time in a 540 MB metadata cluster drops from roughly 4000 seconds to about 2000 seconds, cutting total downtime to around 35 minutes and greatly improving cluster availability.

HAHDFSHadoop
0 likes · 18 min read
Optimizing Hadoop NameNode Restart in HA with QJM
Nightwalker Tech
Nightwalker Tech
Feb 27, 2017 · Big Data

Community Discussion on Learning Paths, Tools, and Applications in Big Data

A diverse group of practitioners share recommendations for books, technologies, real‑world use cases, and practical challenges when learning and applying big‑data processing, covering Hadoop, Spark, data visualization, ETL, and the relationship between data, algorithms, and business value.

Big DataHadoopdata analysis
0 likes · 16 min read
Community Discussion on Learning Paths, Tools, and Applications in Big Data
Architecture Digest
Architecture Digest
Feb 11, 2017 · Big Data

LeKe Sports Big Data Platform Evolution: From Early ETL Reporting to 2.0 Streaming Architecture

The article describes how LeKe Sports built and continuously upgraded its Hadoop‑based big data platform—from a manual ETL‑to‑Elasticsearch reporting system to a 2.0 architecture featuring Spark Streaming, SQL‑based query layers, Elasticsearch indexing, and cloud‑native storage and backup solutions—to meet rapidly growing PB‑scale data demands.

Big DataData PlatformETL
0 likes · 5 min read
LeKe Sports Big Data Platform Evolution: From Early ETL Reporting to 2.0 Streaming Architecture
Efficient Ops
Efficient Ops
Feb 9, 2017 · Big Data

Mastering HDFS Disk Balancer: Optimize DataNode Storage in Hadoop 3

This article explains the new HDFS disk balancer feature introduced in Hadoop 3, covering its purpose, supported volume‑selection policies, step‑by‑step usage, planning and execution commands, and how it helps maintain balanced storage across DataNode disks.

Disk BalancerHDFSHadoop
0 likes · 8 min read
Mastering HDFS Disk Balancer: Optimize DataNode Storage in Hadoop 3
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Jan 26, 2017 · Information Security

Why Hadoop Clusters Need Strong Security and How Kerberos Protects Them

This article explains the security risks facing Hadoop clusters, outlines common attack methods, introduces Kerberos authentication, and describes Transwarp Data Hub's multi‑layer security architecture—including Guardian, KRB5LDAP, and authorization controls—to help administrators secure their big‑data environments.

HadoopKerberosTDH
0 likes · 11 min read
Why Hadoop Clusters Need Strong Security and How Kerberos Protects Them
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Jan 24, 2017 · Big Data

Why Hadoop Remains the Backbone of Big Data: Core Modules, Tools, and Trends

This article provides a comprehensive overview of Hadoop as the leading open‑source platform for big‑data processing, detailing its core components HDFS and MapReduce, the evolution to Hadoop 2.0/YARN, and the extensive ecosystem of tools and commercial solutions that enable scalable storage, analysis, and machine‑learning on massive data sets.

Big DataHDFSHadoop
0 likes · 18 min read
Why Hadoop Remains the Backbone of Big Data: Core Modules, Tools, and Trends
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Dec 13, 2016 · Big Data

Umeng’s Mobile Big Data Platform: Architecture, Challenges & Insights

The article details Umeng’s mobile big‑data platform architecture, describing its Lambda‑style hybrid design, data ingestion pipeline with dual Kafka clusters, offline and real‑time processing using Hadoop, Spark, Storm, and storage layers such as HDFS, HBase, MongoDB and Elasticsearch, while also discussing challenges in data collection, cleaning, computation, security, and value‑added services.

Data ArchitectureHadoopKafka
0 likes · 13 min read
Umeng’s Mobile Big Data Platform: Architecture, Challenges & Insights
Hulu Beijing
Hulu Beijing
Nov 29, 2016 · Big Data

How Hulu’s Segmentation System Powers Big Data Marketing at Scale

At the 2016 WOT Big Data Technology Summit, Hulu’s senior R&D manager Zhao Kunliang presented the company’s Segmentation system, detailing its Hadoop‑based architecture, Spark and Spark Streaming processing, the custom Nesto query engine, and the challenges and innovations involved in supporting large‑scale marketing and advertising analytics.

HadoopNestoSegmentation system
0 likes · 5 min read
How Hulu’s Segmentation System Powers Big Data Marketing at Scale
Weidian Tech Team
Weidian Tech Team
Nov 28, 2016 · Big Data

How We Built the Mars Big Data Platform to Boost Development Efficiency

The article explains why Weidian needed a new big data development platform, outlines the functional features of the Mars system, describes its architecture, scheduling mechanisms, task execution flow, and discusses remaining challenges and future enhancements.

Distributed SystemsHadoopplatform architecture
0 likes · 11 min read
How We Built the Mars Big Data Platform to Boost Development Efficiency
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Nov 22, 2016 · Big Data

Boost Hadoop SQL Performance: Reduce I/O, Network, and CPU Overhead

This article explains how to quickly locate SQL performance bottlenecks on Hadoop by understanding hardware metrics and then applies four practical optimization strategies—cutting data access, shrinking result sets, minimizing interactions, and lowering CPU load—using filters, selective columns, batch operations, and stored procedures.

Hadoop
0 likes · 11 min read
Boost Hadoop SQL Performance: Reduce I/O, Network, and CPU Overhead
Architecture Digest
Architecture Digest
Nov 16, 2016 · Big Data

A Decade of Hadoop: History, Architecture, Ecosystem, and Future Outlook

This article chronicles Hadoop’s ten‑year evolution from its early HDFS and MapReduce roots to a mature big‑data platform, detailing its historical milestones, architectural layers, ecosystem components, industry adoption, and future trends in storage, processing, security, and cloud integration.

EcosystemHadoopdistributed computing
0 likes · 36 min read
A Decade of Hadoop: History, Architecture, Ecosystem, and Future Outlook
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Nov 14, 2016 · Operations

Master Real-Time Hadoop Alerts with Transwarp Manager

Deploying the Transwarp Manager alert system within Hadoop clusters enables operators to monitor resource shortages, failures, and health issues in real time, offering browsing, configurable thresholds, and instant email or script notifications to quickly identify and resolve problems before they impact services.

Alert MonitoringHadoopOperations
0 likes · 9 min read
Master Real-Time Hadoop Alerts with Transwarp Manager
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Nov 11, 2016 · Big Data

Why SQL Still Rules Big Data—and How NoSQL & NewSQL Fit In

The article explores the evolution of data processing from Hadoop and Spark to modern SQL, NoSQL, and NewSQL solutions, comparing their architectures, performance trade‑offs, and use‑cases, while illustrating concepts with examples like MapReduce, Hive, Impala, and streaming platforms such as Storm.

Big DataHadoopNewSQL
0 likes · 14 min read
Why SQL Still Rules Big Data—and How NoSQL & NewSQL Fit In
Architecture Digest
Architecture Digest
Nov 6, 2016 · Big Data

Evolution of Taobao’s Big Data Platform: From RAC to MaxCompute

The article chronicles Taobao’s 13‑year evolution of its big data platform, detailing three phases—from a single‑node Oracle setup and the Tianwang scheduler, through a Hadoop‑based “Cloud Ladder 1” architecture with real‑time analytics, to the current MaxCompute/ODPS era with cross‑region projects and advanced data services.

Big DataData PlatformHadoop
0 likes · 11 min read
Evolution of Taobao’s Big Data Platform: From RAC to MaxCompute
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Oct 27, 2016 · Big Data

Inside Taobao’s Massive Data Architecture: How 1.5 PB Daily Is Processed and Served

The article explains Taobao’s five‑layer data product architecture—covering data sources, compute, storage, query, and product layers—and describes how massive volumes of data are ingested, processed in batch and streaming, stored in MySQL and HBase clusters, and served efficiently through a unified middle‑layer and sophisticated caching mechanisms.

Big DataDistributed SystemsHBase
0 likes · 15 min read
Inside Taobao’s Massive Data Architecture: How 1.5 PB Daily Is Processed and Served
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Oct 8, 2016 · Big Data

Evolving Data Warehouses with Hadoop & Spark: Core Technologies

Data warehouses centralize and transform enterprise data for multidimensional analysis, and modern demands have spawned four types—traditional, real‑time, associative discovery, and data marts—each with distinct technical requirements, while Hadoop‑based solutions like Transwarp Data Hub address challenges of scale, variety, latency, and security.

Big DataHadoopReal-time analytics
0 likes · 21 min read
Evolving Data Warehouses with Hadoop & Spark: Core Technologies
Java High-Performance Architecture
Java High-Performance Architecture
Sep 27, 2016 · Big Data

Build a Hadoop Cluster with Docker: Step‑by‑Step Guide

Learn how to quickly set up a multi‑node Hadoop cluster on a single machine using Docker containers, covering image preparation, SSH configuration, fixed IP assignment with pipework, and building custom Hadoop images, enabling a lightweight, cost‑effective big‑data environment for development and testing.

Big DataCentOSCluster
0 likes · 9 min read
Build a Hadoop Cluster with Docker: Step‑by‑Step Guide
MaGe Linux Operations
MaGe Linux Operations
Aug 23, 2016 · Big Data

Step-by-Step Guide to Building a Hadoop Cluster on CentOS 6.5

This article provides a comprehensive, hands‑on tutorial for setting up a Hadoop 2.6.4 cluster on a CentOS 6.5 development server, covering SSH password‑less login, user/group creation, DNS configuration, JDK installation, environment variables, Hadoop installation, HDFS and YARN configuration, and troubleshooting native library warnings.

Big DataCentOSCluster Setup
0 likes · 12 min read
Step-by-Step Guide to Building a Hadoop Cluster on CentOS 6.5
Ctrip Technology
Ctrip Technology
Aug 19, 2016 · Big Data

Ctrip's Big Data Architecture and Personalized Recommendation System

This article describes how Ctrip transformed its traditional application architecture into a high‑concurrency, big‑data‑driven platform, detailing storage, compute, and business‑layer redesigns that enable massive data ingestion, real‑time user‑intent services, and a scalable personalized recommendation system.

Big DataCtripHadoop
0 likes · 14 min read
Ctrip's Big Data Architecture and Personalized Recommendation System
Qunar Tech Salon
Qunar Tech Salon
Aug 16, 2016 · Big Data

Exploring OLAP Engine with Apache Kylin: Architecture, Theory, and Applications in Qunar's Big Data Platform

This article presents Qunar's experience transitioning from MySQL‑based OLAP to Apache Kylin, detailing the performance challenges, required features, Kylin's architecture and theory, cube construction process, storage mechanisms, real‑world applications, and the pitfalls and optimization practices discovered along the way.

Apache KylinCubeHBase
0 likes · 6 min read
Exploring OLAP Engine with Apache Kylin: Architecture, Theory, and Applications in Qunar's Big Data Platform
ITPUB
ITPUB
Jul 19, 2016 · Big Data

From Traditional Data Warehouses to Big Data: Practical Techniques and Migration Insights

The talk shares hands‑on experiences and best‑practice methods for traditional data‑warehouse processing, public and behavioral data handling in big‑data environments, and practical guidance for migrating legacy warehouses to modern Hadoop‑based platforms, emphasizing data governance, security, and performance optimization.

Big DataData GovernanceETL
0 likes · 13 min read
From Traditional Data Warehouses to Big Data: Practical Techniques and Migration Insights
Architecture Digest
Architecture Digest
Jul 5, 2016 · Big Data

Why Map‑Reduce Is Not the Solution to Your Big Data Problem – A Critical Look at Hadoop

The article reviews Hadoop’s origins from Google’s pioneering papers, explains its architecture and ecosystem, evaluates its strengths such as scalability and benchmarks, discusses current limitations like single‑point failures and complex programming, and outlines upcoming improvements including HDFS Federation and next‑generation MapReduce.

Big DataFutureHDFS
0 likes · 14 min read
Why Map‑Reduce Is Not the Solution to Your Big Data Problem – A Critical Look at Hadoop
ITPUB
ITPUB
Jun 29, 2016 · Big Data

Why OLTP Falls Short for Big Data: OLAP, Hadoop & MPP Explained

The article explains how traditional OLTP systems cannot satisfy modern big‑data analytics needs and compares OLAP, Hadoop, and MPP architectures, highlighting their data processing models, scalability, cloud‑based managed services, and practical recommendations for building effective data warehouses.

Big DataCloud ServicesHadoop
0 likes · 21 min read
Why OLTP Falls Short for Big Data: OLAP, Hadoop & MPP Explained
ITPUB
ITPUB
Jun 26, 2016 · Big Data

How to Combine R with Hadoop for Petabyte-Scale Data Processing

This article explains three practical approaches—Streaming APIs, the Rhipe package, and RHadoop—to integrate R with Hadoop, enabling R to process petabyte-scale datasets, compares their setup complexity, capabilities, and trade‑offs, and highlights key conclusions for choosing the right method.

HadoopRRHadoop
0 likes · 4 min read
How to Combine R with Hadoop for Petabyte-Scale Data Processing
ITPUB
ITPUB
Jun 18, 2016 · Big Data

5 Essential Steps to Maximize Hadoop Value for Enterprise Projects

Enterprises can unlock Hadoop's full potential by following five strategic steps—from defining high‑impact use cases and assessing architectural fit to managing data, integrating systems, and addressing skill gaps—ensuring measurable business value and competitive advantage.

Data ManagementEnterprise AnalyticsHadoop
0 likes · 7 min read
5 Essential Steps to Maximize Hadoop Value for Enterprise Projects
Hulu Beijing
Hulu Beijing
May 31, 2016 · Big Data

What’s New in Hadoop 3.0? Key Features and Improvements Explained

Hadoop 3.0, built on JDK 1.8, adds erasure‑coded HDFS, multi‑NameNode support, native MapReduce task optimizations, cgroup‑based YARN memory and disk isolation, and container resizing, with an alpha slated for summer and a GA release expected in November or December.

Big DataHDFSHadoop
0 likes · 5 min read
What’s New in Hadoop 3.0? Key Features and Improvements Explained
dbaplus Community
dbaplus Community
May 26, 2016 · Big Data

Mastering Apache Parquet: Columnar Storage, Nested Data, and Performance Gains

This article explains Apache Parquet’s columnar storage format, its support for nested data models, the underlying striping/assembly algorithm, file structure, push‑down optimizations, and performance advantages within the Hadoop ecosystem, providing a comprehensive guide for big‑data practitioners.

Apache ParquetBig DataHadoop
0 likes · 22 min read
Mastering Apache Parquet: Columnar Storage, Nested Data, and Performance Gains
Qunar Tech Salon
Qunar Tech Salon
May 13, 2016 · Big Data

Overview and Architecture of Hadoop Distributed File System (HDFS)

This article provides a comprehensive overview of Hadoop Distributed File System (HDFS), detailing its design goals, architecture components such as NameNode, DataNode and SecondaryNameNode, data block handling, replication strategies, communication protocols, and the read, write, and delete processes.

Big DataDistributed File SystemHDFS
0 likes · 18 min read
Overview and Architecture of Hadoop Distributed File System (HDFS)
Architect
Architect
May 11, 2016 · Big Data

Comprehensive Guide to Hadoop MapReduce Job Execution, Scheduling, and Optimization

This article provides an in‑depth explanation of Hadoop MapReduce architecture, covering the roles of JobClient, JobTracker, TaskTracker and HDFS, the complete job lifecycle from submission to completion, scheduling strategies, shuffle and sort mechanisms, fault tolerance, and performance tuning techniques.

Big DataHadoopJobTracker
0 likes · 20 min read
Comprehensive Guide to Hadoop MapReduce Job Execution, Scheduling, and Optimization
ITPUB
ITPUB
Apr 24, 2016 · Big Data

12 Essential Hive Performance Tips for Faster Hadoop Queries

This guide presents twelve practical Hive tuning techniques—including avoiding MapReduce, limiting string concatenation, steering clear of subqueries, choosing the right file formats, managing vectorization, sizing containers, enabling statistics, and optimizing joins—to dramatically improve query speed on Hadoop.

Big DataHadoophive
0 likes · 7 min read
12 Essential Hive Performance Tips for Faster Hadoop Queries
Java High-Performance Architecture
Java High-Performance Architecture
Apr 18, 2016 · Big Data

Why Spark Is Outpacing Hadoop: Speed, Real‑Time Processing, and ML Advantages

The article explains how Spark has become the leading open‑source big‑data platform, highlighting its superior speed, in‑memory processing, real‑time streaming, and built‑in machine‑learning library compared with Hadoop’s slower, disk‑based MapReduce approach and reliance on external storage and ML tools.

Big DataHadoopReal-time Processing
0 likes · 5 min read
Why Spark Is Outpacing Hadoop: Speed, Real‑Time Processing, and ML Advantages
21CTO
21CTO
Apr 4, 2016 · Big Data

How Asana Scaled Its Data Infrastructure: From MySQL to Redshift & Hadoop

This article details Asana's evolution from a simple Python‑MySQL setup to a robust, scalable data platform using Redshift, Hadoop, Luigi, and modern BI tools, highlighting challenges, solutions, and lessons learned for building reliable data pipelines in fast‑growing startups.

Big DataData InfrastructureETL
0 likes · 15 min read
How Asana Scaled Its Data Infrastructure: From MySQL to Redshift & Hadoop
21CTO
21CTO
Mar 31, 2016 · Big Data

Why Hadoop Isn’t the Silver Bullet for Big Data: Insights from Facebook

The article examines common misconceptions about Hadoop, compares it with relational databases, and shares Facebook's data‑analysis practices, highlighting when Hadoop is appropriate and the broader considerations of using open‑source big‑data frameworks.

HadoopMapReduceRelational Databases
0 likes · 8 min read
Why Hadoop Isn’t the Silver Bullet for Big Data: Insights from Facebook
Architecture Digest
Architecture Digest
Mar 28, 2016 · Big Data

Overview of the Hadoop Ecosystem and Modern Big Data Technologies

This article provides a comprehensive overview of Hadoop and its surrounding ecosystem, detailing core components, storage principles, key algorithms, and a wide range of modern big‑data technologies such as Spark, Flink, Kafka, NoSQL databases, and cloud‑based processing platforms.

Big DataHadoopKafka
0 likes · 11 min read
Overview of the Hadoop Ecosystem and Modern Big Data Technologies
Architect
Architect
Mar 10, 2016 · Big Data

Analysis and Practice of a Real-Time Hadoop Data Security Solution

The article presents a detailed technical overview of Apache Eagle's real-time Hadoop data security architecture, covering distributed data collection, stream processing, metadata‑driven policy enforcement, machine‑learning‑based anomaly detection, and integration with Hadoop ecosystem components such as HBase, Kafka, and Storm.

Apache EagleBig DataHadoop
0 likes · 25 min read
Analysis and Practice of a Real-Time Hadoop Data Security Solution
ITPUB
ITPUB
Feb 24, 2016 · Big Data

How Pepperdata Optimizes Hadoop Cluster Resources and Improves Performance

The article explains how Hadoop clusters suffer from resource contention among multiple users, why YARN alone often fails to prioritize workloads, and how Pepperdata provides deeper visibility and automatic adjustments that reduce low‑priority usage, cut node count, and lower cloud costs.

Big DataCluster ManagementHadoop
0 likes · 7 min read
How Pepperdata Optimizes Hadoop Cluster Resources and Improves Performance
ITPUB
ITPUB
Feb 20, 2016 · Big Data

Doug Cutting’s Journey: How Hadoop Shaped the Big Data Era

The article chronicles Doug Cutting’s path from his Stanford studies and early Xerox work through the creation of Lucene, Nutch, and Hadoop, highlighting how open‑source innovations and Google’s technologies propelled Hadoop to become a cornerstone of modern big‑data processing and its future outlook.

Big DataDoug CuttingHadoop
0 likes · 15 min read
Doug Cutting’s Journey: How Hadoop Shaped the Big Data Era
Baidu Maps Tech Team
Baidu Maps Tech Team
Feb 3, 2016 · Big Data

How Baidu Maps Powers Its Open Platform with Big Data Architecture

This article explains how Baidu Maps’ open platform handles massive daily location data through real‑time and offline pipelines, Hadoop‑based offline computing, stream processing, and query engines built on MySQL, Redis, and Apache Kylin, while outlining future big‑data enhancements.

Apache KylinBaidu MapsHadoop
0 likes · 7 min read
How Baidu Maps Powers Its Open Platform with Big Data Architecture
Qunar Tech Salon
Qunar Tech Salon
Jan 11, 2016 · Big Data

Architecture of Taobao's Massive Data Products: From Data Sources to the Glider Middleware

The article details Taobao's massive data product architecture, describing a five‑layer system that processes billions of daily records using Hadoop, real‑time streams, distributed MySQL and HBase clusters, and a middleware layer called Glider that unifies queries, caching, and front‑end integration.

Big DataData ArchitectureDistributed Systems
0 likes · 16 min read
Architecture of Taobao's Massive Data Products: From Data Sources to the Glider Middleware
Baidu Maps Tech Team
Baidu Maps Tech Team
Jan 6, 2016 · Big Data

How Baidu Maps Scales Billion‑Row OLAP Queries with Apache Kylin

Baidu Maps’ Data Intelligence team built a large‑scale OLAP platform using Apache Kylin, detailing the challenges of multi‑dimensional analysis on billions of rows, the architecture, custom extensions for task, resource, and monitoring management, and performance optimizations that achieve millisecond‑level SQL responses.

Apache KylinBig DataHadoop
0 likes · 21 min read
How Baidu Maps Scales Billion‑Row OLAP Queries with Apache Kylin
Efficient Ops
Efficient Ops
Jan 5, 2016 · Information Security

How Apache Eagle Secures Hadoop: Real‑Time Big Data Threat Detection

Apache Eagle is an open‑source, distributed, real‑time security monitoring platform for Hadoop that combines stream‑processing, scalable policy enforcement, and machine‑learning user profiling to protect massive data assets across eBay’s production clusters.

Apache EagleBig DataHadoop
0 likes · 19 min read
How Apache Eagle Secures Hadoop: Real‑Time Big Data Threat Detection
Architect
Architect
Jan 5, 2016 · Big Data

Apache Eagle: eBay’s Open‑Source Real‑Time Hadoop Data Security Platform

The article provides a comprehensive technical overview of Apache Eagle, an open‑source, distributed, real‑time security monitoring and alerting platform for Hadoop developed by eBay, covering its motivation, architecture, core components, machine‑learning based detection, typical use cases, and future development directions.

Apache EagleBig DataHadoop
0 likes · 15 min read
Apache Eagle: eBay’s Open‑Source Real‑Time Hadoop Data Security Platform
21CTO
21CTO
Dec 30, 2015 · Big Data

Mastering Massive Data: MapReduce, Hadoop, and Taobao’s Architecture

This article introduces the fundamental MapReduce model and Hadoop framework, explains their roles in large‑scale data processing, and then examines Taobao’s massive‑data product architecture—including its data source, compute, storage, query, and product layers, as well as the MyFOX, Prom, and Glider components and caching strategies.

Data ArchitectureDistributed SystemsHadoop
0 likes · 16 min read
Mastering Massive Data: MapReduce, Hadoop, and Taobao’s Architecture
21CTO
21CTO
Dec 9, 2015 · Big Data

Mastering Hadoop: From MapReduce Basics to Taobao’s Massive Data Architecture

This article introduces the fundamental MapReduce model and Hadoop framework, explains their components such as HDFS, MapReduce, and HBase, and then examines Taobao’s large‑scale data product architecture—including storage, computation, query, and caching layers—to illustrate practical big‑data processing techniques.

Data ArchitectureDistributed SystemsHadoop
0 likes · 17 min read
Mastering Hadoop: From MapReduce Basics to Taobao’s Massive Data Architecture
21CTO
21CTO
Dec 3, 2015 · Big Data

How Netflix Scales Its Hadoop Data Warehouse on AWS with Genie PaaS

This article explains how Netflix leverages Amazon S3 and Elastic MapReduce to build a virtually unlimited, dynamically scalable Hadoop data warehouse in the cloud, and introduces Genie—a Hadoop platform‑as‑a‑service that abstracts job submission, resource management, and cluster orchestration.

AWSElastic MapReduceGenie
0 likes · 15 min read
How Netflix Scales Its Hadoop Data Warehouse on AWS with Genie PaaS
Architect
Architect
Dec 2, 2015 · Big Data

Designing an Agile Data Warehouse Architecture for Internet Companies

The article outlines a practical, end‑to‑end data platform architecture for internet businesses, covering data collection, storage and analysis, sharing, real‑time processing, task scheduling, and the importance of simplicity and agility in building an agile data warehouse.

Big DataData ArchitectureHadoop
0 likes · 10 min read
Designing an Agile Data Warehouse Architecture for Internet Companies
21CTO
21CTO
Nov 26, 2015 · Big Data

Understanding Big Data: 4V Traits, Google’s Distributed Computing, and Hadoop Ecosystem

This article explores the 4V characteristics of big data, real‑world data growth examples, historical analogies, Google’s GFS‑MapReduce‑BigTable model, Hadoop’s architecture and HDFS processes, HBase components, NoSQL alternatives, and practical big‑data applications at Tencent and beyond.

Data ArchitectureHadoopMapReduce
0 likes · 7 min read
Understanding Big Data: 4V Traits, Google’s Distributed Computing, and Hadoop Ecosystem

Design and Implementation of Alibaba Cloud's Cross‑Data‑Center Hadoop Cluster

In 2013 Alibaba Cloud faced full rack capacity in a single IDC, prompting the development of a multi‑NameNode, cross‑data‑center Hadoop solution that overcomes NameNode scalability, inter‑site bandwidth limits, data placement, job scheduling, massive data migration, and user transparency challenges.

Cross‑Data‑CenterFederationHadoop
0 likes · 14 min read
Design and Implementation of Alibaba Cloud's Cross‑Data‑Center Hadoop Cluster
21CTO
21CTO
Nov 19, 2015 · Big Data

Beyond Hadoop: Modern Big Data Platforms and Technologies Explained

This article surveys the evolution of Hadoop and its ecosystem, explains core storage and processing concepts, and introduces contemporary big‑data technologies such as Spark, Flink, Kafka, Lambda architecture, NoSQL databases, and cloud‑native solutions, highlighting their roles and trade‑offs.

Big DataFlinkHadoop
0 likes · 17 min read
Beyond Hadoop: Modern Big Data Platforms and Technologies Explained

TalkingData’s Journey to Building a Mobile Big Data Platform with Spark and YARN

This article recounts how TalkingData progressively introduced Spark into its Hadoop‑YARN based mobile big‑data platform, detailing early architectures, migration challenges, performance gains, the fully Spark‑centric redesign with Kafka and Spark Streaming, encountered pitfalls, and future plans for further optimization.

Data PlatformHadoopSpark
0 likes · 16 min read
TalkingData’s Journey to Building a Mobile Big Data Platform with Spark and YARN
Architect
Architect
Oct 17, 2015 · Big Data

Designing an Agile Data Warehouse and Data Platform for Internet Companies

The article outlines the purposes, architecture, data ingestion, storage, analysis, sharing, application, real‑time processing, scheduling, monitoring, and best‑practice recommendations for building a fast, flexible, and reliable big‑data platform in the fast‑changing internet industry.

Big DataHadoopSpark
0 likes · 12 min read
Designing an Agile Data Warehouse and Data Platform for Internet Companies
Efficient Ops
Efficient Ops
Oct 14, 2015 · Big Data

Spark vs Hadoop, Flink, HBase/Cassandra, Kafka & Tachyon: Expert Q&A

During a lively “Sit and Discuss” session, experts compared Spark and Hadoop, evaluated Flink against Spark, contrasted HBase with Cassandra, explained why Kafka (and sometimes Flink) is preferred for distributed messaging, and shared insights on Tachyon’s role in modern big‑data ecosystems.

FlinkHBaseHadoop
0 likes · 10 min read
Spark vs Hadoop, Flink, HBase/Cassandra, Kafka & Tachyon: Expert Q&A
21CTO
21CTO
Sep 27, 2015 · Big Data

How Weidian Built a Scalable Big Data Platform for Mobile Commerce

This article outlines the design and implementation of Weidian’s end‑to‑end big data processing platform, covering dataset definition, data collection via Flume‑based DataAgent, transmission through Databus, storage options such as HDFS, Kafka and Elasticsearch, and the monitoring and resource‑integration strategies that support massive mobile commerce logs.

ElasticsearchFlumeHadoop
0 likes · 18 min read
How Weidian Built a Scalable Big Data Platform for Mobile Commerce
MaGe Linux Operations
MaGe Linux Operations
Aug 20, 2015 · Big Data

15 Must‑Try Resources to Master Hadoop Quickly

This article explains what Hadoop is, outlines its key features, and presents a curated list of 15 high‑quality tutorials, video courses, and books to help beginners and professionals efficiently learn Hadoop and its MapReduce ecosystem.

HadoopLearning ResourcesMapReduce
0 likes · 12 min read
15 Must‑Try Resources to Master Hadoop Quickly
Qunar Tech Salon
Qunar Tech Salon
Aug 17, 2015 · Big Data

Comprehensive Overview of Open‑Source Big Data Tools and Platforms

This article presents a detailed, categorized catalogue of more than fifty open‑source big‑data projects—including Hadoop‑related utilities, analytics platforms, databases, BI solutions, data‑mining packages, query engines, programming languages, search tools, and in‑memory technologies—highlighting their primary functions, supported operating systems, and official links.

AnalyticsHadoopIn-Memory
0 likes · 31 min read
Comprehensive Overview of Open‑Source Big Data Tools and Platforms
Qunar Tech Salon
Qunar Tech Salon
Jul 8, 2015 · Big Data

Understanding Logs: The Foundation of Distributed Systems, Data Integration, and Stream Processing

This article explains how logs—simple, append‑only, time‑ordered records—serve as the core abstraction behind databases, distributed systems, data integration pipelines, and modern stream‑processing platforms such as Kafka and Hadoop, illustrating their design, scalability, and practical challenges.

Big DataData IntegrationDistributed Systems
0 likes · 45 min read
Understanding Logs: The Foundation of Distributed Systems, Data Integration, and Stream Processing
Architect
Architect
Jul 6, 2015 · Big Data

Understanding Logs: The Core of Distributed Systems and Data Integration

This article explains how logs—simple, append‑only, time‑ordered records—serve as the fundamental abstraction behind databases, distributed systems, data integration pipelines, and stream‑processing platforms like Kafka and Hadoop, illustrating their role in ordering, replication, scalability, and real‑time analytics.

Data IntegrationDistributed SystemsHadoop
0 likes · 48 min read
Understanding Logs: The Core of Distributed Systems and Data Integration
Efficient Ops
Efficient Ops
Jun 25, 2015 · Big Data

Inside Baidu’s 8‑Year Evolution of Hadoop and Distributed Computing

This article chronicles Baidu’s eight‑year journey from early Hadoop adoption to advanced MPI, DAG engines, and real‑time streaming platforms, detailing architectural milestones, performance optimizations, and practical lessons for large‑scale offline and online data processing.

BaiduDAGHadoop
0 likes · 21 min read
Inside Baidu’s 8‑Year Evolution of Hadoop and Distributed Computing

Design Choices for Distributed Storage Metadata: Comparing GlusterFS, Hadoop, GridFS, HBase, and FastDFS

The article examines various distributed storage design approaches—decentralized (GlusterFS), centralized (Hadoop), database‑based (GridFS and HBase), and metadata‑bypassing (FastDFS)—detailing their advantages, drawbacks, and practical considerations for cloud storage systems.

FastDFSGlusterFSGridFS
0 likes · 17 min read
Design Choices for Distributed Storage Metadata: Comparing GlusterFS, Hadoop, GridFS, HBase, and FastDFS
ITPUB
ITPUB
May 26, 2015 · Big Data

Step-by-Step Guide to Quickly Install and Configure Hive on Hadoop

This article provides a concise, practical walkthrough for installing and configuring Apache Hive on a Hadoop cluster, covering prerequisite HDFS and MapReduce setup, downloading Hive, extracting files, setting environment variables, configuring XML files, starting Hive, and verifying the installation with simple commands.

ConfigurationETLHQL
0 likes · 4 min read
Step-by-Step Guide to Quickly Install and Configure Hive on Hadoop
Suning Technology
Suning Technology
May 22, 2015 · Big Data

Suning’s Big Data Platform Evolution: From SAP BW to Real‑Time Streaming

This article chronicles Suning’s journey from early SAP‑based data warehouses to a modern, open‑source big data platform featuring real‑time collection, Hadoop‑Hive offline processing, Storm‑based streaming, and a visual development environment, highlighting how each layer addresses growing data volume, variety, and business demands.

Data ArchitectureHadoopReal-time Processing
0 likes · 5 min read
Suning’s Big Data Platform Evolution: From SAP BW to Real‑Time Streaming
MaGe Linux Operations
MaGe Linux Operations
Apr 28, 2015 · Databases

Choosing the Right Database: From RDBMS to NoSQL, NewSQL, and Hadoop

The article examines the evolution of database technologies—from traditional relational databases and their ACID guarantees to NoSQL, NewSQL, and Hadoop—illustrating how a gaming company can combine these solutions to handle massive online traffic, ensure data integrity, and enable advanced analytics.

Data AnalyticsHadoopNewSQL
0 likes · 6 min read
Choosing the Right Database: From RDBMS to NoSQL, NewSQL, and Hadoop
MaGe Linux Operations
MaGe Linux Operations
Apr 7, 2015 · Big Data

How Hadoop’s Tiered Storage Optimizes Data Based on Temperature

This article explains Hadoop’s tiered storage concept, describing how data is classified by temperature—hot, warm, cold, frozen—and automatically moved across disk and archive layers to optimize cost and performance, with examples from Hadoop versions and eBay’s large‑scale deployment.

Big DataData TemperatureHDFS
0 likes · 9 min read
How Hadoop’s Tiered Storage Optimizes Data Based on Temperature
MaGe Linux Operations
MaGe Linux Operations
Feb 25, 2015 · Big Data

Do You Really Need Hadoop? 10 Alternatives to Consider First

This article explains why many companies over‑invest in Hadoop, outlines how to evaluate data size, growth, and relevance, and presents practical alternatives such as archiving, data sampling, database sharding, and hiring business‑savvy analysts before committing to a Hadoop deployment.

Big Data AlternativesData ArchitectureHadoop
0 likes · 9 min read
Do You Really Need Hadoop? 10 Alternatives to Consider First