Tagged articles

Hadoop

413 articles · Page 4 of 5

Jun 14, 2017 · Big Data

How Apache Kylin Supercharges Big Data Analytics with Pre‑Computed Cubes

Apache Kylin is an open‑source, distributed OLAP engine built on Hadoop that uses pre‑computed cubes to deliver sub‑second, high‑concurrency SQL queries on massive datasets, integrates with popular BI tools, offers a modular architecture, recent 1.5.x enhancements, and extensive deployment options.

Apache KylinHadoopOLAP

0 likes · 17 min read

How Apache Kylin Supercharges Big Data Analytics with Pre‑Computed Cubes

37 Interactive Technology Team

Jun 13, 2017 · Big Data

MapReduce Principles and Hadoop Execution Process with WordCount Example

The article explains MapReduce’s divide‑and‑conquer model and Hadoop’s execution pipeline—including map, partition, spill, merge, shuffle, and reduce phases—illustrated with a WordCount example that shows how mappers emit word‑1 pairs and reducers aggregate counts to produce final frequencies on HDFS.

Distributed ComputingHadoopMapReduce

0 likes · 7 min read

MapReduce Principles and Hadoop Execution Process with WordCount Example

StarRing Big Data Open Lab

Jun 9, 2017 · Big Data

Secure HDFS with Guardian 5.0: Complete Permission and Quota Guide

This article explains why Hadoop security is critical, introduces Guardian 5.0’s unified authentication and authorization framework, and provides step‑by‑step instructions for configuring HDFS permissions and quotas through its web UI, helping administrators protect massive data assets efficiently.

Guardian5.0HDFSHadoop

0 likes · 9 min read

Secure HDFS with Guardian 5.0: Complete Permission and Quota Guide

Architecture Digest

Jun 9, 2017 · Big Data

A Comprehensive Guide for Big Data Beginners: From Hadoop Fundamentals to Machine Learning

This guide walks beginners through the entire big‑data ecosystem, covering the 4V characteristics, core open‑source frameworks, Hadoop setup, Hive and SQL on Hadoop, data ingestion and export tools, task scheduling, real‑time processing with Kafka, Storm and Spark Streaming, and an introduction to machine‑learning applications.

HadoopHiveKafka

0 likes · 17 min read

A Comprehensive Guide for Big Data Beginners: From Hadoop Fundamentals to Machine Learning

dbaplus Community

Jun 7, 2017 · Big Data

Master MapReduce: From Fundamentals to Real‑World Hadoop Projects

This comprehensive guide walks you through MapReduce fundamentals, the complete execution flow, and seven hands‑on Hadoop projects—including WordCount, custom serialization, custom partitioning, grouping comparators, file merging, multiple outputs, join operations, and friend‑graph analysis—while providing environment setup steps, Maven commands, and Hadoop CLI examples.

CLIDistributed ComputingHadoop

0 likes · 28 min read

Master MapReduce: From Fundamentals to Real‑World Hadoop Projects

dbaplus Community

May 24, 2017 · Operations

How to Replace a ZooKeeper Node in a 5‑Node Cluster Without Downtime

This guide details the step‑by‑step process for replacing a faulty ZooKeeper node (myid 5) in a five‑node cluster, covering configuration updates in zoo.cfg, Hadoop’s hdfs‑site.xml, yarn‑site.xml, HBase‑site.xml, and the required service restarts to ensure continuous high‑availability.

ConfigurationHBaseHadoop

0 likes · 10 min read

How to Replace a ZooKeeper Node in a 5‑Node Cluster Without Downtime

MaGe Linux Operations

May 24, 2017 · Big Data

Demystifying Big Data: From HDFS to Spark, Hive, and Real‑Time Streaming

This article explains how big data challenges traditional storage, introduces HDFS for distributed file management, describes parallel processing frameworks like MapReduce, Tez, and Spark, compares higher‑level tools such as Hive and Pig, and explores real‑time streaming and key‑value stores for low‑latency analytics.

HadoopHiveKey-Value Store

0 likes · 9 min read

Demystifying Big Data: From HDFS to Spark, Hive, and Real‑Time Streaming

dbaplus Community

May 16, 2017 · Big Data

Master HDFS: Theory, Shell Commands, and Java API Hands‑On Guide

This comprehensive tutorial explains HDFS fundamentals, its metadata management and advantages, then walks you through setting up a Hadoop environment, executing core shell commands, and using the Java API with complete code examples, enabling you to confidently operate HDFS in practice.

Distributed File SystemHadoopJava API

0 likes · 15 min read

Master HDFS: Theory, Shell Commands, and Java API Hands‑On Guide

StarRing Big Data Open Lab

May 12, 2017 · Big Data

How to Master Hadoop Performance: A Real-World TPCx-HS Tuning Case Study

This article walks through a detailed Hadoop performance tuning case using the TPCx-HS benchmark, explaining the bottlenecks in TeraGen and TeraSort, the optimization strategies applied, hardware considerations, and the resulting improvements in CPU and network utilization.

Cluster OptimizationHadoopMapReduce

0 likes · 9 min read

How to Master Hadoop Performance: A Real-World TPCx-HS Tuning Case Study

ITFLY8 Architecture Home

May 10, 2017 · Big Data

How Hadoop Implements Distributed File Systems: From GFS Theory to Practice

This article explains the fundamentals of distributed file systems by linking Google’s GFS, MapReduce, and BigTable concepts to Hadoop’s open‑source implementation, covering terminology, architecture, server roles, data distribution, RPC protocols, file operations, fault recovery, consistency, load balancing, and garbage collection.

GFSHDFSHadoop

0 likes · 34 min read

How Hadoop Implements Distributed File Systems: From GFS Theory to Practice

Architects' Tech Alliance

May 7, 2017 · Big Data

Building a Complete Big Data Platform: From Hadoop Basics to Real‑Time Analytics

This guide walks beginners through the entire big‑data ecosystem—explaining the 4V characteristics, listing essential open‑source components, teaching Hadoop setup, Hive and SparkSQL usage, data ingestion with Sqoop, Flume and Kafka, task scheduling with Oozie, and real‑time processing with Storm and Spark Streaming.

Big DataHadoopHive

0 likes · 20 min read

Building a Complete Big Data Platform: From Hadoop Basics to Real‑Time Analytics

MaGe Linux Operations

May 3, 2017 · Big Data

From Storage to Real‑Time: The Evolution of Big Data Technologies

This article outlines the three historical stages of big data technology—from early storage and batch processing, through market‑driven integration with Hive, to today’s focus on speed with Spark, Impala and streaming—while detailing the Hadoop ecosystem components such as HDFS, MapReduce, KV stores and emerging solutions like YDB.

HDFSHadoopHive

0 likes · 13 min read

From Storage to Real‑Time: The Evolution of Big Data Technologies

Efficient Ops

May 2, 2017 · Big Data

Mastering ZooKeeper: Core Concepts and Real-World Big Data Applications

This article introduces ZooKeeper’s fundamental architecture, explains its key concepts such as cluster roles, sessions, ZNodes, watches, and ACLs, and then details how it powers essential distributed coordination tasks—including configuration management, naming services, master election, and distributed locks—in large‑scale Hadoop and HBase ecosystems.

Distributed CoordinationDistributed LocksHBase

0 likes · 25 min read

Mastering ZooKeeper: Core Concepts and Real-World Big Data Applications

360 Quality & Efficiency

Apr 24, 2017 · Big Data

Introduction to Hadoop: Architecture, HDFS, MapReduce, and Common Commands

This article introduces Hadoop as a widely used big‑data framework, explains its core components HDFS and MapReduce, describes the cluster node roles, presents typical command‑line usage and a sample MapReduce workflow, and offers guidance for further learning.

Distributed ComputingHDFSHadoop

0 likes · 5 min read

Introduction to Hadoop: Architecture, HDFS, MapReduce, and Common Commands

Architecture Digest

Apr 24, 2017 · Big Data

Understanding and Solving Data Skew in Hadoop and Spark

This article explains what data skew is, why it occurs in large‑scale Hadoop and Spark jobs, illustrates typical symptoms, and presents practical strategies—including business‑level adjustments, code tweaks, and platform‑specific tuning—to mitigate and resolve skew in big‑data processing.

Big DataData SkewHadoop

0 likes · 11 min read

Understanding and Solving Data Skew in Hadoop and Spark

Meituan Technology Team

Apr 14, 2017 · Big Data

Practical Experience of HDFS Federation at Meituan: Challenges, Improvements, and Automation

Meituan‑Dianping migrated its 2,000‑node HDFS cluster to Federation by fixing ViewFs compatibility, simplifying mount points, leveraging FastCopy for massive data moves, improving token handling, and automating split‑workflow steps, thereby overcoming single‑NameNode bottlenecks and providing a practical blueprint for large‑scale Hadoop deployments.

Big DataFastCopyFederation

0 likes · 22 min read

Practical Experience of HDFS Federation at Meituan: Challenges, Improvements, and Automation

Tongcheng Travel Technology Center

Apr 14, 2017 · Information Security

Implementing a Lightweight User Authentication Mechanism for Hadoop at Tongcheng Travel

This article describes the design, implementation, and deployment of a custom Hadoop security solution that introduces username‑password authentication via RPC, integrates a new protobuf protocol, modifies NameNode behavior, and provides rollout tools to secure a large‑scale shared Hadoop cluster without service interruption.

HadoopKerberosUserGroupInformation

0 likes · 9 min read

Implementing a Lightweight User Authentication Mechanism for Hadoop at Tongcheng Travel

Java High-Performance Architecture

Apr 4, 2017 · Big Data

Master MapReduce: Principles, Process, and 7 Hands‑On Examples

This tutorial quickly introduces the MapReduce model, explains its core principles and execution flow, and guides you through seven practical examples—from basic WordCount to custom serialization, partitioning, joins, and friend‑recommendation—while providing test data and an optional ready‑made Hadoop environment for hands‑on practice.

Distributed ComputingHadoopMapReduce

0 likes · 3 min read

Master MapReduce: Principles, Process, and 7 Hands‑On Examples

Java High-Performance Architecture

Mar 23, 2017 · Big Data

Master HDFS: From Basics to Hands‑On Java API and Shell Operations

This tutorial guides you through HDFS fundamentals, explaining its purpose and mechanisms, demonstrates command‑line and Java API operations, and walks you through the complete read/write workflow, while providing a ready‑to‑use practice environment for hands‑on learning.

Distributed File SystemHDFSHadoop

0 likes · 2 min read

Master HDFS: From Basics to Hands‑On Java API and Shell Operations

Meituan Technology Team

Mar 17, 2017 · Big Data

Optimizing Hadoop NameNode Restart in HA with QJM

By applying a series of JIRA patches and configuration tweaks—such as shrinking the fsLock scope, increasing checkpoint transaction thresholds, off‑loading quota calculations, simplifying BlockReport handling, and async processing of mis‑replicated blocks—the Hadoop HA NameNode restart time in a 540 MB metadata cluster drops from roughly 4000 seconds to about 2000 seconds, cutting total downtime to around 35 minutes and greatly improving cluster availability.

HAHDFSHadoop

0 likes · 18 min read

Optimizing Hadoop NameNode Restart in HA with QJM

Efficient Ops

Mar 8, 2017 · Big Data

Inside iQIYI’s Massive Hadoop Platform: Architecture, Ops, and the Gear Workflow Engine

iQIYI’s Hadoop platform, built since 2010, now spans over a thousand nodes and 60 PB storage, detailing its architectural evolution, operational management practices, encountered challenges, and the custom Gear workflow system that streamlines job scheduling, dependencies, and alerts for large‑scale data processing.

GearHadoopYARN

0 likes · 19 min read

Inside iQIYI’s Massive Hadoop Platform: Architecture, Ops, and the Gear Workflow Engine

Tencent Cloud Developer

Mar 8, 2017 · Big Data

HBase Data Migration from Version 0.94.15 to 1.2.1: Issues and Solutions

The migration of 500 GB HBase and 5 TB Solr data from version 0.94.15 to 1.2.1 required fixing hardware clock drift, DNS hostname issues, and missing Snappy support, and demonstrated that a brute‑force HDFS transfer is more reliable than import/export when handling deprecated parameters.

Data MigrationHBaseHadoop

0 likes · 9 min read

HBase Data Migration from Version 0.94.15 to 1.2.1: Issues and Solutions

Nightwalker Tech

Feb 27, 2017 · Big Data

Community Discussion on Learning Paths, Tools, and Applications in Big Data

A diverse group of practitioners share recommendations for books, technologies, real‑world use cases, and practical challenges when learning and applying big‑data processing, covering Hadoop, Spark, data visualization, ETL, and the relationship between data, algorithms, and business value.

Big DataHadoopdata analysis

0 likes · 16 min read

Community Discussion on Learning Paths, Tools, and Applications in Big Data

Architecture Digest

Feb 11, 2017 · Big Data

LeKe Sports Big Data Platform Evolution: From Early ETL Reporting to 2.0 Streaming Architecture

The article describes how LeKe Sports built and continuously upgraded its Hadoop‑based big data platform—from a manual ETL‑to‑Elasticsearch reporting system to a 2.0 architecture featuring Spark Streaming, SQL‑based query layers, Elasticsearch indexing, and cloud‑native storage and backup solutions—to meet rapidly growing PB‑scale data demands.

Big DataData PlatformETL

0 likes · 5 min read

LeKe Sports Big Data Platform Evolution: From Early ETL Reporting to 2.0 Streaming Architecture

Efficient Ops

Feb 9, 2017 · Big Data

Mastering HDFS Disk Balancer: Optimize DataNode Storage in Hadoop 3

This article explains the new HDFS disk balancer feature introduced in Hadoop 3, covering its purpose, supported volume‑selection policies, step‑by‑step usage, planning and execution commands, and how it helps maintain balanced storage across DataNode disks.

Disk BalancerHDFSHadoop

0 likes · 8 min read

Mastering HDFS Disk Balancer: Optimize DataNode Storage in Hadoop 3

StarRing Big Data Open Lab

Jan 26, 2017 · Information Security

Why Hadoop Clusters Need Strong Security and How Kerberos Protects Them

This article explains the security risks facing Hadoop clusters, outlines common attack methods, introduces Kerberos authentication, and describes Transwarp Data Hub's multi‑layer security architecture—including Guardian, KRB5LDAP, and authorization controls—to help administrators secure their big‑data environments.

HadoopKerberosTDH

0 likes · 11 min read

Why Hadoop Clusters Need Strong Security and How Kerberos Protects Them

Huawei Cloud Developer Alliance

Jan 25, 2017 · Big Data

Top 10 Hadoop Data Security Practices Every Enterprise Should Follow

This article outlines ten essential Hadoop data‑security measures, describes the eight‑layer Hadoop ecosystem, presents real‑world Hadoop case studies, and discusses the platform's development roadmap and future trends, offering a comprehensive guide for big‑data professionals.

Big DataCase StudiesData Security

0 likes · 17 min read

Top 10 Hadoop Data Security Practices Every Enterprise Should Follow

Huawei Cloud Developer Alliance

Jan 24, 2017 · Big Data

Why Hadoop Remains the Backbone of Big Data: Core Modules, Tools, and Trends

This article provides a comprehensive overview of Hadoop as the leading open‑source platform for big‑data processing, detailing its core components HDFS and MapReduce, the evolution to Hadoop 2.0/YARN, and the extensive ecosystem of tools and commercial solutions that enable scalable storage, analysis, and machine‑learning on massive data sets.

Big DataDistributed ComputingHDFS

0 likes · 18 min read

Why Hadoop Remains the Backbone of Big Data: Core Modules, Tools, and Trends

Art of Distributed System Architecture Design

Dec 31, 2016 · Big Data

Understanding Hadoop: Architecture, HDFS, and MapReduce

This article explains Hadoop as an Apache‑managed open‑source platform for storing massive data on distributed clusters and running robust, efficient analytics via its two core components—HDFS for storage and the Java‑based MapReduce framework for processing—highlighting modularity, high availability, and common tooling.

Distributed ComputingHDFSHadoop

0 likes · 6 min read

Understanding Hadoop: Architecture, HDFS, and MapReduce

dbaplus Community

Dec 22, 2016 · Big Data

What’s New in the Big Data Ecosystem? Hadoop 3.0 Alpha, Druid 0.9.2, Kudu 1.1 and More

This article summarizes the latest releases and feature updates in the big data ecosystem—including Hadoop 3.0 Alpha, Druid 0.9.2, Apache Kudu 1.1.0, HAWQ 2.1.0 enterprise—as well as a brief overview of Docker’s 2015‑2016 version history and its adoption status in China.

Big DataDruidHAWQ

0 likes · 18 min read

What’s New in the Big Data Ecosystem? Hadoop 3.0 Alpha, Druid 0.9.2, Kudu 1.1 and More

ITFLY8 Architecture Home

Dec 13, 2016 · Big Data

Umeng’s Mobile Big Data Platform: Architecture, Challenges & Insights

The article details Umeng’s mobile big‑data platform architecture, describing its Lambda‑style hybrid design, data ingestion pipeline with dual Kafka clusters, offline and real‑time processing using Hadoop, Spark, Storm, and storage layers such as HDFS, HBase, MongoDB and Elasticsearch, while also discussing challenges in data collection, cleaning, computation, security, and value‑added services.

Data ArchitectureHadoopKafka

0 likes · 13 min read

Umeng’s Mobile Big Data Platform: Architecture, Challenges & Insights

Architects' Tech Alliance

Dec 6, 2016 · Big Data

How Hulu’s Segmentation System Powers Big Data Marketing at Scale

At the 2016 WOT Big Data Technology Summit, Hulu’s senior R&D manager Zhao Kunliang presented the company’s Segmentation system, detailing its Hadoop‑based architecture, Spark and Spark Streaming processing, the custom Nesto query engine, and the challenges and innovations involved in supporting large‑scale marketing and advertising analytics.

HadoopMarketing AnalyticsNesto

0 likes · 5 min read

How Hulu’s Segmentation System Powers Big Data Marketing at Scale

Weidian Tech Team

Nov 28, 2016 · Big Data

How We Built the Mars Big Data Platform to Boost Development Efficiency

The article explains why Weidian needed a new big data development platform, outlines the functional features of the Mars system, describes its architecture, scheduling mechanisms, task execution flow, and discusses remaining challenges and future enhancements.

HadoopTask schedulingdistributed systems

0 likes · 11 min read

How We Built the Mars Big Data Platform to Boost Development Efficiency

StarRing Big Data Open Lab

Nov 22, 2016 · Big Data

Boost Hadoop SQL Performance: Reduce I/O, Network, and CPU Overhead

This article explains how to quickly locate SQL performance bottlenecks on Hadoop by understanding hardware metrics and then applies four practical optimization strategies—cutting data access, shrinking result sets, minimizing interactions, and lowering CPU load—using filters, selective columns, batch operations, and stored procedures.

Hadoop

0 likes · 11 min read

StarRing Big Data Open Lab

Nov 18, 2016 · Big Data

Unveiling Modern Big Data Architecture: Key Technologies and Trends

This article reviews a comprehensive big‑data lecture covering traditional databases, Hadoop ecosystems, commercial big‑data platforms, computing models, analysis techniques, visualization, and leading vendors, highlighting how these technologies shape today’s data‑driven enterprises.

Big DataData ArchitectureHadoop

0 likes · 14 min read

Unveiling Modern Big Data Architecture: Key Technologies and Trends

Architecture Digest

Nov 16, 2016 · Big Data

A Decade of Hadoop: History, Architecture, Ecosystem, and Future Outlook

This article chronicles Hadoop’s ten‑year evolution from its early HDFS and MapReduce roots to a mature big‑data platform, detailing its historical milestones, architectural layers, ecosystem components, industry adoption, and future trends in storage, processing, security, and cloud integration.

Distributed ComputingHadoopecosystem

0 likes · 36 min read

A Decade of Hadoop: History, Architecture, Ecosystem, and Future Outlook

StarRing Big Data Open Lab

Nov 14, 2016 · Operations

Master Real-Time Hadoop Alerts with Transwarp Manager

Deploying the Transwarp Manager alert system within Hadoop clusters enables operators to monitor resource shortages, failures, and health issues in real time, offering browsing, configurable thresholds, and instant email or script notifications to quickly identify and resolve problems before they impact services.

Alert MonitoringHadoopOperations

0 likes · 9 min read

Master Real-Time Hadoop Alerts with Transwarp Manager

StarRing Big Data Open Lab

Nov 11, 2016 · Big Data

Why SQL Still Rules Big Data—and How NoSQL & NewSQL Fit In

The article explores the evolution of data processing from Hadoop and Spark to modern SQL, NoSQL, and NewSQL solutions, comparing their architectures, performance trade‑offs, and use‑cases, while illustrating concepts with examples like MapReduce, Hive, Impala, and streaming platforms such as Storm.

Big DataHadoopNewSQL

0 likes · 14 min read

Why SQL Still Rules Big Data—and How NoSQL & NewSQL Fit In

Architecture Digest

Nov 6, 2016 · Big Data

Evolution of Taobao’s Big Data Platform: From RAC to MaxCompute

The article chronicles Taobao’s 13‑year evolution of its big data platform, detailing three phases—from a single‑node Oracle setup and the Tianwang scheduler, through a Hadoop‑based “Cloud Ladder 1” architecture with real‑time analytics, to the current MaxCompute/ODPS era with cross‑region projects and advanced data services.

Big DataData PlatformData Warehouse

0 likes · 11 min read

Evolution of Taobao’s Big Data Platform: From RAC to MaxCompute

Architects' Tech Alliance

Oct 31, 2016 · Big Data

A Decade of Hadoop: History, Architecture, Industry Impact and Future Outlook

This article chronicles Hadoop’s ten‑year evolution from its early Nutch roots to a mature big‑data platform, detailing its technical architecture, ecosystem growth, industry adoption, application scenarios, and future challenges in storage, resource management, compute engines, security, and analytics.

Cloud ComputingData PlatformHadoop

0 likes · 35 min read

A Decade of Hadoop: History, Architecture, Industry Impact and Future Outlook

ITFLY8 Architecture Home

Oct 27, 2016 · Big Data

Inside Taobao’s Massive Data Architecture: How 1.5 PB Daily Is Processed and Served

The article explains Taobao’s five‑layer data product architecture—covering data sources, compute, storage, query, and product layers—and describes how massive volumes of data are ingested, processed in batch and streaming, stored in MySQL and HBase clusters, and served efficiently through a unified middle‑layer and sophisticated caching mechanisms.

Big DataCachingHBase

0 likes · 15 min read

Inside Taobao’s Massive Data Architecture: How 1.5 PB Daily Is Processed and Served

Java High-Performance Architecture

Oct 17, 2016 · Databases

How Does HBase Store Massive Tables? Inside Its Architecture

HBase stores huge tables by splitting them into regions, distributing these across region servers managed by a master, and further dividing each region into column-family stores, memstores, and StoreFiles, forming a layered architecture built on Hadoop’s distributed storage.

Distributed storageHBaseHadoop

0 likes · 2 min read

How Does HBase Store Massive Tables? Inside Its Architecture

StarRing Big Data Open Lab

Oct 8, 2016 · Big Data

Evolving Data Warehouses with Hadoop & Spark: Core Technologies

Data warehouses centralize and transform enterprise data for multidimensional analysis, and modern demands have spawned four types—traditional, real‑time, associative discovery, and data marts—each with distinct technical requirements, while Hadoop‑based solutions like Transwarp Data Hub address challenges of scale, variety, latency, and security.

Big DataDistributed ComputingHadoop

0 likes · 21 min read

Evolving Data Warehouses with Hadoop & Spark: Core Technologies

Java High-Performance Architecture

Sep 27, 2016 · Big Data

Build a Hadoop Cluster with Docker: Step‑by‑Step Guide

Learn how to quickly set up a multi‑node Hadoop cluster on a single machine using Docker containers, covering image preparation, SSH configuration, fixed IP assignment with pipework, and building custom Hadoop images, enabling a lightweight, cost‑effective big‑data environment for development and testing.

Big DataCentOSDocker

0 likes · 9 min read

Build a Hadoop Cluster with Docker: Step‑by‑Step Guide

Java High-Performance Architecture

Sep 24, 2016 · Big Data

Step-by-Step Guide to Building a Hadoop 2.7.3 Cluster on Three Servers

This tutorial walks you through preparing three Linux servers, configuring password‑less SSH, installing Hadoop 2.7.3, editing core XML files, distributing the installation, starting the services, and verifying HDFS and MapReduce functionality with practical commands and screenshots.

Big DataCluster SetupHDFS

0 likes · 10 min read

Step-by-Step Guide to Building a Hadoop 2.7.3 Cluster on Three Servers

MaGe Linux Operations

Aug 23, 2016 · Big Data

Step-by-Step Guide to Building a Hadoop Cluster on CentOS 6.5

This article provides a comprehensive, hands‑on tutorial for setting up a Hadoop 2.6.4 cluster on a CentOS 6.5 development server, covering SSH password‑less login, user/group creation, DNS configuration, JDK installation, environment variables, Hadoop installation, HDFS and YARN configuration, and troubleshooting native library warnings.

Big DataCentOSCluster Setup

0 likes · 12 min read

Step-by-Step Guide to Building a Hadoop Cluster on CentOS 6.5

Ctrip Technology

Aug 19, 2016 · Big Data

Ctrip's Big Data Architecture and Personalized Recommendation System

This article describes how Ctrip transformed its traditional application architecture into a high‑concurrency, big‑data‑driven platform, detailing storage, compute, and business‑layer redesigns that enable massive data ingestion, real‑time user‑intent services, and a scalable personalized recommendation system.

Big DataCtripHadoop

0 likes · 14 min read

Ctrip's Big Data Architecture and Personalized Recommendation System

Qunar Tech Salon

Aug 16, 2016 · Big Data

Exploring OLAP Engine with Apache Kylin: Architecture, Theory, and Applications in Qunar's Big Data Platform

This article presents Qunar's experience transitioning from MySQL‑based OLAP to Apache Kylin, detailing the performance challenges, required features, Kylin's architecture and theory, cube construction process, storage mechanisms, real‑world applications, and the pitfalls and optimization practices discovered along the way.

Apache KylinCubeHBase

0 likes · 6 min read

Exploring OLAP Engine with Apache Kylin: Architecture, Theory, and Applications in Qunar's Big Data Platform

MaGe Linux Operations

Aug 4, 2016 · Big Data

How Hadoop 2.0 Collects and Manages Job Logs with YARN

This article explains Hadoop 2.0's built‑in MRv2 log collection mechanism, detailing job‑run and task‑run logs, their generation steps, log aggregation, and the role of the JobHistory Server for centralized analysis.

Big DataHadoopJobHistory

0 likes · 8 min read

How Hadoop 2.0 Collects and Manages Job Logs with YARN

ITPUB

Jul 19, 2016 · Big Data

From Traditional Data Warehouses to Big Data: Practical Techniques and Migration Insights

The talk shares hands‑on experiences and best‑practice methods for traditional data‑warehouse processing, public and behavioral data handling in big‑data environments, and practical guidance for migrating legacy warehouses to modern Hadoop‑based platforms, emphasizing data governance, security, and performance optimization.

Big DataData GovernanceData Warehouse

0 likes · 13 min read

From Traditional Data Warehouses to Big Data: Practical Techniques and Migration Insights

Architecture Digest

Jul 5, 2016 · Big Data

Why Map‑Reduce Is Not the Solution to Your Big Data Problem – A Critical Look at Hadoop

The article reviews Hadoop’s origins from Google’s pioneering papers, explains its architecture and ecosystem, evaluates its strengths such as scalability and benchmarks, discusses current limitations like single‑point failures and complex programming, and outlines upcoming improvements including HDFS Federation and next‑generation MapReduce.

Big DataDistributed ComputingFuture

0 likes · 14 min read

Why Map‑Reduce Is Not the Solution to Your Big Data Problem – A Critical Look at Hadoop

ITPUB

Jun 29, 2016 · Big Data

Why OLTP Falls Short for Big Data: OLAP, Hadoop & MPP Explained

The article explains how traditional OLTP systems cannot satisfy modern big‑data analytics needs and compares OLAP, Hadoop, and MPP architectures, highlighting their data processing models, scalability, cloud‑based managed services, and practical recommendations for building effective data warehouses.

Big DataData WarehouseHadoop

0 likes · 21 min read

Why OLTP Falls Short for Big Data: OLAP, Hadoop & MPP Explained

ITPUB

Jun 26, 2016 · Big Data

How to Combine R with Hadoop for Petabyte-Scale Data Processing

This article explains three practical approaches—Streaming APIs, the Rhipe package, and RHadoop—to integrate R with Hadoop, enabling R to process petabyte-scale datasets, compares their setup complexity, capabilities, and trade‑offs, and highlights key conclusions for choosing the right method.

HadoopRRHadoop

0 likes · 4 min read

How to Combine R with Hadoop for Petabyte-Scale Data Processing

ITPUB

Jun 18, 2016 · Big Data

5 Essential Steps to Maximize Hadoop Value for Enterprise Projects

Enterprises can unlock Hadoop's full potential by following five strategic steps—from defining high‑impact use cases and assessing architectural fit to managing data, integrating systems, and addressing skill gaps—ensuring measurable business value and competitive advantage.

Data ManagementEnterprise AnalyticsHadoop

0 likes · 7 min read

5 Essential Steps to Maximize Hadoop Value for Enterprise Projects

dbaplus Community

Jun 7, 2016 · Big Data

What Is Big Data? Value, Platforms, and How to Harness Its Power

This article explains what big data is, where its value lies, how to design and build a big data platform, and the essential steps to turn massive data into actionable business insights while addressing technical and operational challenges.

BIBig DataData Value

0 likes · 16 min read

What Is Big Data? Value, Platforms, and How to Harness Its Power

Hulu Beijing

May 31, 2016 · Big Data

What’s New in Hadoop 3.0? Key Features and Improvements Explained

Hadoop 3.0, built on JDK 1.8, adds erasure‑coded HDFS, multi‑NameNode support, native MapReduce task optimizations, cgroup‑based YARN memory and disk isolation, and container resizing, with an alpha slated for summer and a GA release expected in November or December.

Big DataHDFSHadoop

0 likes · 5 min read

What’s New in Hadoop 3.0? Key Features and Improvements Explained

dbaplus Community

May 26, 2016 · Big Data

Mastering Apache Parquet: Columnar Storage, Nested Data, and Performance Gains

This article explains Apache Parquet’s columnar storage format, its support for nested data models, the underlying striping/assembly algorithm, file structure, push‑down optimizations, and performance advantages within the Hadoop ecosystem, providing a comprehensive guide for big‑data practitioners.

Apache ParquetBig DataHadoop

0 likes · 22 min read

Mastering Apache Parquet: Columnar Storage, Nested Data, and Performance Gains

Qunar Tech Salon

May 13, 2016 · Big Data

Overview and Architecture of Hadoop Distributed File System (HDFS)

This article provides a comprehensive overview of Hadoop Distributed File System (HDFS), detailing its design goals, architecture components such as NameNode, DataNode and SecondaryNameNode, data block handling, replication strategies, communication protocols, and the read, write, and delete processes.

Big DataData ReplicationDistributed File System

0 likes · 18 min read

Architect

May 11, 2016 · Big Data

Comprehensive Guide to Hadoop MapReduce Job Execution, Scheduling, and Optimization

This article provides an in‑depth explanation of Hadoop MapReduce architecture, covering the roles of JobClient, JobTracker, TaskTracker and HDFS, the complete job lifecycle from submission to completion, scheduling strategies, shuffle and sort mechanisms, fault tolerance, and performance tuning techniques.

Big DataHadoopJobTracker

0 likes · 20 min read

Comprehensive Guide to Hadoop MapReduce Job Execution, Scheduling, and Optimization

ITPUB

Apr 24, 2016 · Big Data

12 Essential Hive Performance Tips for Faster Hadoop Queries

This guide presents twelve practical Hive tuning techniques—including avoiding MapReduce, limiting string concatenation, steering clear of subqueries, choosing the right file formats, managing vectorization, sizing containers, enabling statistics, and optimizing joins—to dramatically improve query speed on Hadoop.

Big DataHadoopHive

0 likes · 7 min read

12 Essential Hive Performance Tips for Faster Hadoop Queries

Java High-Performance Architecture

Apr 18, 2016 · Big Data

Why Spark Is Outpacing Hadoop: Speed, Real‑Time Processing, and ML Advantages

The article explains how Spark has become the leading open‑source big‑data platform, highlighting its superior speed, in‑memory processing, real‑time streaming, and built‑in machine‑learning library compared with Hadoop’s slower, disk‑based MapReduce approach and reliance on external storage and ML tools.

Big DataHadoopReal-time Processing

0 likes · 5 min read

Why Spark Is Outpacing Hadoop: Speed, Real‑Time Processing, and ML Advantages

21CTO

Apr 4, 2016 · Big Data

How Asana Scaled Its Data Infrastructure: From MySQL to Redshift & Hadoop

This article details Asana's evolution from a simple Python‑MySQL setup to a robust, scalable data platform using Redshift, Hadoop, Luigi, and modern BI tools, highlighting challenges, solutions, and lessons learned for building reliable data pipelines in fast‑growing startups.

Big DataData InfrastructureETL

0 likes · 15 min read

How Asana Scaled Its Data Infrastructure: From MySQL to Redshift & Hadoop

Art of Distributed System Architecture Design

Mar 31, 2016 · Big Data

Airbnb’s Big Data Platform Architecture: Design, Evolution, and Lessons Learned

Airbnb’s engineering team outlines the evolution and design of its massive big‑data platform—detailing the dual “gold” and “silver” Hive clusters, use of Kafka, Presto, Airflow, Mesos, and Spark, along with performance gains, cost reductions, and open‑source contributions.

AirbnbAirflowBig Data

0 likes · 13 min read

Airbnb’s Big Data Platform Architecture: Design, Evolution, and Lessons Learned

21CTO

Mar 31, 2016 · Big Data

Why Hadoop Isn’t the Silver Bullet for Big Data: Insights from Facebook

The article examines common misconceptions about Hadoop, compares it with relational databases, and shares Facebook's data‑analysis practices, highlighting when Hadoop is appropriate and the broader considerations of using open‑source big‑data frameworks.

HadoopMapReduceRelational Databases

0 likes · 8 min read

Why Hadoop Isn’t the Silver Bullet for Big Data: Insights from Facebook

Architecture Digest

Mar 28, 2016 · Big Data

Overview of the Hadoop Ecosystem and Modern Big Data Technologies

This article provides a comprehensive overview of Hadoop and its surrounding ecosystem, detailing core components, storage principles, key algorithms, and a wide range of modern big‑data technologies such as Spark, Flink, Kafka, NoSQL databases, and cloud‑based processing platforms.

Big DataHadoopKafka

0 likes · 11 min read

Overview of the Hadoop Ecosystem and Modern Big Data Technologies

Architect

Mar 10, 2016 · Big Data

Analysis and Practice of a Real-Time Hadoop Data Security Solution

The article presents a detailed technical overview of Apache Eagle's real-time Hadoop data security architecture, covering distributed data collection, stream processing, metadata‑driven policy enforcement, machine‑learning‑based anomaly detection, and integration with Hadoop ecosystem components such as HBase, Kafka, and Storm.

Apache EagleBig DataData Security

0 likes · 25 min read

Analysis and Practice of a Real-Time Hadoop Data Security Solution

ITPUB

Feb 24, 2016 · Big Data

How Pepperdata Optimizes Hadoop Cluster Resources and Improves Performance

The article explains how Hadoop clusters suffer from resource contention among multiple users, why YARN alone often fails to prioritize workloads, and how Pepperdata provides deeper visibility and automatic adjustments that reduce low‑priority usage, cut node count, and lower cloud costs.

Big DataHadoopPepperdata

0 likes · 7 min read

How Pepperdata Optimizes Hadoop Cluster Resources and Improves Performance

ITPUB

Feb 20, 2016 · Big Data

Doug Cutting’s Journey: How Hadoop Shaped the Big Data Era

The article chronicles Doug Cutting’s path from his Stanford studies and early Xerox work through the creation of Lucene, Nutch, and Hadoop, highlighting how open‑source innovations and Google’s technologies propelled Hadoop to become a cornerstone of modern big‑data processing and its future outlook.

Big DataDistributed ComputingDoug Cutting

0 likes · 15 min read

Doug Cutting’s Journey: How Hadoop Shaped the Big Data Era

Baidu Maps Tech Team

Feb 3, 2016 · Big Data

How Baidu Maps Powers Its Open Platform with Big Data Architecture

This article explains how Baidu Maps’ open platform handles massive daily location data through real‑time and offline pipelines, Hadoop‑based offline computing, stream processing, and query engines built on MySQL, Redis, and Apache Kylin, while outlining future big‑data enhancements.

Apache KylinBaidu MapsHadoop

0 likes · 7 min read

How Baidu Maps Powers Its Open Platform with Big Data Architecture

Java High-Performance Architecture

Jan 24, 2016 · Big Data

MapReduce Explained: From Library Book Counting to Word Count in Big Data

This article introduces the MapReduce parallel processing model, illustrates its core map and reduce operations with a library‑shelf analogy and a classic word‑count example, and walks through each processing stage using clear diagrams to show how massive data is aggregated efficiently.

Big DataHadoopMapReduce

0 likes · 5 min read

MapReduce Explained: From Library Book Counting to Word Count in Big Data

Qunar Tech Salon

Jan 11, 2016 · Big Data

Architecture of Taobao's Massive Data Products: From Data Sources to the Glider Middleware

The article details Taobao's massive data product architecture, describing a five‑layer system that processes billions of daily records using Hadoop, real‑time streams, distributed MySQL and HBase clusters, and a middleware layer called Glider that unifies queries, caching, and front‑end integration.

Big DataData ArchitectureHadoop

0 likes · 16 min read

Architecture of Taobao's Massive Data Products: From Data Sources to the Glider Middleware

Baidu Maps Tech Team

Jan 6, 2016 · Big Data

How Baidu Maps Scales Billion‑Row OLAP Queries with Apache Kylin

Baidu Maps’ Data Intelligence team built a large‑scale OLAP platform using Apache Kylin, detailing the challenges of multi‑dimensional analysis on billions of rows, the architecture, custom extensions for task, resource, and monitoring management, and performance optimizations that achieve millisecond‑level SQL responses.

Apache KylinBig DataData Warehouse

0 likes · 21 min read

How Baidu Maps Scales Billion‑Row OLAP Queries with Apache Kylin

Efficient Ops

Jan 5, 2016 · Information Security

How Apache Eagle Secures Hadoop: Real‑Time Big Data Threat Detection

Apache Eagle is an open‑source, distributed, real‑time security monitoring platform for Hadoop that combines stream‑processing, scalable policy enforcement, and machine‑learning user profiling to protect massive data assets across eBay’s production clusters.

Apache EagleBig DataHadoop

0 likes · 19 min read

How Apache Eagle Secures Hadoop: Real‑Time Big Data Threat Detection

Architect

Jan 5, 2016 · Big Data

Apache Eagle: eBay’s Open‑Source Real‑Time Hadoop Data Security Platform

The article provides a comprehensive technical overview of Apache Eagle, an open‑source, distributed, real‑time security monitoring and alerting platform for Hadoop developed by eBay, covering its motivation, architecture, core components, machine‑learning based detection, typical use cases, and future development directions.

Apache EagleBig DataData Security

0 likes · 15 min read

Apache Eagle: eBay’s Open‑Source Real‑Time Hadoop Data Security Platform

21CTO

Dec 30, 2015 · Big Data

Mastering Massive Data: MapReduce, Hadoop, and Taobao’s Architecture

This article introduces the fundamental MapReduce model and Hadoop framework, explains their roles in large‑scale data processing, and then examines Taobao’s massive‑data product architecture—including its data source, compute, storage, query, and product layers, as well as the MyFOX, Prom, and Glider components and caching strategies.

Data ArchitectureHadoopMapReduce

0 likes · 16 min read

Mastering Massive Data: MapReduce, Hadoop, and Taobao’s Architecture

Qunar Tech Salon

Dec 13, 2015 · Big Data

Introduction to Distributed Computing: Sharding, Message Queues, Hadoop and MapReduce

This article explains the fundamentals of distributed computing, covering sharding algorithms, message‑queue based task distribution, an overview of Hadoop and its MapReduce model, and the characteristics of offline batch processing for large‑scale data workloads.

Distributed ComputingHadoopMapReduce

0 likes · 11 min read

Introduction to Distributed Computing: Sharding, Message Queues, Hadoop and MapReduce

Efficient Ops

Dec 9, 2015 · Big Data

Big Data Lessons from Baidu: Pitfalls, Language Choices, and NewSQL Insights

In this expert Q&A, Baidu’s senior big-data specialists reveal common project pitfalls, argue for Java in Hadoop-style systems, discuss MongoDB deployment, outline criteria for choosing open-source versus self-built solutions, and evaluate the viability of NewSQL/Spanner-type startups.

Data EngineeringHadoopMongoDB

0 likes · 8 min read

Big Data Lessons from Baidu: Pitfalls, Language Choices, and NewSQL Insights

21CTO

Dec 9, 2015 · Big Data

Mastering Hadoop: From MapReduce Basics to Taobao’s Massive Data Architecture

This article introduces the fundamental MapReduce model and Hadoop framework, explains their components such as HDFS, MapReduce, and HBase, and then examines Taobao’s large‑scale data product architecture—including storage, computation, query, and caching layers—to illustrate practical big‑data processing techniques.

Data ArchitectureHadoopMapReduce

0 likes · 17 min read

Mastering Hadoop: From MapReduce Basics to Taobao’s Massive Data Architecture

21CTO

Dec 3, 2015 · Big Data

How Netflix Scales Its Hadoop Data Warehouse on AWS with Genie PaaS

This article explains how Netflix leverages Amazon S3 and Elastic MapReduce to build a virtually unlimited, dynamically scalable Hadoop data warehouse in the cloud, and introduces Genie—a Hadoop platform‑as‑a‑service that abstracts job submission, resource management, and cluster orchestration.

AWSData WarehouseElastic MapReduce

0 likes · 15 min read

How Netflix Scales Its Hadoop Data Warehouse on AWS with Genie PaaS

Architect

Dec 2, 2015 · Big Data

Designing an Agile Data Warehouse Architecture for Internet Companies

The article outlines a practical, end‑to‑end data platform architecture for internet businesses, covering data collection, storage and analysis, sharing, real‑time processing, task scheduling, and the importance of simplicity and agility in building an agile data warehouse.

Big DataData ArchitectureData Warehouse

0 likes · 10 min read

Designing an Agile Data Warehouse Architecture for Internet Companies

21CTO

Nov 26, 2015 · Big Data

Understanding Big Data: 4V Traits, Google’s Distributed Computing, and Hadoop Ecosystem

This article explores the 4V characteristics of big data, real‑world data growth examples, historical analogies, Google’s GFS‑MapReduce‑BigTable model, Hadoop’s architecture and HDFS processes, HBase components, NoSQL alternatives, and practical big‑data applications at Tencent and beyond.

Data ArchitectureDistributed ComputingHadoop

0 likes · 7 min read

Understanding Big Data: 4V Traits, Google’s Distributed Computing, and Hadoop Ecosystem

Art of Distributed System Architecture Design

Nov 20, 2015 · Big Data

Design and Implementation of Alibaba Cloud's Cross‑Data‑Center Hadoop Cluster

In 2013 Alibaba Cloud faced full rack capacity in a single IDC, prompting the development of a multi‑NameNode, cross‑data‑center Hadoop solution that overcomes NameNode scalability, inter‑site bandwidth limits, data placement, job scheduling, massive data migration, and user transparency challenges.

CloudCross‑Data‑CenterDistributed storage

0 likes · 14 min read

Design and Implementation of Alibaba Cloud's Cross‑Data‑Center Hadoop Cluster

21CTO

Nov 19, 2015 · Big Data

Beyond Hadoop: Modern Big Data Platforms and Technologies Explained

This article surveys the evolution of Hadoop and its ecosystem, explains core storage and processing concepts, and introduces contemporary big‑data technologies such as Spark, Flink, Kafka, Lambda architecture, NoSQL databases, and cloud‑native solutions, highlighting their roles and trade‑offs.

Big DataFlinkHadoop

0 likes · 17 min read

Beyond Hadoop: Modern Big Data Platforms and Technologies Explained

Art of Distributed System Architecture Design

Oct 29, 2015 · Big Data

TalkingData’s Journey to Building a Mobile Big Data Platform with Spark and YARN

This article recounts how TalkingData progressively introduced Spark into its Hadoop‑YARN based mobile big‑data platform, detailing early architectures, migration challenges, performance gains, the fully Spark‑centric redesign with Kafka and Spark Streaming, encountered pitfalls, and future plans for further optimization.

Data PlatformHadoopSpark

0 likes · 16 min read

TalkingData’s Journey to Building a Mobile Big Data Platform with Spark and YARN

Architect

Oct 17, 2015 · Big Data

Designing an Agile Data Warehouse and Data Platform for Internet Companies

The article outlines the purposes, architecture, data ingestion, storage, analysis, sharing, application, real‑time processing, scheduling, monitoring, and best‑practice recommendations for building a fast, flexible, and reliable big‑data platform in the fast‑changing internet industry.

Big DataData WarehouseHadoop

0 likes · 12 min read

Designing an Agile Data Warehouse and Data Platform for Internet Companies

Efficient Ops

Oct 14, 2015 · Big Data

Spark vs Hadoop, Flink, HBase/Cassandra, Kafka & Tachyon: Expert Q&A

During a lively “Sit and Discuss” session, experts compared Spark and Hadoop, evaluated Flink against Spark, contrasted HBase with Cassandra, explained why Kafka (and sometimes Flink) is preferred for distributed messaging, and shared insights on Tachyon’s role in modern big‑data ecosystems.

CassandraFlinkHBase

0 likes · 10 min read

Spark vs Hadoop, Flink, HBase/Cassandra, Kafka & Tachyon: Expert Q&A

Art of Distributed System Architecture Design

Oct 10, 2015 · Artificial Intelligence

Integrating Deep Learning with Apache Hadoop: Caffe-on-Spark on GPU‑Enhanced Clusters

This article describes how Yahoo integrated deep learning into its massive Hadoop ecosystem by adding GPU nodes, using YARN and Spark to run Caffe at scale, and presents performance results on AlexNet and GoogLeNet alongside open‑source contributions.

Big DataCaffeGPU

0 likes · 9 min read

Integrating Deep Learning with Apache Hadoop: Caffe-on-Spark on GPU‑Enhanced Clusters

21CTO

Sep 28, 2015 · Cloud Computing

How Airbnb Scales on AWS: Cloud Architecture, Big Data, and Machine Learning Insights

Airbnb leverages AWS, Hadoop, Presto, Airflow, and custom machine‑learning tools to power its global marketplace, optimizing search, pricing, and data pipelines while achieving significant cost savings and operational efficiency.

AWSAirflowBig Data

0 likes · 7 min read

How Airbnb Scales on AWS: Cloud Architecture, Big Data, and Machine Learning Insights

21CTO

Sep 27, 2015 · Big Data

How Weidian Built a Scalable Big Data Platform for Mobile Commerce

This article outlines the design and implementation of Weidian’s end‑to‑end big data processing platform, covering dataset definition, data collection via Flume‑based DataAgent, transmission through Databus, storage options such as HDFS, Kafka and Elasticsearch, and the monitoring and resource‑integration strategies that support massive mobile commerce logs.

ElasticsearchFlumeHadoop

0 likes · 18 min read

How Weidian Built a Scalable Big Data Platform for Mobile Commerce

Efficient Ops

Aug 30, 2015 · Databases

Oracle’s Future: Cloud Migration, Big Data Integration, and the Post‑IOE Era

In a lively Q&A session, Oracle experts discuss how China’s “post‑IOE” shift, cloud migration, big‑data collaboration with Hadoop, and the strengths of In‑Memory, TimesTen, and Exadata shape the future direction of Oracle databases.

DatabaseExadataHadoop

0 likes · 12 min read

Oracle’s Future: Cloud Migration, Big Data Integration, and the Post‑IOE Era

MaGe Linux Operations

Aug 20, 2015 · Big Data

15 Must‑Try Resources to Master Hadoop Quickly

This article explains what Hadoop is, outlines its key features, and presents a curated list of 15 high‑quality tutorials, video courses, and books to help beginners and professionals efficiently learn Hadoop and its MapReduce ecosystem.

Data EngineeringHadoopLearning Resources

0 likes · 12 min read

15 Must‑Try Resources to Master Hadoop Quickly

Qunar Tech Salon

Aug 17, 2015 · Big Data

Comprehensive Overview of Open‑Source Big Data Tools and Platforms

This article presents a detailed, categorized catalogue of more than fifty open‑source big‑data projects—including Hadoop‑related utilities, analytics platforms, databases, BI solutions, data‑mining packages, query engines, programming languages, search tools, and in‑memory technologies—highlighting their primary functions, supported operating systems, and official links.

AnalyticsDatabasesHadoop

0 likes · 31 min read

Comprehensive Overview of Open‑Source Big Data Tools and Platforms

Qunar Tech Salon

Jul 8, 2015 · Big Data

Understanding Logs: The Foundation of Distributed Systems, Data Integration, and Stream Processing

This article explains how logs—simple, append‑only, time‑ordered records—serve as the core abstraction behind databases, distributed systems, data integration pipelines, and modern stream‑processing platforms such as Kafka and Hadoop, illustrating their design, scalability, and practical challenges.

Big DataData IntegrationHadoop

0 likes · 45 min read

Understanding Logs: The Foundation of Distributed Systems, Data Integration, and Stream Processing

Architect

Jul 6, 2015 · Big Data

Understanding Logs: The Core of Distributed Systems and Data Integration

This article explains how logs—simple, append‑only, time‑ordered records—serve as the fundamental abstraction behind databases, distributed systems, data integration pipelines, and stream‑processing platforms like Kafka and Hadoop, illustrating their role in ordering, replication, scalability, and real‑time analytics.

Data IntegrationHadoopKafka

0 likes · 48 min read

Understanding Logs: The Core of Distributed Systems and Data Integration

Efficient Ops

Jun 25, 2015 · Big Data

Inside Baidu’s 8‑Year Evolution of Hadoop and Distributed Computing

This article chronicles Baidu’s eight‑year journey from early Hadoop adoption to advanced MPI, DAG engines, and real‑time streaming platforms, detailing architectural milestones, performance optimizations, and practical lessons for large‑scale offline and online data processing.

BaiduDAGDistributed Computing

0 likes · 21 min read

Inside Baidu’s 8‑Year Evolution of Hadoop and Distributed Computing

Art of Distributed System Architecture Design

Jun 21, 2015 · Big Data

Design Choices for Distributed Storage Metadata: Comparing GlusterFS, Hadoop, GridFS, HBase, and FastDFS

The article examines various distributed storage design approaches—decentralized (GlusterFS), centralized (Hadoop), database‑based (GridFS and HBase), and metadata‑bypassing (FastDFS)—detailing their advantages, drawbacks, and practical considerations for cloud storage systems.

Distributed storageFastDFSGlusterFS

0 likes · 17 min read

Design Choices for Distributed Storage Metadata: Comparing GlusterFS, Hadoop, GridFS, HBase, and FastDFS

Art of Distributed System Architecture Design

Jun 1, 2015 · Big Data

Overview of Big Data Technologies and Architectures

This article provides a comprehensive overview of major big‑data platforms such as Hadoop, Spark, Flink, Kafka, and related ecosystem components, explaining their core concepts, storage models, processing frameworks, and architectural patterns for handling massive, distributed datasets.

HadoopKafkaNoSQL

0 likes · 18 min read

Overview of Big Data Technologies and Architectures

ITPUB

May 26, 2015 · Big Data

Step-by-Step Guide to Quickly Install and Configure Hive on Hadoop

This article provides a concise, practical walkthrough for installing and configuring Apache Hive on a Hadoop cluster, covering prerequisite HDFS and MapReduce setup, downloading Hive, extracting files, setting environment variables, configuring XML files, starting Hive, and verifying the installation with simple commands.

ConfigurationETLHQL

0 likes · 4 min read

Step-by-Step Guide to Quickly Install and Configure Hive on Hadoop

Suning Technology

May 22, 2015 · Big Data

Suning’s Big Data Platform Evolution: From SAP BW to Real‑Time Streaming

This article chronicles Suning’s journey from early SAP‑based data warehouses to a modern, open‑source big data platform featuring real‑time collection, Hadoop‑Hive offline processing, Storm‑based streaming, and a visual development environment, highlighting how each layer addresses growing data volume, variety, and business demands.

Data ArchitectureHadoopReal-time Processing

0 likes · 5 min read

Suning’s Big Data Platform Evolution: From SAP BW to Real‑Time Streaming