Tagged articles

3675 articles

Page 30 of 37

Jul 5, 2019 · Big Data

How SnapshotScanMR Doubles HBase‑to‑Hive ETL Speed and Relieves Cluster Load

This article explains how leveraging HBase's SnapshotScanMR feature to create a custom hbase2hiveBySnapshot task dramatically reduces region server pressure, halves ETL execution time, and improves cluster stability for large‑scale data back‑fill operations.

Big DataETLHBase

0 likes · 6 min read

How SnapshotScanMR Doubles HBase‑to‑Hive ETL Speed and Relieves Cluster Load

Big Data Technology & Architecture

Jul 2, 2019 · Big Data

Integrating Apache Flink with Apache Pulsar for Scalable Elastic Data Processing

This article explains how Apache Pulsar and Apache Flink can be combined to provide a unified, scalable, and fault‑tolerant data processing platform, covering Pulsar's architecture, its differences from other messaging systems, various integration patterns, and concrete code examples for stream and batch workloads.

Apache FlinkApache PulsarBig Data

0 likes · 13 min read

Integrating Apache Flink with Apache Pulsar for Scalable Elastic Data Processing

21CTO

Jul 2, 2019 · Operations

How to Build Ultra‑Reliable Systems: Multi‑Level Caching, Isolation, and Monitoring Strategies

This article outlines practical techniques for achieving high system availability, covering multi‑level caching, dynamic group switching, database and service isolation across data centers, concurrency control, gray‑release deployment, comprehensive monitoring, graceful degradation, and data consistency models, with insights on leveraging big‑data pipelines for intelligent logistics.

Big Datacachingcanary release

0 likes · 10 min read

How to Build Ultra‑Reliable Systems: Multi‑Level Caching, Isolation, and Monitoring Strategies

ITPUB

Jul 2, 2019 · Databases

How ClickHouse Powers Ctrip’s Hotel Data Platform for Billions of Daily Updates

This article explains how Ctrip’s hotel data intelligence platform handles over ten billion daily data updates and nearly a million queries by adopting ClickHouse, detailing the system's background, the reasons for choosing ClickHouse over other solutions, the data ingestion pipelines, monitoring strategies, operational practices, and performance outcomes.

Big DataClickHouseReal-time analytics

0 likes · 13 min read

How ClickHouse Powers Ctrip’s Hotel Data Platform for Billions of Daily Updates

DataFunTalk

Jul 2, 2019 · Artificial Intelligence

From Zero to Autonomous Driving: Pony.ai’s Technical Journey

The article traces the evolution of autonomous driving from early concepts to modern implementations, highlighting Pony.ai’s technical innovations in sensor fusion, high‑definition mapping, simulation, data processing, software iteration, and the challenges of scaling vehicle fleets for commercial deployment.

AIBig DataPony.ai

0 likes · 12 min read

From Zero to Autonomous Driving: Pony.ai’s Technical Journey

58 Tech

Jul 2, 2019 · Artificial Intelligence

Magic Mirror: A Visual Data‑Intelligence Platform for Low‑Code Machine Learning

Magic Mirror is a big‑data‑based visual analytics platform that lowers the barrier of machine‑learning for non‑experts while accelerating expert workflows through visual UI, modular algorithms, distributed feature generation, and automated binary‑classification modeling.

Automated ModelingBig DataSpark

0 likes · 9 min read

Magic Mirror: A Visual Data‑Intelligence Platform for Low‑Code Machine Learning

Alibaba Cloud Developer

Jul 1, 2019 · Big Data

Why Lambda, Kappa, and Lambda+ Are Shaping Modern Big Data Architecture

This article examines the technical challenges of large‑scale data processing, compares the classic Lambda and Kappa architectures, introduces the unified stream‑batch Lambda+ design built on Tablestore and Blink, and outlines suitable scenarios and practical solutions for modern big‑data systems.

Big DataCloud ComputingKappa architecture

0 likes · 16 min read

Why Lambda, Kappa, and Lambda+ Are Shaping Modern Big Data Architecture

Big Data Technology & Architecture

Jun 30, 2019 · Big Data

Curated Collection of Big Data, Flink, Hadoop and Real‑Time Computing Articles from the “Big Data Technology and Architecture” Series

This article presents a carefully organized catalogue of over a hundred technical posts covering Flink source‑code analysis, fundamental and advanced big‑data structures, Hadoop ecosystem components, real‑time streaming with Spark and Kafka, as well as system design guidelines and miscellaneous insights, each linked to its original publication for easy reference.

Big DataDistributed SystemsFlink

0 likes · 6 min read

Curated Collection of Big Data, Flink, Hadoop and Real‑Time Computing Articles from the “Big Data Technology and Architecture” Series

Big Data Technology & Architecture

Jun 29, 2019 · Big Data

Apache Flink 1.9 Feature Overview – Beijing Meetup (June 29)

On June 29, the Apache Flink Beijing Meetup presented a comprehensive analysis of Flink 1.9’s major architectural changes, new Table API & SQL capabilities, runtime and core enhancements, and future roadmap, with slides and resources made available for download.

Apache FlinkBig DataFlink 1.9

0 likes · 2 min read

Apache Flink 1.9 Feature Overview – Beijing Meetup (June 29)

21CTO

Jun 28, 2019 · Fundamentals

Beijing’s Software Industry Surpasses Trillion-Yuan Mark: 2019 Report Highlights

The 2019 Beijing Software and Information Service Industry Development Report reveals that the sector’s scale exceeded one trillion yuan, with double‑digit growth in cloud computing, big data, AI and cybersecurity, while talent, investment, and regional collaboration propelled the city to a leading national position.

BeijingBig DataInformation Security

0 likes · 9 min read

Beijing’s Software Industry Surpasses Trillion-Yuan Mark: 2019 Report Highlights

Architecture Digest

Jun 26, 2019 · Big Data

Guide to Setting Up Hadoop High Availability (HA) Cluster with HDFS and YARN

This article provides a step‑by‑step tutorial on configuring Hadoop high availability, covering HDFS HA architecture, Quorum Journal Manager synchronization, NameNode failover, YARN HA, required pre‑conditions, cluster planning, configuration files, service startup, and verification procedures.

Big DataCluster SetupHDFS

0 likes · 16 min read

Guide to Setting Up Hadoop High Availability (HA) Cluster with HDFS and YARN

Big Data Technology & Architecture

Jun 24, 2019 · Big Data

Hive Optimization Techniques: Column/Partition Pruning, Predicate Pushdown, Join Strategies, and MapReduce Tuning

This article provides a comprehensive guide to improving Hive query performance by covering column and partition pruning, predicate pushdown, replacing ORDER BY with SORT BY, using GROUP BY instead of DISTINCT, fine‑tuning join operations, and optimizing MapReduce parameters such as mapper/reducer counts, file merging, compression, JVM reuse, parallel execution, strict mode, and storage formats.

Big DataJOIN optimizationMapReduce

0 likes · 19 min read

Hive Optimization Techniques: Column/Partition Pruning, Predicate Pushdown, Join Strategies, and MapReduce Tuning

Didi Tech

Jun 22, 2019 · Big Data

Analysis of Hadoop RPC Architecture and Implementation

The article examines Hadoop’s RPC framework—detailing its client‑server workflow, core classes (RPC, Client, Server), dynamic proxy handling, NIO‑based server threading, configurable concurrency controls such as FairCallQueue, and a practical HDFS mkdir command example, illustrating high‑performance distributed communication.

Big DataHadoopRPC

0 likes · 17 min read

Analysis of Hadoop RPC Architecture and Implementation

dbaplus Community

Jun 20, 2019 · Big Data

How Kafka Hits Million‑Message Throughput Using Page Cache and Zero‑Copy

Kafka achieves its ultra‑high throughput and low latency by writing data to the OS page cache, performing sequential disk writes, and employing zero‑copy techniques that eliminate unnecessary data copies during consumption, enabling tens of thousands to millions of messages per second.

Big DataHigh ThroughputKafka

0 likes · 8 min read

How Kafka Hits Million‑Message Throughput Using Page Cache and Zero‑Copy

Big Data Technology & Architecture

Jun 20, 2019 · Big Data

Comprehensive Guide to Flink SQL: Background, New Features, Programming Model, Operators, Functions, and a Practical NBA Scoring Leader Example

This article provides an in‑depth overview of Flink SQL, covering its origins, the latest 1.7.0 and 1.8.0 enhancements, the underlying programming model, common operators and built‑in functions, and a complete end‑to‑end example that analyzes NBA scoring‑leader data using Flink SQL.

Apache FlinkBig DataFlink SQL

0 likes · 27 min read

Comprehensive Guide to Flink SQL: Background, New Features, Programming Model, Operators, Functions, and a Practical NBA Scoring Leader Example

Suning Technology

Jun 20, 2019 · Fundamentals

How Suning’s Digital Transformation Is Shaping the Future of Retail

Suning’s senior tech leader explains how the company leveraged AI, big data, cloud computing and IoT to drive a digital‑first retail ecosystem, illustrating the broader shift toward intelligent, data‑driven retail operations in a rapidly changing market.

Artificial IntelligenceBig DataCloud Computing

0 likes · 4 min read

How Suning’s Digital Transformation Is Shaping the Future of Retail

Big Data Technology & Architecture

Jun 19, 2019 · Big Data

Understanding Spark Structured Streaming StateStore: Architecture, Operations, and Fault Recovery

This article explains the design and implementation of Spark Structured Streaming's StateStore module, covering its distributed architecture, state sharding, versioning, batch read/write, migration, update/query APIs, maintenance compaction, and fault‑tolerance mechanisms that enable incremental continuous queries with exactly‑once guarantees.

Big DataSparkStateStore

0 likes · 8 min read

Understanding Spark Structured Streaming StateStore: Architecture, Operations, and Fault Recovery

21CTO

Jun 17, 2019 · Big Data

Why Data Middle Platforms May Be the Biggest Opportunity of the Next 20 Years

The article explores the rapid rise of data middle platforms in China, tracing their historical roots, explaining their core purpose of unifying data across legacy and new systems, showcasing Shulan Technology’s real‑world implementations, and analyzing market dynamics and future opportunities for enterprises and startups alike.

Big DataData Middle Platformdata strategy

0 likes · 25 min read

Why Data Middle Platforms May Be the Biggest Opportunity of the Next 20 Years

Big Data Technology & Architecture

Jun 16, 2019 · Big Data

Understanding Data Warehouse Terminology: DB, DW, ODS, OLTP, OLAP, BI, and Data Mining

This article explains core data‑warehouse concepts—including DB, DW, ODS, OLTP, OLAP, BI, and the differing meanings of DM—as well as their relationships, integration examples, and why OLAP cannot replace data mining, providing a concise reference for beginners in data analytics.

BIBig DataOLAP

0 likes · 9 min read

Understanding Data Warehouse Terminology: DB, DW, ODS, OLTP, OLAP, BI, and Data Mining

AI Large-Model Wave and Transformation Guide

Jun 16, 2019 · Artificial Intelligence

Understanding AI's Four Core Elements: Data, Compute, Algorithms, and Scenarios

The article breaks down artificial intelligence into four essential components—massive data, powerful compute, effective algorithms, and real‑world scenarios—explaining each element with concrete analogies, hardware benchmarks, algorithm classifications, and a list of typical AI applications.

AI fundamentalsAI use casesBig Data

0 likes · 5 min read

Understanding AI's Four Core Elements: Data, Compute, Algorithms, and Scenarios

Programmer DD

Jun 15, 2019 · Big Data

How to Sync MySQL Data to Elasticsearch with Logstash: Step‑by‑Step Guide

This guide walks you through installing JDK, Logstash, Ruby, and required plugins, configuring Logstash to pull data from a MySQL table, and sending it to Elasticsearch, including code snippets, configuration files, and troubleshooting tips for a smooth data synchronization.

Big DataData IntegrationElasticsearch

0 likes · 6 min read

How to Sync MySQL Data to Elasticsearch with Logstash: Step‑by‑Step Guide

Xianyu Technology

Jun 14, 2019 · Big Data

Xianyu IFTTT: Scalable Real-Time User Relationship Platform

Xianyu IFTTT is a scalable real-time user-relationship platform that enriches metadata, enables bidirectional buyer-seller interactions, integrates quickly via SLS logs, uses a chain-of-responsibility for customizable lists, processes push actions with fatigue filtering, and stores TB-scale data in Lindorm, delivering billions of daily records and more than double the click-through rate of offline pushes.

Big DataIFTTTReal-Time

0 likes · 9 min read

Xianyu IFTTT: Scalable Real-Time User Relationship Platform

Full-Stack Internet Architecture

Jun 14, 2019 · Big Data

How I Prepared for ByteDance (TouTiao) Interviews: Study Plan, Interview Experiences, and Practical Tips

An in‑depth personal account details how the author prepared for ByteDance’s (TouTiao) recruitment, outlining a month‑by‑month study schedule covering Java, big‑data technologies, algorithms, and system fundamentals, describing each interview round, sharing successful test strategies, and offering practical advice for landing offers at top tech firms.

AlgorithmsBig DataByteDance

0 likes · 11 min read

How I Prepared for ByteDance (TouTiao) Interviews: Study Plan, Interview Experiences, and Practical Tips

Big Data Technology & Architecture

Jun 12, 2019 · Big Data

Comprehensive Guide to FlinkCEP: API Overview, Pattern Definitions, Quantifiers, Conditions, and Usage Examples

This article provides a detailed introduction to FlinkCEP, covering how to add the library, define simple and composite patterns, use quantifiers and conditions, handle skip strategies, time constraints, and select results, with complete Java and Scala code examples for complex event processing.

Big DataCEPFlink

0 likes · 27 min read

Comprehensive Guide to FlinkCEP: API Overview, Pattern Definitions, Quantifiers, Conditions, and Usage Examples

Dada Group Technology

Jun 11, 2019 · Big Data

Building and Evolving the Dada‑JD Daojia Big Data Platform: Architecture, Strategies, and Lessons Learned

This article presents a comprehensive case study of the Dada‑JD Daojia big data platform, detailing its evolution from a MySQL‑based warehouse to a multi‑layered One Data, One Platform, One Service, Many Apps architecture, the technical challenges faced, and the strategic approaches adopted to ensure coverage, accuracy, stability, and scalability.

Big DataData GovernanceData Platform

0 likes · 14 min read

Building and Evolving the Dada‑JD Daojia Big Data Platform: Architecture, Strategies, and Lessons Learned

Architecture Digest

Jun 11, 2019 · Databases

Database Optimization for Billion‑Scale Data: Partitioning, Sharding, and Vertical Splitting in MySQL

This article explains how a high‑traffic messaging platform with tens of millions of users and billions of daily records can be optimized using MySQL partitioning, sharding (both client‑side and proxy‑side), vertical database splitting, and practical migration scripts to maintain performance and availability.

Big DataDatabase OptimizationMySQL

0 likes · 15 min read

Database Optimization for Billion‑Scale Data: Partitioning, Sharding, and Vertical Splitting in MySQL

Big Data Technology & Architecture

Jun 10, 2019 · Big Data

Understanding Spark SQL: Origin, Features, and Columnar Storage

This article explains the evolution of Spark SQL from Shark, describes its key features such as SchemaRDD and in‑memory columnar storage, compares row‑based and column‑based storage, and provides practical Scala code examples for creating DataFrames and loading data from various sources.

Big DataJDBCParquet

0 likes · 16 min read

Understanding Spark SQL: Origin, Features, and Columnar Storage

360 Tech Engineering

Jun 10, 2019 · Information Security

Design and Practice of Big Data Platform Security: Insights from 360’s Data Center Technical Director

In this interview, 360’s Big Data Center Technical Director Xu Hao discusses the critical data security challenges faced by enterprises, outlines regulatory, system‑level, and managerial risks, and shares practical strategies for building robust security governance, platform architecture, permission controls, and cloud‑based data protection.

Big Datacloud securitydata security

0 likes · 13 min read

Design and Practice of Big Data Platform Security: Insights from 360’s Data Center Technical Director

Big Data Technology & Architecture

Jun 9, 2019 · Big Data

Optimizing Spark Shuffle: Can Fetch, Efficient Fetch, and Reliable Fetch

This article analyzes three Spark shuffle bottlenecks—oversized partitions that exceed Netty's 2 GB limit, excessive retry latency caused by dead executors, and insufficient data‑corruption checks—and presents concrete configuration changes, new block identifiers, executor‑liveness checks, and CRC‑32 verification to improve fetchability, efficiency, and reliability at scale.

Big DataShuffleSpark

0 likes · 18 min read

Optimizing Spark Shuffle: Can Fetch, Efficient Fetch, and Reliable Fetch

Full-Stack Internet Architecture

Jun 8, 2019 · Big Data

The Story of Doug Cutting: From Stanford to Hadoop and Beyond

This article chronicles Doug Cutting's journey from his humble beginnings at Stanford through his pioneering work on Lucene, Nutch, and Hadoop, highlighting how his innovations in search and distributed computing reshaped the big data landscape and led to the rise of Cloudera.

Big DataClouderaDoug Cutting

0 likes · 8 min read

The Story of Doug Cutting: From Stanford to Hadoop and Beyond

21CTO

Jun 7, 2019 · Big Data

How to Build a Real-Time Big Data Sentiment Analysis Platform Using Lambda & Kappa

This article explores the design of a large‑scale, real‑time sentiment analysis system, detailing the data ingestion, processing, and storage requirements, comparing Lambda and Kappa architectures, and presenting an Alibaba Cloud solution that combines Tablestore and Blink for unified batch‑and‑stream processing.

Big DataKappa architectureLambda architecture

0 likes · 18 min read

How to Build a Real-Time Big Data Sentiment Analysis Platform Using Lambda & Kappa

Full-Stack Internet Architecture

Jun 7, 2019 · Backend Development

Comprehensive Guide to Autumn Recruitment: Strategies, Case Studies, and Interview Topics for Java and Big Data Positions

This article provides a detailed roadmap for autumn campus recruitment, covering the significance of the hiring season, tailored preparation strategies for different skill levels, multiple case studies, extensive interview question collections across Java, JVM, big data, and system fundamentals, as well as practical tips for resume polishing and interview mindset.

AlgorithmsBig Datacareer advice

0 likes · 18 min read

Comprehensive Guide to Autumn Recruitment: Strategies, Case Studies, and Interview Topics for Java and Big Data Positions

Tencent Cloud Developer

Jun 6, 2019 · Big Data

2019 Big Data Industry Summit Highlights and Outcomes

From June 4‑5, 2019, the China‑hosted Big Data Industry Summit gathered more than 4,000 attendees and 60,000 online viewers to present award winners, release multiple whitepapers and standards, and hold six thematic forums and two roundtables that examined data platforms, asset management, security, law, and emerging technologies, outlining current opportunities and future challenges for big‑data growth.

Big DataChinaData Asset Management

0 likes · 14 min read

2019 Big Data Industry Summit Highlights and Outcomes

360 Quality & Efficiency

Jun 6, 2019 · Big Data

An Overview of Kafka: Introduction, Design Principles, and Common Issues

This article introduces Kafka, explains its core concepts and design principles, outlines typical use cases, and discusses common operational problems and troubleshooting tips for this high‑throughput distributed messaging system.

Big DataDistributed SystemsKafka

0 likes · 9 min read

An Overview of Kafka: Introduction, Design Principles, and Common Issues

Big Data Technology & Architecture

Jun 5, 2019 · Big Data

Real-Time Advertising Click Counting with Spark Structured Streaming and Redis Streams

This article presents a complete solution for real‑time advertising click counting using Spark Structured Streaming combined with Redis Streams, detailing the business scenario, data flow, input/output formats, and step‑by‑step implementation including data extraction, processing, storage, and query via Spark‑SQL.

Big DataRedis StreamScala

0 likes · 11 min read

Real-Time Advertising Click Counting with Spark Structured Streaming and Redis Streams

360 Zhihui Cloud Developer

Jun 4, 2019 · Big Data

Why Flink Outperforms Storm: Deep Dive into Stream Processing Performance

Based on data transmission and reliability metrics, this article compares Apache Storm and Apache Flink in stream processing, presenting benchmark designs, test environments, results for synthetic and Kafka data, and offers practical recommendations such as operator chaining, object reuse, and checkpoint strategies to maximize Flink performance.

Big DataFlinkPerformance Testing

0 likes · 13 min read

Why Flink Outperforms Storm: Deep Dive into Stream Processing Performance

Architecture Digest

Jun 4, 2019 · Big Data

Overview of Taobao Cloud Computing Architecture and Data Synchronization Solutions

This article presents a comprehensive overview of Taobao's cloud computing architecture, detailing system components, various data synchronization methods such as TimeTunnel, Dbsync, and DataX, the scheduling system design, and metadata-driven analysis platforms for performance optimization and monitoring.

Big DataDistributed SystemsScheduling

0 likes · 11 min read

Overview of Taobao Cloud Computing Architecture and Data Synchronization Solutions

Big Data Technology & Architecture

Jun 3, 2019 · Big Data

Design and Implementation of Alibaba Cloud's 10PB+ Daily Log Service

This article presents an in‑depth interview with Alibaba Cloud senior expert Sun Tingtao, detailing the architecture, core features, design challenges, and operational strategies of the Alibaba Cloud Log Service that handles over 10 PB of daily log data for massive, diverse production workloads.

Alibaba CloudBig DataDistributed Systems

0 likes · 12 min read

Design and Implementation of Alibaba Cloud's 10PB+ Daily Log Service

360 Tech Engineering

Jun 3, 2019 · Big Data

Performance Comparison of Apache Storm and Apache Flink from Data Transmission and Reliability Perspectives

This article presents a detailed performance benchmark comparing Apache Storm and Apache Flink in stream processing, focusing on data transmission methods, reliability mechanisms, operator chaining, and both self‑generated and Kafka‑sourced workloads, and provides practical optimization recommendations based on the results.

Big DataData TransmissionFlink

0 likes · 10 min read

Performance Comparison of Apache Storm and Apache Flink from Data Transmission and Reliability Perspectives

Big Data Technology & Architecture

Jun 2, 2019 · Big Data

Tencent's Oceanus Real-Time Stream Computing Platform and Flink Optimizations

The article presents Tencent's evolution of real‑time stream processing using Flink, the design of the Oceanus one‑stop visual platform, and a series of deep extensions and optimizations—including UI redesign, JobManager failover, checkpoint handling, enhanced windows, LocalKeyBy, idle detection, and log isolation—aimed at supporting petabyte‑scale data workloads.

Big DataFlinkOceanus

0 likes · 16 min read

Tencent's Oceanus Real-Time Stream Computing Platform and Flink Optimizations

Java Captain

Jun 2, 2019 · Big Data

Comprehensive Guide to Autumn Recruitment: Strategies, Learning Paths, and Interview Questions for Java and Big Data Positions

This article provides a detailed roadmap for candidates preparing for the autumn recruitment season, covering interview experience sharing, systematic learning routes, project preparation, essential Java and big‑data technologies, core algorithms, and practical interview question collections to help readers avoid common pitfalls and succeed in securing offers.

AlgorithmsAutumn RecruitmentBig Data

0 likes · 18 min read

Comprehensive Guide to Autumn Recruitment: Strategies, Learning Paths, and Interview Questions for Java and Big Data Positions

Big Data Technology & Architecture

Jun 1, 2019 · Big Data

Understanding Spark Executor Memory Management: On‑Heap, Off‑Heap, and Unified Memory

This article explains Spark's executor memory architecture, covering on‑heap and off‑heap memory planning, static and unified memory managers, storage and execution memory allocation, RDD persistence, eviction policies, and shuffle memory usage, providing practical guidance for performance tuning.

Big DataExecutorMemory Management

0 likes · 23 min read

Understanding Spark Executor Memory Management: On‑Heap, Off‑Heap, and Unified Memory

Big Data Technology Architecture

Jun 1, 2019 · Big Data

Impact of Excessive HBase Partitions and How to Calculate Reasonable Region Numbers

The article explains how excessive HBase partitions can cause frequent flushes, compaction storms, high memory usage, long master assignment times, and reduced MapReduce concurrency, and provides formulas and guidelines for calculating a reasonable number of regions per RegionServer.

Big DataHBasecluster stability

0 likes · 8 min read

Impact of Excessive HBase Partitions and How to Calculate Reasonable Region Numbers

Architecture Digest

May 31, 2019 · Operations

Running a 400+ Node Elasticsearch Cluster: Architecture, Scaling, and Performance Tuning

Meltwater details how it processes millions of daily media posts using a custom‑tuned Elasticsearch 1.7.6 cluster of over 400 nodes on AWS, covering data volume, query complexity, node configuration, indexing strategy, performance optimizations, and lessons learned for large‑scale search deployments.

AWSBig DataElasticsearch

0 likes · 12 min read

Running a 400+ Node Elasticsearch Cluster: Architecture, Scaling, and Performance Tuning

Big Data Technology & Architecture

May 30, 2019 · Big Data

Data Skew Optimization Techniques in Spark

This article explains the phenomenon, causes, detection methods, and a comprehensive set of solutions—including Hive preprocessing, key filtering, shuffle parallelism, two‑stage aggregation, map‑join, sampling, random prefixing, and combined strategies—to mitigate data skew in Spark jobs and improve performance.

Big DataData SkewShuffle

0 likes · 31 min read

Data Skew Optimization Techniques in Spark

Big Data Technology & Architecture

May 29, 2019 · Cloud Native

Real-Time Computing Solutions with Flink and HBase: Architecture, Market Analysis, and Use Cases

The article presents Alibaba Cloud's real-time computing solution based on Flink and HBase, covering market competition, open‑source ecosystem, containerized architecture on Kubernetes, and typical applications such as online education video analysis, city‑brain traffic management, and fraud detection.

Big DataCloud NativeFlink

0 likes · 12 min read

Real-Time Computing Solutions with Flink and HBase: Architecture, Market Analysis, and Use Cases

ITPUB

May 29, 2019 · Big Data

How to Build a Trillion-Scale Real-Time Data Platform: Lessons from DTCC 2019

In a DTCC 2019 keynote, Zhao Qun, director of big‑data platform at Percent Point, outlines the challenges of trillion‑scale real‑time analytics and presents a transparent, fine‑grained architecture built on Kafka, Spark Streaming, ClickHouse, HBase, Ceph and Elasticsearch, detailing design principles, component sizing, multi‑center deployment, performance testing and operational safeguards.

ArchitectureBig DataKafka

0 likes · 17 min read

How to Build a Trillion-Scale Real-Time Data Platform: Lessons from DTCC 2019

dbaplus Community

May 28, 2019 · Big Data

Mastering Kafka: Deep Dive into Architecture, Production, Consumption, and Transactions

This article provides a comprehensive technical guide to Kafka, covering its distributed architecture, producer and consumer workflows, partition and leader management, message delivery semantics, exactly‑once guarantees, transaction handling, file organization, and key configuration parameters.

Big DataKafkamessage queues

0 likes · 18 min read

Mastering Kafka: Deep Dive into Architecture, Production, Consumption, and Transactions

Big Data Technology & Architecture

May 28, 2019 · Big Data

Optimizing Flink Shuffle: New Flow‑Control Mechanism, Serialization Improvements, and Architecture Refactoring

The article explains how Flink's shuffle pipeline—from upstream data serialization to downstream consumption—is optimized through a credit‑based flow‑control mechanism, zero‑copy network buffers, broadcast serialization reduction, external shuffle service, and a plugin‑based shuffle manager, resulting in significant performance gains for both streaming and batch jobs.

Big DataFlinkFlow Control

0 likes · 15 min read

Optimizing Flink Shuffle: New Flow‑Control Mechanism, Serialization Improvements, and Architecture Refactoring

MaGe Linux Operations

May 28, 2019 · Big Data

Recreating Google Ngram Trends with Python, PyTubes, and NumPy

This article demonstrates how to download the Google 1‑gram dataset, load and filter billions of rows with the PyTubes library, compute yearly word frequencies using NumPy, and reproduce the classic Python usage trend chart while discussing performance considerations and future improvements.

Big DataGoogle NgramNumPy

0 likes · 9 min read

Recreating Google Ngram Trends with Python, PyTubes, and NumPy

Big Data Technology & Architecture

May 26, 2019 · Big Data

Apache Flink at Didi: Platformization, Production Practices, and StreamSQL

This article describes how Didi adopted Apache Flink for its real‑time data streams, detailing the platformized architecture, production use cases such as ETL, monitoring and CEP, the evolution of StreamSQL, and the engineering improvements made to support large‑scale, low‑latency processing.

Big DataDidiFlink

0 likes · 14 min read

Apache Flink at Didi: Platformization, Production Practices, and StreamSQL

21CTO

May 24, 2019 · Operations

How Meituan’s R&D Team Cut Tens of Millions in Resource Costs: A Practical Guide

This article details Meituan's R&D team's systematic PDCA‑based approach to resource cost optimization, covering methodology definition, planning, execution, checking, and iterative improvement across infrastructure, big‑data, and shared services, ultimately saving tens of millions of yuan.

Big DataCost OptimizationOperations

0 likes · 22 min read

How Meituan’s R&D Team Cut Tens of Millions in Resource Costs: A Practical Guide

dbaplus Community

May 21, 2019 · Big Data

How to Supercharge Elasticsearch Queries on Billions of Records

This article explains why Elasticsearch can be slow on massive datasets, then details practical techniques—leveraging filesystem cache, pre‑heating hot data, separating hot and cold indices, designing lean document models, and avoiding deep pagination—to achieve sub‑second query performance at billions‑scale.

Big DataElasticsearchdata modeling

0 likes · 11 min read

How to Supercharge Elasticsearch Queries on Billions of Records

Big Data Technology & Architecture

May 19, 2019 · Big Data

Implementing End-to-End Exactly-Once Semantics in Apache Flink with Apache Kafka Using Two-Phase Commit Sink

This article explains how Apache Flink’s TwoPhaseCommitSinkFunction, introduced in version 1.4, enables end-to-end exactly-once semantics when integrated with Apache Kafka, detailing the checkpoint mechanism and the two-phase commit protocol that ensures reliable data processing.

Apache FlinkApache KafkaBig Data

0 likes · 4 min read

Implementing End-to-End Exactly-Once Semantics in Apache Flink with Apache Kafka Using Two-Phase Commit Sink

Qunar Tech Salon

May 16, 2019 · Big Data

Optimizing HDFS Federation Data Migration with FastCopy and qFastCopy at Qunar

This article describes the challenges of scaling Qunar's Hadoop NameNode, introduces HDFS Federation and the FastCopy tool, presents performance tests comparing FastCopy with DistCp, and details the development and evaluation of an optimized qFastCopy solution that reduces multi‑petabyte migration time from hours to a few.

Big DataData MigrationFastCopy

0 likes · 8 min read

Optimizing HDFS Federation Data Migration with FastCopy and qFastCopy at Qunar

dbaplus Community

May 13, 2019 · Big Data

Tackling HDFS Performance Bottlenecks: Real‑World Optimizations from VIP.com

This article examines the performance challenges encountered after upgrading a large‑scale HDFS cluster at VIP.com, explains the root causes of NameNode RPC latency, and presents concrete solutions—including delayed block reports, configurable block deletion, federation redesign, client monitoring, temp‑directory sharding, and small‑file handling—along with configuration snippets and real‑world results.

Big DataFederationHDFS

0 likes · 13 min read

Tackling HDFS Performance Bottlenecks: Real‑World Optimizations from VIP.com

DataFunTalk

May 13, 2019 · Artificial Intelligence

Financial Risk Management: Business Requirements and Technical Solutions

This article presents a comprehensive overview of financial risk management, detailing business challenges such as identity verification and fraud, and describing technical solutions including feature engineering, sample handling, model optimization, and online validation, emphasizing the integration of data-driven AI techniques throughout the process.

Big DataRisk managementfinancial modeling

0 likes · 13 min read

Financial Risk Management: Business Requirements and Technical Solutions

Big Data Technology & Architecture

May 12, 2019 · Big Data

Understanding Spark Streaming Integration with Kafka: Receiver-based and Direct Approaches

This article explains Spark Streaming’s architecture, core concepts such as DStream, windowing, and the two Kafka integration methods—Receiver-based and Direct approaches—detailing their configurations, memory implications, checkpointing, and best‑practice recommendations for reliable, high‑throughput real‑time data processing.

Big DataDirect ApproachReceiver Approach

0 likes · 18 min read

Understanding Spark Streaming Integration with Kafka: Receiver-based and Direct Approaches

Architecture Digest

May 11, 2019 · Cloud Native

Ant Financial’s Fifteen‑Year Technology Architecture Evolution and the Future of FinTech

In a QCon 2019 talk, Ant Financial’s deputy CTO Hu Xi outlines the company’s fifteen‑year journey reshaping payments and micro‑loans through blockchain, AI, security, IoT and cloud computing, and details the emerging cloud‑native, high‑availability, data‑intelligent architecture that will underpin the next generation of financial technology.

Artificial IntelligenceBig DataBlockchain

0 likes · 16 min read

Ant Financial’s Fifteen‑Year Technology Architecture Evolution and the Future of FinTech

DataFunTalk

May 10, 2019 · Artificial Intelligence

Pony.ai Infrastructure Overview: Vehicle Systems, Simulation Platform, and Data Architecture

The article presents a comprehensive overview of Pony.ai's autonomous driving infrastructure, covering the core infrastructure team’s responsibilities, vehicle onboard systems, simulation platform, data architecture, and supporting services, while discussing the technical challenges and engineering practices employed to achieve scalability, reliability, and high performance.

AIBig DataInfrastructure

0 likes · 14 min read

Pony.ai Infrastructure Overview: Vehicle Systems, Simulation Platform, and Data Architecture

Alibaba Cloud Developer

May 10, 2019 · Cloud Native

How Ant Group Built a Cloud‑Native, Financial‑Grade Architecture Over 15 Years

Ant Group’s former CTO Hu Xi outlines the 15‑year evolution of its fintech architecture, highlighting the five BASIC technologies—blockchain, AI, security, IoT, and cloud computing—while detailing the shift to cloud‑native, distributed middleware, OceanBase, service mesh, risk‑auto‑recovery, and open‑intelligent data platforms.

Big DataBlockchainDistributed Systems

0 likes · 18 min read

How Ant Group Built a Cloud‑Native, Financial‑Grade Architecture Over 15 Years

AntTech

May 9, 2019 · Cloud Native

Ant Financial’s Fifteen‑Year Technology Architecture Evolution and the Future of FinTech

The article reviews Ant Financial’s fifteen‑year journey reshaping payments and micro‑loans through blockchain, AI, security, IoT and cloud computing, explains how distributed middleware, OceanBase, service‑mesh‑based cloud‑native infrastructure and open intelligent computing architectures enable high‑availability, scalable financial services, and introduces the BASIC College talent program.

Artificial IntelligenceBig DataBlockchain

0 likes · 16 min read

dbaplus Community

May 7, 2019 · Big Data

Why Kafka Achieves Million‑Level Throughput: Sequential Writes, mmap, and Zero‑Copy

This article explains how Kafka attains high throughput by using sequential disk writes, memory‑mapped files, sendfile zero‑copy, and batch compression, detailing both write and read path optimizations and their impact on performance.

Batch CompressionBig DataHigh Throughput

0 likes · 8 min read

Why Kafka Achieves Million‑Level Throughput: Sequential Writes, mmap, and Zero‑Copy

DataFunTalk

May 7, 2019 · Databases

Time Series Data Platform: Business Scenarios, Architecture, and Core Technologies

This article introduces the main business scenarios, system architecture, and core technologies of a time‑series data platform, covering data collection, processing, storage, analysis, and the specific features required for high‑performance, scalable, and reliable time‑series data management.

ArchitectureBig DataTSDB

0 likes · 10 min read

Time Series Data Platform: Business Scenarios, Architecture, and Core Technologies

Big Data Technology & Architecture

May 5, 2019 · Databases

Designing Effective RowKeys in HBase

This article explains why HBase rowkey design is critical for performance, outlines common interview expectations, and provides visual guidelines to help developers create efficient rowkeys for production workloads, including best‑practice tips on key length, salting, and ordering to avoid hotspotting.

Big DataDatabase designrowKey

0 likes · 1 min read

Didi Tech

May 1, 2019 · Artificial Intelligence

New Generation AI Empowering the Era of Smart Mobility – Insights from Didi’s Chief Scientist Tang Jian

Chief Scientist Tang Jian explains how Didi leverages next‑generation AI—big‑data, hybrid‑augmented, autonomous, and collective intelligence—to transform smart mobility through advanced dispatch, safety systems, in‑car perception, traffic‑signal optimization, and global collaborations, while confronting challenges of model scale, computing power, and safety assurance.

Artificial IntelligenceBig DataDidi

0 likes · 11 min read

New Generation AI Empowering the Era of Smart Mobility – Insights from Didi’s Chief Scientist Tang Jian

21CTO

Apr 29, 2019 · Big Data

How EasyScheduler Powers Scalable Big Data Workflow Management

EasyScheduler is an open‑source big‑data workflow scheduler that uses a decentralized architecture with Master and Worker nodes coordinated via ZooKeeper, supporting DAG‑based task definitions, various task types, fault tolerance, priority handling, distributed locks, and remote logging, all illustrated with detailed component diagrams.

Big DataDAGDistributed Systems

0 likes · 17 min read

How EasyScheduler Powers Scalable Big Data Workflow Management

Big Data Technology & Architecture

Apr 29, 2019 · Big Data

Understanding Retract Updates in FlinkSQL: Append vs Retract Modes

FlinkSQL's retract updates allow handling of data modifications in streaming queries by using toRetractStream, contrasting with the append-only toAppendStream mode, and this article explains the differences, when each mode applies, and provides illustrative examples and visual diagrams.

Append ModeBig DataFlinkSQL

0 likes · 3 min read

Understanding Retract Updates in FlinkSQL: Append vs Retract Modes

Youzan Coder

Apr 29, 2019 · Big Data

Optimizing Flink Sliding Windows for Super Long Time Ranges

To overcome severe performance degradation of Flink sliding windows over very long time ranges, Youzan engineers applied time‑slicing based on the greatest common divisor of window length and slide step, reducing state writes and timers, which yielded 3‑8× speedups in production.

Big DataFlinkReal-time Processing

0 likes · 18 min read

Optimizing Flink Sliding Windows for Super Long Time Ranges

Big Data Technology & Architecture

Apr 28, 2019 · Databases

Introduction to HBase: Architecture, Concepts, and Common Commands

This article introduces HBase, a distributed column‑oriented NoSQL database built on Hadoop, explains its architecture, data model, key concepts such as rowkeys, column families, timestamps, regions, and ZooKeeper, outlines its main features and typical use cases, and provides common HBase shell commands with examples.

Big DataHadoopNoSQL

0 likes · 21 min read

Introduction to HBase: Architecture, Concepts, and Common Commands

Big Data Technology & Architecture

Apr 24, 2019 · Big Data

Hive SQL Optimization Techniques and Best Practices

This article provides a comprehensive guide to Hive SQL performance tuning, covering optimization goals, common pitfalls, execution flow, table and job settings, map, shuffle, reduce, and query-level improvements such as join, bucket join, group‑by, and count‑distinct optimizations.

Big DataHadoophive

0 likes · 11 min read

Hive SQL Optimization Techniques and Best Practices

Efficient Ops

Apr 23, 2019 · Information Security

How Situational Awareness Transforms Modern Cybersecurity Defense

The article explains how situational awareness—covering pre‑attack, during‑attack, and post‑attack stages—leverages big data, AI, threat intelligence, UEBA and visualization to turn security platforms into proactive “security brains,” while also critiquing current product implementations and market practices.

Big DataThreat IntelligenceUEBA

0 likes · 14 min read

How Situational Awareness Transforms Modern Cybersecurity Defense

Big Data Technology & Architecture

Apr 23, 2019 · Databases

Implementing Row-to-Column Pivot in Hive: Traditional and Map Approaches

This article explains how to perform row-to-column transformations (pivot) in Hive using two methods: a traditional SQL approach mimicking Oracle/SQL Server pivot syntax and a more concise map-based technique, comparing their syntax, performance, and memory considerations.

Big DataMAPPivot

0 likes · 3 min read

Implementing Row-to-Column Pivot in Hive: Traditional and Map Approaches

Didi Tech

Apr 23, 2019 · Big Data

Travel Time Index (TTI): Evaluation Methods, Calculation, and Validation Using Didi Trajectory Data

The Travel Time Index (TTI) quantifies urban congestion by comparing actual travel time to free‑flow conditions, and this study details domestic and international evaluation methods, free‑flow speed estimation, weight calculation, link extraction via PostGIS, system architecture, and validation using massive Didi trajectory data to support city traffic management.

Big DataGISPostGIS

0 likes · 9 min read

Travel Time Index (TTI): Evaluation Methods, Calculation, and Validation Using Didi Trajectory Data

Youku Technology

Apr 22, 2019 · Artificial Intelligence

Exploring the Construction of an Entertainment Brain: AI and Big Data Practices in the Fish Brain Platform

The talk introduces Alibaba’s Fish Brain platform, an AI‑powered decision‑support system for entertainment that combines a three‑layer data‑model, AI‑processed basic data, and application models, leveraging NLP, computer‑vision, custom embeddings, loss functions and predictive hybrid networks to analyze content, user behavior, and forecast performance.

AIBig DataEmbedding

0 likes · 12 min read

Exploring the Construction of an Entertainment Brain: AI and Big Data Practices in the Fish Brain Platform

Big Data Technology & Architecture

Apr 21, 2019 · Big Data

Overview of Hive Data Warehouse, Its Architecture, Query Processing, and Comparison with Impala

This article provides a comprehensive overview of Hive as a Hadoop‑based data warehouse, explains its architecture, query‑to‑MapReduce translation, high‑availability design, and compares its batch‑oriented processing with Impala's low‑latency SQL engine for big data analytics.

Big DataImpalaMapReduce

0 likes · 15 min read

Overview of Hive Data Warehouse, Its Architecture, Query Processing, and Comparison with Impala

Big Data Technology & Architecture

Apr 20, 2019 · Big Data

Weekly Hadoop Knowledge Points: Compression Formats, MapReduce Join, Hive Setup, and YARN Capacity Scheduler

This weekly bulletin summarizes four Hadoop knowledge points—compression formats, MapReduce join techniques, Hive installation, and YARN Capacity Scheduler—while also sharing personal updates about a PhD graduation, the upcoming May Day holiday, and a request for likes and shares.

Big DataHadoopMapReduce

0 likes · 2 min read

Weekly Hadoop Knowledge Points: Compression Formats, MapReduce Join, Hive Setup, and YARN Capacity Scheduler

Didi Tech

Apr 18, 2019 · Big Data

Big Data-Driven Smart Transportation Lecture by Didi and High Education Community

Didi’s vice‑president and chief scientist of smart transportation, Professor Henry Liu, delivered the “Big Data‑Driven Smart Transportation” lecture—part of the AI Industry Applications course—on China University MOOC, teaching students fundamental concepts, real‑world cases, and future prospects of big‑data and AI in traffic management.

Artificial IntelligenceBig DataDidi

0 likes · 3 min read

Big Data-Driven Smart Transportation Lecture by Didi and High Education Community

Alibaba Cloud Developer

Apr 18, 2019 · Big Data

How MaxCompute Evolved: 10 Years of Big Data Innovation at Alibaba

This article reviews a decade of MaxCompute development, covering its origins, core technologies, performance gains, ecosystem integration, intelligent features, competitive positioning, and commercialization, while highlighting the platform's role as Alibaba's central big‑data compute engine.

AI integrationBig DataMaxCompute

0 likes · 21 min read

How MaxCompute Evolved: 10 Years of Big Data Innovation at Alibaba

Big Data Technology & Architecture

Apr 17, 2019 · Big Data

Step-by-Step Guide to Installing Hive 2.1.0 on a Hadoop 2.7.1 Cluster (Ubuntu 14.04)

This tutorial provides a comprehensive, step-by-step procedure for setting up Hive 2.1.0 on a Hadoop 2.7.1 cluster running Ubuntu 14.04, covering environment preparation, Hive installation, configuration of environment variables, MySQL metastore integration, client setup, service startup, and basic verification commands.

Big DataHadoopInstallation

0 likes · 8 min read

Step-by-Step Guide to Installing Hive 2.1.0 on a Hadoop 2.7.1 Cluster (Ubuntu 14.04)

DataFunTalk

Apr 17, 2019 · Artificial Intelligence

Evolution of Ctrip Financial Risk Control Models: From Data Platform to AI‑Driven Scoring and Anti‑Fraud Systems

This report details Ctrip Financial's end‑to‑end risk control development, covering business overview, a three‑layer data platform, the progression of credit scoring and anti‑fraud models from rule‑based to advanced AI techniques, and the evaluation, monitoring, and social‑network‑based fraud detection strategies employed.

Big DataFinancial AIanti-fraud

0 likes · 16 min read

Evolution of Ctrip Financial Risk Control Models: From Data Platform to AI‑Driven Scoring and Anti‑Fraud Systems

dbaplus Community

Apr 16, 2019 · Big Data

Scaling Elasticsearch for Billions of Daily Events: Cluster Planning, Routing & Hot‑Warm Tips

This article explains how to handle a real‑time OLAP monitoring platform processing 10‑12 billion daily events and 400 billion yearly records by optimizing Elasticsearch 5.3.3 through cluster planning, storage strategies, index sharding, compression, hot‑warm architecture, routing, index templates, rollover, and cross‑cluster search, providing concrete configurations and code examples.

Big DataCluster PlanningElasticsearch

0 likes · 23 min read

Scaling Elasticsearch for Billions of Daily Events: Cluster Planning, Routing & Hot‑Warm Tips

Big Data Technology & Architecture

Apr 15, 2019 · Big Data

Map‑Side Join and Reduce‑Side Join Examples in Hadoop MapReduce (Java)

This article provides two reusable Java code samples that demonstrate how to perform a map‑side join and a reduce‑side join in Hadoop MapReduce, enabling efficient joining of a large dataset with a smaller reference table.

Big DataHadoopJOIN

0 likes · 8 min read

Map‑Side Join and Reduce‑Side Join Examples in Hadoop MapReduce (Java)

21CTO

Apr 15, 2019 · Big Data

Mastering High‑Concurrency Big Data: Sharding, Partitioning, and Index Strategies

This article explores practical techniques for handling massive, high‑concurrency data workloads, covering relational database limits, read/write separation, vertical and horizontal sharding, key selection, archival to NoSQL stores, and the use of heterogeneous index tables to maintain performance.

Big DataPartitioningdatabase scaling

0 likes · 6 min read

Mastering High‑Concurrency Big Data: Sharding, Partitioning, and Index Strategies

Alibaba Cloud Developer

Apr 15, 2019 · Artificial Intelligence

Why Deep Learning Finally Succeeded and What Challenges Lie Ahead

This article reviews Jia Yangqing’s insights on why deep learning finally succeeded—highlighting the roles of big data and high‑performance computing—while examining its current limitations, emerging challenges, and future opportunities across AI engineering, AutoML, and hardware‑software co‑design.

AI ChallengesAI EngineeringAutoML

0 likes · 9 min read

Why Deep Learning Finally Succeeded and What Challenges Lie Ahead

Big Data Technology & Architecture

Apr 12, 2019 · Big Data

Weekly Knowledge Summary: Yarn Resource Scheduler, Hadoop Rack Awareness, HDFS Data Flow, and Small File Solutions

This weekly note shares personal updates and a concise technical overview covering Yarn's resource scheduling, Hadoop's rack‑aware architecture, HDFS data flow, and practical solutions to the HDFS small‑file problem, along with links to further reading and upcoming work plans.

Big DataHDFSHadoop

0 likes · 5 min read

Weekly Knowledge Summary: Yarn Resource Scheduler, Hadoop Rack Awareness, HDFS Data Flow, and Small File Solutions

System Architect Go

Apr 11, 2019 · Big Data

Introduction to Apache Kafka: Core Concepts, Message Delivery, Partition Storage, and Consumption

This article introduces Apache Kafka as a distributed streaming platform, explaining its three core capabilities, key concepts such as producers, topics, brokers, partitions and consumers, and detailing how messages are delivered, stored in partitions, and consumed by consumer groups.

Big DataDistributed StreamingKafka

0 likes · 8 min read

Introduction to Apache Kafka: Core Concepts, Message Delivery, Partition Storage, and Consumption

Architecture Digest

Apr 11, 2019 · Big Data

Understanding Hadoop and HBase: Installation, Configuration, and Basic Operations

This guide introduces Hadoop and HBase fundamentals, explains their architectures and advantages, and provides step‑by‑step instructions for setting up a multi‑node Hadoop cluster, configuring core services, installing HBase, and performing basic HBase shell operations.

Big DataHBaseHadoop

0 likes · 18 min read

Understanding Hadoop and HBase: Installation, Configuration, and Basic Operations

JD Retail Technology

Apr 10, 2019 · Databases

HBase at JD.com: Architecture, Use Cases, and Evolution

This article explains how JD.com leverages the open‑source HBase database for massive, low‑latency data storage across various business lines, detailing its architecture, multi‑tenant isolation, disaster‑recovery mechanisms, and integration with Phoenix SQL for OLTP workloads.

Big DataDatabase ArchitectureHBase

0 likes · 13 min read

HBase at JD.com: Architecture, Use Cases, and Evolution

Java Captain

Apr 9, 2019 · Big Data

Kafka FAQs: Zookeeper Dependency, Retention Policies, Cleanup Rules, Performance Bottlenecks, and Cluster Best Practices

This article answers common Kafka questions, explaining why Kafka cannot operate without Zookeeper, describing its two retention strategies based on time and size, detailing how simultaneous time‑ and size‑based cleanup works, listing performance bottlenecks, and offering practical guidelines for sizing and configuring Kafka clusters.

Big DataCluster DesignKafka

0 likes · 2 min read

Kafka FAQs: Zookeeper Dependency, Retention Policies, Cleanup Rules, Performance Bottlenecks, and Cluster Best Practices

Big Data Technology & Architecture

Apr 8, 2019 · Big Data

Understanding HDFS Data Blocks, Rack Awareness, and Dynamic Node Addition

This article explains how HDFS stores files in replicated data blocks, implements rack awareness to improve reliability and performance, shows the necessary configuration in core-site.xml, provides sample scripts, and demonstrates how to add new DataNode machines without restarting the NameNode.

Big DataData BlockDynamic Node Addition

0 likes · 10 min read

Understanding HDFS Data Blocks, Rack Awareness, and Dynamic Node Addition

Big Data Technology & Architecture

Apr 7, 2019 · Big Data

Understanding YARN: Background, Architecture, and Execution Process

This article explains why YARN was created to overcome the limitations of MapReduce 1.x, describes its architecture—including ResourceManager, NodeManager, ApplicationMaster, Container, and Client—and outlines the step‑by‑step execution flow that enables multiple computation frameworks to run on Hadoop.

Big DataHadoopYARN

0 likes · 11 min read

Understanding YARN: Background, Architecture, and Execution Process

Youzan Coder

Apr 7, 2019 · Industry Insights

How Youzan Scaled Order Search: Hot‑State Indexing and AKF Expansion

This article reviews the evolution of Youzan's order search architecture over two years, detailing challenges from data growth, the creation of a hot‑state index covering half of search traffic, time‑sharded indexes, and the AKF expansion cube that guides multi‑axis scalability.

Big DataElasticsearchScalability

0 likes · 10 min read

How Youzan Scaled Order Search: Hot‑State Indexing and AKF Expansion

Big Data Technology & Architecture

Apr 4, 2019 · Big Data

Weekly Knowledge Points: Interview Reflections, Hadoop Introduction, MapReduce and HDFS Overview

This weekly briefing shares five curated resources covering interview reflections, a concise Hadoop introduction, the principles of MapReduce, an overview of HDFS, and upcoming plans to study Hive and HBase, emphasizing the distributed nature of big‑data processing.

Big DataHDFSHadoop

0 likes · 3 min read

Weekly Knowledge Points: Interview Reflections, Hadoop Introduction, MapReduce and HDFS Overview

Big Data Technology & Architecture

Apr 3, 2019 · Big Data

Understanding RAID and Its Role in HDFS Architecture

This article explains the storage challenges of big data, introduces RAID technologies and their variants, and shows how the principles of RAID are applied in the Hadoop Distributed File System (HDFS) to achieve scalable, reliable, and high‑performance data storage and processing.

Big DataHDFSRAID

0 likes · 10 min read

Understanding RAID and Its Role in HDFS Architecture

Alibaba Cloud Developer

Apr 3, 2019 · Cloud Computing

What Alibaba Cloud’s New President Reveals About the Future of Cloud Computing

In a candid interview, Alibaba Cloud’s new president discusses how pricing is just a starting point, the shift from open‑source to self‑developed data platforms, the rapid growth of hybrid cloud, security priorities, the role of AI, the evolution of the middle‑platform concept, ecosystem integration, and the strategic focus on scaling, public‑cloud share, and partner collaboration to drive Alibaba Cloud’s future growth.

AIAlibaba CloudBig Data

0 likes · 31 min read

What Alibaba Cloud’s New President Reveals About the Future of Cloud Computing

Alibaba Cloud Developer

Apr 3, 2019 · Cloud Computing

What’s Next for Cloud Computing? Insights from Alibaba Cloud’s New President

In a detailed interview, Alibaba Cloud’s new president discusses the future of cloud computing, emphasizing the shift from price competition to core value, the importance of hybrid cloud, data processing platforms, open‑source challenges, AI integration, ecosystem strategy, and the evolving role of the cloud as a platform and integrated service.

Alibaba CloudArtificial IntelligenceBig Data

0 likes · 28 min read

What’s Next for Cloud Computing? Insights from Alibaba Cloud’s New President

Big Data Technology & Architecture

Apr 2, 2019 · Big Data

Understanding Hadoop MapReduce: Programming Model, WordCount Example, and Job Execution Mechanism

The article explains Hadoop's MapReduce framework as both a programming model and execution engine, detailing its map and reduce phases, the WordCount example code, job startup components, data shuffling, partitioning, and how large‑scale distributed computations are orchestrated across a cluster.

Big DataHadoopMapReduce

0 likes · 10 min read

Understanding Hadoop MapReduce: Programming Model, WordCount Example, and Job Execution Mechanism

Alibaba Cloud Native

Apr 2, 2019 · Big Data

Inside Spark Operator: How Kubernetes Manages Spark Jobs End‑to‑End

This article explains the internal architecture of Spark Operator, covering Kubernetes operator fundamentals, CRD definitions, code layout, job submission flow, state machine handling, monitoring integration, and troubleshooting techniques for reliable Spark workloads on Kubernetes.

Big DataCRDGo

0 likes · 11 min read

Inside Spark Operator: How Kubernetes Manages Spark Jobs End‑to‑End

Programmer DD

Apr 2, 2019 · Backend Development

From Freshman to Senior Engineer: A Developer’s Journey Through Java, Spring, and Big Data

This article chronicles a Chinese computer science graduate’s step‑by‑step evolution from learning basic C and Java in university to building campus apps, winning software contests, mastering Spring, Hadoop, Elasticsearch, and Neo4j, and ultimately landing offers from top tech firms, illustrating the challenges and perseverance required for a successful software engineering career.

Big Datacareerjava

0 likes · 13 min read

From Freshman to Senior Engineer: A Developer’s Journey Through Java, Spring, and Big Data