Tagged articles

3675 articles

Page 34 of 37

Oct 15, 2017 · Big Data

How JD Built a Scalable Seller Log Platform with Kafka, Storm, ES & HBase

This article details JD's end‑to‑end seller log system architecture, explaining why Kafka, Storm, Elasticsearch and HBase were chosen, the challenges faced during scaling, and the practical solutions implemented to achieve a unified, high‑throughput logging platform for merchants and operations.

Big DataElasticsearchHBase

0 likes · 13 min read

How JD Built a Scalable Seller Log Platform with Kafka, Storm, ES & HBase

Alibaba Cloud Developer

Oct 15, 2017 · Information Security

How Alibaba’s Data Security Maturity Model (DSMM) Is Shaping China’s Data Protection Landscape

The article explains Alibaba's Data Security Maturity Model (DSMM), its partnership program, the involvement of 17 leading security firms, and how the model aims to improve data security capabilities across industries by establishing standardized assessment criteria and fostering ecosystem collaboration.

AlibabaBig DataDSMM

0 likes · 10 min read

How Alibaba’s Data Security Maturity Model (DSMM) Is Shaping China’s Data Protection Landscape

ITFLY8 Architecture Home

Oct 12, 2017 · Backend Development

How Taobao Scaled Its Backend Architecture Over Time

This article outlines Taobao's learning objectives, traces the evolution of its backend architecture from V1.0 to V3.0, highlights the technical challenges faced at each stage, and explains the architectural decisions—such as modularization, service‑oriented frameworks, distributed storage, and large‑scale monitoring—that enabled massive scalability, reliability, and performance improvements.

ArchitectureBackendBig Data

0 likes · 6 min read

How Taobao Scaled Its Backend Architecture Over Time

Baidu Intelligent Testing

Oct 9, 2017 · Big Data

User Behavior Analysis: From Data Acquisition to Funnel Insights

The article explains how to move beyond macro app metrics by collecting offline and real‑time user data, storing it in HDFS, processing it with Spark, visualizing behavior paths as state‑machine trees, and performing branch‑funnel analysis to uncover conversion bottlenecks and improve product quality.

AnalyticsBig DataFunnel Analysis

0 likes · 5 min read

User Behavior Analysis: From Data Acquisition to Funnel Insights

ITPUB

Sep 30, 2017 · Big Data

Designing Scalable Open‑Source ETL Systems: Lessons from Baidu Waimai

This talk details Baidu Waimai's end‑to‑end ETL design, covering demand sources, data flow patterns, multi‑stage system evolution, storage choices, scheduling architecture, configuration‑driven processing, quality monitoring, and how data lineage enables transparent, self‑service data delivery.

Big DataData QualityETL

0 likes · 25 min read

Designing Scalable Open‑Source ETL Systems: Lessons from Baidu Waimai

Tongcheng Travel Technology Center

Sep 29, 2017 · Big Data

Evolution of Monitoring Architecture and Traffic Alert Algorithms at Tongcheng Travel

This article describes how Tongcheng Travel’s monitoring system evolved from a monolithic design to a distributed and big‑data‑based architecture, introducing real‑time processing with Storm, machine‑learning‑enhanced alerts, and a multivariate linear regression model that dramatically improves traffic anomaly detection accuracy.

Big DataReal-time Processingarchitecture evolution

0 likes · 10 min read

Evolution of Monitoring Architecture and Traffic Alert Algorithms at Tongcheng Travel

ITPUB

Sep 29, 2017 · Big Data

Designing an Open ETL System: Baidu Waimai’s Scalable Data Pipeline Practices

In this talk, a Baidu Waimai engineer explains the motivations, requirements, and architectural choices behind their open‑source ETL platform, covering data flow patterns, logical mappings, storage options, scheduling, metadata management, and quality monitoring to achieve scalable, transparent, and explainable data delivery.

Big DataETLScheduling

0 likes · 26 min read

Designing an Open ETL System: Baidu Waimai’s Scalable Data Pipeline Practices

21CTO

Sep 25, 2017 · Big Data

How Meitu Scaled Its Billion-User Data Analytics: Architecture Evolution and Lessons

This article explains how Meitu built and evolved a large‑scale data statistics platform to handle billions of users, detailing the challenges of growing data volume, the architectural shifts from simple scripts to Hadoop, and the design of modular components for job management, scheduling, execution, and future expansion.

Big DataData PlatformHadoop

0 likes · 16 min read

How Meitu Scaled Its Billion-User Data Analytics: Architecture Evolution and Lessons

Qunar Tech Salon

Sep 25, 2017 · Big Data

Comprehensive Guide to Spark Ecosystem: Data Warehouse, Machine Learning, Streaming, and Enterprise Use Cases

This article provides an extensive overview of Apache Spark’s ecosystem—including its data‑warehouse capabilities, ML/MLlib libraries, streaming with Spark Streaming, external frameworks, and real‑world enterprise case studies—while also noting a promotional announcement for a React Native conference.

Big DataKafkaSpark

0 likes · 21 min read

Comprehensive Guide to Spark Ecosystem: Data Warehouse, Machine Learning, Streaming, and Enterprise Use Cases

ITPUB

Sep 22, 2017 · Big Data

How Baidu Waimai Scaled Traffic Analysis with Apache Kylin: A Deep Dive

This article presents a detailed case study of Baidu Waimai's traffic analysis platform, outlining the data challenges of high dimensionality and volume, the evaluation of OLAP engines, the adoption of Apache Kylin for pre‑computation, the end‑to‑end data modeling, cube construction, incremental builds, and integration with Saiku‑Mondrian reporting, while sharing practical lessons and performance gains.

Apache KylinBig DataOLAP

0 likes · 29 min read

How Baidu Waimai Scaled Traffic Analysis with Apache Kylin: A Deep Dive

Meituan Technology Team

Sep 21, 2017 · Big Data

Feature Production Scheduling: Architecture Evolution and Core Technologies

Using Meituan‑Dianping’s hospitality online feature system as a case study, the article describes how feature production scheduling evolved from offline batch ETL to automated, metadata‑driven pipelines and sub‑second streaming, detailing the underlying architecture, incremental updates, storage abstraction, write‑shaving, atomicity, and recovery mechanisms.

Big DataReal-time ProcessingSystem Architecture

0 likes · 23 min read

Feature Production Scheduling: Architecture Evolution and Core Technologies

Ctrip Technology

Sep 20, 2017 · Big Data

Building a Real‑Time Computing Platform with Spark Streaming at Ctrip: Design, Implementation, and Lessons Learned

This article describes how Ctrip migrated its large‑scale real‑time platform from JStorm to Spark Streaming, detailing the architectural design, the Muise Spark Core encapsulation, operational metrics, encountered pitfalls, and future plans to adopt Flink and Beam for streaming workloads.

Big DataExactly-OnceSpark Streaming

0 likes · 22 min read

Building a Real‑Time Computing Platform with Spark Streaming at Ctrip: Design, Implementation, and Lessons Learned

Alibaba Cloud Developer

Sep 19, 2017 · Artificial Intelligence

Inside Alibaba’s 2017 Tech Forum: AI, Big Data, and Cloud Innovations Unveiled

At the inaugural 2017 Alibaba Technology Forum held at Hong Kong University of Science and Technology, senior executives highlighted Alibaba’s cutting‑edge AI, machine learning, big‑data, and cloud breakthroughs, showcasing how data‑driven technologies power billions of users across e‑commerce, finance, logistics, healthcare, and entertainment.

Big DataCloud Computing

0 likes · 6 min read

Inside Alibaba’s 2017 Tech Forum: AI, Big Data, and Cloud Innovations Unveiled

Architecture Digest

Sep 11, 2017 · Big Data

Architecture and Data Flow of the Chinese Almanac Headline Recommendation System

The article describes the design, storage, update mechanisms, and optimization strategies of a headline recommendation platform that aggregates various data types using algorithms, MySQL, Redis, and a modular data‑fetching framework to achieve scalable and efficient content delivery.

Big DataData ArchitectureMySQL

0 likes · 12 min read

Architecture and Data Flow of the Chinese Almanac Headline Recommendation System

MaGe Linux Operations

Sep 11, 2017 · Big Data

How Big Data Can Revolutionize Operations Monitoring

This article explores applying big‑data thinking and platforms—such as Flume, Spark Streaming, and HBase—to operations monitoring, detailing data sources, metric categories, architecture design, implementation steps, and the benefits of a scalable, low‑code monitoring platform.

ArchitectureBig DataOperations

0 likes · 10 min read

How Big Data Can Revolutionize Operations Monitoring

Architecture Digest

Sep 7, 2017 · Big Data

Design and Implementation of Bilibili's Lancer Log Collection System

The article presents the architecture, component design, optimizations, and reliability guarantees of Bilibili's Lancer log collection system, a Flume‑based distributed pipeline that handles both real‑time and offline data streams for billions of events daily.

Big DataDistributed SystemsFlume

0 likes · 13 min read

Design and Implementation of Bilibili's Lancer Log Collection System

21CTO

Sep 5, 2017 · Big Data

Build a PHP Word Count with Hadoop MapReduce: Step-by-Step Guide

This article explains what MapReduce is, when to use it, and how to implement a PHP word‑count and a gold‑price average calculation on an Apache Hadoop cluster, covering installation hints, mapper and reducer scripts, testing commands, and visualizing results with gnuplot.

Big DataGnuplotHadoop

0 likes · 10 min read

Build a PHP Word Count with Hadoop MapReduce: Step-by-Step Guide

MaGe Linux Operations

Sep 4, 2017 · Fundamentals

The Ultimate Technical Knowledge Map: 50+ Skill Charts for Architects & Developers

This article presents a comprehensive collection of technical knowledge maps compiled over years, covering architecture, Java, microservices, consistency, big data, cloud computing, mobile development, front‑end, back‑end, DevOps, and more, aiming to help engineers and architects master essential skills and best practices.

ArchitectureBig DataCloud Computing

0 likes · 6 min read

The Ultimate Technical Knowledge Map: 50+ Skill Charts for Architects & Developers

Tencent IMWeb Frontend Team

Sep 3, 2017 · Frontend Development

What’s Hot This Week in Web Tech? Apple Event, KSQL, Polymer 3, and More

This week’s IMWeb Frontend Community roundup highlights the Apple September event details, introduces KSQL for Apache Kafka, previews Polymer 3.0’s shift to ES6 modules, discusses the Ayo.js Node.js fork, ASP.NET Core 2 Razor pages, VS 2017 preview, container adoption trends, and Oracle’s cloud database innovations.

Big DataFrontendTechnology News

0 likes · 6 min read

What’s Hot This Week in Web Tech? Apple Event, KSQL, Polymer 3, and More

Architecture Digest

Sep 3, 2017 · Big Data

An Overview of Big Data Processing Frameworks: Batch, Stream, and Hybrid Systems

This article introduces the evolution of big‑data processing from Google’s MapReduce concept to modern open‑source frameworks, defines big data and its 3V characteristics, outlines typical processing pipelines, and compares batch, stream, and hybrid systems such as Hadoop, Storm, Samza, Spark, and Flink.

Batch ProcessingBig DataFlink

0 likes · 20 min read

An Overview of Big Data Processing Frameworks: Batch, Stream, and Hybrid Systems

BiCaiJia Technology Team

Sep 2, 2017 · Big Data

How to Install and Test Kafka on CentOS: A Step‑by‑Step Guide

This guide walks you through installing Zookeeper and Kafka on a CentOS server, configuring essential settings, creating topics, and running producers and consumers, while highlighting common pitfalls and providing the exact commands needed for a successful deployment.

Big DataCentOSInstallation

0 likes · 6 min read

How to Install and Test Kafka on CentOS: A Step‑by‑Step Guide

Architecture Digest

Sep 2, 2017 · Big Data

Designing a High‑Availability, High‑Efficiency Distributed Scheduling Platform for Big Data

This article examines the principles, features, and implementation details of distributed scheduling for big‑data ETL pipelines, covering decentralised schedulers, host selection strategies, fault‑tolerance, operator abstraction, elasticity, trigger mechanisms, visual monitoring, alarm handling, data fan‑in/fan‑out, parameter consistency, real‑time quality checks, lineage tracking, and field‑level traceability.

Big DataData LineageDistributed Scheduling

0 likes · 23 min read

Designing a High‑Availability, High‑Efficiency Distributed Scheduling Platform for Big Data

StarRing Big Data Open Lab

Sep 1, 2017 · Information Security

How Guardian 5.0 Reinvents Big Data Security with an Enhanced ARBAC Model

Guardian 5.0 introduces a comprehensive security solution for the TDH big‑data platform, featuring LDAP/Kerberos authentication, unified ARBAC authorization, quota management, and multi‑layer architecture, while enhancing data protection and simplifying operations across cloud‑based deployments.

ARBACAuthenticationAuthorization

0 likes · 9 min read

How Guardian 5.0 Reinvents Big Data Security with an Enhanced ARBAC Model

Tongcheng Travel Technology Center

Aug 31, 2017 · Big Data

Evolution and Architecture of the Transportation Division Data Warehouse

The article details how the Transportation Division’s data warehouse grew from a simple SQL‑based solution to a multi‑layer, big‑data platform handling petabyte‑scale data with daily 10 TB increments, describing the technical and business architecture, ETL strategies, and future roadmap.

Big DataData ArchitectureETL

0 likes · 10 min read

Evolution and Architecture of the Transportation Division Data Warehouse

Tongcheng Travel Technology Center

Aug 29, 2017 · Big Data

How to Become a Data Mining Engineer: A Year‑Long Journey and Practical Guide

This article recounts a year-long journey to become a data mining engineer, explaining the role’s value, required skills, tools such as Excel, Tableau, SQL, Python, Scala, Spark, and machine‑learning techniques, and offers practical steps for aspiring professionals.

Big DataPythonTableau

0 likes · 11 min read

How to Become a Data Mining Engineer: A Year‑Long Journey and Practical Guide

21CTO

Aug 27, 2017 · Big Data

Uncovering Ghost Bikes: How to Crawl and Analyze Mobike Data in Chengdu

This article details the process of capturing Mobike's public API data, building a high‑performance Python crawler with proxy rotation, storing the results in databases, and performing large‑scale analysis to reveal stationary bikes, travel distances, usage frequency, and urban development patterns in Chengdu.

Big DataBike SharingMobike

0 likes · 13 min read

Uncovering Ghost Bikes: How to Crawl and Analyze Mobike Data in Chengdu

Meituan Technology Team

Aug 25, 2017 · Big Data

Data Platform Integration and Multi‑Data‑Center Architecture at Meituan‑Dianping

After Meituan merged with Dianping, engineers unified two massive Hadoop ecosystems across Beijing and Shanghai by breaking the project into four phases—unify, copy, switch, fuse—standardizing versions, implementing zone‑aware transfers, cross‑realm Kerberos, and federated metadata to achieve a single, reliable multi‑data‑center platform.

Big DataCluster FusionData Platform

0 likes · 32 min read

Data Platform Integration and Multi‑Data‑Center Architecture at Meituan‑Dianping

21CTO

Aug 21, 2017 · Big Data

Rethinking Hadoop: When to Use It and How Cloud Computing Changes the Game

This article reviews when Hadoop is appropriate, outlines its core features and limitations, explains cloud computing concepts and service models, and highlights the benefits of pre‑built Hadoop images for accelerating big‑data projects.

Big DataHadoopPre-built Images

0 likes · 13 min read

Rethinking Hadoop: When to Use It and How Cloud Computing Changes the Game

Architects' Tech Alliance

Aug 20, 2017 · Artificial Intelligence

Understanding AIOps: Gartner’s AI‑Driven IT Operations Platform and Its Key Drivers

Based on Gartner research, this article explains what AIOps is, how digital transformation drives its emergence, the platform’s big‑data and machine‑learning components, the factors and elements shaping it, and practical considerations for implementing AI‑powered IT operations.

Big DataDigital TransformationGartner

0 likes · 7 min read

Understanding AIOps: Gartner’s AI‑Driven IT Operations Platform and Its Key Drivers

Architecture Digest

Aug 15, 2017 · Artificial Intelligence

Why AI Engineers Must Understand Basic Infrastructure: From Big Data to Deep Learning

The article explains why AI engineers need foundational infrastructure knowledge—covering big‑data processing, cloud services, containerization, MapReduce, and deep‑learning platforms—to effectively solve real‑world problems, collaborate with teams, and build scalable, maintainable AI solutions.

AI InfrastructureBig DataCloud Computing

0 likes · 14 min read

Why AI Engineers Must Understand Basic Infrastructure: From Big Data to Deep Learning

21CTO

Aug 14, 2017 · Big Data

Unveiling Flink’s Multi‑Layer Execution Graph: From StreamGraph to Physical Deployment

This article explains Flink’s architecture, detailing the roles of Client, JobManager and TaskManager, walks through a SocketTextStreamWordCount example, and clarifies the four‑layer graph model—StreamGraph, JobGraph, ExecutionGraph, and the physical execution graph—highlighting why each layer exists.

Big DataExecution GraphFlink

0 likes · 9 min read

Unveiling Flink’s Multi‑Layer Execution Graph: From StreamGraph to Physical Deployment

Alibaba Cloud Developer

Aug 10, 2017 · Big Data

Alibaba’s HBase Innovations: Powering Big Data at Scale – HBaseCon 2017 Asia Insights

At HBaseCon 2017 Asia, Alibaba showcased a series of groundbreaking HBase enhancements—including strong synchronous replication, SQL-on-HBase capabilities, cross‑cluster range data copy, and read/write path optimizations—that dramatically improve performance, reliability, and usability for large‑scale big‑data storage.

Big DataHBasePerformance

0 likes · 10 min read

Alibaba’s HBase Innovations: Powering Big Data at Scale – HBaseCon 2017 Asia Insights

High Availability Architecture

Aug 8, 2017 · Big Data

Practical Big Data Architecture Evolution and Lessons Learned

The article reviews the evolution of big‑data architectures from a simple RDB‑centric pipeline to a SaaS‑based solution, highlighting common bottlenecks such as scaling, integration, cost, and operational complexity, and shares practical experiences and best‑practice recommendations for building efficient, maintainable data platforms.

ArchitectureBig DataSaaS

0 likes · 12 min read

Practical Big Data Architecture Evolution and Lessons Learned

ITFLY8 Architecture Home

Jul 26, 2017 · Big Data

Inside Taobao’s Massive Data Architecture: From Hadoop “Cloud Ladder” to Real‑Time “Galaxy”

This article details Taobao’s multi‑layer massive data platform, covering its five‑tier architecture, the 1500‑node Hadoop “Cloud Ladder” for batch processing, the low‑latency “Galaxy” stream engine, MySQL‑based MyFOX, HBase‑based Prom storage, the glider middle‑layer, and sophisticated caching strategies that together support petabytes of data and millions of daily queries.

Big DataDistributed SystemsHBase

0 likes · 16 min read

Inside Taobao’s Massive Data Architecture: From Hadoop “Cloud Ladder” to Real‑Time “Galaxy”

21CTO

Jul 22, 2017 · Big Data

Why Every Company Needs a Chief Data Officer to Unlock Data Value

The article explains the strategic importance of the Chief Data Officer role, outlining how CDOs drive data‑driven innovation through a four‑stage data supply chain—data supply, logistics, science, and execution—to create competitive advantage and business growth.

Big DataChief Data OfficerData Governance

0 likes · 14 min read

Why Every Company Needs a Chief Data Officer to Unlock Data Value

Architecture Digest

Jul 22, 2017 · Big Data

Popular Big Data Tools and Their Descriptions

This article provides an extensive overview of more than ninety open‑source and commercial big‑data tools—including ETL platforms, resource managers, storage systems, messaging queues, processing engines, and visualization libraries—detailing their core functions, typical use cases, and notable adopters.

AnalyticsBig DataData Integration

0 likes · 26 min read

Popular Big Data Tools and Their Descriptions

High Availability Architecture

Jul 19, 2017 · Artificial Intelligence

Weiflow: A Scalable Machine Learning Workflow Framework for Sina Weibo

The article introduces Weiflow, a dual‑layer DAG‑based machine‑learning workflow framework designed for Sina Weibo, and explains how its modular XML configuration, Scala implementation, and integration with Spark, TensorFlow, Hive, Storm, and Flink improve development efficiency, scalability, and execution performance across the entire ML pipeline.

Big DataDAGScala

0 likes · 16 min read

Weiflow: A Scalable Machine Learning Workflow Framework for Sina Weibo

ITFLY8 Architecture Home

Jul 17, 2017 · Big Data

Mastering Data Sync, Real‑Time Processing, and Scalable Storage for Modern Systems

This article explores practical techniques for synchronizing heterogeneous data sources, performing batch and incremental analytics with Hadoop and Spark, designing low‑latency real‑time computation pipelines, implementing push notifications, and choosing appropriate storage solutions—from in‑memory caches to distributed databases—while addressing performance, reliability, and scalability challenges.

Big DataDistributed SystemsReal-time Processing

0 likes · 25 min read

Mastering Data Sync, Real‑Time Processing, and Scalable Storage for Modern Systems

Alibaba Cloud Developer

Jul 17, 2017 · Artificial Intelligence

How Alibaba Turns Big Data into ‘Data New Energy’ with Automated Tagging and Distributed Knowledge Graphs

Alibaba's senior algorithm expert Yang Hongxia explains how the company fuses massive, heterogeneous data sources into a unified platform, builds automated tag‑production pipelines and large‑scale distributed knowledge graphs, and applies these technologies to drive smarter business decisions and AI‑enabled services.

AlibabaBig DataData Platform

0 likes · 14 min read

How Alibaba Turns Big Data into ‘Data New Energy’ with Automated Tagging and Distributed Knowledge Graphs

Efficient Ops

Jul 16, 2017 · Cloud Computing

Why PB‑Level Object Storage Is Essential and How to Choose the Right Solution

With data volumes soaring to petabyte scales, the article explains why object storage is the only viable solution for massive storage needs, outlines procurement considerations, design principles, and operational challenges, and offers practical guidance for building, evaluating, and scaling PB‑level storage systems.

Big DataCloud ComputingStorage Architecture

0 likes · 38 min read

Why PB‑Level Object Storage Is Essential and How to Choose the Right Solution

Architecture Digest

Jul 13, 2017 · Operations

Comprehensive Architecture and DevOps Tool Knowledge Map

This article compiles an extensive collection of architecture knowledge maps and a detailed overview of DevOps tools, categorizing them by development, deployment, and maintenance functions while also presenting related big‑data and cloud‑computing skill maps for engineers seeking a holistic view of modern software infrastructure.

ArchitectureBig DataCloud Computing

0 likes · 9 min read

Comprehensive Architecture and DevOps Tool Knowledge Map

High Availability Architecture

Jul 12, 2017 · Artificial Intelligence

Machine Learning Platform and Risk‑Control Applications at DianRong Net

The article presents a comprehensive overview of DianRong Net's in‑house machine‑learning platform built on Spark, its workflow, pain points it addresses, risk‑control case studies using graph mining, and practical tips for improving model performance through data, algorithms, hyper‑parameter tuning and ensemble methods.

Big DataModel OptimizationSpark

0 likes · 14 min read

Machine Learning Platform and Risk‑Control Applications at DianRong Net

Architects' Tech Alliance

Jul 11, 2017 · Big Data

Understanding HDFS Architecture and Its Integration with NFS and Various Storage Solutions

This article reviews the fundamental concepts of HDFS, explains its master‑slave architecture with NameNode and DataNode, describes block replication, and discusses various implementations—including native HDFS, NetApp/Lustre, GPFS/Ceph, and Isilon—as well as HDFS‑to‑NFS gateway integration.

Big DataDistributed File SystemHDFS

0 likes · 7 min read

Understanding HDFS Architecture and Its Integration with NFS and Various Storage Solutions

dbaplus Community

Jul 10, 2017 · Big Data

Master Apache Storm: Real‑Time Stream Processing from Basics to Word‑Count and Call‑Log Examples

This tutorial explains Apache Storm’s core principles, architecture, and development workflow, covering its relationship with Hadoop, key concepts such as spouts, bolts, tuples, and topologies, and provides step‑by‑step code examples for a word‑count program and a call‑log analysis application.

Apache StormBig DataReal-time Processing

0 likes · 14 min read

Master Apache Storm: Real‑Time Stream Processing from Basics to Word‑Count and Call‑Log Examples

Efficient Ops

Jul 9, 2017 · Cloud Native

How Goldwind Accelerated Wind Energy Management with Cloud‑Native Microservices

Goldwind transformed its global wind‑farm operations by adopting a cloud‑native, container‑based microservice architecture that tackles iteration speed, hybrid‑cloud deployment, and IoT big‑data challenges, enabling faster development, cost reduction, and advanced energy‑forecasting capabilities.

Big DataDevOpsIoT

0 likes · 8 min read

How Goldwind Accelerated Wind Energy Management with Cloud‑Native Microservices

21CTO

Jul 7, 2017 · Big Data

How to Kickstart Your Big Data Career: A Complete Learning Roadmap

This guide walks beginners through the vast big data landscape, helping them choose the right role, understand essential terminology, plan a learning path, and access curated resources for becoming a data engineer or analyst, all illustrated with clear diagrams.

Big DataLearning Pathbig data technologies

0 likes · 16 min read

How to Kickstart Your Big Data Career: A Complete Learning Roadmap

Meituan Technology Team

Jul 6, 2017 · Backend Development

Online Feature System: Architecture, Storage, and High‑Concurrency Techniques

Using Meituan’s hotel‑travel platform as a case study, the article details a scalable online feature system architecture that combines layered storage, efficient compression, and robust synchronization to meet extreme concurrency, throughput, terabyte‑scale data, and sub‑10 ms latency demands for AI‑driven strategy services.

Big Datadata compressiondistributed storage

0 likes · 23 min read

Online Feature System: Architecture, Storage, and High‑Concurrency Techniques

Alibaba Cloud Developer

Jul 5, 2017 · Artificial Intelligence

Is This the New Golden Age of Visual AI? Insights from Alibaba Cloud

The article reviews the three historic AI booms, explains why today’s cloud‑based visual intelligence represents a distinct era, outlines five key factors for successful visual AI, and showcases real‑world Alibaba Cloud applications such as product search, city‑wide monitoring, medical diagnosis, and visual advertising.

AI applicationsAlibaba CloudBig Data

0 likes · 18 min read

Is This the New Golden Age of Visual AI? Insights from Alibaba Cloud

Tencent Advertising Technology

Jul 3, 2017 · Artificial Intelligence

Tencent Social Advertising College Algorithm Contest

Tencent's social advertising team hosts an algorithm contest for college students, leveraging big data and machine learning to develop innovative solutions for social advertising scenarios, inviting participants to submit algorithmic approaches to real-world advertising challenges.

Academic CompetitionAlgorithm ContestBig Data

0 likes · 2 min read

Tencent Social Advertising College Algorithm Contest

21CTO

Jul 3, 2017 · Big Data

Inside the World’s Best Data Architectures: Netflix, Facebook, Airbnb, Pinterest

This article explores the cutting‑edge data pipelines of Netflix, Facebook, Airbnb and Pinterest, detailing the massive event volumes they handle, the core technologies such as Kafka, Spark, Presto and Hadoop, and how these giants design scalable, real‑time analytics infrastructures.

AirbnbBig DataData Architecture

0 likes · 6 min read

Inside the World’s Best Data Architectures: Netflix, Facebook, Airbnb, Pinterest

21CTO

Jul 1, 2017 · Operations

How Ctrip Scales Its Architecture: Ops, Release, and Big Data Insights

This article outlines Ctrip’s evolving architecture—covering its operational backbone, framework components, release system, configuration management, SOA evolution, and the massive UserProfile big‑data platform—offering practical insights from a senior developer on how the company achieves high availability and scalability.

ArchitectureBig DataOperations

0 likes · 12 min read

How Ctrip Scales Its Architecture: Ops, Release, and Big Data Insights

Java High-Performance Architecture

Jun 29, 2017 · Big Data

Master Apache Storm: Core Concepts, Real‑Time Word Count & Call Log Analytics

This tutorial introduces Apache Storm’s fundamental principles and development workflow, providing a PDF guide and source code for two practical examples—real‑time word‑count and call‑record aggregation—while covering its definition, use cases, relationship with Hadoop, core concepts, cluster architecture, and step‑by‑step usage.

Apache StormBig DataReal-time Processing

0 likes · 1 min read

Master Apache Storm: Core Concepts, Real‑Time Word Count & Call Log Analytics

Efficient Ops

Jun 27, 2017 · Big Data

How a Leading Bank Evolved Its Big Data Platform Architecture

This talk outlines how China’s Guangfa Bank built, refined, and scaled its big‑data platform since 2014, covering data positioning, system architecture optimization, delivery model improvements, team restructuring, and real‑world use cases that demonstrate the platform’s impact on risk control, marketing and operational efficiency.

BankingBig DataMicroservices

0 likes · 14 min read

How a Leading Bank Evolved Its Big Data Platform Architecture

dbaplus Community

Jun 27, 2017 · Big Data

Why Time‑Series Databases Are Essential for Modern Monitoring: Fundamentals and Live‑Streaming Use Cases

This article introduces the fundamentals of time‑series databases, compares them with traditional databases, surveys industry adoption, and details how Tiger Live leverages OpenTSDB, Grafana, and Bosun to build a scalable monitoring system for live‑streaming services.

Big DataGrafanaOpenTSDB

0 likes · 13 min read

Why Time‑Series Databases Are Essential for Modern Monitoring: Fundamentals and Live‑Streaming Use Cases

Baidu Intelligent Testing

Jun 20, 2017 · Big Data

Design and Challenges of Web Crawlers and Link Scheduling for Knowledge Graph Construction

The article explains how web crawlers (spiders) collect data for knowledge graphs, covering core tasks, major challenges, crawler features, new‑link expansion, storage design, link‑selection scheduling strategies, and the role of large‑scale data mining and machine learning in optimizing crawl efficiency.

Big DataSpiderWeb Crawling

0 likes · 17 min read

Design and Challenges of Web Crawlers and Link Scheduling for Knowledge Graph Construction

21CTO

Jun 20, 2017 · Artificial Intelligence

How Toutiao’s AI Powers Personalized News Recommendations

This article examines Toutiao’s rapid rise as a personalized news platform, detailing its AI‑driven recommendation pipeline, web‑crawling infrastructure, similarity‑matrix algorithms, A/B testing, and the role of human moderation in delivering highly targeted content to billions of users.

A/B testingAIBig Data

0 likes · 16 min read

How Toutiao’s AI Powers Personalized News Recommendations

Alibaba Cloud Developer

Jun 19, 2017 · Cloud Computing

How Alibaba Built a Cloud‑Native HR System That Cut Costs 100× and Boosted Speed 6×

This article details Alibaba's migration from Oracle PeopleSoft HCM to a self‑developed, cloud‑native eHR platform, describing the technical challenges, phased development using Groovy and MaxCompute, and the resulting six‑fold speed increase, hundred‑fold cost reduction, and enhanced employee experience.

Big DataCloud ComputingGroovy

0 likes · 11 min read

How Alibaba Built a Cloud‑Native HR System That Cut Costs 100× and Boosted Speed 6×

ITFLY8 Architecture Home

Jun 18, 2017 · Cloud Computing

Inside Alibaba’s Middleware: Career Paths, Tech Stack, and Architecture Challenges

This article explores why Alibaba's middleware is dubbed the architect's cradle, outlines career development routes within the team, details the extensive technology stack, and examines the major technical challenges such as massive data processing, real‑time analytics, and large‑scale deployment during peak events.

Big DataCareer DevelopmentCloud Computing

0 likes · 25 min read

Inside Alibaba’s Middleware: Career Paths, Tech Stack, and Architecture Challenges

StarRing Big Data Open Lab

Jun 16, 2017 · Big Data

How TDH Dominated the TPCx‑HS 10TB Benchmark: Strategies and Results

The article details how StarRocks and Cisco’s joint TPCx‑HS 10TB benchmark placed the TDH platform at the top of the performance ranking, explains the test setup, describes the pre‑ and post‑optimization strategies for TeraGen and TeraSort, and outlines the hardware configuration and key tuning parameters.

Big DataHadoopPerformance Optimization

0 likes · 10 min read

How TDH Dominated the TPCx‑HS 10TB Benchmark: Strategies and Results

Ctrip Technology

Jun 13, 2017 · Operations

Evolution and Architecture of Ctrip's System: Operations, Frameworks, and Big Data

This article presents a comprehensive overview of Ctrip's evolving system architecture, detailing its operational strategies, framework components such as SOA and release systems, and the large‑scale UserProfile big‑data platform, illustrating how each iteration addressed prior challenges while introducing new capabilities.

Big DataCtripOperations

0 likes · 13 min read

Evolution and Architecture of Ctrip's System: Operations, Frameworks, and Big Data

21CTO

Jun 9, 2017 · Big Data

From Hadoop to Spark: A Complete Roadmap to Becoming a Big Data Architect

This guide walks beginners through the essential big‑data ecosystem—from understanding Hadoop’s core components and mastering MapReduce, to using Hive, SparkSQL, Kafka, and real‑time frameworks like Storm, while also covering data ingestion, export, scheduling, and introductory machine‑learning techniques.

Big DataSparkdata engineering

0 likes · 20 min read

From Hadoop to Spark: A Complete Roadmap to Becoming a Big Data Architect

Suning Technology

Jun 9, 2017 · Big Data

How Suning’s AI‑Powered Smart Replenishment Turns Retail from B2C to C2B

Suning’s smart replenishment system showcased at CES Asia 2017 leverages big‑data analytics and machine‑learning models—linear regression, random forest, and XGBoost—to predict sales, optimize inventory across multiple warehouses, and shift retail from traditional B2C to a data‑driven C2B approach.

Big Datainventory optimizationmachine learning

0 likes · 5 min read

How Suning’s AI‑Powered Smart Replenishment Turns Retail from B2C to C2B

Alibaba Cloud Developer

Jun 8, 2017 · Big Data

Flink Forward 2017: Stream Processing Insights from Alibaba, Uber & Netflix

The article recounts the 2017 Flink Forward conference in San Francisco, highlighting key sessions from DataArtisans, Uber, Netflix and Alibaba, and discusses real‑time stream processing use cases, large‑scale deployments, runtime and TableAPI/SQL improvements, and the growing adoption of Flink in the industry.

Apache FlinkBig DataFlink

0 likes · 16 min read

Flink Forward 2017: Stream Processing Insights from Alibaba, Uber & Netflix

Architects' Tech Alliance

Jun 7, 2017 · Databases

What Is SAP HANA? Architecture, Deployment Models, and Use Cases Explained

This article provides a comprehensive overview of SAP HANA, covering its purpose as an in‑memory database platform, key architectural components, various deployment options such as appliances and TDI, and typical application scenarios ranging from OLAP analytics to OLTP workloads.

ArchitectureBig DataDeployment

0 likes · 11 min read

What Is SAP HANA? Architecture, Deployment Models, and Use Cases Explained

StarRing Big Data Open Lab

May 27, 2017 · Big Data

Simplify Big Data Governance with Data Lineage & Impact Analysis

Enterprise big‑data platforms face massive scale and complex metadata relationships, but using Transwarp Governor’s data lineage and impact analysis graphs enables precise tracing of data origins, rapid error localization, and prediction of downstream effects, dramatically improving data quality and governance efficiency.

Big DataData GovernanceData Lineage

0 likes · 8 min read

Simplify Big Data Governance with Data Lineage & Impact Analysis

MaGe Linux Operations

May 26, 2017 · Big Data

How Big Data Transforms Everyday Life: From Finance to Healthcare

This article explains what big data is, outlines its 5V characteristics, and showcases numerous real‑world applications such as personal finance monitoring, tax fraud detection, healthcare prediction, public opinion tracking, precise marketing, product development, traffic planning, strategic decision‑making, and credit scoring.

ApplicationsBig DataData Analytics

0 likes · 4 min read

How Big Data Transforms Everyday Life: From Finance to Healthcare

Architecture Digest

May 25, 2017 · Big Data

Designing Data Warehouse Layers: Principles, Models, and Practical Practices

This article explains why data warehouses should be layered, describes the classic ODS‑DW‑APP model, details each layer’s purpose and implementation techniques, presents an improved layering scheme with dimension and temporary tables, and answers common questions about parallel DWS and DWD processing.

Big DataData ArchitectureETL

0 likes · 17 min read

Designing Data Warehouse Layers: Principles, Models, and Practical Practices

Alibaba Cloud Developer

May 25, 2017 · Big Data

How Alibaba’s Blink Engine Redefines Real‑Time Big Data Processing

This article explains how Alibaba’s Blink, built on Apache Flink, transforms batch‑oriented big‑data platforms into a unified, high‑performance real‑time computing engine, detailing its architecture, state management, checkpointing, and successful deployment in e‑commerce, search, recommendation, and online machine‑learning scenarios.

AlibabaBig DataFlink

0 likes · 17 min read

How Alibaba’s Blink Engine Redefines Real‑Time Big Data Processing

Alibaba Cloud Developer

May 20, 2017 · Artificial Intelligence

How Alibaba’s AI‑Driven Information Retrieval Is Shaping E‑Commerce Futures

The second “Frontiers and Future of Information Retrieval” forum, co‑hosted by the Chinese Computer Society, Alibaba and academic committees, showcased how massive, structured e‑commerce data and AI algorithms are revolutionizing search, customer service, and research collaborations across the industry.

AlibabaBig Datae‑commerce

0 likes · 4 min read

How Alibaba’s AI‑Driven Information Retrieval Is Shaping E‑Commerce Futures

Alibaba Cloud Developer

May 17, 2017 · Databases

How Alibaba Tackles the Massive Challenges of Time‑Series Data Storage

This article details Alibaba's middleware team's exploration of time‑series data characteristics, real‑world monitoring scenarios, the limitations of traditional databases, and the evolution of their custom HiTSDB solution that combines inverted indexing, high‑compression algorithms, and distributed aggregation to meet massive write and query demands.

AlibabaBig DataHiTSDB

0 likes · 25 min read

How Alibaba Tackles the Massive Challenges of Time‑Series Data Storage

MaGe Linux Operations

May 17, 2017 · Big Data

How Big Data Turns Raw Information into Resource Optimization

The article explains that the ultimate value of big data lies in optimizing resource allocation by first crowdsourcing massive data, then fully mining it to uncover truth, and finally using those insights across industries such as transportation, advertising, finance, and more.

Big DataResource Optimizationcrowdsourcing

0 likes · 7 min read

How Big Data Turns Raw Information into Resource Optimization

Baidu Waimai Technology Team

May 16, 2017 · Big Data

Analysis of OLTP/OLAP Integrated Solutions: Apache Phoenix, Apache Trafodion, and Splice Machine

This article examines the convergence of OLTP and OLAP by introducing Apache Phoenix, Apache Trafodion, and Splice Machine, compares their technical features, and describes how Baidu Waimai adopted a Phoenix‑based solution to address scalability and performance challenges in its operational data store.

Apache PhoenixApache TrafodionBig Data

0 likes · 12 min read

Analysis of OLTP/OLAP Integrated Solutions: Apache Phoenix, Apache Trafodion, and Splice Machine

Qunar Tech Salon

May 16, 2017 · Artificial Intelligence

Personalized Recommendation Systems: Applications, User Profiling, Algorithms, and Optimization

This article presents a comprehensive overview of personalized recommendation systems, covering their application scenarios and value, user profiling, core algorithms such as content‑based and collaborative filtering, system architecture, performance and effect optimization techniques, and practical Q&A insights.

AIBig Datacollaborative filtering

0 likes · 18 min read

Personalized Recommendation Systems: Applications, User Profiling, Algorithms, and Optimization

MaGe Linux Operations

May 15, 2017 · Databases

Top 10 Must‑Know Data Storage Tools for Java Developers

Facing ever‑growing complexity, Java developers can streamline their projects by mastering a curated list of essential data storage and processing tools—including MongoDB, Elasticsearch, Cassandra, Redis, Hazelcast, EHCache, Hadoop, Solr, Spark, and Memcached—each offering distinct strengths for modern big‑data applications.

Big DataNoSQLdata-processing

0 likes · 8 min read

Top 10 Must‑Know Data Storage Tools for Java Developers

Architecture Digest

May 14, 2017 · Big Data

Handling Transactions, Failover, and Exactly‑Once Semantics in Distributed Systems

This article explores practical techniques for handling node liveness, failover, recovery, and exactly‑once transaction semantics in distributed systems, illustrating implementations with Zookeeper, Kafka, Storm, and database sharding while addressing big‑data reach calculations and performance trade‑offs.

Big DataDistributed SystemsExactly-Once

0 likes · 15 min read

Handling Transactions, Failover, and Exactly‑Once Semantics in Distributed Systems

MaGe Linux Operations

May 10, 2017 · Big Data

What Defines Big Data? Core Concepts, Challenges, and Future Directions

This article outlines the fundamental definition of big data, its acquisition, transmission, usability, common algorithmic issues, security concerns, potential reforms in hardware and software, and the major computational and interdisciplinary challenges that must be addressed.

AlgorithmsBig DataData Transmission

0 likes · 3 min read

What Defines Big Data? Core Concepts, Challenges, and Future Directions

Baidu Waimai Technology Team

May 9, 2017 · Industry Insights

Building Baidu Waimai’s Big Data Platform: Governance & Team Insights

This article examines how Baidu Waimai designed and evolved its big data platform, comparing traditional BI to modern 4V‑driven architectures, detailing database choices, OLTP/OLAP trade‑offs, data‑analysis team structures, and the essential steps for data governance and platformization.

Big DataData ArchitectureData Governance

0 likes · 19 min read

Building Baidu Waimai’s Big Data Platform: Governance & Team Insights

Qunar Tech Salon

May 9, 2017 · Artificial Intelligence

Ctrip CTO Gan Quan on Building a Data‑Driven Personalized Recommendation System

The article details how Ctrip’s CTO Gan Quan has leveraged big‑data platforms, deep‑learning algorithms, cross‑screen user tracking, and rapid AB testing to create a real‑time, personalized recommendation engine that shortens travel decision cycles and drives significant revenue growth.

AB testingAIBig Data

0 likes · 14 min read

Ctrip CTO Gan Quan on Building a Data‑Driven Personalized Recommendation System

ITPUB

May 8, 2017 · Big Data

Master Spark Performance: Practical Tuning Tips and Real‑World Examples

This article explains essential Spark concepts, illustrates common performance bottlenecks, and provides concrete tuning strategies for memory, CPU, serialization, data locality, file I/O, and shuffle reduction, backed by real‑world examples and visual metrics.

Big DataCPU optimizationMemory Management

0 likes · 19 min read

Master Spark Performance: Practical Tuning Tips and Real‑World Examples

Architects' Tech Alliance

May 7, 2017 · Big Data

Building a Complete Big Data Platform: From Hadoop Basics to Real‑Time Analytics

This guide walks beginners through the entire big‑data ecosystem—explaining the 4V characteristics, listing essential open‑source components, teaching Hadoop setup, Hive and SparkSQL usage, data ingestion with Sqoop, Flume and Kafka, task scheduling with Oozie, and real‑time processing with Storm and Spark Streaming.

Big DataHadoopKafka

0 likes · 20 min read

Building a Complete Big Data Platform: From Hadoop Basics to Real‑Time Analytics

MaGe Linux Operations

May 7, 2017 · Artificial Intelligence

Big Data & Machine Learning: Core Definitions and Essential Algorithms

This article explains what big data and machine learning are, their interrelationship, various big‑data analysis approaches, core machine‑learning concepts, and details ten fundamental algorithms—including regression, neural networks, SVM, clustering, dimensionality reduction, and recommendation—while highlighting their roles in modern data‑driven applications.

Big DataNeural Networksclustering

0 likes · 24 min read

Big Data & Machine Learning: Core Definitions and Essential Algorithms

MaGe Linux Operations

May 5, 2017 · Big Data

Essential Big Data Glossary: Key Terms Every Data Professional Should Know

This article presents an A‑to‑Z glossary of common big‑data terminology, offering concise definitions for concepts such as aggregation, algorithms, analytics, AI, cloud computing, databases, machine learning, and more, to help readers quickly grasp the core vocabulary of the big‑data ecosystem.

AnalyticsBig DataGlossary

0 likes · 30 min read

Essential Big Data Glossary: Key Terms Every Data Professional Should Know

MaGe Linux Operations

May 4, 2017 · Big Data

How to Process 100GB Logs and Massive Datasets with Hash Partitioning and Bloom Filters

This article explains the definition and 4V characteristics of big data and presents practical algorithms—including hash partitioning, min‑heap top‑K selection, bitmap extensions, and Bloom filter techniques—to efficiently handle ultra‑large log files, integer sets, and keyword searches within strict memory limits.

Big DataBitmapHash Partitioning

0 likes · 12 min read

How to Process 100GB Logs and Massive Datasets with Hash Partitioning and Bloom Filters

Efficient Ops

May 3, 2017 · Operations

How Tencent Scales NBA Live Streams to Millions: Behind the Tech and Operations

This article details Tencent's large‑scale live streaming architecture for NBA games, covering the rapid growth of live video, key technical features, network transmission challenges, multi‑angle production, CDN deployment, monitoring, big‑data processing, and strategies for ensuring low latency and high reliability for millions of concurrent viewers.

Big DataCDNOperations

0 likes · 25 min read

How Tencent Scales NBA Live Streams to Millions: Behind the Tech and Operations

Baidu Waimai Technology Team

Apr 28, 2017 · Big Data

Recap of Baidu Waimai Tech Team’s “Code Talk” Session on Data Platform Architecture and Big Data Practices

The article summarizes Baidu Waimai’s recent “Code Talk” event, highlighting the speaker’s overview of the company’s big‑data platform evolution, its technical architecture, practical challenges such as data security and accuracy, and a lively Q&A covering storm, high availability, and metric management.

Baidu WaimaiBig DataData Platform

0 likes · 6 min read

Recap of Baidu Waimai Tech Team’s “Code Talk” Session on Data Platform Architecture and Big Data Practices

Architects' Tech Alliance

Apr 27, 2017 · Big Data

Curated List of Big Data Learning Resources from w3cschool

This article presents a comprehensive, Chinese‑language collection of big‑data resources—including relational databases, distributed file systems, key‑value stores, distributed programming tools, file data models, and key‑map frameworks—compiled by w3cschool to help programmers deepen their understanding of big data technologies.

Big DataDistributed SystemsResources

0 likes · 6 min read

Curated List of Big Data Learning Resources from w3cschool

Architecture Digest

Apr 24, 2017 · Big Data

Understanding and Solving Data Skew in Hadoop and Spark

This article explains what data skew is, why it occurs in large‑scale Hadoop and Spark jobs, illustrates typical symptoms, and presents practical strategies—including business‑level adjustments, code tweaks, and platform‑specific tuning—to mitigate and resolve skew in big‑data processing.

Big DataData SkewHadoop

0 likes · 11 min read

Understanding and Solving Data Skew in Hadoop and Spark

StarRing Big Data Open Lab

Apr 22, 2017 · Big Data

How to Harness Event‑Driven StreamSQL for Low‑Latency Real‑Time Analytics

This article explains how StreamSQL runs on the Slipstream engine in event‑driven mode, shows how to enable the mode, and provides step‑by‑step code examples for low‑latency stream processing, window aggregation, and joining multiple window streams.

Big DataEvent-drivenLow latency

0 likes · 9 min read

How to Harness Event‑Driven StreamSQL for Low‑Latency Real‑Time Analytics

21CTO

Apr 21, 2017 · R&D Management

How to Turn Technical Experience into Personal Value: Lessons from Outsourcing to Big Data

The author shares a candid journey from low‑paid outsourcing coding to senior roles in design, analysis, and big‑data architecture, revealing how understanding value networks, leveraging cloud and data trends, and expanding beyond pure coding can dramatically increase a technologist’s personal and market value.

Big DataCareer DevelopmentCloud Computing

0 likes · 34 min read

How to Turn Technical Experience into Personal Value: Lessons from Outsourcing to Big Data

Alibaba Cloud Developer

Apr 21, 2017 · Big Data

How Alibaba Tackles Real-Time Stream and Graph Computing at Scale

In his ASPLOS keynote, Alibaba’s Vice President Zhou Jingren detailed the company’s large‑scale stream and graph computing platforms, highlighting fault‑tolerance innovations, real‑time data challenges, and upcoming advances in graph analytics and massive machine‑learning workloads.

AIAlibabaBig Data

0 likes · 7 min read

How Alibaba Tackles Real-Time Stream and Graph Computing at Scale

Baidu Waimai Technology Team

Apr 20, 2017 · Databases

Greenplum (GPDB) Architecture, Features, and Operational Tools Overview

This article explains Greenplum's MPP architecture, master‑segment design, high‑availability, interconnect network, rich management tools, parallel query planning, data loading techniques, and additional capabilities such as LDAP authentication and resource queues, demonstrating why it is a strong next‑generation big‑data query engine.

Big DataGreenplumMPP

0 likes · 15 min read

Greenplum (GPDB) Architecture, Features, and Operational Tools Overview

Baidu Waimai Technology Team

Apr 18, 2017 · Industry Insights

Baidu Waimai’s Cloud Migration, AI Logistics, and Architecture – QCon 2017

At QCon Beijing 2017, three senior Baidu Waimai engineers detailed the company’s year‑long migration from IDC to cloud using custom operation platforms, described the AI‑driven, data‑rich logistics scheduling system that outperforms manual dispatch, and shared architectural evolutions that enabled rapid, zero‑downtime scaling of the fast‑growing delivery business.

AI logisticsBig DataOperations

0 likes · 5 min read

Architect

Apr 17, 2017 · Big Data

Understanding Apache Spark Architecture: RDD, Computation Model, Cluster Modes, RPC, and Core Components

This article provides a comprehensive overview of Apache Spark's architecture, covering its RDD abstraction, computation model, various cluster deployment modes, RPC communication layer, startup procedures, core components, interaction flows, and block management for broadcast variables.

Apache SparkBig DataCluster Mode

0 likes · 15 min read

Understanding Apache Spark Architecture: RDD, Computation Model, Cluster Modes, RPC, and Core Components

Meituan Technology Team

Apr 14, 2017 · Big Data

Practical Experience of HDFS Federation at Meituan: Challenges, Improvements, and Automation

Meituan‑Dianping migrated its 2,000‑node HDFS cluster to Federation by fixing ViewFs compatibility, simplifying mount points, leveraging FastCopy for massive data moves, improving token handling, and automating split‑workflow steps, thereby overcoming single‑NameNode bottlenecks and providing a practical blueprint for large‑scale Hadoop deployments.

Big DataFastCopyFederation

0 likes · 22 min read

Practical Experience of HDFS Federation at Meituan: Challenges, Improvements, and Automation

MaGe Linux Operations

Apr 13, 2017 · Big Data

How to Choose the Right Language for Your Big Data Project

This article compares R, Python, Scala, and Java for big‑data projects, outlining each language’s strengths and weaknesses, and offers guidance on selecting the most suitable language based on project requirements, team expertise, and production needs.

Big DataPythonR

0 likes · 8 min read

How to Choose the Right Language for Your Big Data Project

ITPUB

Apr 11, 2017 · Big Data

Understanding Spark Executor Memory Management: On‑heap, Off‑heap, and Unified Approaches

This article explains Spark's executor memory architecture, covering on‑heap and off‑heap allocation, static versus unified memory managers, storage and execution memory handling, RDD persistence, eviction policies, and the role of Tungsten's page‑based management in optimizing performance.

Big DataExecutorMemory Management

0 likes · 23 min read

Understanding Spark Executor Memory Management: On‑heap, Off‑heap, and Unified Approaches

ITFLY8 Architecture Home

Apr 9, 2017 · Fundamentals

Understanding Bloom Filters: Fast, Space-Efficient Membership Tests

Bloom filters are highly space-efficient probabilistic data structures that quickly test set membership using multiple hash functions, guaranteeing no false negatives while allowing a small false positive rate, making them ideal for large-scale applications such as email blacklists and massive URL deduplication.

Big Databloom-filtermembership testing

0 likes · 5 min read

Understanding Bloom Filters: Fast, Space-Efficient Membership Tests

21CTO

Apr 4, 2017 · Artificial Intelligence

How Vipshop Evolved Its Real-Time Personalized Recommendation Engine

This article recounts Wu Guanlin’s presentation on the evolution of Vipshop’s personalized recommendation system, detailing the technical challenges of real‑time predictions, the three generations of architecture, the four‑stage recommendation engine, and the VRE platform’s design for scalability and low latency.

Big DataSystem Architecturemachine learning

0 likes · 10 min read

How Vipshop Evolved Its Real-Time Personalized Recommendation Engine

Baidu Waimai Technology Team

Mar 31, 2017 · Databases

Intelligent Resource Optimization and Risk Management in Baidu Waimai's Database Operations

The article examines how Baidu Waimai, positioning itself as a big‑data company, built an intelligent system for database resource optimization and risk management, outlines preventive measures for database reliability, and discusses the rapid growth and challenges of cloud databases.

Big DataCloud DatabasesDatabase Management

0 likes · 7 min read

Intelligent Resource Optimization and Risk Management in Baidu Waimai's Database Operations

ITFLY8 Architecture Home

Mar 26, 2017 · Big Data

How to Build Scalable Log Monitoring and Analytics with ELK, Kafka, and Spark

This article explains various enterprise log types, recommends monitoring tools like Cacti, Zabbix, Splunk, and the ELK stack, and details architectures for handling server, application, and user‑click logs using technologies such as Logstash, Elasticsearch, Kibana, Kafka, Flume, and Spark.

AnalyticsBig DataELK

0 likes · 26 min read

How to Build Scalable Log Monitoring and Analytics with ELK, Kafka, and Spark