Tagged articles

data pipeline

240 articles · Page 3 of 3

May 16, 2019 · Backend Development

Design and Implementation of a Configurable, Extensible Content Processing System (Apollo)

Apollo is a configurable, extensible content‑processing platform that models each step as a node defined in a configuration file, supports multiple implementations for A/B testing, decouples producers and consumers via Kafka, ensures fault‑tolerant retries and replay, captures fine‑grained metrics through Canal‑to‑TiDB pipelines, and cuts new‑type development effort to roughly ten percent of the original cost while delivering high‑quality data to downstream teams.

TiDBWorkflow Enginebackend-architecture

0 likes · 9 min read

Design and Implementation of a Configurable, Extensible Content Processing System (Apollo)

HomeTech

Jan 18, 2019 · Big Data

Data Mill: A Real‑Time Spark Streaming Framework for DSP Business Support

Data Mill is a Spark‑Streaming‑based real‑time computation framework that abstracts tasks as DataFrames, enables SQL‑driven development, and supports DSP business requirements by reducing latency to 15‑30 minutes while providing a scalable architecture, caching strategy, and automated fault handling.

CacheDSPReal-Time Computing

0 likes · 10 min read

Data Mill: A Real‑Time Spark Streaming Framework for DSP Business Support

DataFunTalk

Dec 20, 2018 · Artificial Intelligence

How to Build World-Class Visual AI Technology

This presentation outlines the fundamentals of computer vision, discusses key factors such as algorithm research, large‑scale training platforms, intelligent data processing, and hardware optimization, and shares practical experiences from DeepGlint on building a world‑class visual AI system and its real‑world applications.

computer visiondata pipelinehardware optimization

0 likes · 23 min read

How to Build World-Class Visual AI Technology

JD Tech

Oct 10, 2018 · Backend Development

Design and Architecture of JD's Virtual Order Center (Hamal)

The article explains the architecture and core mechanisms of JD's Virtual Order Center, describing how the Hamal service leverages MySQL binlog listening, Zookeeper coordination, fast TCP‑based consumption, read‑write separation, and multi‑level search to reliably process billions of virtual orders.

BinlogHigh Availabilitybackend

0 likes · 7 min read

Design and Architecture of JD's Virtual Order Center (Hamal)

21CTO

Sep 28, 2018 · Artificial Intelligence

Inside E‑Commerce Recommendation Systems: From Data Collection to Real‑Time Personalization

This article explains how e‑commerce recommendation systems work, covering regular and personalized recommendation types, the challenges of user profiling and data handling, the three‑stage recommendation pipeline, and the overall system architecture that powers real‑time, AI‑driven product suggestions.

AIdata pipelinee-commerce

0 likes · 17 min read

Inside E‑Commerce Recommendation Systems: From Data Collection to Real‑Time Personalization

Big Data and Microservices

Sep 3, 2018 · Big Data

From Raw Data to Business Impact: A Complete Data Analyst Skill Guide

The article outlines a comprehensive data‑analyst competency framework, covering data collection, storage, extraction, mining, analysis, visualization, and practical application, and provides concrete questions, techniques, and tool recommendations to help analysts turn raw data into actionable business insights.

Business IntelligenceData Visualizationdata analysis

0 likes · 9 min read

From Raw Data to Business Impact: A Complete Data Analyst Skill Guide

Ctrip Technology

Jul 24, 2018 · Backend Development

Design and Implementation of CTran V3: A Multilingual Translation Platform for Ctrip International Business

This article presents a comprehensive case study of CTran V3, a redesigned multilingual translation platform for Ctrip's international business, detailing its architecture, data flow, job scheduling, translation engine, real‑time services, and lessons learned to guide similar large‑scale content localization projects.

Job Schedulingbackendcontent management

0 likes · 21 min read

Design and Implementation of CTran V3: A Multilingual Translation Platform for Ctrip International Business

Dada Group Technology

Jul 24, 2018 · Operations

Building a Scalable Growth Operations Platform: User Grouping, Dynamic Queries, and Automation

The article describes how a growth operations team can improve efficiency by designing a flexible user‑grouping system, dynamic query generation, and automated rule execution, while addressing data latency, real‑time processing, and scalability challenges through a Lambda‑style architecture.

AutomationDynamic QueryLambda architecture

0 likes · 14 min read

Building a Scalable Growth Operations Platform: User Grouping, Dynamic Queries, and Automation

Efficient Ops

Jun 6, 2018 · Big Data

How Tencent’s Multi‑Dimensional Monitoring Turns Big Data Into Real‑Time Business Insights

This article explains how Tencent’s ZhiYun multi‑dimensional monitoring system evolves from the Mobile Monitor platform, outlines its design principles, data‑factory capabilities, storage choices, and intelligent features, and demonstrates how it enables real‑time, multi‑dimensional analysis and alerting for large‑scale business operations.

Big DataDruidStorm

0 likes · 11 min read

How Tencent’s Multi‑Dimensional Monitoring Turns Big Data Into Real‑Time Business Insights

Java Captain

May 24, 2018 · Big Data

Debugging a Kafka Data Drop: A Step‑by‑Step Troubleshooting Case Study

After a recent feature release caused a sharp decline in a key data metric, the team followed a systematic, fourteen‑step troubleshooting process—including verification, code review, DBA involvement, local debugging, environment comparison, logging, packet capture, service restarts, request mode changes, load testing, and partition resizing—to identify and resolve a Kafka‑related throughput bottleneck.

Performance debuggingasynchronous vs synchronousdata pipeline

0 likes · 8 min read

Debugging a Kafka Data Drop: A Step‑by‑Step Troubleshooting Case Study

21CTO

Apr 9, 2018 · Artificial Intelligence

How E‑Commerce Platforms Build Effective Product Recommendation Systems

This article explains the fundamentals and advanced techniques of e‑commerce product recommendation systems, covering conventional and personalized approaches, user profiling, data collection, storage, modeling, the three‑stage pipeline of preprocessing, recall and ranking, as well as system architecture, challenges, and key algorithms such as LR and GBDT.

data pipelinee-commercemachine learning

0 likes · 17 min read

How E‑Commerce Platforms Build Effective Product Recommendation Systems

Snowball Engineer Team

Mar 23, 2018 · Big Data

Redesigning Snowball's Log Collection Architecture During Hadoop Cluster Expansion

The article details Snowball's challenges with a saturated CDH Hadoop cluster, outlines the limitations of the original Kafka‑based log pipeline, and explains how a comprehensive redesign using FlumeNG, Spillable Memory Channels, and custom HDFS sinks resolves latency, data loss, and high‑load issues while supporting future growth.

Cluster MigrationFlumeNGHadoop

0 likes · 6 min read

Redesigning Snowball's Log Collection Architecture During Hadoop Cluster Expansion

Suning Technology

Mar 9, 2018 · Big Data

How Suning Built a Scalable Real-Time Log Analysis Platform with Spark Streaming

Suning’s real‑time log analysis system integrates Flume, Kafka, Storm and Spark Streaming to collect, cleanse, and compute metrics like NDCG, ensuring low latency, high throughput, exact‑once processing, and robust data safety while supporting multi‑dimensional analytics on massive online‑offline traffic.

Big DataData QualityNDCG

0 likes · 12 min read

How Suning Built a Scalable Real-Time Log Analysis Platform with Spark Streaming

Meitu Technology

Dec 19, 2017 · Industry Insights

Inside Meitu’s In‑House Log Collection System Arachnia: Design, Challenges, and Core Mechanisms

This article introduces Meitu’s self‑developed log collection system Arachnia, explaining why a custom solution was needed for massive server‑side user‑behavior logs, the key requirements such as reliability and real‑time throughput, and the core architectural mechanisms that address those challenges.

ArachniaBig DataMeitu

0 likes · 2 min read

Inside Meitu’s In‑House Log Collection System Arachnia: Design, Challenges, and Core Mechanisms

Efficient Ops

Dec 18, 2017 · Operations

How WiFi Key Built a Million‑User Monitoring Platform: Architecture and Best Practices

This article describes how WiFi 万能钥匙 designed and implemented the Roma monitoring platform to handle billions of daily requests, covering background challenges, architectural principles, component design, data collection, transmission, storage, alerting, and future directions for large‑scale observability.

MicroservicesMonitoringObservability

0 likes · 16 min read

How WiFi Key Built a Million‑User Monitoring Platform: Architecture and Best Practices

Qunar Tech Salon

Dec 7, 2017 · Big Data

User Behavior Data Collection and Real-Time Processing Architecture at Qunar

This article describes Qunar's end‑to‑end user behavior data pipeline, covering offline and real‑time ETL processes, system architecture, Dubbo service interfaces, monitoring, optimizations, and the numerous product applications that leverage the unified behavior dataset.

ETLRecommendation Systemsdata pipeline

0 likes · 15 min read

User Behavior Data Collection and Real-Time Processing Architecture at Qunar

Meituan Technology Team

Dec 1, 2017 · Big Data

Metric Logic Tree: Automated Anomaly Analysis for Business Metrics

The Metric Logic Tree automates business metric anomaly analysis by integrating heterogeneous data sources (Kylin, MySQL, Elasticsearch, Druid) with a three‑layer architecture—metric calculation, algorithmic analysis (waterfall and Gini‑coefficient methods), and a master‑worker computation service—that parallelizes queries, delivers immediate conclusions, and shortens decision cycles, as demonstrated in Meituan‑Dianping’s hotel‑travel operations.

Anomaly DetectionBig Dataalgorithm

0 likes · 7 min read

Metric Logic Tree: Automated Anomaly Analysis for Business Metrics

Alibaba Cloud Developer

Nov 24, 2017 · Databases

How Alibaba Cloud Combines RDS PostgreSQL & HybridDB for Real‑Time HTAP Analytics

This article explains how Alibaba Cloud uses RDS PostgreSQL together with HybridDB for PostgreSQL and OSS to handle hundreds of thousands of transactions per second, merge order feeds into a wide table, and provide minute‑level latency with millisecond‑level real‑time analytics for e‑commerce platforms.

HTAPHybridDBPostgreSQL

0 likes · 14 min read

How Alibaba Cloud Combines RDS PostgreSQL & HybridDB for Real‑Time HTAP Analytics

Efficient Ops

Nov 15, 2017 · Big Data

How Tencent Built a 10 TB‑Per‑Day Full‑Link Log Monitoring Platform

This article explains how Tencent's ZhiYun full‑link log monitoring platform handles massive daily logs, overcomes challenges of diverse log formats, high throughput, fault‑tolerant design, and provides scalable storage, query, and alerting capabilities for distributed micro‑service environments.

Big DataLog Monitoringdata pipeline

0 likes · 10 min read

How Tencent Built a 10 TB‑Per‑Day Full‑Link Log Monitoring Platform

ITPUB

Sep 30, 2017 · Big Data

Designing Scalable Open‑Source ETL Systems: Lessons from Baidu Waimai

This talk details Baidu Waimai's end‑to‑end ETL design, covering demand sources, data flow patterns, multi‑stage system evolution, storage choices, scheduling architecture, configuration‑driven processing, quality monitoring, and how data lineage enables transparent, self‑service data delivery.

Big DataData QualityData Warehouse

0 likes · 25 min read

Designing Scalable Open‑Source ETL Systems: Lessons from Baidu Waimai

ITPUB

Sep 29, 2017 · Big Data

Designing an Open ETL System: Baidu Waimai’s Scalable Data Pipeline Practices

In this talk, a Baidu Waimai engineer explains the motivations, requirements, and architectural choices behind their open‑source ETL platform, covering data flow patterns, logical mappings, storage options, scheduling, metadata management, and quality monitoring to achieve scalable, transparent, and explainable data delivery.

Big DataData EngineeringETL

0 likes · 26 min read

Designing an Open ETL System: Baidu Waimai’s Scalable Data Pipeline Practices

Meituan Technology Team

Sep 21, 2017 · Big Data

Feature Production Scheduling: Architecture Evolution and Core Technologies

Using Meituan‑Dianping’s hospitality online feature system as a case study, the article describes how feature production scheduling evolved from offline batch ETL to automated, metadata‑driven pipelines and sub‑second streaming, detailing the underlying architecture, incremental updates, storage abstraction, write‑shaving, atomicity, and recovery mechanisms.

Big DataReal-time Processingdata pipeline

0 likes · 23 min read

Feature Production Scheduling: Architecture Evolution and Core Technologies

Architecture Digest

Sep 15, 2017 · Artificial Intelligence

Overview of Recommendation Systems: Goals, Methods, Architecture, and Practical Considerations

This article explains the objectives of recommendation systems, compares popular recommendation approaches, details the components and algorithms of personalized recommendation pipelines, and discusses practical challenges such as real‑time processing, freshness, cold‑start, diversity, content quality, and surprise handling.

EvaluationReal-timecold-start

0 likes · 15 min read

Overview of Recommendation Systems: Goals, Methods, Architecture, and Practical Considerations

Architecture Digest

Sep 7, 2017 · Big Data

Design and Implementation of Bilibili's Lancer Log Collection System

The article presents the architecture, component design, optimizations, and reliability guarantees of Bilibili's Lancer log collection system, a Flume‑based distributed pipeline that handles both real‑time and offline data streams for billions of events daily.

Big DataFlumedata pipeline

0 likes · 13 min read

Design and Implementation of Bilibili's Lancer Log Collection System

Architecture Digest

Sep 2, 2017 · Big Data

Designing a High‑Availability, High‑Efficiency Distributed Scheduling Platform for Big Data

This article examines the principles, features, and implementation details of distributed scheduling for big‑data ETL pipelines, covering decentralised schedulers, host selection strategies, fault‑tolerance, operator abstraction, elasticity, trigger mechanisms, visual monitoring, alarm handling, data fan‑in/fan‑out, parameter consistency, real‑time quality checks, lineage tracking, and field‑level traceability.

Big DataDistributed SchedulingETL

0 likes · 23 min read

Designing a High‑Availability, High‑Efficiency Distributed Scheduling Platform for Big Data

21CTO

Aug 23, 2017 · Big Data

How to Build a Real-Time Customer Behavior Collection System with Storm and NSQ

This article explains the design of a real-time customer behavior collection platform that uses NSQ for messaging, Storm for streaming processing, and HBase for storage, covering architecture, data flow, reliability guarantees, and deployment considerations.

HBaseNSQReal-time Streaming

0 likes · 11 min read

How to Build a Real-Time Customer Behavior Collection System with Storm and NSQ

21CTO

Aug 18, 2017 · Big Data

How Ctrip Builds a Scalable User Profile Platform for Personalized Travel

This article explains why Ctrip creates user profiles, describes the product and technical architectures, and details the data collection, computation, storage, high‑availability querying, and monitoring components that power its personalized travel recommendations and services.

CtripReal-time Processingdata pipeline

0 likes · 8 min read

High Availability Architecture

Aug 8, 2017 · Big Data

Practical Big Data Architecture Evolution and Lessons Learned

The article reviews the evolution of big‑data architectures from a simple RDB‑centric pipeline to a SaaS‑based solution, highlighting common bottlenecks such as scaling, integration, cost, and operational complexity, and shares practical experiences and best‑practice recommendations for building efficient, maintainable data platforms.

Big DataLoggingMonitoring

0 likes · 12 min read

Practical Big Data Architecture Evolution and Lessons Learned

StarRing Big Data Open Lab

Jul 28, 2017 · Big Data

How Transwarp Transporter Enables Near‑Real‑Time ETL in Big Data Pipelines

The article introduces Transwarp Transporter, a near‑real‑time ETL tool for TDH 5.x, explains its architecture, visual dashboard, drag‑and‑drop data‑flow design, debugging features, parameter management, and highlights how it empowers business users to achieve fast, reliable data migration in big‑data environments.

Data IntegrationETLTranswarp

0 likes · 7 min read

How Transwarp Transporter Enables Near‑Real‑Time ETL in Big Data Pipelines

Architecture Digest

Jul 18, 2017 · Backend Development

Design and Implementation of Ctrip Real‑Time User Data Collection System

This article describes the design, technology selection, and performance evaluation of Ctrip's real‑time user behavior data collection platform, covering Netty‑based network handling, Kafka/Hermes messaging, encryption, compression, Avro backup, and related analytics products, with detailed feasibility analysis and benchmark results.

Nettybackend-architecturedata pipeline

0 likes · 17 min read

Design and Implementation of Ctrip Real‑Time User Data Collection System

Qunar Tech Salon

Jul 4, 2017 · Big Data

Design and Evolution of Airbnb's Log Data Storage and Query Platform

The article describes how Airbnb's data infrastructure team built a next‑generation log storage and query platform to improve data quality, timeliness, flexibility, and anomaly detection, outlining the system architecture, key requirements, five improvement areas, and the resulting benefits.

AirbnbMonitoringdata pipeline

0 likes · 7 min read

Design and Evolution of Airbnb's Log Data Storage and Query Platform

Architects' Tech Alliance

May 7, 2017 · Big Data

Building a Complete Big Data Platform: From Hadoop Basics to Real‑Time Analytics

This guide walks beginners through the entire big‑data ecosystem—explaining the 4V characteristics, listing essential open‑source components, teaching Hadoop setup, Hive and SparkSQL usage, data ingestion with Sqoop, Flume and Kafka, task scheduling with Oozie, and real‑time processing with Storm and Spark Streaming.

Big DataHadoopHive

0 likes · 20 min read

Building a Complete Big Data Platform: From Hadoop Basics to Real‑Time Analytics

Baidu Intelligent Testing

Mar 14, 2017 · Big Data

Baidu's Third‑Party Sentiment Feedback System: Architecture, Data Capture, Cleaning, and Output

The article presents Baidu's end‑to‑end third‑party sentiment feedback solution, detailing its evolution from waterfall to agile and lean development, and describing the three‑stage pipeline of data acquisition, cleansing, and warehousing that enables real‑time product quality loops.

BaiduBig DataSentiment Analysis

0 likes · 14 min read

Baidu's Third‑Party Sentiment Feedback System: Architecture, Data Capture, Cleaning, and Output

Meituan Technology Team

Mar 2, 2017 · Big Data

Meituan Waimai Feature Archive Platform: Architecture, Tag System, and Data Processing

Meituan Waimai’s Feature Archive platform processes billions of daily orders by managing ~200 user and 400 merchant tags through a three‑layer architecture—Hive, Elasticsearch, HBase, and MySQL—offering visual tag selection, instant self‑service queries, full data extraction, and a predicate‑logic query language, while supporting future extensibility.

Big DataElasticsearchHBase

0 likes · 14 min read

Meituan Waimai Feature Archive Platform: Architecture, Tag System, and Data Processing

21CTO

Nov 6, 2016 · Artificial Intelligence

How to Build a Scalable AI-Powered Recommendation System with SOA

This article outlines a service‑oriented architecture for a high‑availability personalized recommendation platform, detailing the front‑end, back‑end, crawler, user‑profile modeling, data collection from logs and client events, and processing pipelines using technologies such as Node.js, Python, RabbitMQ/Kafka, MongoDB and TensorFlow.

Full-StackSOATensorFlow

0 likes · 5 min read

How to Build a Scalable AI-Powered Recommendation System with SOA

Meituan Technology Team

Aug 5, 2016 · Big Data

Design and Implementation of a Large-Scale User Behavior Analytics Platform

The article outlines Meituan‑Dianping’s “Sensors Analytics” platform, a privately‑deployed, open‑PaaS solution that collects full‑stack user events from iOS, Android, Web and WeChat, maps IDs in near real‑time, stores detailed records in Kudu (real‑time) and Parquet (offline), and serves low‑latency queries via Impala, addressing the architectural and operational challenges of high‑throughput ingestion and data‑security requirements.

ImpalaKuduUser Behavior Analytics

0 likes · 8 min read

Design and Implementation of a Large-Scale User Behavior Analytics Platform

Baidu Maps Tech Team

Feb 3, 2016 · Big Data

How Baidu Maps Powers Its Open Platform with Big Data Architecture

This article explains how Baidu Maps’ open platform handles massive daily location data through real‑time and offline pipelines, Hadoop‑based offline computing, stream processing, and query engines built on MySQL, Redis, and Apache Kylin, while outlining future big‑data enhancements.

Apache KylinBaidu MapsHadoop

0 likes · 7 min read

How Baidu Maps Powers Its Open Platform with Big Data Architecture

21CTO

Nov 29, 2015 · Backend Development

Designing High‑Performance E‑Commerce Search Engines: Architecture, Scaling, and Reliability

This article explores the unique characteristics of e‑commerce search engines, their specialized architecture, core modules, data update processes, and practical solutions for bugs, high concurrency, caching, and cold‑start challenges, offering a comprehensive guide for building robust search systems.

CachingE-commerce SearchIndexing

0 likes · 12 min read

Designing High‑Performance E‑Commerce Search Engines: Architecture, Scaling, and Reliability

21CTO

Nov 21, 2015 · Big Data

Why Build a Kafka System? Core Use Cases and Design Principles

This article explains why Kafka is essential for activity and operational data pipelines, outlines key use cases such as news feeds, relevance ranking, security, monitoring, and reporting, and details its deployment topology, design decisions, and message persistence strategies.

Distributed MessagingReal-time Processingdata pipeline

0 likes · 14 min read

Why Build a Kafka System? Core Use Cases and Design Principles

Art of Distributed System Architecture Design

Apr 24, 2015 · Big Data

Pinterest Real-Time Data Pipeline Using Kafka, Spark, and MemSQL

Pinterest built a real‑time data pipeline that streams user engagement events through Apache Kafka into Spark Streaming, enriches them with location and category information, and persists the results in MemSQL to enable fast, SQL‑based analytics for its recommendation engine.

Big DataMemSQLPinterest

0 likes · 3 min read