Tagged articles

data pipeline

240 articles · Page 2 of 3

Aug 29, 2022 · Backend Development

Optimizing Front‑Back End Collaboration with Interface Platform and Data Direct Access at Baidu

Baidu’s commercial front‑end team integrated an interface platform with a data‑direct capability—leveraging BFF layers, Redis‑based offline data injection, stub services, data grading, and fragment‑based batch editing—to enable true parallel front‑back end development, eliminate separate test environments, and cut average project delivery time by more than half across thousands of projects.

BFFPlatform Engineeringbackend

0 likes · 9 min read

Optimizing Front‑Back End Collaboration with Interface Platform and Data Direct Access at Baidu

37 Interactive Technology Team

Aug 23, 2022 · Big Data

Optimizing Game Event Reporting with Stream Processing to Overcome ClickHouse Performance Bottlenecks

Faced with ClickHouse query times ballooning to over an hour for massive game‑event data, the team replaced the DB‑pull model with a stream‑processing pipeline that evaluates trigger rules in real time, cuts batch queries by 60 %, and brings reporting latency down to minutes.

ClickHouseGame AnalyticsPerformance Optimization

0 likes · 6 min read

Optimizing Game Event Reporting with Stream Processing to Overcome ClickHouse Performance Bottlenecks

DataFunSummit

Aug 22, 2022 · Big Data

Design and Practice of 360ShuKe Risk Control System Architecture

This presentation details 360ShuKe's risk control system architecture, covering its layered design, credit data lifecycle management, real‑time indicator computation, feature platform evolution, and solutions to challenges such as data loss, rapid model iteration, and feature drift.

Credit Scoringdata pipelinefeature engineering

0 likes · 12 min read

Design and Practice of 360ShuKe Risk Control System Architecture

Volcano Engine Developer Services

Aug 15, 2022 · Big Data

How ByteDance Scales Event Tracking: Inside a Billion‑Events‑Per‑Second Data Pipeline

This article explains how ByteDance’s event‑tracking (埋点) data flow handles billions of events per second using Flink‑based real‑time ETL, dynamic rule engines, data sharding, and multi‑datacenter disaster‑recovery to ensure stability, low latency, and cost‑effective processing for diverse downstream services.

Big DataFlinkdata pipeline

0 likes · 16 min read

How ByteDance Scales Event Tracking: Inside a Billion‑Events‑Per‑Second Data Pipeline

DataFunTalk

Jul 26, 2022 · Big Data

Feature Platform Architecture and Stream‑Batch Integrated Solutions

This talk presents Shuhe Technology’s feature platform, detailing its four‑layer architecture, feature storage services, stream‑batch integrated processing, event‑center design, consistency models, and four model‑strategy invocation schemes, illustrating data flows from MySQL through Sqoop, Kafka, Flink, HBase and ClickHouse.

Big DataClickHouseFlink

0 likes · 17 min read

Feature Platform Architecture and Stream‑Batch Integrated Solutions

GuanYuan Data Tech Team

Jul 21, 2022 · Operations

Mastering Data Workflows with DAGs: Scheduling, Configurable UI, and Visual Design

This article explains how to abstract repetitive data‑report tasks into a standardized workflow, describes the core capabilities of scheduling and configuration, shows how to implement DAG‑based visual editors, and compares similar platforms such as n8n and Orange, offering practical code examples and design insights.

DAGSchedulingbackend

0 likes · 20 min read

Mastering Data Workflows with DAGs: Scheduling, Configurable UI, and Visual Design

Continuous Delivery 2.0

Jun 24, 2022 · Operations

The Three Core Pipelines for Digital Transformation: Code, Data, and Experiment Pipelines

The article explains why modern enterprises must build three fundamental pipelines—code, data, and experiment—to achieve successful digital transformation, outlines their purposes, discusses construction strategies, and highlights challenges specific to each pipeline.

code pipelinedata pipelinedigital transformation

0 likes · 5 min read

The Three Core Pipelines for Digital Transformation: Code, Data, and Experiment Pipelines

IT Architects Alliance

Jun 5, 2022 · Big Data

Real-Time Data and User Profiling Practices at Zhihu: Architecture, Challenges, and Solutions

This article presents a comprehensive case study of Zhihu's data empowerment team, detailing the design of a real‑time data platform and user profiling system, the challenges faced in scalability, latency, and data quality, and the practical solutions and architectural choices implemented to drive business value.

Data QualityLambda architectureReal-time Data

0 likes · 22 min read

Real-Time Data and User Profiling Practices at Zhihu: Architecture, Challenges, and Solutions

DataFunSummit

May 21, 2022 · Big Data

Tencent News Massive Log Processing Architecture and Data Applications

The article presents Tencent News' comprehensive massive log processing solution, covering background, overall architecture, data collection, real-time and offline computation layers, data quality assurance, and practical examples such as Flink CDC for database synchronization, illustrating how large‑scale data is managed and applied.

FlinkLog ProcessingTencent

0 likes · 10 min read

Tencent News Massive Log Processing Architecture and Data Applications

Weimob Technology Center

May 20, 2022 · Big Data

Mastering Data Projects: From Collection to Modeling in the Big Data Era

This article walks through the four essential stages of building a data project—data collection, modeling, analysis, and application—explaining key principles, common models such as 3NF, star/snowflake, cube, and wide tables, and comparing offline versus real‑time pipelines.

data collectiondata modelingdata pipeline

0 likes · 10 min read

Mastering Data Projects: From Collection to Modeling in the Big Data Era

Qunar Tech Salon

Apr 13, 2022 · Big Data

Design and Implementation of a High‑Performance, Scalable MySQL Log Analysis System

This document describes the background, problems, requirements, analysis, and detailed design of a high‑performance, highly available, and scalable log‑analysis pipeline for MySQL that leverages Kafka, ClickHouse, MySQL, and custom services to aggregate, enrich, and visualize massive query logs.

ClickHousedata pipelinehigh performance

0 likes · 17 min read

Design and Implementation of a High‑Performance, Scalable MySQL Log Analysis System

High Availability Architecture

Apr 11, 2022 · Big Data

Ensuring Data Accuracy and Reliability in Baidu Log Platform: Architecture, Challenges, and Solutions

This article introduces the current state of Baidu's log platform, explains its lifecycle from data collection to downstream applications, analyzes the challenges of achieving near‑zero duplication and loss, and presents architectural optimizations and best‑practice recommendations to improve data stability and accuracy across the system.

Big DataData Reliabilitydata pipeline

0 likes · 19 min read

Ensuring Data Accuracy and Reliability in Baidu Log Platform: Architecture, Challenges, and Solutions

Python Programming Learning Circle

Apr 6, 2022 · Backend Development

Scrapy‑Based Zhihu User Follow/Followers Crawler with MongoDB Storage

This tutorial demonstrates how to build a Scrapy spider that crawls Zhihu user follow and follower data via Zhihu’s public APIs, handles request headers, parses JSON responses, paginates results, and stores the extracted information into MongoDB using a custom item pipeline.

APIMongoDBPython

0 likes · 11 min read

Scrapy‑Based Zhihu User Follow/Followers Crawler with MongoDB Storage

Baidu Geek Talk

Apr 6, 2022 · Big Data

Baidu Log Platform: Ensuring Data Accuracy with No-Duplication and No-Loss Architecture

Baidu’s logging platform centralizes data collection, transmission, management, and analysis for billions of daily logs, employing a layered architecture with priority persistence, service decomposition, stream computing, and client‑side optimizations to guarantee no duplication, no loss, and 99.99%+ stability.

BaiduData Accuracyarchitecture design

0 likes · 11 min read

Baidu Log Platform: Ensuring Data Accuracy with No-Duplication and No-Loss Architecture

DataFunSummit

Apr 4, 2022 · Big Data

User Portrait Scenarios and Technical Implementation Solutions

This article presents a comprehensive overview of user portrait applications across various industries, detailing common scenarios, product functionalities, and a step‑by‑step technical solution that includes data collection, tag management, ETL pipelines, and service architecture for real‑time and offline processing.

ETLSCRMTag Management

0 likes · 18 min read

User Portrait Scenarios and Technical Implementation Solutions

DataFunTalk

Mar 31, 2022 · Artificial Intelligence

Comprehensive Guide to TensorFlow: Modeling, Deployment, and Operations

This article provides an in‑depth overview of the TensorFlow ecosystem, covering Keras modeling productivity tools, classic model examples, AutoKeras and KerasTuner for automated search, data preprocessing pipelines, performance profiling, model optimization, and multiple deployment strategies for servers, browsers, and edge devices.

AutoMLKerasModel Deployment

0 likes · 20 min read

Comprehensive Guide to TensorFlow: Modeling, Deployment, and Operations

NetEase Smart Enterprise Tech+

Mar 29, 2022 · Big Data

Automating Consumer Insight Testing with Spark, Hive, and ClickHouse

This article explains how to build a big‑data consumer insight platform using Spark applications, Hive, MySQL and ClickHouse, and how to automate data validation and algorithm testing to improve coverage, efficiency, and reliability of insight services.

Big DataClickHouseSpark

0 likes · 8 min read

Automating Consumer Insight Testing with Spark, Hive, and ClickHouse

DeWu Technology

Mar 21, 2022 · Big Data

Real-time Customer Service Dashboard: Architecture and Implementation with Flink and ClickHouse

The article describes a real‑time customer‑service dashboard built on Flink for streaming MySQL changes captured via Kafka, which cleans and aggregates ~60 operational metrics before writing them to ClickHouse’s MergeTree/ReplacingMergeTree tables, enabling sub‑second queries and exactly‑once guarantees while separating offline and live pipelines.

ClickHouseFlinkdashboard

0 likes · 18 min read

Real-time Customer Service Dashboard: Architecture and Implementation with Flink and ClickHouse

dbaplus Community

Feb 23, 2022 · Big Data

Inside OPPO’s Real‑Time Computing Platform: Architecture, Practices, and Future Roadmap

This article details OPPO’s real‑time computing platform, covering its business scope, big‑data architecture built on Flink, Spark and Trino, the end‑to‑end job development lifecycle, SQL IDE features, diagnostic and monitoring mechanisms, link latency tracking, SLA guarantees, practical use cases, and upcoming lakehouse and cloud‑native evolution.

FlinkReal-Time Computingbig data platform

0 likes · 23 min read

Inside OPPO’s Real‑Time Computing Platform: Architecture, Practices, and Future Roadmap

IT Architects Alliance

Feb 8, 2022 · Backend Development

Designing a Daily Million-Transaction Payment Reconciliation System

This article explains how to architect a payment reconciliation system that can reliably process tens of millions of transactions per day, covering the underlying logic, scalability challenges, data collection methods, big‑data integration, and step‑by‑step processing flows to ensure accurate financial matching.

Big DataHiveSpark

0 likes · 32 min read

Designing a Daily Million-Transaction Payment Reconciliation System

dbaplus Community

Jan 21, 2022 · Big Data

Turning Data into Oil: Building Scalable Big Data Pipelines and Secure Insights

The talk outlines how to treat raw data like crude oil—using a robust big‑data platform, entity‑resolution techniques, secure data governance, and visualisation tools to transform disparate sources into reusable assets that drive rapid business insights and operational efficiency.

Cloud Data PlatformData SecurityData Visualization

0 likes · 16 min read

Turning Data into Oil: Building Scalable Big Data Pipelines and Secure Insights

Code DAO

Dec 20, 2021 · Artificial Intelligence

Building Efficient Data Pipelines with TensorFlow’s tf.data API

This article explains how to use TensorFlow’s tf.data API to construct high‑performance, flexible data pipelines—from loading images or tensors, applying transformations and data augmentation, to batching, shuffling, caching, prefetching, and feeding the pipeline directly into model.fit for training.

PrefetchPythonTensorFlow

0 likes · 9 min read

Building Efficient Data Pipelines with TensorFlow’s tf.data API

Code DAO

Dec 12, 2021 · Artificial Intelligence

Lightning Flash 0.3 Introduces New Tasks, Visualization Tools, Data Pipelines, and Registry API

Lightning Flash 0.3 expands the PyTorch Lightning ecosystem with eight new computer‑vision and NLP tasks, modular API design, integrated model hubs, visualisation callbacks, customizable data‑source hooks, and a central registry for model backbones, all illustrated with concrete code examples.

Lightning FlashPyTorch Lightningcomputer vision

0 likes · 7 min read

Lightning Flash 0.3 Introduces New Tasks, Visualization Tools, Data Pipelines, and Registry API

DataFunSummit

Dec 6, 2021 · Big Data

Design and Performance Optimization of a Real‑Time Billion‑Scale Data Processing Pipeline

This article reviews the background, architecture, and a series of performance‑optimizing techniques—including consumption, batch, storage, and execution‑engine tweaks—applied to a real‑time pipeline that processes hundreds of billions of records daily, and presents the resulting resource savings and latency improvements.

Performance OptimizationReal-time ProcessingSparkSQL

0 likes · 9 min read

Design and Performance Optimization of a Real‑Time Billion‑Scale Data Processing Pipeline

Baidu Intelligent Testing

Oct 12, 2021 · Artificial Intelligence

Full‑Link Consistency Testing for Click‑Through Rate Models in Large‑Scale Machine Learning

The article describes a comprehensive full‑link consistency testing framework for click‑through‑rate models, defining consistency issues, outlining data and logic consistency goals, and presenting a multi‑stage technical solution—including online data capture, offline data stitching, q‑value comparison, and reporting—to ensure model stability and performance.

DNNclick-through ratedata pipeline

0 likes · 18 min read

Full‑Link Consistency Testing for Click‑Through Rate Models in Large‑Scale Machine Learning

StarRocks

Sep 24, 2021 · Big Data

How Didi Scaled Real‑Time Funnel Analysis with StarRocks: Architecture, Design, and Performance Tips

Didi's data architecture team migrated high‑volume, real‑time funnel analysis from ClickHouse to StarRocks, built a multi‑layer pipeline with Kafka, Flink/Spark, Hive, and materialized views, and achieved sub‑3‑second query times on billions of rows, while outlining future enhancements.

Big DataFunnel AnalysisHive

0 likes · 14 min read

How Didi Scaled Real‑Time Funnel Analysis with StarRocks: Architecture, Design, and Performance Tips

dbaplus Community

Sep 6, 2021 · Frontend Development

Building a Scalable Frontend Performance Monitoring System at 哈啰

This article details 哈啰's front‑end performance monitoring architecture, covering the background of rapid growth, a three‑step optimization workflow, data collection, cleaning, aggregation, visualization, and practical techniques like pre‑rendering and offline packages to dramatically improve page load metrics.

MetricsOptimizationdata pipeline

0 likes · 30 min read

Building a Scalable Frontend Performance Monitoring System at 哈啰

Xianyu Technology

Aug 31, 2021 · Big Data

Xianyu SPU System Architecture and Data Pipeline Overview

Xianyu built a custom SPU system and data pipeline that cleans Alibaba’s raw SPU data, defines key, binding, sales and product attributes, stores enriched records in MySQL, syncs to OpenSearch, and supports diverse business scenarios such as inspection, search publishing, and worry‑free purchase.

OpenSearchProduct ModelingSPU

0 likes · 8 min read

Xianyu SPU System Architecture and Data Pipeline Overview

DataFunTalk

Jul 26, 2021 · Big Data

Accelerating Hive Daily Tables with Flink: A SmartNews Case Study

This article describes how SmartNews integrated Flink into its Airflow‑driven Hive batch pipeline to cut the actions table generation latency from three hours to about thirty‑four minutes, detailing the technical challenges, design decisions, and production results.

AWSBig DataFlink

0 likes · 12 min read

Accelerating Hive Daily Tables with Flink: A SmartNews Case Study

Didi Tech

Jul 1, 2021 · Big Data

Full-Chain Traffic Data Detection in DiDi's Omega Platform

DiDi’s Omega platform provides an end‑to‑end traffic‑data pipeline—from SDK collection through real‑time and offline ETL to storage and analysis—augmented by a detection service that measures loss, duplication and accuracy, achieving sub‑1% SDK loss, integrity tagging, comprehensive monitoring dashboards, and includes a senior data‑engineer hiring call.

Data QualityOmega Platformdata pipeline

0 likes · 9 min read

Full-Chain Traffic Data Detection in DiDi's Omega Platform

Java High-Performance Architecture

Jun 14, 2021 · Big Data

How NetEase Games Built a Scalable Flink‑Based Streaming ETL Platform

This article explains how NetEase Games engineers designed and operated a Flink‑driven streaming ETL system, covering business background, log classification, dedicated and generic ETL services, architecture evolution, Python UDF integration, runtime optimizations, tuning practices, fault‑tolerance mechanisms, and future roadmap.

FlinkGame Analyticsdata pipeline

0 likes · 22 min read

How NetEase Games Built a Scalable Flink‑Based Streaming ETL Platform

Architecture Digest

Jun 10, 2021 · Big Data

NetEase Game Streaming ETL Architecture and Practices Based on Flink

This article presents NetEase Game's streaming ETL solution built on Flink, covering business background, log characteristics, specialized and generic ETL services, architectural evolution, Python UDF integration, runtime optimizations, fault‑tolerance mechanisms, and future roadmap for unified real‑time and offline data warehouses.

Big DataFlinkLog Processing

0 likes · 19 min read

NetEase Game Streaming ETL Architecture and Practices Based on Flink

IT Architects Alliance

Jun 8, 2021 · Industry Insights

Inside Toutiao’s 11B Daily‑Active‑User Architecture: Data, Recommendations & Scaling

This article dissects Toutiao’s rapid growth from a small startup to a platform with over 5 billion registered users, detailing its data collection pipeline, user‑modeling techniques, recommendation engine, micro‑service architecture, PaaS infrastructure, storage strategies, and push‑notification system.

Recommendation EngineToutiaodata pipeline

0 likes · 9 min read

Inside Toutiao’s 11B Daily‑Active‑User Architecture: Data, Recommendations & Scaling

Xianyu Technology

Jun 8, 2021 · Big Data

Longgong Data Analysis Platform: Architecture and Solutions for Large‑Scale Structured Data

The Longgong Data Analysis Platform enables Idle Fish to capture, store, and analyze billions of structured product attributes in real time across more than 8,000 categories, using TableStore, MySQL, ODPS, and a distributed scheduler to achieve over 50% query speedup, 80% category coverage, and rapid support for search and recommendation teams.

AlibabaBig DataData Platform

0 likes · 9 min read

Longgong Data Analysis Platform: Architecture and Solutions for Large‑Scale Structured Data

Architecture Digest

May 17, 2021 · Big Data

Technical Architecture Overview of Toutiao: Data Pipeline, User Modeling, Recommendation System, and Microservices

The article provides a comprehensive technical overview of Toutiao's rapid growth, detailing its massive user base, data collection and processing pipelines, user modeling, cold‑start strategies, recommendation engines, storage solutions, push notification mechanisms, and the underlying microservice and PaaS architecture.

Big DataHadoopMicroservices

0 likes · 8 min read

Technical Architecture Overview of Toutiao: Data Pipeline, User Modeling, Recommendation System, and Microservices

DataFunTalk

May 14, 2021 · Big Data

Real‑time Billion‑Scale Data Transmission and AI Pipeline Architecture at Bilibili

This article presents a technical deep‑dive into Bilibili’s evolution from offline to real‑time data processing, describing the challenges of timeliness, ETL, AI feature engineering, and the design of a Flink‑on‑YARN incremental pipeline that supports trillion‑scale message throughput and AI‑driven real‑time applications.

AIBig DataFlink

0 likes · 27 min read

Real‑time Billion‑Scale Data Transmission and AI Pipeline Architecture at Bilibili

HelloTech

May 14, 2021 · Big Data

User Behavior Analysis System: Architecture, ClickHouse Cluster Deployment, and Analytical Techniques

The article describes a real‑time user behavior analysis platform built on a ClickHouse cluster, detailing its architecture, Hive‑to‑ClickHouse data ingestion with user‑ID routing, table designs for behavior and group data, and five analytical methods—event, funnel, path, retention, and attribution—leveraging shard‑level parallelism and custom functions for high efficiency.

AnalyticsBig DataClickHouse

0 likes · 20 min read

User Behavior Analysis System: Architecture, ClickHouse Cluster Deployment, and Analytical Techniques

Big Data Technology Architecture

May 12, 2021 · Big Data

End-to-End Tutorial: Sync MySQL Binlog to Kafka and Consume with Flink Using TiDB

This article provides a step‑by‑step guide to build a data pipeline that captures MySQL binlog, streams it through Canal into Kafka, processes it with Flink, and finally writes the results into TiDB, covering environment setup, component deployment, configuration, and verification.

CanalFlinkTiDB

0 likes · 31 min read

End-to-End Tutorial: Sync MySQL Binlog to Kafka and Consume with Flink Using TiDB

Spring Full-Stack Practical Cases

May 10, 2021 · Backend Development

How to Sync Oracle Data to Elasticsearch with Logstash: Step‑by‑Step Guide

This article walks through three data‑sync strategies for Elasticsearch, then details a complete Logstash JDBC configuration to pull Oracle records into an ES index, including setup, parameter explanations, startup commands, and verification via Kibana.

ELKJDBCLogstash

0 likes · 5 min read

How to Sync Oracle Data to Elasticsearch with Logstash: Step‑by‑Step Guide

DeWu Technology

May 7, 2021 · Big Data

Unified Semantic Layer for Data Development: Addressing Pain Points and Optimizing Queries

A unified semantic layer for data development creates a consistent, multi‑view representation of metrics that buffers logical changes, lets downstream applications use metric names only, and enables analysts and developers to select optimal query objects, thereby reducing misunderstandings, cutting rework, and improving query performance and maintainability.

OLAPdata pipeline

0 likes · 5 min read

Unified Semantic Layer for Data Development: Addressing Pain Points and Optimizing Queries

IT Architects Alliance

Apr 23, 2021 · Industry Insights

Inside Toutiao’s Massive Scale: How the News App Handles Billions of Requests

This article provides an in‑depth technical overview of Toutiao’s rapid growth, data collection pipelines, user modeling, cold‑start strategies, recommendation engine architecture, storage solutions, push notification system, microservice design, and its three‑layer PaaS platform, illustrating how the news app serves hundreds of millions of users daily.

Big DataIndustry insightToutiao

0 likes · 8 min read

Inside Toutiao’s Massive Scale: How the News App Handles Billions of Requests

ITFLY8 Architecture Home

Apr 22, 2021 · Big Data

Inside Toutiao’s Massive Big Data & Recommendation Architecture

This article examines Toutiao’s rapid growth from a small startup to a platform serving over 500 million users, detailing its data collection, user modeling, cold‑start handling, recommendation engines, storage solutions, messaging push system, micro‑service design, and virtualized PaaS infrastructure that enable high‑throughput, personalized news delivery.

Cloud ComputingMicroservicesUser Modeling

0 likes · 9 min read

Inside Toutiao’s Massive Big Data & Recommendation Architecture

Xianyu Technology

Apr 22, 2021 · Big Data

Real-time Performance Optimization of the Mahé Selection and Delivery System

By classifying data streams, aggregating large‑scale T+1 records in six‑hour windows, encoding attributes with multi‑value mappings, storing compressed rule‑hit backups, and synchronizing recall tables in real time, Mahé’s selection‑and‑delivery pipeline cut end‑to‑end latency from minutes to seconds, achieving robust second‑level responsiveness.

Big DataPerformance OptimizationReal-time

0 likes · 12 min read

Real-time Performance Optimization of the Mahé Selection and Delivery System

Full-Stack Internet Architecture

Apr 20, 2021 · Big Data

Building Near Real-Time Elasticsearch Indexes for PB‑Scale Data

This article explains how to construct near real‑time Elasticsearch indexes for petabyte‑level datasets by comparing MySQL limitations, describing Elasticsearch fundamentals, and detailing a pipeline that uses Hive, wide tables, MySQL binlog, Canal, and Otter to achieve second‑level index updates.

Big DataCanalElasticsearch

0 likes · 18 min read

Building Near Real-Time Elasticsearch Indexes for PB‑Scale Data

TAL Education Technology

Apr 15, 2021 · Big Data

Global Feature Pool Architecture and Workflow for Data‑Driven Growth

The article describes a unified global feature pool architecture that standardizes offline and real‑time feature production, management, and service layers using Hive, Spark, Flink, Kafka, MySQL, and Hologres to break data silos, improve algorithm development efficiency, and boost growth business performance.

Data Platformdata pipelinefeature engineering

0 likes · 7 min read

Global Feature Pool Architecture and Workflow for Data‑Driven Growth

Top Architect

Apr 9, 2021 · Big Data

Technical Architecture and Data Processing of Toutiao News Feed System

This article provides a comprehensive overview of Toutiao's rapid growth, massive user base, data collection pipelines, user modeling, recommendation engine, storage solutions, message push strategies, micro‑service architecture, and virtualization PaaS platform, illustrating how big‑data technologies enable personalized news delivery at scale.

Big DataMicroservicesToutiao

0 likes · 8 min read

Technical Architecture and Data Processing of Toutiao News Feed System

Ctrip Technology

Apr 1, 2021 · Big Data

Design and Implementation of a Binlog‑Based Real‑Time Data Foundation Layer for Ctrip Finance

This article describes how Ctrip Finance built a unified financial data center by collecting MySQL binlog streams with Canal, transporting them via Kafka, persisting to HDFS with Spark‑Streaming, and merging into Hive tables, while addressing performance, idempotency, delete handling, and data‑quality checks.

Big DataBinlogReal-time

0 likes · 14 min read

Design and Implementation of a Binlog‑Based Real‑Time Data Foundation Layer for Ctrip Finance

Meituan Technology Team

Mar 4, 2021 · Artificial Intelligence

How Meituan Waimai Scaled Feature Engineering for Billions of Requests

This article details Meituan Waimai's evolution from a simple feature framework to a sophisticated, configurable platform that handles massive feature production, multi‑task scheduling, dynamic protobuf storage, and a model‑feature description language (MFDL) to enable efficient online retrieval, high‑performance computation, and consistent training‑sample generation for its recommendation, advertising, and search services.

MFDLMachine Learning PlatformMeituan

0 likes · 31 min read

How Meituan Waimai Scaled Feature Engineering for Billions of Requests

dbaplus Community

Feb 23, 2021 · Big Data

How NetEase Game Teams Built a Scalable Flink‑Based Streaming ETL Platform

This article explains how NetEase games collect heterogeneous logs, design a Flink‑driven streaming ETL pipeline, handle schema‑free sources, implement Python UDFs with Jython, optimize HDFS writes, manage real‑time and offline warehouses, and share practical tuning and fault‑tolerance techniques.

ETLFlinkHive

0 likes · 22 min read

How NetEase Game Teams Built a Scalable Flink‑Based Streaming ETL Platform

NetEase Smart Enterprise Tech+

Feb 4, 2021 · Operations

How NetEase Cloud Communication Builds a Real-Time Service Monitoring Platform

NetEase Cloud Communication’s service monitoring platform leverages data collection, preprocessing, alerting, and visualization pipelines—using HTTP APIs, Kafka, custom scripts, and NTSDB—to provide real-time insights, ensure stability, and support scalable, high‑throughput audio‑video services.

Operationscloud communicationdata pipeline

0 likes · 11 min read

How NetEase Cloud Communication Builds a Real-Time Service Monitoring Platform

Top Architect

Jan 17, 2021 · Big Data

Migrating LinkedIn’s Who Viewed Your Profile System from Lambda Architecture to a Lambda‑less Architecture

This article describes how LinkedIn’s Who Viewed Your Profile feature was originally built on a Lambda architecture, the operational challenges it caused, and the step‑by‑step migration to a streamlined, Samza‑driven, Lambda‑less design that improves performance, reduces maintenance overhead, and retains essential batch capabilities.

Lambda architectureLinkedInPinot

0 likes · 11 min read

Liulishuo Tech Team

Jan 12, 2021 · Big Data

Design and Implementation of CDC‑Based Real‑Time Data Ingestion with Delta Lake on Alibaba Cloud EMR

This article describes how FluentSpeak replaced a DataX master‑slave pipeline with a CDC‑plus‑Delta Lake solution on Alibaba Cloud EMR, detailing architecture choices, streaming SQL merge logic, monitoring, challenges, and the resulting cost and latency improvements.

CDCDelta LakeEMR

0 likes · 17 min read

Design and Implementation of CDC‑Based Real‑Time Data Ingestion with Delta Lake on Alibaba Cloud EMR

Laiye Technology Team

Dec 18, 2020 · Big Data

Comprehensive Overview of Laiye Technology's Business Intelligence Ecosystem

This article provides a detailed, end‑to‑end description of Laiye Technology's BI ecosystem, covering its background, development stages, data acquisition, transmission, transformation, loading, modeling, storage layers, statistical analysis, real‑time metrics, visualization, and future challenges, illustrating how the company builds a scalable, cloud‑native data‑driven platform.

AnalyticsBIBig Data

0 likes · 22 min read

Comprehensive Overview of Laiye Technology's Business Intelligence Ecosystem

DataFunTalk

Nov 27, 2020 · Big Data

Evolution of Kafka‑Based Data Pipeline at Chehaoduo Group: Architecture, Scaling, and Best Practices

This article chronicles the four‑year evolution of Chehaoduo Group’s Kafka ecosystem—from its initial role as a simple data‑ingestion layer to becoming the core of the company’s large‑scale data pipeline—detailing cluster management, upgrade strategies, multi‑cluster deployment, AVRO schema handling, SDK development, and operational lessons learned.

AvroSchema Registrycluster management

0 likes · 21 min read

Evolution of Kafka‑Based Data Pipeline at Chehaoduo Group: Architecture, Scaling, and Best Practices

58 Tech

Nov 25, 2020 · Databases

Design and Implementation of a Financial Fraud Detection Graph Network Using JanusGraph

This article presents a comprehensive overview of building a financial fraud detection graph network, covering background challenges, graph schema design, a four‑layer architecture with JanusGraph, data import pipelines, quality assurance, performance optimizations, and practical applications such as risk scoring, association analysis, and id‑mapping.

JanusGraphRisk analysisdata pipeline

0 likes · 22 min read

Design and Implementation of a Financial Fraud Detection Graph Network Using JanusGraph

Xianyu Technology

Nov 17, 2020 · Big Data

Xianyu Premium Product Library: Architecture and Implementation

Xianyu’s premium‑product library combines interpretable, multi‑dimensional metric models built from structured product and user attributes with real‑time and offline pipelines to systematically tag high‑quality items, delivering services via HSF and a message bus, and has driven over 20% click‑through growth and nearly doubled conversion rates.

Real-time Processingdata pipelinefeature engineering

0 likes · 7 min read

Xianyu Premium Product Library: Architecture and Implementation

Ctrip Technology

Nov 12, 2020 · Artificial Intelligence

Ctrip Machine Translation Platform: Architecture, Data Construction, Algorithm Design, and Performance Optimization

This article presents a comprehensive overview of Ctrip's multilingual machine translation platform, detailing demand analysis, system architecture, data pipeline, algorithmic innovations such as task‑space fusion and term‑translation interventions, as well as extensive performance optimizations for low‑resource languages.

AICtripMachine Translation

0 likes · 20 min read

Ctrip Machine Translation Platform: Architecture, Data Construction, Algorithm Design, and Performance Optimization

Big Data Technology & Architecture

Nov 2, 2020 · Big Data

Log Collection and Processing Architecture with Flume and Kafka for Big Data Platforms

This article explains how to design a scalable log collection system for big‑data platforms by combining Flume for data ingestion, Kafka for buffering and high‑throughput transport, and downstream processing components, providing configuration examples and best‑practice recommendations.

Big DataFlumeReal-time Processing

0 likes · 9 min read

Log Collection and Processing Architecture with Flume and Kafka for Big Data Platforms

System Architect Go

Nov 1, 2020 · Big Data

Introduction to Logstash: Basics, Installation, Configuration, and Plugins

This article introduces Logstash as an open‑source data‑pipeline tool, explains why it simplifies data ingestion, filtering and output, walks through installation and a first‑pipeline example, and provides a comprehensive overview of its input, filter, and output plugins with configuration snippets.

ConfigurationELKLogstash

0 likes · 10 min read

Introduction to Logstash: Basics, Installation, Configuration, and Plugins

Tencent Cloud Developer

Oct 29, 2020 · Cloud Computing

Distributed Atmospheric Monitoring System – Cloud Architecture, Module Implementation, and Cost Analysis

The paper describes Tencent’s community‑driven distributed atmospheric monitoring platform, detailing its multi‑layer cloud architecture, data ingestion and aggregation modules built with API Gateway, Serverless Functions, MySQL, and Cloud Map, and compares Phase II and Phase III operational costs while outlining future enhancements.

Distributed MonitoringIoTServerless

0 likes · 11 min read

Distributed Atmospheric Monitoring System – Cloud Architecture, Module Implementation, and Cost Analysis

Xianyu Technology

Oct 15, 2020 · Industry Insights

Cutting Data Dashboard Development Time from Days to Hours: Xianyu’s 3‑Layer Serverless Solution

Xianyu transformed its slow, manual data‑analysis workflow— plagued by BI bottlenecks, slow SQL, and cumbersome front‑end integration—into a three‑layer, serverless architecture that abstracts SQL into reusable atoms, automates data pipelines, and delivers smart, seconds‑level visual dashboards, slashing development effort from five days to half a day.

Data VisualizationSQL abstractionServerless

0 likes · 12 min read

Cutting Data Dashboard Development Time from Days to Hours: Xianyu’s 3‑Layer Serverless Solution

MaGe Linux Operations

Sep 20, 2020 · Big Data

Mastering Alibaba Canal: Step‑by‑Step Setup for Real‑Time MySQL Binlog Sync

This guide explains what Canal is, its key features and limitations, the underlying binlog replication principle, and provides detailed, step‑by‑step instructions for downloading, configuring, and launching both Canal Server and Canal Adapter to achieve high‑performance real‑time data synchronization.

BinlogCanalDocker

0 likes · 10 min read

Mastering Alibaba Canal: Step‑by‑Step Setup for Real‑Time MySQL Binlog Sync

ITPUB

Sep 14, 2020 · Big Data

How Alibaba’s DChain Data Converger Auto‑Generates Real‑Time Wide Tables with SQL Pipelines

This article explains how the ADC (Alibaba DChain Data Converger) project automatically creates large real‑time tables by letting users configure metrics on the front‑end, then generating and publishing SQL through a pipeline that leverages design patterns, priority queues, and tree‑based data structures for efficient cross‑database processing.

FlinkSQL Generationdata pipeline

0 likes · 15 min read

How Alibaba’s DChain Data Converger Auto‑Generates Real‑Time Wide Tables with SQL Pipelines

Big Data Technology & Architecture

Aug 19, 2020 · Big Data

Big Data ETL Project: Parsing Advertising JSON with Spark, IP Lookup, and Storing into Kudu

This tutorial describes how to place advertising JSON data on HDFS, use Spark for ETL and analysis, enrich logs with IP lookup, and persist the results into Kudu with daily scheduling, including code examples and schema definitions.

Big DataETLIP lookup

0 likes · 17 min read

Big Data ETL Project: Parsing Advertising JSON with Spark, IP Lookup, and Storing into Kudu

Tencent Cloud Middleware

Aug 12, 2020 · Big Data

How Serverless Functions Can Replace Traditional Kafka Data Pipelines for Lower Cost and Easier Scaling

This article explains how Tencent Cloud CKafka works, describes the challenges of traditional open‑source data‑flow solutions, and demonstrates a Serverless Function approach—complete with architecture diagrams and code examples—to achieve low‑cost, auto‑scaling Kafka‑to‑Elasticsearch pipelines.

Big DataCKafkaElasticsearch

0 likes · 12 min read

How Serverless Functions Can Replace Traditional Kafka Data Pipelines for Lower Cost and Easier Scaling

Efficient Ops

Jul 28, 2020 · Operations

How to Turn Ops Data into Business Value: A Practical Guide

This article explores the evolution and monetization of operations data, outlines a four‑stage management process—from data discovery to modeling, ingestion, and monetization—highlights key scenarios such as intelligent monitoring and root‑cause analysis, and offers practical recommendations for building an effective ops data platform.

AIData ManagementData Monetization

0 likes · 15 min read

How to Turn Ops Data into Business Value: A Practical Guide

WecTeam

Jul 23, 2020 · Backend Development

How We Reduced WebMonitor Latency from Minutes to Seconds – Architecture & Performance Secrets

This article chronicles the evolution of the WebMonitor front‑end monitoring system, detailing its three‑tier stack, data pipeline upgrades from raw disk sampling to HDFS and Elasticsearch, extensive collector‑side optimizations, Jetty thread and timeout tuning, and the resulting performance gains that lowered response times from minutes to sub‑second levels.

JavaJettyMonitoring

0 likes · 15 min read

How We Reduced WebMonitor Latency from Minutes to Seconds – Architecture & Performance Secrets

Beike Product & Technology

Jul 16, 2020 · Backend Development

Kafka Connect: Introduction and Concepts for Data Pipelines

This article introduces Kafka Connect, a framework for building scalable data pipelines between Kafka and other systems, covering its architecture, key concepts like connectors and tasks, and practical deployment examples.

Backend DevelopmentBig DataETL

0 likes · 20 min read

Kafka Connect: Introduction and Concepts for Data Pipelines

Ctrip Technology

Jul 16, 2020 · Big Data

Design and Architecture of the User Profiling System at Ctrip Business Travel

This article describes the concept, tag taxonomy, data flow architecture, and Lambda‑based query service design of Ctrip Business Travel's user profiling system, highlighting how batch and real‑time processing with Spark, Flink, Hive, MongoDB and Redis enable precise marketing, risk control and personalized services.

Big DataCtripdata pipeline

0 likes · 12 min read

Design and Architecture of the User Profiling System at Ctrip Business Travel

58 Tech

Jul 10, 2020 · Artificial Intelligence

Tag Mining for Used‑Car Business: NLP, Word2Vec, and Retrieval Pipeline

This article details the end‑to‑end process of extracting and leveraging tags for used‑car listings, covering data collection, segmentation, NLP‑based tokenization, word‑vector generation, tag‑library construction, and online retrieval flow to improve personalized recall and CTR.

Information RetrievalNLPTagging

0 likes · 19 min read

Tag Mining for Used‑Car Business: NLP, Word2Vec, and Retrieval Pipeline

dbaplus Community

Jul 7, 2020 · Big Data

How Flink + ClickHouse Power Real‑Time Analytics at Scale

This article explains how FunTouTiao builds a high‑performance real‑time analytics pipeline using Flink, Hive, and ClickHouse, covering business scenarios, hour‑level and second‑level Flink‑to‑Hive architectures, streaming file sink mechanics, multi‑user permissions, ClickHouse performance tricks, and future roadmap for unified stream‑batch storage.

Big DataClickHouseFlink

0 likes · 18 min read

How Flink + ClickHouse Power Real‑Time Analytics at Scale

Big Data Technology & Architecture

Jul 2, 2020 · Big Data

KSQL Quick Start: Deploying and Querying Kafka Data with Streaming SQL

This article introduces KSQL as a lightweight streaming SQL engine for Apache Kafka, explains its architecture and core concepts of streams and tables, and provides step‑by‑step deployment instructions, command‑line examples for creating streams/tables, querying data, and managing persistent queries.

Apache KafkaBig DataKSQL

0 likes · 10 min read

KSQL Quick Start: Deploying and Querying Kafka Data with Streaming SQL

Ctrip Technology

Jun 29, 2020 · Backend Development

Optimizing Ctrip’s Vacation Search Engine: From Search 1.0 to 5.5

This article details the evolution and optimization of Ctrip’s vacation search engine, covering business challenges, indexing redesign, data collection pipelines, write‑path improvements, compression techniques, query performance enhancements, deployment strategies, and the resulting gains in storage, latency, and stability.

Index Optimizationbackenddata pipeline

0 likes · 14 min read

Optimizing Ctrip’s Vacation Search Engine: From Search 1.0 to 5.5

DataFunTalk

Jun 18, 2020 · Big Data

Real-time Data Processing at QuTouTiao: Flink + ClickHouse Architecture and Practices

QuTouTiao leverages Flink and ClickHouse to build a high‑performance real‑time analytics platform that supports hourly Hive pipelines and sub‑second ClickHouse queries, achieving sub‑second response for 80% of requests through streaming ingestion, exactly‑once semantics, multi‑cluster coordination, and optimized ClickHouse storage and connector designs.

Big DataClickHouseFlink

0 likes · 16 min read

Real-time Data Processing at QuTouTiao: Flink + ClickHouse Architecture and Practices

DataFunTalk

Jun 14, 2020 · Big Data

Designing an Offline Big Data Processing Architecture Based on Object Storage

This article presents a comprehensive offline big‑data processing framework that leverages scalable object storage for PB‑level data, details storage and compute engine requirements, compares cost options, describes data pipeline design, and showcases an e‑commerce case study with Spark‑driven analytics.

Big DataData EngineeringSpark

0 likes · 19 min read

Designing an Offline Big Data Processing Architecture Based on Object Storage

21CTO

May 12, 2020 · Big Data

Inside Toutiao’s Massive Data Pipeline: Architecture, Recommendation & Scaling

This article details Toutiao’s rapid growth and its large‑scale data pipeline, covering article crawling, user modeling, recommendation engines, storage solutions, push notifications, micro‑service architecture, and the underlying virtualization PaaS platform that powers its personalized news service.

MicroservicesToutiaodata pipeline

0 likes · 8 min read

Inside Toutiao’s Massive Data Pipeline: Architecture, Recommendation & Scaling

Tencent Advertising Technology

May 2, 2020 · Artificial Intelligence

How to Use TI-ONE Built‑in Operators for the 2020 Tencent Advertising Algorithm Competition

This tutorial walks you through creating a TI‑ONE project, ingesting competition data, configuring and training a decision‑tree model with built‑in operators, running the workflow, and downloading and uploading the result files for the 2020 Tencent Advertising Algorithm Competition.

Decision TreeModel TrainingTI-ONE

0 likes · 7 min read

How to Use TI-ONE Built‑in Operators for the 2020 Tencent Advertising Algorithm Competition

Taobao Frontend Technology

Apr 22, 2020 · Frontend Development

How Pipcook Leverages TensorFlow.js to Bring AI to Front‑End Development

This article explains how Pipcook combines TensorFlow.js with a JavaScript‑friendly pipeline to enable front‑end engineers to process data, train models, and deploy AI solutions, while comparing its approach to TFX and outlining future development and contribution opportunities.

data pipelinemachine learning

0 likes · 11 min read

How Pipcook Leverages TensorFlow.js to Bring AI to Front‑End Development

Alibaba Terminal Technology

Apr 21, 2020 · Frontend Development

Boost Front-End Efficiency with Pipcook: Harnessing TensorFlow.js for AI Pipelines

This article explains how Pipcook leverages TensorFlow.js to create a JavaScript‑friendly machine‑learning pipeline for front‑end engineers, addressing skill gaps, data handling, model training, deployment options, and future roadmap to accelerate intelligent front‑end development.

AIdata pipelinefrontend

0 likes · 10 min read

Boost Front-End Efficiency with Pipcook: Harnessing TensorFlow.js for AI Pipelines

ITPUB

Apr 12, 2020 · Big Data

Inside Toutiao’s Massive Data Pipeline and Real‑Time Recommendation Engine

This article details how Toutiao processes billions of daily page views, builds user models with Hadoop and Storm, runs real‑time recommendation and cold‑start personalization, and scales its microservice‑based architecture using Kafka, MySQL, MongoDB, Redis and a high‑throughput push system.

data pipelinerecommendation system

0 likes · 10 min read

Inside Toutiao’s Massive Data Pipeline and Real‑Time Recommendation Engine

Architecture Digest

Mar 15, 2020 · Big Data

Quick Guide to Deploying Alibaba Canal for Real‑Time MySQL Binlog Synchronization with Kafka and Zookeeper

This article provides a step‑by‑step tutorial on building a small‑scale data platform by installing MySQL, Zookeeper, Kafka and the open‑source Canal middleware, configuring Canal to capture MySQL binlog events, and forwarding the structured data to Kafka for downstream processing.

CanalReal-time SyncZookeeper

0 likes · 20 min read

Quick Guide to Deploying Alibaba Canal for Real‑Time MySQL Binlog Synchronization with Kafka and Zookeeper

dbaplus Community

Mar 3, 2020 · Big Data

How MaFengWo Scaled Kafka for Real‑Time Big Data: Lessons and Best Practices

This article details MaFengWo's practical experience with Kafka in its big‑data platform, covering three core usage scenarios, a four‑stage evolution roadmap—including version upgrades, resource isolation, security and monitoring—and future plans such as transaction‑based deduplication and consumer throttling.

Big DataResource Isolationdata pipeline

0 likes · 17 min read

How MaFengWo Scaled Kafka for Real‑Time Big Data: Lessons and Best Practices

Qunar Tech Salon

Feb 21, 2020 · Artificial Intelligence

Building an End‑to‑End Data‑Model Loop for Alibaba XiaoMi AI Services

The article describes how Alibaba's XiaoMi AI platform constructs a closed‑loop pipeline—from data collection and annotation to model training, evaluation, and real‑time deployment—using multi‑dimensional data processing, visualization, and Spark‑based engines to accelerate iterative improvements and address operational pain points.

AIBig DataModel Training

0 likes · 9 min read

Building an End‑to‑End Data‑Model Loop for Alibaba XiaoMi AI Services

DataFunTalk

Feb 17, 2020 · Artificial Intelligence

Building a Closed‑Loop AI System: From Data Collection to Model Deployment in Alibaba’s XiaoMi

This article explains how Alibaba’s XiaoMi team constructs a full‑cycle AI pipeline—covering real‑time and offline data processing, high‑dimensional visualization, model training, iterative feedback, and Spark‑based deployment—to accelerate intelligent product iteration while addressing common engineering pain points.

AIBig DataReal-time Processing

0 likes · 10 min read

Building a Closed‑Loop AI System: From Data Collection to Model Deployment in Alibaba’s XiaoMi

Big Data Technology & Architecture

Feb 16, 2020 · Big Data

Implementing MySQL Binlog Synchronization to HDFS Using Canal

This article details a step‑by‑step guide for deploying Canal to capture MySQL binlog events, configure HA with ZooKeeper, design a client that parses binlog into JSON, asynchronously acknowledges messages, archive data to local files for batch upload to HDFS, and monitor latency for alerts.

Big DataBinlogCanal

0 likes · 10 min read

Implementing MySQL Binlog Synchronization to HDFS Using Canal

dbaplus Community

Jan 12, 2020 · Big Data

How Xiaomi Achieved Real‑Time MySQL‑to‑Kudu Sync with Binlog and Talos

Facing MySQL performance bottlenecks at massive scale, Xiaomi built the LCSBinlog service that captures MySQL binlog events, streams them through the Talos platform, and writes to Kudu for real‑time BI, detailing architecture, job scheduling, consistency guarantees, use cases, and troubleshooting lessons.

BinlogCDCKudu

0 likes · 13 min read

How Xiaomi Achieved Real‑Time MySQL‑to‑Kudu Sync with Binlog and Talos

Top Architect

Jan 7, 2020 · Big Data

Technical Architecture Overview of Toutiao: Data Processing, User Modeling, and Recommendation System

This article provides a comprehensive overview of Toutiao's rapid growth and technical architecture, detailing its massive user base, data collection pipelines, user modeling, recommendation engines, storage solutions, message push mechanisms, micro‑service design, and virtualization PaaS platform.

Big DataMicroservicesToutiao

0 likes · 8 min read

Technical Architecture Overview of Toutiao: Data Processing, User Modeling, and Recommendation System

dbaplus Community

Nov 21, 2019 · Databases

How to Build a Real‑Time MySQL Statistics Platform with ClickHouse

This article explains how a growing company designed, optimized, and deployed a comprehensive MySQL monitoring and analysis pipeline—moving from Flume‑HDFS‑Hive to ClickTail‑ClickHouse, enriching SQL parsing, and applying practical methods for state statistics, trend analysis, permission management, and data‑skew detection.

DBADatabase MonitoringSQL Analytics

0 likes · 16 min read

How to Build a Real‑Time MySQL Statistics Platform with ClickHouse

DataFunTalk

Nov 7, 2019 · Big Data

Real-Time Computing Engine at Beike: Architecture, Practices, and Future Plans

This article details Beike's real‑time computing engine, covering its background, streaming platform built on Spark Streaming and Flink, data ingestion via Kafka, metadata handling, SQL‑based task development, monitoring, storage solutions, and future roadmap for resource management and AI‑enhanced monitoring.

Big DataFlinkMonitoring

0 likes · 14 min read

Real-Time Computing Engine at Beike: Architecture, Practices, and Future Plans

Big Data Technology & Architecture

Oct 30, 2019 · Big Data

Building a Real‑Time Data Processing Pipeline with Apache Kafka, Spark Streaming, and Cassandra

This tutorial explains how to create a highly scalable, fault‑tolerant real‑time data processing platform by configuring a Kafka topic, a Cassandra keyspace, adding Spark and connector dependencies, developing a Java‑based Spark Streaming pipeline, enabling checkpoints, and deploying the application with spark‑submit.

Big DataCassandraJava

0 likes · 8 min read

Building a Real‑Time Data Processing Pipeline with Apache Kafka, Spark Streaming, and Cassandra

Alibaba Cloud Developer

Oct 30, 2019 · Big Data

How Real-Time Big Data Pipelines Detect E‑Commerce Ad Misplacements

This article explains how a large‑scale e‑commerce search advertising system uses real‑time big‑data pipelines, log synchronization, NoSQL storage, and proactive verification to automatically discover and correct ad placement errors across the entire data processing chain, protecting both advertisers and the platform.

Big Dataad verificationdata pipeline

0 likes · 13 min read

How Real-Time Big Data Pipelines Detect E‑Commerce Ad Misplacements

58 Tech

Sep 6, 2019 · Big Data

Architecture and Technical Implementation of the WMDA Data Analytics Platform

The article details WMDA's end‑to‑end data analytics architecture, covering zero‑event data collection, real‑time and offline processing pipelines built on Spark Streaming, Druid, Hadoop, Kettle, and TaskServer, and explains how these components collaborate to deliver comprehensive user behavior analysis.

Big DataDruidETL

0 likes · 11 min read

Architecture and Technical Implementation of the WMDA Data Analytics Platform

Xueersi Online School Tech Team

Sep 6, 2019 · Big Data

Real-Time Data Architecture, Evolution, and Applications at an Online School

The article details the six‑layer big‑data architecture of an online school, chronicles its migration from Storm to Spark Streaming and finally to Flink, and showcases concrete real‑time applications such as gateway monitoring, user‑profile tagging, renewal reporting, and advertising analysis, while outlining future development directions.

AnalyticsBig Data ArchitectureFlink

0 likes · 14 min read

Real-Time Data Architecture, Evolution, and Applications at an Online School

Ctrip Technology

Sep 4, 2019 · Artificial Intelligence

Design and Implementation of Ctrip's User Precise Marketing System

This article details the design goals, architecture, core functionalities, and optimization strategies of Ctrip's user precise marketing system, which leverages RESTful integration, flexible rule-based and machine‑learning models, real‑time monitoring, and AB testing to improve traffic utilization and conversion rates.

AB testingCtripMarketing

0 likes · 11 min read

Design and Implementation of Ctrip's User Precise Marketing System

Xianyu Technology

Aug 28, 2019 · Big Data

Unified Search System Architecture and Automation for Multiple Business Scenarios

To avoid building separate search services for each Xianyu business, the team created a unified, generic search architecture based on Alibaba’s HA3 engine and a control layer that automates data dumping, indexing, query translation, and result ranking across five subsystems, enabling new services to be onboarded in minutes instead of weeks.

AutomationBig DataIndexing

0 likes · 18 min read

Unified Search System Architecture and Automation for Multiple Business Scenarios

HomeTech

Aug 15, 2019 · Big Data

Real‑Time Data Warehouse Development with Flink: Architecture, Implementation, and Lessons Learned

This article describes the motivation, technology selection, implementation details, and practical challenges of building a real‑time data warehouse using Flink, covering stream ingestion, data cleaning, dimension‑table joins, state backend choices, and operational lessons for large‑scale streaming pipelines.

FlinkReal-Time Data WarehouseState Backend

0 likes · 8 min read

Real‑Time Data Warehouse Development with Flink: Architecture, Implementation, and Lessons Learned

Youzan Coder

Aug 14, 2019 · Big Data

Comprehensive Guide to Data Collection, Event Modeling, and Tracking in Big Data Platforms

The guide explains how comprehensive data collection in big‑data platforms relies on a standardized event model, passive and code‑based embedding, multi‑platform SDKs, a log‑middleware layer, precise location tracking, and an embedding management platform that supports workflow, testing, quality monitoring, and scalable infrastructure for future enhancements.

AnalyticsBig DataLog Processing

0 likes · 19 min read

Comprehensive Guide to Data Collection, Event Modeling, and Tracking in Big Data Platforms

ITPUB

Jul 2, 2019 · Databases

How ClickHouse Powers Ctrip’s Hotel Data Platform for Billions of Daily Updates

This article explains how Ctrip’s hotel data intelligence platform handles over ten billion daily data updates and nearly a million queries by adopting ClickHouse, detailing the system's background, the reasons for choosing ClickHouse over other solutions, the data ingestion pipelines, monitoring strategies, operational practices, and performance outcomes.

Big DataClickHouseMonitoring

0 likes · 13 min read

How ClickHouse Powers Ctrip’s Hotel Data Platform for Billions of Daily Updates

Ctrip Technology

Jun 26, 2019 · Databases

Applying ClickHouse for a High‑Performance Hotel Data Intelligence Platform

This article describes how Ctrip Hotel's data intelligence platform leverages ClickHouse to achieve real‑time analytics on billions of daily updates and millions of queries, detailing the system architecture, data ingestion pipelines, monitoring, and operational lessons learned for large‑scale, high‑availability data services.

Data Warehousedata pipelinehotel platform

0 likes · 12 min read

Applying ClickHouse for a High‑Performance Hotel Data Intelligence Platform

58 Tech

May 31, 2019 · Artificial Intelligence

Summary of 58 Group Technical Salon: Recommendation System Architecture and Search Ranking Algorithm Practices

The article summarizes the 58 Group technical salon where experts presented the microservice‑based recommendation system architecture, data and strategy layers, and the internally built search ranking platform covering sampling, feature engineering, and model training, highlighting practical implementations and lessons learned.

AIMicroservicesdata pipeline

0 likes · 7 min read