Tagged articles

data pipeline

240 articles · Page 1 of 3

Jun 22, 2026 · Artificial Intelligence

Building DataFlow: An Industrial‑Grade LLM Data Pipeline from Documents to Training

The article presents DataFlow, an open‑source, GPU‑centric data‑engineering framework that tackles LLM data‑preparation bottlenecks by defining a two‑level operator taxonomy, a LLM‑driven WebAgent for automatic crawling, a PDF‑to‑Markdown MinerU, a Ray‑based distributed runtime, and extensive multimodal extensions, and validates the design with quantitative experiments showing significant quality gains across math, code, and reasoning benchmarks.

DataFlowLLMMultimodal

0 likes · 14 min read

Building DataFlow: An Industrial‑Grade LLM Data Pipeline from Documents to Training

PMTalk Product Manager Community

Jun 11, 2026 · Product Management

Three High‑Paying Skills Every AI Product Manager Needs

In the AI boom, product managers who can coordinate front‑end, back‑end, algorithm, data cleaning and compute resources and master reverse‑engineering, rapid execution, and patient problem‑solving command six‑figure salaries, as illustrated by refund‑strategy redesign, custom AI客服 deployment, and complex 3D point‑cloud labeling pipelines.

AI product managementAI workflowLLM

0 likes · 10 min read

Three High‑Paying Skills Every AI Product Manager Needs

Top Architecture Tech Stack

Jun 4, 2026 · Artificial Intelligence

Why OpenHuman’s Architecture Beats Its 118 Integrations

OpenHuman’s Memory Tree architecture separates hot and cold data paths, uses content‑addressed IDs, and builds layered summaries, offering low‑latency queries and robust idempotency for AI agents that need continuous background learning.

Content AddressingLLMLayered Summaries

0 likes · 7 min read

Why OpenHuman’s Architecture Beats Its 118 Integrations

StarRocks

May 28, 2026 · Industry Insights

How Fresha Built a Modern Real‑Time Analytics Stack with AutoMQ and StarRocks

Fresha replaced its Postgres‑Snowflake‑MSK pipeline with an AutoMQ‑based Diskless Kafka message layer and StarRocks for real‑time analytics, cutting storage costs 17‑20×, dropping query latency from seconds to sub‑second, and migrating ~1,000 topics in a week with zero downtime.

AutoMQCloud MigrationPostgres

0 likes · 24 min read

How Fresha Built a Modern Real‑Time Analytics Stack with AutoMQ and StarRocks

Xiaohongshu Tech REDtech

May 19, 2026 · Artificial Intelligence

Agent‑Driven R&D Efficiency: Exploration and Practice at QECon Shenzhen 2026

At QECon Shenzhen 2026, Xiaohongshu's tech team will present five technical talks that showcase how AI agents are applied to architecture risk analysis, change automation, large‑model load‑testing data construction, end‑to‑end testing, and client‑side performance, illustrating concrete engineering solutions and measurable productivity gains.

AI AgentAutomationLLM

0 likes · 13 min read

Agent‑Driven R&D Efficiency: Exploration and Practice at QECon Shenzhen 2026

StarRocks

May 8, 2026 · Big Data

Scaling Real‑Time Analytics at KaptureCX: Best Practices with RisingWave and StarRocks

KaptureCX migrated its core analytics from ClickHouse to StarRocks, introduced RisingWave and Kafka for CDC, and achieved millisecond‑level query latency, a reporting cycle cut from weeks to one day, and a solid data foundation for AI‑driven services.

CDCMVPRisingWave

0 likes · 11 min read

Scaling Real‑Time Analytics at KaptureCX: Best Practices with RisingWave and StarRocks

Lao Guo's Learning Space

Apr 29, 2026 · Big Data

Designing a Full-Stack Credit Data System: From Ingestion to Real-Time Decision

The article dissects a credit data system architecture, detailing six logical layers—from multi-source data collection and feature engineering (including graph features and feature stores) to model training, real‑time stream processing, decision engine integration, and privacy‑preserving computation—while explaining the trade‑offs, tools, and performance targets needed for accurate, low‑latency risk assessment.

Credit ScoringFeature StoreFlink

0 likes · 16 min read

Designing a Full-Stack Credit Data System: From Ingestion to Real-Time Decision

Machine Heart

Apr 28, 2026 · Artificial Intelligence

Why a 7‑Month‑Old Startup Claims Human‑Like Robots Are Key to General Embodied Intelligence

The article details KAI, a 173 cm, 115‑DOF humanoid robot with tactile skin and a custom battery, and explains how its ultra‑human form, massive first‑person data collection, and three‑stage training pipeline are intended to enable a world‑model‑driven embodied AI system, while also acknowledging the engineering and market challenges ahead.

Embodied AIHumanoid Robotdata pipeline

0 likes · 13 min read

Why a 7‑Month‑Old Startup Claims Human‑Like Robots Are Key to General Embodied Intelligence

Alibaba Cloud Big Data AI Platform

Apr 22, 2026 · Artificial Intelligence

How to Build an End‑to‑End Hand‑Video to VLA Data Pipeline on Alibaba Cloud PAI with Data‑Juicer

This article details a step‑by‑step, distributed pipeline built on Alibaba Cloud PAI using Data‑Juicer and Ray that transforms raw egocentric hand videos into LeRobot v2.0‑compatible Vision‑Language‑Action (VLA) training data, covering video splitting, frame extraction, camera calibration, 3D hand reconstruction, pose estimation, action captioning, and export, with code snippets, performance numbers, and references.

Data-JuicerDistributed ComputingEmbodied AI

0 likes · 29 min read

How to Build an End‑to‑End Hand‑Video to VLA Data Pipeline on Alibaba Cloud PAI with Data‑Juicer

Alibaba Cloud Big Data AI Platform

Apr 13, 2026 · Artificial Intelligence

How to Build a Scalable Multimodal Data Pipeline with Alibaba Cloud PAI and DataJuicer

This article details a step‑by‑step guide for constructing a high‑performance multimodal data pipeline—covering video segmentation, duration filtering, frame extraction, safety and aesthetic scoring, and caption generation—using Alibaba Cloud PAI, Paimon, DataJuicer, and distributed frameworks like Ray and Daft, with real‑world performance metrics.

AIAlibaba CloudDaft

0 likes · 30 min read

How to Build a Scalable Multimodal Data Pipeline with Alibaba Cloud PAI and DataJuicer

AI Large-Model Wave and Transformation Guide

Apr 11, 2026 · Artificial Intelligence

How to Engineer Reliable AI Models: From Infrastructure to Deployment

This article presents a comprehensive, step‑by‑step framework for turning laboratory AI models into production‑ready systems, covering capability mapping, technology stack choices, model selection, prompt engineering, data pipelines, training strategies, and cross‑team collaboration to ensure stability, observability, and trustworthiness.

AI model engineeringModel DeploymentModel Monitoring

0 likes · 14 min read

How to Engineer Reliable AI Models: From Infrastructure to Deployment

AI Large-Model Wave and Transformation Guide

Apr 11, 2026 · Artificial Intelligence

How to Build a Full‑Cycle Model Engineering System for Scalable AI

This article outlines a comprehensive, six‑part model engineering framework that transforms AI capabilities into reusable business functions, defines a stable technical stack, establishes model selection and architecture guidelines, implements rigorous control, data, and training processes, and explains how these layers synergize for reliable, scalable deployment.

AI DeploymentModel TrainingOperations

0 likes · 27 min read

How to Build a Full‑Cycle Model Engineering System for Scalable AI

AI Engineer Programming

Apr 9, 2026 · Artificial Intelligence

Why Powerful AI Models Still Fail: The Real Infrastructure Challenges of Agents

Despite ever‑more capable large language models, AI agents frequently stumble because enterprise data is messy, pipelines introduce errors, RAG lacks timeliness and conflict resolution, and context assembly requires dedicated ingestion, resolution, selection, decay, and inference layers, plus a harness to manage execution and governance.

AI AgentsEnterprise AIHarness

0 likes · 19 min read

Why Powerful AI Models Still Fail: The Real Infrastructure Challenges of Agents

Open Source Tech Hub

Apr 8, 2026 · Backend Development

Master Efficient PHP Data Pipelines with the Low‑Memory Flow Framework

This article introduces the Flow PHP data‑processing framework, highlights its ultra‑low memory footprint and extensible pipeline capabilities, and provides step‑by‑step installation and code examples for handling in‑memory arrays and CSV files in ETL workflows.

ETLPHPbackend

0 likes · 4 min read

Master Efficient PHP Data Pipelines with the Low‑Memory Flow Framework

PaperAgent

Mar 3, 2026 · Artificial Intelligence

How CharacterFlywheel Scales Engaging LLMs: 15 Iterations of Production Optimization

The article presents CharacterFlywheel, a 15‑generation flywheel methodology that iteratively improves social‑dialogue LLMs in production using data‑driven reward models, rejection sampling, and a mix of SFT, DPO, and RL, with detailed experiments and best‑practice insights.

AI safetyLLM OptimizationReward Modeling

0 likes · 12 min read

How CharacterFlywheel Scales Engaging LLMs: 15 Iterations of Production Optimization

dbaplus Community

Jan 26, 2026 · Cloud Native

How Starbucks China Revamped Its Log Platform: From VMs to Cloud‑Native Kubernetes with 80% Faster Queries

Starbucks China’s logging team migrated several petabytes of logs from legacy VM‑based Elasticsearch clusters to a cloud‑native bare‑metal Kubernetes platform, upgrading ES from 7.x to 8.x, containerizing components, optimizing storage and Kafka, and achieving up to 80% query speed gains, 30% CPU reduction, and 200% write‑throughput improvement.

Performance Optimizationdata pipelinelog platform

0 likes · 25 min read

How Starbucks China Revamped Its Log Platform: From VMs to Cloud‑Native Kubernetes with 80% Faster Queries

StarRocks

Dec 18, 2025 · Databases

How Fresha Scaled Real‑Time Analytics with StarRocks: A Deep Dive into Their Hybrid Architecture

Facing Postgres overload and costly Snowflake queries, Fresha rebuilt its analytics platform by introducing StarRocks as a unified SQL entry point, combining federated lakehouse queries with high‑performance internal tables, which reduced homepage query latency to around 200 ms and achieved minute‑level data freshness across real‑time, historical, and search workloads.

Compute-Storage SeparationHybrid ArchitectureLakehouse

0 likes · 20 min read

How Fresha Scaled Real‑Time Analytics with StarRocks: A Deep Dive into Their Hybrid Architecture

Baobao Algorithm Notes

Nov 13, 2025 · Artificial Intelligence

Introducing UNO‑Bench: The First Unified Omni‑Modal LLM Evaluation Suite

UNO‑Bench, an open‑source benchmark from Meituan’s LongCat team, provides the first high‑quality, low‑redundancy unified evaluation framework for omni‑modal large language models, featuring 1,250 manually annotated cross‑modal samples and 2,480 enhanced single‑modal samples covering 44 fine‑grained tasks and five modality combinations.

AI Scaling Lawbenchmarkdata pipeline

0 likes · 15 min read

Introducing UNO‑Bench: The First Unified Omni‑Modal LLM Evaluation Suite

Alibaba Cloud Developer

Nov 7, 2025 · Big Data

Unlock Enterprise‑Grade Data Pipelines with DMS Airflow: Features, Integration & Code Samples

This article introduces DMS Airflow, an enterprise‑level data workflow orchestration platform built on Apache Airflow, covering its advanced DAG capabilities, deep DMS integration, scheduling, task dependency management, dynamic task generation, resource scaling, security features, and practical code examples for SQL, Spark, DTS, and Notebook tasks.

AirflowBig DataDMS

0 likes · 20 min read

Unlock Enterprise‑Grade Data Pipelines with DMS Airflow: Features, Integration & Code Samples

IT Services Circle

Oct 7, 2025 · Fundamentals

Unlock Python Dictionaries with missing: Transform Missing Keys into Smart Logic

This article explores Python's __missing__ dunder method, showing how it surpasses traditional approaches like if/else, .get() and defaultdict by enabling dynamic, self‑healing dictionary behavior, and demonstrates advanced real‑world applications such as smart counters, infinite nested dicts, API caching, and automatic data pipelines.

API cachePython__missing__

0 likes · 13 min read

Unlock Python Dictionaries with __missing__: Transform Missing Keys into Smart Logic

DataFunSummit

Oct 5, 2025 · Artificial Intelligence

How Baidu’s AI‑Powered Code Assistant Is Revolutionizing Software Development

In this detailed presentation, Baidu’s engineering manager Yang Jingwei explains the current landscape, emerging trends, key challenges, data pipelines, model training, prompt engineering, multi‑platform support, and future outlook of Baidu’s intelligent code assistant and AI IDE, illustrating practical solutions and real‑world impact.

AI code assistantModel TrainingPrompt Engineering

0 likes · 26 min read

How Baidu’s AI‑Powered Code Assistant Is Revolutionizing Software Development

StarRocks

Sep 23, 2025 · Databases

How Zepto Scaled Real‑Time Brand Analytics with StarRocks: From Postgres MVP to Sub‑Second Queries

Zepto transformed its brand‑analytics platform from a Postgres MVP into a production‑grade, sub‑second real‑time analytics solution by adopting StarRocks, redesigning its data pipeline with Databricks, Kafka, and Flink, and choosing a storage‑compute architecture that supports massive joins and rapid insights.

DatabricksFlinkOLAP

0 likes · 14 min read

Alibaba Cloud Big Data AI Platform

Aug 21, 2025 · Big Data

How Hypergryph Built a High‑Performance Real‑Time Analytics Platform with StarRocks

This case study details how Hypergryph leveraged Alibaba Cloud EMR Serverless StarRocks, Flink, and Kafka to replace a ClickHouse data warehouse with a high‑performance, elastic, and easy‑to‑operate real‑time analytics platform that dramatically improved query speed, stability, operational efficiency, and cost for their gaming business.

Cloud ComputingFlinkStarRocks

0 likes · 8 min read

How Hypergryph Built a High‑Performance Real‑Time Analytics Platform with StarRocks

Youzan Coder

Aug 4, 2025 · Artificial Intelligence

How to Quantify a “Good Image” for AI‑Generated E‑Commerce Graphics?

This article explains how to define and objectively evaluate the quality of AI‑generated product images for e‑commerce by decoupling assessment from the generation pipeline, selecting quantifiable metrics such as CLIPScore and Inception Score, building a lightweight evaluation system, cleaning and labeling data, and validating the approach with real‑world business and model datasets.

AI image evaluationCLIPScoreInception Score

0 likes · 26 min read

How to Quantify a “Good Image” for AI‑Generated E‑Commerce Graphics?

Efficient Ops

Jul 16, 2025 · Operations

Why Vector Is the High‑Performance Alternative to Logstash and Fluentd

This article introduces Vector, an open‑source, Rust‑based observability data pipeline that outperforms traditional tools like Logstash and Fluentd, covering its core features, concepts, installation script, minimal configuration example, and how it handles events, logs, metrics, and traces.

ConfigurationInstallationVector

0 likes · 5 min read

Why Vector Is the High‑Performance Alternative to Logstash and Fluentd

Alibaba Cloud Observability

Jul 14, 2025 · Operations

How New SPL Operators Supercharge Log Processing Performance

The latest SPL update introduces powerful operators like pack-fields, log-to-metric, and metric-to-metric, delivering dramatic performance gains, richer data transformation capabilities, and enhanced observability for cloud‑native log processing pipelines.

Log ProcessingPerformance OptimizationSPL

0 likes · 8 min read

How New SPL Operators Supercharge Log Processing Performance

Alibaba Cloud Developer

Jun 6, 2025 · Big Data

Why Observability 2.0 and SLS Data Pipelines Are Revolutionizing Log Analytics

This article explains how Observability 2.0 reshapes log, metric and trace management by unifying health views, introduces the evolution of Alibaba Cloud's SLS data pipeline, compares its three service modes, and demonstrates performance, cost and integration benefits for large‑scale, real‑time log processing.

Big DataObservabilitySLS

0 likes · 11 min read

Why Observability 2.0 and SLS Data Pipelines Are Revolutionizing Log Analytics

Alibaba Cloud Native

May 20, 2025 · Cloud Native

How Observability 2.0 Redefines Cloud‑Native Log Pipelines and Cuts Costs by 66%

Observability 2.0 unifies logs, metrics and traces into a single platform, introduces event‑centric Wide Events, and drives a complete redesign of Alibaba Cloud's SLS data pipeline that delivers higher performance, lower latency, richer low‑code SPL processing, and up to a 66.7% reduction in processing costs.

ObservabilitySPLcost optimization

0 likes · 12 min read

How Observability 2.0 Redefines Cloud‑Native Log Pipelines and Cuts Costs by 66%

Big Data Tech Team

Apr 26, 2025 · Big Data

Mastering the Data Development Roadmap: From Infrastructure to AI Integration

This guide outlines a comprehensive data development roadmap, covering infrastructure setup, governance frameworks, automated pipelines, BI and analytics tools, AI/ML integration, cultural adoption, and continuous performance monitoring to enable intelligent business transformation.

AI integrationAnalyticsBig Data

0 likes · 5 min read

Mastering the Data Development Roadmap: From Infrastructure to AI Integration

DaTaobao Tech

Apr 9, 2025 · Operations

Proactive Alerting System for Taobao Special Edition: Design, Scope, and Solutions

The article outlines the design and implementation of a proactive alerting system for Taobao Special Edition, covering five alert categories—slot expiration, rights issues, configuration platforms, experiment audience expiration, and public‑opinion problems—detailing data‑driven rule engines, flexible integration, and successful 24‑hour inventory alerts while planning minute‑level rapid‑consumption warnings.

Risk Managementalert systemdata pipeline

0 likes · 7 min read

Proactive Alerting System for Taobao Special Edition: Design, Scope, and Solutions

JD Tech Talk

Mar 12, 2025 · Big Data

Ensuring Stability of the Double‑11 Supply Chain Dashboard: Full‑Chain Process, Risk Points, and Technical Safeguard Strategies

This article details how the supply‑chain big‑screen dashboard for Double‑11 maintains high stability by mapping the full data‑flow, identifying risk points across ingestion, processing, storage and service layers, and applying comprehensive technical safeguards such as high‑availability design, fault‑tolerance, monitoring, and coordinated operational procedures.

Big DataMonitoringStability

0 likes · 11 min read

Ensuring Stability of the Double‑11 Supply Chain Dashboard: Full‑Chain Process, Risk Points, and Technical Safeguard Strategies

Bilibili Tech

Mar 4, 2025 · Artificial Intelligence

Engineering Practices and Optimizations for Text‑to‑Video Generation Models (OpenSora, CogVideoX) on Bilibili TTV Team

The Bilibili TTV team optimized OpenSora and CogVideoX text‑to‑video models by redesigning data storage with Alluxio, parallelizing VAE encoding, applying dynamic sequence‑parallel and DeepSpeed‑Ulysses attention, adapting GPU code for NPU execution, leveraging profiling‑driven kernel fusion, FlashAttention, and expandable memory to dramatically increase training efficiency and frame throughput, while outlining future pipeline‑parallel and ZeRO‑3 scaling plans.

FlashAttentionNPUdata pipeline

0 likes · 26 min read

Engineering Practices and Optimizations for Text‑to‑Video Generation Models (OpenSora, CogVideoX) on Bilibili TTV Team

Su San Talks Tech

Dec 8, 2024 · Big Data

How to Build Near Real-Time ElasticSearch Indexes for PB-Scale Data

This article explains why traditional databases like MySQL struggle with massive datasets, introduces ElasticSearch’s inverted‑index architecture, and details a practical pipeline using Hive, wide tables, binlog, Canal, and Otter to achieve near real‑time indexing for petabyte‑level data.

CanalHiveOtter

0 likes · 19 min read

How to Build Near Real-Time ElasticSearch Indexes for PB-Scale Data

Chen Tian Universe

Dec 5, 2024 · Operations

How to Build a Scalable Reconciliation System for High‑Volume Transactions

This article explains the concepts, models, architecture, data management, project configuration, engine design, and error‑handling procedures needed to implement an automated, systematic reconciliation system that can handle large‑scale online transaction volumes with high accuracy and efficiency.

AutomationError handlingReconciliation

0 likes · 27 min read

How to Build a Scalable Reconciliation System for High‑Volume Transactions

Test Development Learning Exchange

Dec 1, 2024 · Big Data

How to Install Apache Airflow and Build a Simple Data Processing Pipeline

This tutorial guides you through installing Apache Airflow, initializing its database, starting the web server and scheduler, creating a Python DAG that reads, cleans, groups, and saves CSV data, configuring the DAG directory, and monitoring the pipeline via the Airflow web UI.

Apache AirflowDAGETL

0 likes · 6 min read

How to Install Apache Airflow and Build a Simple Data Processing Pipeline

Architect

Nov 15, 2024 · Frontend Development

How Bilibili Built a Scalable Front‑End Error Monitoring System from Scratch

This article details Bilibili's end‑to‑end front‑end error monitoring solution, covering the custom SDK, error capture and classification, unique ID generation, filtering, white‑screen detection, data pipelines, APM visualisation, lifecycle plugins, one‑click alerts, and future roadmap, all backed by real‑world metrics and code examples.

APMAlertingBilibili

0 likes · 34 min read

How Bilibili Built a Scalable Front‑End Error Monitoring System from Scratch

DaTaobao Tech

Nov 15, 2024 · Big Data

Engineering Practices for a Billion‑Scale Image Asset Platform

The article recounts how the author built a billion‑scale AI image‑asset library by replacing a week‑long import with a clustered‑table, sharded pipeline, MD5‑based unique keys, a custom DataWorks task scheduler, and multi‑engine query layers, sharing practical engineering practices learned through successive iterations.

Big DataHashingImage processing

0 likes · 14 min read

Engineering Practices for a Billion‑Scale Image Asset Platform

Baobao Algorithm Notes

Nov 14, 2024 · Artificial Intelligence

How I Built a 1B‑Parameter Chinese LLM on a Single A100: Lessons Learned

This article details the end‑to‑end process of pre‑training, fine‑tuning, and evaluating a 1‑billion‑parameter Chinese LLM named Steel‑LLM on limited hardware, covering data collection, pipeline design, training framework choices, architectural tweaks, performance results, and practical lessons for resource‑constrained developers.

LLMTraining Optimizationdata pipeline

0 likes · 18 min read

How I Built a 1B‑Parameter Chinese LLM on a Single A100: Lessons Learned

Mingyi World Elasticsearch

Nov 9, 2024 · Big Data

AI‑Powered JD.com Review Collection, Indexing, and Kibana Visualization with Elasticsearch

The author builds a fully AI‑driven pipeline that scrapes JD.com comments about an Elasticsearch book, processes the data through cleaning and preprocessing, indexes it into Elasticsearch, and creates a series of Kibana visualizations, while reflecting on model selection and practical challenges.

AI AutomationChatGPTElasticsearch

0 likes · 9 min read

AI‑Powered JD.com Review Collection, Indexing, and Kibana Visualization with Elasticsearch

Sohu Tech Products

Nov 6, 2024 · Operations

Design and Implementation of a Business Operation Log Management System Using Canal and Elasticsearch

The article presents a decoupled business operation log management architecture that uses Alibaba’s Canal to capture MySQL binlog changes, streams them through Kafka, and stores structured before‑and‑after records in Elasticsearch with nested mappings, enabling multi‑table correlation via transaction IDs, visual querying, and reliable rollback without modifying application code.

CanalElasticsearchMySQL binlog

0 likes · 12 min read

Design and Implementation of a Business Operation Log Management System Using Canal and Elasticsearch

Mingyi World Elasticsearch

Nov 3, 2024 · Big Data

How to Build a Scalable Business Operation Log System with Canal and Elasticsearch

This article walks through the design and implementation of a decoupled, high‑performance business operation log solution that captures MySQL binlog changes via Canal, streams them through Kafka, and stores and queries them in Elasticsearch, addressing challenges such as batch operations, multi‑table transactions, and non‑business data filtering.

BinlogCanalElasticsearch

0 likes · 13 min read

How to Build a Scalable Business Operation Log System with Canal and Elasticsearch

Ctrip Technology

Sep 23, 2024 · Frontend Development

Intelligent Alert Attribution System for Ctrip Hotel Frontend: Design, Implementation, and Outcomes

This article details the design and deployment of an intelligent alert attribution system for Ctrip Hotel's front‑end, describing the background challenges, the unified data pool, weighted alert rules, three attribution algorithms, achieved improvements in accuracy and troubleshooting speed, and future enhancement plans.

AlertMonitoringattribution

0 likes · 18 min read

Intelligent Alert Attribution System for Ctrip Hotel Frontend: Design, Implementation, and Outcomes

DevOps Operations Practice

Sep 1, 2024 · Operations

Understanding Logstash: Core Syntax, Filters, and Advanced Configuration

This article introduces Logstash’s core configuration syntax, explains key filter plugins such as grok, mutate, date, ruby, and aggregate, demonstrates conditional processing and multi‑event handling, and provides practical code examples to help readers efficiently parse, transform, and route log data.

ConfigurationELKFilters

0 likes · 6 min read

Understanding Logstash: Core Syntax, Filters, and Advanced Configuration

Ctrip Technology

Aug 22, 2024 · Backend Development

Evolution of Ctrip Vacation Product Log System: From Single‑Table DB to ES + HBase Platform

This article details the three‑stage evolution of Ctrip's vacation product log system—from a simple single‑table DB approach, through a platform‑based ES + HBase solution, to a scalable V3.0 architecture that improves storage, search, and business empowerment while handling billions of log entries.

Elasticsearchbackenddata pipeline

0 likes · 16 min read

Evolution of Ctrip Vacation Product Log System: From Single‑Table DB to ES + HBase Platform

Top Architect

Aug 10, 2024 · Big Data

Design and Implementation of a Scalable Real-Time Log Monitoring Platform at Baidu

This article introduces Baidu's log platform that handles billions of daily events, explains UBC logging concepts and monitoring requirements, and details a low‑cost, high‑accuracy architecture using real‑time streaming, dimension mapping, watermarking, and time‑window aggregation to achieve reliable, scalable event monitoring.

Big DataLog MonitoringReal-time Streaming

0 likes · 14 min read

Design and Implementation of a Scalable Real-Time Log Monitoring Platform at Baidu

Smart Era Software Development

Jul 3, 2024 · Artificial Intelligence

Deploying Domain Models with Open-Source LLMs: Lessons from SECon 2024

The article analyzes the rapid rise of open‑source large language models, explains how Llama 3 serves as a strong base for domain‑specific models, details a data‑driven pipeline, fine‑tuning, reinforcement learning, engineering optimizations, and a comprehensive evaluation framework, and showcases the XuanYuan series that outperforms GPT‑4 on several finance benchmarks.

Llama 3data pipelinedomain model

0 likes · 12 min read

Deploying Domain Models with Open-Source LLMs: Lessons from SECon 2024

JD Cloud Developers

Jul 3, 2024 · Big Data

How to Build a High‑Availability Real‑Time Logistics Dashboard with Flink and ClickHouse

This article details the design and implementation of a high‑availability, real‑time logistics supply‑chain dashboard, covering Flink‑based data pipelines, ClickHouse OLAP storage, metric consistency, stability measures, extensible configuration, and comprehensive monitoring to ensure accurate, scalable performance during major promotions.

Big DataClickHouseFlink

0 likes · 9 min read

How to Build a High‑Availability Real‑Time Logistics Dashboard with Flink and ClickHouse

JD Tech Talk

Jul 3, 2024 · Big Data

Real-time Monitoring Dashboard for Logistics Supply Chain: Architecture, Data Processing, and Stability Practices

This article describes the design and implementation of a high‑availability, real‑time logistics supply‑chain dashboard using Flink and ClickHouse, covering data processing pipelines, metric consistency, stability mechanisms, extensible configurations, and monitoring techniques to guide similar large‑screen projects.

ClickHouseFlinkStability

0 likes · 9 min read

Real-time Monitoring Dashboard for Logistics Supply Chain: Architecture, Data Processing, and Stability Practices

JD Tech

Jul 2, 2024 · Big Data

Real‑Time Monitoring Dashboard for Logistics Supply Chain: Architecture, Data Modeling, and Stability Design

This article presents the design and implementation of a high‑availability, real‑time logistics supply‑chain monitoring dashboard, covering its data processing pipeline with Flink, storage choices between Elasticsearch and ClickHouse, multi‑layer architecture, metric consistency, stability mechanisms, extensibility configurations, and monitoring practices.

Big DataClickHouseElasticsearch

0 likes · 11 min read

Real‑Time Monitoring Dashboard for Logistics Supply Chain: Architecture, Data Modeling, and Stability Design

iQIYI Technical Product Team

Jun 28, 2024 · Artificial Intelligence

Feature Center Overview in iQIYI's Opal Machine Learning Platform

The Feature Center in iQIYI’s Opal platform centralizes feature creation, storage, and real‑time access through a drag‑and‑drop DAG workflow and DSL‑driven transformations, handling massive QPS and low‑latency demands while enabling fast business iteration, cross‑team reuse, and monitoring for advertising, recommendation, and risk‑control applications.

Opaldata pipelinereal-time features

0 likes · 13 min read

Feature Center Overview in iQIYI's Opal Machine Learning Platform

Baidu Geek Talk

Jun 17, 2024 · Industry Insights

How Baidu Scales Real‑Time Event Monitoring for Billions of Log Events

This article explains Baidu's log platform architecture, the UBC event‑tracking protocol, monitoring requirements, and the low‑cost, high‑accuracy solutions—including dimension mapping, watermark handling, data trimming, and time‑window aggregation—that enable real‑time, customizable monitoring of petabyte‑scale log streams.

Log MonitoringUBCcost optimization

0 likes · 13 min read

How Baidu Scales Real‑Time Event Monitoring for Billions of Log Events

Zhuanzhuan Tech

May 23, 2024 · Backend Development

Design and Implementation of a Channel Reconciliation System for ZuanZuan Payments

This article details the background, architecture, data preparation methods, massive‑data handling strategies, verification processes, and error‑handling mechanisms of ZuanZuan's channel reconciliation system, highlighting design choices such as binlog ingestion, task‑driven bill downloads, sharding with Hive archiving, and MQ‑based reconciliation to ensure financial data consistency and safety.

HiveMQReconciliation

0 likes · 11 min read

Design and Implementation of a Channel Reconciliation System for ZuanZuan Payments

DataFunTalk

May 21, 2024 · Big Data

Applying Alluxio to Autonomous Driving Model Training: Deployment, Performance, and Operational Insights

This article details how Alluxio was adopted to replace NAS in autonomous driving model training, describing the data closed‑loop workflow, the challenges of the previous system, Alluxio's architectural benefits, deployment strategies across single and multiple data centers, functional and performance testing, operational tuning, and the resulting cost and efficiency gains.

AlluxioDistributed storageModel Training

0 likes · 15 min read

Applying Alluxio to Autonomous Driving Model Training: Deployment, Performance, and Operational Insights

Alibaba Cloud Developer

Apr 25, 2024 · Big Data

What Happens Behind the Scenes When a SQL Query Runs in a Big Data Platform?

This article walks through the end‑to‑end lifecycle of a SQL task in a big‑data environment, covering creation, scheduling metadata, instance generation, resource allocation, ODPS execution, and final processing on the Fuxi distributed engine.

FuxiODPSSQL

0 likes · 11 min read

What Happens Behind the Scenes When a SQL Query Runs in a Big Data Platform?

NetEase Cloud Music Tech Team

Apr 11, 2024 · Backend Development

Design and Implementation of an Online Configurable Data Consumption Service for NetEase Cloud Music Frontend Performance Monitoring (Corona)

The article details NetEase Cloud Music’s end‑to‑end, online‑configurable data‑consumption service and schema‑driven visualization platform that transform raw client logs into ClickHouse records, automatically generate tables and dashboards, and provide observability, dramatically reducing manual effort while supporting over twenty performance metrics for frontend monitoring.

ClickHousedata pipelinefrontend

0 likes · 17 min read

Design and Implementation of an Online Configurable Data Consumption Service for NetEase Cloud Music Frontend Performance Monitoring (Corona)

DataFunSummit

Mar 22, 2024 · Artificial Intelligence

Risk Control Model Construction for Online Small Loans: Pre‑loan, In‑loan, Post‑loan and Monitoring

This article presents a comprehensive overview of risk control model building for online small‑loan scenarios, covering pre‑loan, in‑loan and post‑loan stages, the associated data pipelines, model deployment strategies, optimization attempts, and monitoring frameworks to ensure accuracy, stability and effectiveness.

Credit ScoringMonitoringdata pipeline

0 likes · 16 min read

Risk Control Model Construction for Online Small Loans: Pre‑loan, In‑loan, Post‑loan and Monitoring

Bilibili Tech

Mar 22, 2024 · Backend Development

Design and Evolution of Incremental Indexing for Advertising Retrieval Systems

The article describes how an advertising retrieval system evolved from serial to parallel full builds and finally to a hybrid incremental indexing approach that records direct entity relationships during assembly, enabling fast reverse‑lookup of changed units via inverted indexes, reducing database load, latency, and rebuild overhead.

Backend Developmentadvertising systemdata pipeline

0 likes · 20 min read

Design and Evolution of Incremental Indexing for Advertising Retrieval Systems

DataFunSummit

Mar 11, 2024 · Big Data

Evolution of iQIYI's Event Tracking System and Its Data Processing Pipeline

This article outlines the importance of event tracking for data, describes iQIYI's five‑stage tracking system evolution, analyzes the challenges of the self‑service phase, presents the middle‑platform improvements, explains the migration strategy, and details the downstream data lake, real‑time stream, and data‑warehouse processing workflows.

Data Engineeringdata pipelineiQIYI

0 likes · 13 min read

Evolution of iQIYI's Event Tracking System and Its Data Processing Pipeline

ByteDance Data Platform

Jan 24, 2024 · Big Data

How ByteDance Cut Billions in Event‑Tracking Costs with Smart Data Governance

This article details ByteDance's end‑to‑end event‑tracking cost governance, covering background, strategies, large‑scale data pipelines, resource challenges, control mechanisms, automated and supervised governance modes, and the substantial savings achieved through point filtering, tiered prioritization, and sampling.

cost governancedata pipelineevent tracking

0 likes · 16 min read

How ByteDance Cut Billions in Event‑Tracking Costs with Smart Data Governance

Alibaba Cloud Native

Dec 9, 2023 · Cloud Native

How Serverless Function Compute Transformed Log Processing for a FinTech Firm

Shuhe Technology replaced a cumbersome Kafka‑to‑ECS/K8s log‑processing pipeline with Alibaba Cloud Function Compute, achieving faster, more elastic, and cost‑effective handling of massive application logs while reducing operational overhead and simplifying maintenance.

Alibaba CloudFunction ComputeServerless

0 likes · 10 min read

How Serverless Function Compute Transformed Log Processing for a FinTech Firm

JavaEdge

Nov 24, 2023 · Backend Development

Why Kafka Is the Ultimate Backbone for Modern Backend Systems

This article explores how Kafka serves as a versatile backbone for messaging, durable storage, log aggregation, monitoring, commit logs, recommendation pipelines, stream processing, CDC, system migration, and event sourcing, highlighting its performance, reliability, and practical deployment patterns.

Message QueueStreamingbackend

0 likes · 10 min read

Why Kafka Is the Ultimate Backbone for Modern Backend Systems

Ctrip Technology

Nov 23, 2023 · Big Data

Optimizing Data Warehouse Timeliness Using Metadata Lineage

This article presents a metadata‑driven approach to improve data warehouse timeliness by extracting upstream lineage, identifying over‑layered, duplicate, and critical‑path tasks, and applying targeted scheduling and code‑level optimizations, demonstrated with a hotel order wide‑table case study.

DAGData WarehouseLineage

0 likes · 7 min read

Optimizing Data Warehouse Timeliness Using Metadata Lineage

dbaplus Community

Nov 19, 2023 · Big Data

How Agoda Scales Apache Kafka: Two‑Step Logging, Monitoring, and Cost Attribution

This article details Agoda's evolution of Apache Kafka usage—from a two‑step logging architecture that separates developer concerns, through cluster layout, scaling metrics, monitoring and audit pipelines, to cost attribution, authentication, ACLs, and automation tools—highlighting trade‑offs and operational lessons learned.

Apache Kafkacost managementdata pipeline

0 likes · 17 min read

How Agoda Scales Apache Kafka: Two‑Step Logging, Monitoring, and Cost Attribution

DataFunTalk

Nov 19, 2023 · Big Data

Design and Evolution of Zhihu's Event‑Tracking (埋点) System

This article presents a comprehensive overview of Zhihu's event‑tracking system, covering its evolution from early Hadoop‑based pipelines to cloud‑native architectures, detailing toolsets for requirement management, validation, data collection, querying, and service design, and concluding with a practical Q&A on best practices and optimization.

Data Qualitydata pipelineevent tracking

0 likes · 12 min read

Design and Evolution of Zhihu's Event‑Tracking (埋点) System

JD Retail Technology

Nov 16, 2023 · Big Data

Dada Platform Data Collection Migration: Value, Process, Architecture, and Technical Highlights

This report details the migration of Dada's unified data‑collection platform across 43 sites, outlining the achieved cost reductions, data‑analysis benefits, migration workflow, architectural design, technical highlights, and the challenges and solutions encountered during the project.

AnalyticsData Migrationdata pipeline

0 likes · 6 min read

Dada Platform Data Collection Migration: Value, Process, Architecture, and Technical Highlights

DataFunSummit

Nov 6, 2023 · Big Data

Building and Managing Huolala's User Event Tracking System: Architecture, Governance, and Monitoring

This article details Huolala's user event tracking (埋点) system, covering its background, challenges, the construction of a four‑module management platform, backend SDK design, monitoring and quality assurance mechanisms, and future plans for service integration, data lineage, and governance optimization.

Data GovernanceMonitoringbackend SDK

0 likes · 16 min read

Building and Managing Huolala's User Event Tracking System: Architecture, Governance, and Monitoring

ByteDance Data Platform

Oct 11, 2023 · Backend Development

How Volcano Engine Rebuilt Its Ad‑Testing Platform for Scalability and Reliability

This article explains how Volcano Engine identified the tangled authorization, data‑fetching, and performance problems of its advertising AB‑testing platform and refactored it by splitting services, redesigning the data model with MySQL and ClickHouse, applying DAG scheduling, time‑wheel algorithms, Domain‑Driven Design, and rigorous unit testing to achieve a more stable, extensible backend solution.

AB testingAdvertisingDAG

0 likes · 16 min read

How Volcano Engine Rebuilt Its Ad‑Testing Platform for Scalability and Reliability

DataFunSummit

Oct 10, 2023 · Big Data

Real-Time Risk Insight: Architecture Evolution and Future Outlook

This article presents a comprehensive overview of the challenges, architectural evolution from version 1.0 to 3.0, core components, key technologies, and future directions of JD's real‑time risk insight platform, highlighting data integration, streaming processing, plugin mechanisms, and intelligent anomaly detection.

Anomaly Detectionarchitecturedata pipeline

0 likes · 18 min read

Real-Time Risk Insight: Architecture Evolution and Future Outlook

Bilibili Tech

Oct 10, 2023 · Backend Development

Design and Implementation of a Scalable Live‑Streaming Full‑Stream Data System

The article details a scalable live‑stream full‑stream data system that replaces a tightly‑coupled legacy architecture with a producer‑consumer model using a custom key‑value store, bucket sharding, gRPC server‑streaming, versioned caching, and comprehensive observability, achieving sub‑second queries, horizontal scalability, and reliable support for thousands of downstream services.

Live StreamingObservabilitydata pipeline

0 likes · 18 min read

Design and Implementation of a Scalable Live‑Streaming Full‑Stream Data System

Code Ape Tech Column

Sep 21, 2023 · Big Data

Elasticsearch vs ClickHouse: Performance, Cost, and Deployment Guide

This article compares Elasticsearch and ClickHouse in terms of write throughput, query speed, and server cost, then provides a detailed step‑by‑step deployment guide for Zookeeper, Kafka, FileBeat, and ClickHouse, including common issues and their solutions.

Big DataClickHouseElasticsearch

0 likes · 14 min read

Elasticsearch vs ClickHouse: Performance, Cost, and Deployment Guide

Bilibili Tech

Sep 1, 2023 · Big Data

Design and Implementation of Session‑Based User Engagement Tracking for Cloud TV Application

The Cloud Vision TV app implements a session‑id and placement‑id driven tracking pipeline that generates, collects, and processes lifecycle data across server and client layers, enabling fine‑grained engagement strategies, scene reconstruction via AC automata, and actionable BI dashboards to improve user retention and personalization.

BI visualizationMobile AppOLAP

0 likes · 14 min read

Design and Implementation of Session‑Based User Engagement Tracking for Cloud TV Application

Smart Era Software Development

Aug 25, 2023 · Big Data

Evolving Enterprise Data Architecture for the Large‑Model Era: Practices and Case Studies

The article analyzes how enterprise data systems must be re‑engineered for large‑model applications, outlines the three‑stage data pipeline (ingestion, orchestration, interaction), introduces data‑virtualization techniques with virtual tables and intelligent materialization, and validates the approach with two banking case studies.

Big DataCase StudyData Architecture

0 likes · 14 min read

Evolving Enterprise Data Architecture for the Large‑Model Era: Practices and Case Studies

Huolala Tech

Aug 3, 2023 · Big Data

Building a Scalable Ad Attribution Platform: Architecture & Real‑Time Data Flow

This article explains how to design and implement a scalable ad attribution platform, covering data collection, real‑time processing with Kafka, storage in HBase, deduplication strategies, attribution models, and configurable media integration to maximize ROI for marketers.

Ad AttributionHBaseMarketing Analytics

0 likes · 25 min read

Building a Scalable Ad Attribution Platform: Architecture & Real‑Time Data Flow

JD Retail Technology

Jul 27, 2023 · Big Data

Big Data Dual-Stream Construction and High-Fidelity Pressure Testing Guidelines

This document outlines the standards, evaluation dimensions, and implementation process for dual‑stream construction in big‑data pipelines, describes high‑fidelity pressure‑testing methods and objectives, and provides migration procedures for business units not participating in the tests.

High Availabilitydata pipelinedual-stream

0 likes · 8 min read

Big Data Dual-Stream Construction and High-Fidelity Pressure Testing Guidelines

DataFunTalk

Jul 23, 2023 · Backend Development

Rearchitecting the Advertising AB Testing Platform: Service Decomposition, Data Modeling, DAG Scheduling, and DDD Practices

The article describes how Volcano Engine's DataTester team refactored the advertising AB testing platform by splitting services, redesigning the data model with MySQL and ClickHouse, introducing DAG‑based scheduling and a time‑wheel algorithm, and applying domain‑driven design and rigorous unit testing to improve stability, scalability, and maintainability.

AB testingDAG schedulingDomain-Driven Design

0 likes · 16 min read

Rearchitecting the Advertising AB Testing Platform: Service Decomposition, Data Modeling, DAG Scheduling, and DDD Practices

DataFunTalk

Jul 15, 2023 · Big Data

Standardizing Event Tracking (埋点) at Bilibili: Practices, Challenges, and Applications

This article explains Bilibili's comprehensive approach to event‑tracking (埋点) standardization, covering the definition, data pipeline, common business issues, metadata‑driven design strategies, efficiency gains in accuracy, storage and querying, and future directions for automated data flow.

BilibiliStandardizationdata pipeline

0 likes · 21 min read

Standardizing Event Tracking (埋点) at Bilibili: Practices, Challenges, and Applications

DataFunSummit

Jul 8, 2023 · Big Data

Data Preparation Practices at Douyin Group for Diverse Application Scenarios

This article explains Douyin Group's large‑scale data applications, introduces the concept and architecture of data preparation, details its four subsystems and modular capabilities, and showcases how these are applied in BI, CDP, and custom scenarios within the Volcano Engine ecosystem.

BIBig DataCDP

0 likes · 16 min read

Data Preparation Practices at Douyin Group for Diverse Application Scenarios

Inke Technology

Jun 28, 2023 · Big Data

Extending Apache Seatunnel for Trino and Kyuubi Integration: A Practical Guide

This article outlines the challenges of scaling data integration platforms, proposes a comprehensive solution using Apache Seatunnel and Dinky, details the implementation of Trino and Kyuubi JDBC support, and describes the platform's architecture, task publishing workflow, logging, monitoring, resource management, and future enhancements.

Apache SeaTunnelData IntegrationKyuubi

0 likes · 16 min read

Extending Apache Seatunnel for Trino and Kyuubi Integration: A Practical Guide

JD Tech

Jun 15, 2023 · Big Data

Event Bus: Architecture, Technical Challenges, and Solutions for High‑Throughput Data Standardization

This article introduces the event bus as a data pipeline for risk insight, explains its source‑transform‑sink architecture, outlines key technical challenges such as data heterogeneity and high‑throughput parsing, and presents solutions including standardized data models, plugin extensibility, low‑code hot‑loading, dynamic grouping, one‑click degradation, and traffic monitoring.

Standardizationdata pipelineevent bus

0 likes · 14 min read

Event Bus: Architecture, Technical Challenges, and Solutions for High‑Throughput Data Standardization

Huolala Tech

May 25, 2023 · Big Data

How Huolala Solved HBase Bulkload Challenges: A Practical Guide

This article details Huolala’s experience building a unified Hive‑to‑HBase pipeline, addressing low development efficiency, lack of monitoring, and HBase instability by evaluating two architectures, implementing a generic Transform tool, optimizing compaction and DistCp, and establishing stability and data‑validation mechanisms.

CompactionDistcpHBase

0 likes · 12 min read

How Huolala Solved HBase Bulkload Challenges: A Practical Guide

Architecture Digest

May 11, 2023 · Backend Development

Design and Evolution of Vivo's Points Task System

This article details the conception, architectural evolution, and technical implementation of Vivo's points task system, covering its business model, Fogg behavior model, multi‑stage development, behavior SDK, data collection, rule engine, system stability measures, and future enhancements.

Points Systembehavior SDKdata pipeline

0 likes · 14 min read

Design and Evolution of Vivo's Points Task System

ITPUB

Apr 25, 2023 · Big Data

Top 8 Open‑Source ETL Tools for Data Migration and Integration

This article reviews eight widely used ETL and data‑migration tools—including Kettle, DataX, DataPipeline, Talend, DataStage, Sqoop, FineDataLink, and Canal—detailing their core features, architectures, supported data sources, and typical usage scenarios to help practitioners choose the right solution.

Big DataData IntegrationData Migration

0 likes · 13 min read

Top 8 Open‑Source ETL Tools for Data Migration and Integration

Qunar Tech Salon

Apr 17, 2023 · Mobile Development

Quantifying Mobile App User Experience Value: Design, Metrics, and Technical Implementation

This article presents a comprehensive approach to measuring and visualizing user experience value in a mobile app, covering background challenges, metric definitions, data infrastructure, technical solutions, platform construction, analysis results, and a repeatable SOP for continuous improvement.

Mobile AppReact Nativedata pipeline

0 likes · 21 min read

Quantifying Mobile App User Experience Value: Design, Metrics, and Technical Implementation

ITPUB

Apr 8, 2023 · Big Data

How Bilibili Cut Data Pipeline Costs by 20% with Flink Real‑Time Incremental Computing

Facing daily terabyte‑scale data ingestion and costly duplicate reads in its ODS‑to‑DWD pipeline, Bilibili introduced a Flink‑based real‑time incremental computation and multi‑level partition shuffling, dramatically reducing read amplification, cutting resource usage by ~20%, improving latency to minutes, and enhancing scalability.

Big DataFlinkReal-time Processing

0 likes · 19 min read

How Bilibili Cut Data Pipeline Costs by 20% with Flink Real‑Time Incremental Computing

Alimama Tech

Mar 22, 2023 · Big Data

Intelligent Merchant‑Side Diagnostic System: Architecture, Rule Engine, and Data Center

The article describes an intelligent merchant‑side diagnostic platform that unifies ad‑operation data in a centralized lake, uses a low‑code rule engine with arithmetic, code, and Java class modes to orchestrate reusable SOPs, and employs an acceleration layer for fast large‑scale queries, achieving over 90% coverage and outlining future expansion.

Rule EngineSQLdata pipeline

0 likes · 12 min read

Intelligent Merchant‑Side Diagnostic System: Architecture, Rule Engine, and Data Center

JD Tech

Mar 7, 2023 · Product Management

Design and Future Planning of the New Membership Badge System

This article provides a comprehensive overview of the new membership badge system, detailing its background, the shortcomings of the previous model, the redesigned badge hierarchy, business value, product architecture, dynamic business-line integration, core data interaction, periodic evaluation processes, SAAS capabilities, and future development plans.

SaaSUser Segmentationbadge system

0 likes · 14 min read

Design and Future Planning of the New Membership Badge System

DataFunSummit

Feb 4, 2023 · Artificial Intelligence

Walle: An End‑to‑End, General‑Purpose, Scalable Edge‑Cloud Collaborative Machine Learning System

The article introduces Walle, Alibaba's four‑year‑old edge‑cloud collaborative machine‑learning platform that unifies compute containers, data pipelines, and a deployment platform to enable low‑latency, privacy‑preserving, and high‑throughput AI services across billions of mobile devices, and presents its architecture, design challenges, and evaluation results.

Cloud Computingdata pipelineedge computing

0 likes · 25 min read

Walle: An End‑to‑End, General‑Purpose, Scalable Edge‑Cloud Collaborative Machine Learning System

IT Architects Alliance

Jan 27, 2023 · Big Data

Technical Architecture Overview of Toutiao (Jinri Toutiao) News Platform

The article provides a comprehensive technical overview of Toutiao's growth, data collection, user modeling, recommendation engine, storage solutions, message push system, and its micro‑service and virtualized PaaS architecture, highlighting the massive scale and engineering practices behind the platform.

MicroservicesToutiaoarchitecture

0 likes · 8 min read

Technical Architecture Overview of Toutiao (Jinri Toutiao) News Platform

ITPUB

Jan 19, 2023 · Big Data

How Real‑Time Data Warehouses Power Advertising: Architecture, Standards, and Best Practices

This article summarizes Liu Chong's DTCC2022 talk on building a real‑time advertising data warehouse, covering business context, layered model design, development technologies such as Flink and Kafka, full‑link quality assurance, practical implementation details, and future architectural directions.

advertising analyticsdata pipelinekafka

0 likes · 21 min read

How Real‑Time Data Warehouses Power Advertising: Architecture, Standards, and Best Practices

Baidu Intelligent Cloud Tech Hub

Jan 18, 2023 · Artificial Intelligence

How Baidu’s AI Cloud Powers Scalable Autonomous Driving Solutions

This article outlines Baidu Intelligent Cloud’s end‑to‑end autonomous driving platform, detailing its AI foundation, massive cloud‑based data and compute requirements, flexible deployment strategies for various manufacturers, and comprehensive toolchains for data collection, annotation, training, simulation, and compliance.

AI platformBaiduCloud Computing

0 likes · 12 min read

How Baidu’s AI Cloud Powers Scalable Autonomous Driving Solutions

NetEase Cloud Music Tech Team

Jan 17, 2023 · Big Data

How NetEase Cloud Music Cut Data Pipeline Delays by 60% with Full‑Link Baseline Governance

This case study details NetEase Cloud Music's full‑link baseline governance initiative, outlining the challenges of massive data pipelines, the metrics used to measure success, the three‑pronged action plan (infrastructure, task optimization, and standards), and the resulting improvements in availability, resource utilization, and monitoring accuracy.

Big DataMonitoringbaseline governance

0 likes · 11 min read

How NetEase Cloud Music Cut Data Pipeline Delays by 60% with Full‑Link Baseline Governance

Code Ape Tech Column

Jan 3, 2023 · Big Data

Elasticsearch vs ClickHouse: Performance, Cost, and Deployment Guide

This article compares Elasticsearch and ClickHouse in terms of write throughput, query speed, and server cost, then provides a step‑by‑step deployment guide for a private data pipeline using Zookeeper, Kafka, FileBeat, and ClickHouse, along with common issues and their solutions.

Big DataClickHouseElasticsearch

0 likes · 15 min read

DataFunTalk

Nov 30, 2022 · Big Data

Design and Practice of Yanxuan A/B Scientific Experiment Platform

The article presents the design, scientific methodology, system architecture, and case studies of Yanxuan's A/B testing platform, detailing how statistical principles, automated tracking, traffic allocation models, and unified reporting accelerate decision‑making and reduce development effort in e‑commerce experiments.

A/B testingAutomationdata pipeline

0 likes · 15 min read

Design and Practice of Yanxuan A/B Scientific Experiment Platform

DataFunTalk

Nov 25, 2022 · Operations

Overview of Volcano Engine A/B Experiment System Platform

This article presents a comprehensive overview of Volcano Engine's A/B testing platform, detailing its four core stages—reliable experiment system, efficient data construction, scientific statistical analysis, and fine-grained governance—while explaining execution components, data pipelines, statistical methods, and operational best practices for large‑scale experimentation.

A/B testingBig DataExperiment Platform

0 likes · 16 min read

Overview of Volcano Engine A/B Experiment System Platform

vivo Internet Technology

Nov 16, 2022 · Big Data

Vivo Hawking A/B Experiment Platform: Architecture, Practices, and Solutions

The Vivo Hawking platform provides a company‑wide, one‑stop A/B testing solution with a layered architecture, covariate‑balanced split algorithms, real‑time monitoring, and unified SDKs for Android, Java and H5, enabling thousands of daily experiments, automated analysis, and rapid product iteration across multiple departments.

Covariate balancingExperiment PlatformHive

0 likes · 22 min read

Vivo Hawking A/B Experiment Platform: Architecture, Practices, and Solutions

Liulishuo Tech Team

Oct 18, 2022 · Big Data

How to Build a Near‑Real‑Time Metric Management System with Flink, Kafka, and Trino

This article outlines the design and implementation of a near‑real‑time metric management platform at Liulishuo, detailing its data flow—from Kafka ingestion through Flink‑SQL processing into Hudi tables, Trino querying, metric configuration, lineage, visualization, alerting, scheduling, and future optimization plans.

FlinkHudiTrino

0 likes · 7 min read

How to Build a Near‑Real‑Time Metric Management System with Flink, Kafka, and Trino

58 Tech

Oct 11, 2022 · Operations

Design and Implementation of the “Sentinel” Monitoring System for Enterprise Data Reporting

The article details the background, five‑layer architecture, core modules, data model, processing, storage, and alert strategies of the Sentinel monitoring system built on Nebula Graph and integrated with Enterprise WeChat, highlighting its real‑time monitoring, task tracing, and the resulting improvements in reporting timeliness and reliability.

Enterprise WeChatMonitoringNebula Graph

0 likes · 13 min read

Design and Implementation of the “Sentinel” Monitoring System for Enterprise Data Reporting

DevOps Cloud Academy

Sep 15, 2022 · Big Data

Understanding Apache Airflow DAGs and Best Practices

This article explains what Apache Airflow DAGs are, describes their architecture and how they model data pipelines as directed acyclic graphs, and provides practical best‑practice guidelines for writing clean, reproducible, and resource‑efficient workflows.

Apache AirflowDAGbest practices

0 likes · 10 min read

Understanding Apache Airflow DAGs and Best Practices

Tencent Cloud Middleware

Sep 6, 2022 · Cloud Computing

Quickly Set Up One‑Click Data Ingestion Pipelines in Tencent Cloud Elasticsearch

This guide explains how to use Tencent Cloud Elasticsearch Service’s one‑click data‑link visual integration with CKafka to create end‑to‑end pipelines—covering source selection, component configuration, data collection, caching, processing, and destination setup—for both CVM and TKE environments, while reducing operational overhead.

CKafkaCVMCloud Computing

0 likes · 9 min read

Quickly Set Up One‑Click Data Ingestion Pipelines in Tencent Cloud Elasticsearch

Bilibili Tech

Sep 6, 2022 · Big Data

Lancer: Evolution of Bilibili's Real-Time Streaming Architecture

Lancer, Bilibili’s real‑time streaming backbone, has evolved from a monolithic Flume pipeline to a log‑id‑isolated, Kubernetes‑native architecture where Go edge agents feed synchronous Kafka‑proxied gateways into per‑logid topics processed by dedicated Flink‑SQL jobs, delivering exactly‑once, back‑pressured, highly scalable data ingestion for billions of daily requests.

Big DataFlinkLancer

0 likes · 29 min read

Lancer: Evolution of Bilibili's Real-Time Streaming Architecture