Tagged articles
3675 articles
Page 13 of 37
DataFunTalk
DataFunTalk
Mar 1, 2023 · Databases

Evolution and Optimization of Tencent Music Content Library Data Platform: From Architecture 1.0 to 4.0

This article details the evolution of Tencent Music's content library data platform from version 1.0 to 4.0, describing business requirements, architectural redesigns—including migration from ClickHouse to Apache Doris, introduction of a semantic layer, and extensive write, query, and cost optimizations—while sharing practical lessons and future directions.

Apache DorisBig DataFlink
0 likes · 21 min read
Evolution and Optimization of Tencent Music Content Library Data Platform: From Architecture 1.0 to 4.0
macrozheng
macrozheng
Feb 28, 2023 · Big Data

How Tencent Music Scaled Its Content Data Platform with Apache Doris: From ClickHouse to 4.0 Architecture

This article details the evolution of Tencent Music's content data platform from version 1.0 to 4.0, describing the migration from ClickHouse to Apache Doris, the introduction of a semantic layer, optimization of data ingestion, query performance, and cost reduction strategies that dramatically improved data timeliness, operational efficiency, and storage costs.

Apache DorisBig DataData Architecture
0 likes · 23 min read
How Tencent Music Scaled Its Content Data Platform with Apache Doris: From ClickHouse to 4.0 Architecture
DataFunTalk
DataFunTalk
Feb 27, 2023 · Big Data

Comprehensive Overview of Data Middle Platform Architecture and Its Core Frameworks

This article provides a detailed overview of data middle platform concepts, describing a decoupled six‑subsystem architecture—including storage, collection, processing, governance, security, and operation frameworks—while illustrating typical enterprise implementations, industry‑specific solutions, and best‑practice considerations for building scalable, secure, and value‑driven data platforms.

Big DataData GovernanceData Integration
0 likes · 25 min read
Comprehensive Overview of Data Middle Platform Architecture and Its Core Frameworks
Programmer DD
Programmer DD
Feb 27, 2023 · Big Data

Why Hadoop/Spark Feel Heavy and How SPL Offers a Lightweight Big Data Solution

With data volumes soaring, traditional Hadoop and Spark clusters become costly and cumbersome for small to medium workloads, prompting many to seek lighter alternatives; this article examines the technical, operational, and financial burdens of Hadoop/Spark and introduces the open‑source SPL engine as a fast, low‑cost, easy‑to‑use big‑data solution.

Big DataHadoopPerformance
0 likes · 16 min read
Why Hadoop/Spark Feel Heavy and How SPL Offers a Lightweight Big Data Solution

How NetEase Yanxuan Migrated from Lambda to Iceberg for Real‑Time Batch‑Stream Integration

This article details how NetEase Yanxuan transformed its data platform from a dual Lambda architecture to a unified batch‑stream solution built on Apache Iceberg, covering the original challenges, the evaluation of Iceberg versus Hudi and Delta Lake, implementation of stream‑batch pipelines, message ordering fixes, snapshot generation, and extensive table‑governance optimizations.

Apache FlinkApache SparkBatch-Stream Integration
0 likes · 14 min read
How NetEase Yanxuan Migrated from Lambda to Iceberg for Real‑Time Batch‑Stream Integration
DataFunTalk
DataFunTalk
Feb 26, 2023 · Big Data

Design, Optimization, and Use Cases of Data Lineage in ByteDance's DataLeap Platform

This article presents an in‑depth overview of DataLeap's data lineage capabilities, covering the challenges, multi‑layer model design, implementation with Apache Atlas and JanusGraph, performance optimizations, diverse use cases across asset, development, governance and security domains, and future trends for lineage technology.

Apache AtlasBig DataData Governance
0 likes · 19 min read
Design, Optimization, and Use Cases of Data Lineage in ByteDance's DataLeap Platform
21CTO
21CTO
Feb 25, 2023 · Big Data

Which IT Skills Earn Over $140K? 2023’s Top-Paying Tech Expertise Revealed

Based on Dice’s 2023 Tech Salary Report, the article lists the ten highest‑earning IT skill sets in the U.S., detailing average salaries—often exceeding $140,000—and explains why expertise in areas such as containers, Kubernetes, PaaS, Redis, Teradata, Kafka, Elasticsearch, and Go commands premium pay.

2023Big DataCloud Computing
0 likes · 10 min read
Which IT Skills Earn Over $140K? 2023’s Top-Paying Tech Expertise Revealed
DataFunTalk
DataFunTalk
Feb 25, 2023 · Big Data

T3 Travel’s Modern Data Stack and Feature Platform: Architecture and Practices

This article details T3 Travel’s exploration of the Modern Data Stack, describing its four‑point overview, business scenarios, the initial MDS implementation using Apache Hudi and Kyuubi, and the design of a feature platform that integrates Metricflow, Feast, and other components to support data processing, analytics, and machine‑learning workflows.

Apache HudiBig DataData Lake
0 likes · 22 min read
T3 Travel’s Modern Data Stack and Feature Platform: Architecture and Practices
DeWu Technology
DeWu Technology
Feb 24, 2023 · Big Data

Real-Time Data Architecture Evolution for a Complex Supply Chain

The article traces Dewu’s supply‑chain data platform from slow MySQL reporting through early CDC‑based wide tables to a Flink‑Kafka‑ClickHouse 1.0 design, then to a more scalable Flink‑Kafka‑Hologres 2.0 architecture that solves upsert and compute‑storage separation, while detailing key operational tricks, code‑generation tools, and future plans for lake‑house integration.

Big DataClickHouseFlink
0 likes · 10 min read
Real-Time Data Architecture Evolution for a Complex Supply Chain
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Feb 24, 2023 · Big Data

What Makes MPP Databases the Powerhouse Behind Modern Data Analytics?

MPP (Massive Parallel Processing) databases, designed for large‑scale analytical workloads, use distributed, shared‑nothing architectures with multiple control and compute nodes, offering high scalability, diverse data‑sharding strategies, and powerful SQL compatibility, as illustrated by vendors like Teradata, Vertica, Greenplum, and emerging open‑source solutions.

Big DataGreenplumMPP
0 likes · 15 min read
What Makes MPP Databases the Powerhouse Behind Modern Data Analytics?
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 24, 2023 · Big Data

Common Flink Task Submission Issues and Solutions on YARN

This article compiles frequent Flink job submission problems on YARN—including WordCount jar errors, HBase dependency conflicts, MySQL timeout, checkpoint restoration failures, parallelism limits, and unexpected container termination—provides root‑cause analysis and step‑by‑step remediation instructions.

Big DataCheckpointFlink
0 likes · 21 min read
Common Flink Task Submission Issues and Solutions on YARN
DataFunTalk
DataFunTalk
Feb 21, 2023 · Databases

Building a Stream‑Batch Integrated Data Architecture with Apache Doris at SelectDB

This article details how SelectDB’s data technology architect designed and implemented a new stream‑batch unified data platform using Apache Doris, covering the shortcomings of the early CDH‑based architecture, the selection process, data modeling, ingestion pipelines, performance testing, operational optimizations, and future plans.

Apache DorisBatch ProcessingBig Data
0 likes · 17 min read
Building a Stream‑Batch Integrated Data Architecture with Apache Doris at SelectDB
ITPUB
ITPUB
Feb 20, 2023 · Databases

Why Teradata Is Leaving China and What It Means for the Domestic Data Warehouse Market

Teradata's withdrawal from China, driven by geopolitical tensions and the rise of mature domestic data‑warehouse solutions, prompts a detailed look at its MPP architecture, the three main Chinese warehouse designs, Gartner market positioning, and migration tools for alternatives like GBase 8a and GaussDB DWS.

Big DataGBaseGaussDB
0 likes · 9 min read
Why Teradata Is Leaving China and What It Means for the Domestic Data Warehouse Market
DataFunSummit
DataFunSummit
Feb 20, 2023 · Product Management

Evaluating the Value of Data Products: Scenarios, Frameworks, and Improvement Methods

This article explains why data product value assessment is essential, outlines common usage scenarios and a DBA evaluation framework, describes quantitative methods such as usage, business, and data‑driven metrics, and offers practical ways to enhance data product value through metric optimization, high‑value direction selection, and resource allocation.

Big DataData ProductMetrics
0 likes · 13 min read
Evaluating the Value of Data Products: Scenarios, Frameworks, and Improvement Methods
DataFunTalk
DataFunTalk
Feb 20, 2023 · Big Data

Understanding Data Lakes and Their Application at iQIYI: Concepts, Scenarios, and Iceberg Implementation

This article explains the definition of data lakes (public‑cloud and non‑public‑cloud), outlines their key characteristics, presents three typical business scenarios—real‑time event analysis, change‑data analysis, and stream‑batch integration—summarizes required product features, evaluates open‑source lake formats, and details iQIYI's adoption of Apache Iceberg across multiple services to achieve low‑latency, large‑scale, cost‑effective analytics.

Big DataData LakeIceberg
0 likes · 23 min read
Understanding Data Lakes and Their Application at iQIYI: Concepts, Scenarios, and Iceberg Implementation
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 20, 2023 · Big Data

How Alibaba’s DataWorks Transforms Data Governance for Efficiency, Security, and Cost Savings

This article explores Alibaba's DataWorks platform and its comprehensive data governance practices, covering application efficiency, security controls, cost optimization, organizational structure, and cultural initiatives that together enable scalable, secure, and cost‑effective data management across the enterprise.

Big DataCost OptimizationData Governance
0 likes · 31 min read
How Alibaba’s DataWorks Transforms Data Governance for Efficiency, Security, and Cost Savings
DataFunTalk
DataFunTalk
Feb 18, 2023 · Big Data

Xiaomi Data Governance Evolution: Cost Governance Practices for HDFS and HBase

The article outlines Xiaomi's data governance journey, focusing on storage‑service cost governance, describing the transition from simple cost‑centered governance to big‑data‑driven asset management, and detailing concrete HDFS and HBase practices that achieved significant resource and cost reductions.

Big DataData GovernanceHBase
0 likes · 15 min read
Xiaomi Data Governance Evolution: Cost Governance Practices for HDFS and HBase
DataFunSummit
DataFunSummit
Feb 17, 2023 · Big Data

Data Governance Practices and Platform Construction with Alibaba DataWorks

Alibaba’s DataWorks team shares extensive experiences in building and operating a large‑scale data platform, covering data governance across stages—from data stability and quality to security, cost control, and organizational culture—illustrating how systematic practices and tools drive efficiency, reliability, and value for enterprises.

Big DataCost OptimizationData Governance
0 likes · 55 min read
Data Governance Practices and Platform Construction with Alibaba DataWorks
DataFunTalk
DataFunTalk
Feb 17, 2023 · Big Data

Tencent Alluxio (DOP) Deployment and Optimization in Financial Data Analytics

This article describes how Tencent's Alluxio-based Data Orchestration Platform (DOP) was applied to financial analytics, detailing the business background, challenges of large‑scale OLAP workloads, the Alluxio architecture and usage modes, performance results, and the series of optimizations and tuning performed to achieve significant speedups.

AlluxioBig DataData Orchestration
0 likes · 15 min read
Tencent Alluxio (DOP) Deployment and Optimization in Financial Data Analytics
Tencent Advertising Technology
Tencent Advertising Technology
Feb 17, 2023 · Big Data

Cost Optimization and Mixed‑Resource Deployment in Tencent's Taiji Machine Learning Platform

The article details how Tencent's Taiji machine‑learning platform reduces training costs and improves efficiency for large‑scale advertising models by leveraging cloud‑native mixed‑resource strategies—including online idle, offline elastic, and compute‑resource sharing—while maintaining high service stability through advanced scheduling, fault‑tolerance, and resource‑prediction techniques.

Big DataCloud NativeMachine Learning Platform
0 likes · 16 min read
Cost Optimization and Mixed‑Resource Deployment in Tencent's Taiji Machine Learning Platform
DataFunSummit
DataFunSummit
Feb 16, 2023 · Artificial Intelligence

Curated Collection of Articles on AI‑Powered Smart Medicine

This guide introduces the challenges in healthcare, explains how artificial intelligence is already reshaping the field, and provides a curated list of recent articles on smart medicine for readers to explore the emerging AI‑healthcare integration.

AIBig DataHealthcare
0 likes · 4 min read
Curated Collection of Articles on AI‑Powered Smart Medicine
DataFunSummit
DataFunSummit
Feb 16, 2023 · Big Data

JD Real-Time Data Product Practice: Overview, Low‑Code Platform, Stream‑Batch Integration, and Operations

This article summarizes JD's real‑time data product practice, covering product overview, low‑code real‑time platform construction, stream‑batch integrated architecture, and the three‑layer operational defense model, while highlighting challenges, evolution, user distribution, and future directions.

Big DataLow‑code platformreal-time data
0 likes · 13 min read
JD Real-Time Data Product Practice: Overview, Low‑Code Platform, Stream‑Batch Integration, and Operations
Kuaishou Big Data
Kuaishou Big Data
Feb 15, 2023 · Big Data

Kuaishou’s Data Application Factory: Boosting BI with Low‑Code & Unified Queries

This article details how Kuaishou’s Data Application Factory tackles the challenges of rapid BI delivery, data accuracy, and service stability by leveraging low‑code development, unified query services, standardized configurations, and service isolation to achieve efficient, high‑quality data products across multiple business lines.

BIBig DataUnified query
0 likes · 16 min read
Kuaishou’s Data Application Factory: Boosting BI with Low‑Code & Unified Queries
Alimama Tech
Alimama Tech
Feb 15, 2023 · Big Data

Dolphin: Alibaba's Hyper‑Converged Multi‑Modal Big Data Engine Overview

Dolphin, Alibaba’s hyper‑converged multi‑modal big‑data engine, unifies OLAP, AI, streaming, and batch workloads on a decoupled compute‑storage MPP foundation, offering a Dolphin SQL layer, advanced bitmap/GroupTable/AFile indexes, intelligent materialization, and one‑write‑multiple‑read storage that cuts costs over 70% while delivering sub‑millisecond queries on trillion‑row datasets.

AIBig DataOLAP
0 likes · 14 min read
Dolphin: Alibaba's Hyper‑Converged Multi‑Modal Big Data Engine Overview
DataFunTalk
DataFunTalk
Feb 15, 2023 · Big Data

Alluxio Deployment at Ant Group: Stability Building, Performance Optimization, and Scale‑up for Large‑Scale Model Training

This article summarizes how Ant Group introduced Alluxio to address storage I/O, capacity, and latency challenges in large‑scale model training, detailing stability improvements through worker‑register follower and master migration, performance gains via follower‑only reads, and horizontal scaling using metadata sharding and multi‑cluster deployment.

AlluxioBig DataModel Training
0 likes · 15 min read
Alluxio Deployment at Ant Group: Stability Building, Performance Optimization, and Scale‑up for Large‑Scale Model Training
ByteDance Data Platform
ByteDance Data Platform
Feb 15, 2023 · Databases

How ByteHouse Powers Real‑Time Data Warehousing at Scale

ByteHouse, a cloud‑native data warehouse built on ClickHouse, delivers ultra‑fast real‑time and massive offline analytics with elastic scaling, addressing business needs in ByteDance and the financial sector through optimized architecture, ROI‑driven monitoring, and comprehensive operational tools.

Big DataByteHouseClickHouse
0 likes · 16 min read
How ByteHouse Powers Real‑Time Data Warehousing at Scale
Data Thinking Notes
Data Thinking Notes
Feb 14, 2023 · Big Data

How Cloud Music Turned 60k Tables into Valuable Data Assets

This article details Cloud Music's year‑long data assetization journey, covering the background, practical achievements, governance methods, and future roadmap for turning massive data warehouses into high‑value, well‑governed assets that drive cost reduction and business insight.

Big DataData GovernanceData Platform
0 likes · 10 min read
How Cloud Music Turned 60k Tables into Valuable Data Assets
Alibaba Terminal Technology
Alibaba Terminal Technology
Feb 14, 2023 · Artificial Intelligence

How ChatGPT Is Reshaping Front‑End Development and Data Engineering

This article reflects on the rapid rise of ChatGPT, reviews key AI concepts and high‑quality external resources, analyzes its current limitations, and explores how the technology is transforming front‑end development, big‑data workflows, and engineers' daily practices, offering practical advice for adapting to the AI‑driven future.

Big DataProductivity
0 likes · 18 min read
How ChatGPT Is Reshaping Front‑End Development and Data Engineering
DataFunTalk
DataFunTalk
Feb 12, 2023 · Big Data

Optimizing Bilibili Presto Cluster Query Performance with Alluxio and Local Cache

This article presents a comprehensive technical overview of Bilibili's Presto cluster architecture, the challenges of query performance on Hadoop, and the systematic optimizations—including Alluxio integration, local cache mechanisms, multi‑active coordinators, label‑based scheduling, and real‑time penalties—that together improve availability, stability, and latency for large‑scale analytics workloads.

AlluxioBig DataCache
0 likes · 23 min read
Optimizing Bilibili Presto Cluster Query Performance with Alluxio and Local Cache
Sohu Tech Products
Sohu Tech Products
Feb 8, 2023 · Big Data

Design and Implementation of a General H5 User Behavior Tracking and Data Warehouse Model

This article presents a comprehensive H5 (HTML5) tracking solution that details the planning of event‑collection points, the full data‑warehouse modeling process—including schema design, retention calculations, and SQL implementations—and the automatic data‑capture mechanisms needed to improve user‑behavior analysis efficiency across the product lifecycle.

Big DataH5 analyticsdata-warehouse
0 likes · 17 min read
Design and Implementation of a General H5 User Behavior Tracking and Data Warehouse Model
Architects' Tech Alliance
Architects' Tech Alliance
Feb 8, 2023 · Artificial Intelligence

Computing‑in‑Memory (CiM) Technology: Concepts, History, Advantages, Classifications and Application Scenarios

This article provides a comprehensive overview of Computing‑in‑Memory technology, covering its definition, historical evolution, performance advantages over traditional von Neumann architectures, various technical classifications, storage‑media choices, market drivers, and its pivotal role in AI and big‑data workloads across edge, cloud and automotive domains.

AI accelerationBig DataMemory Architecture
0 likes · 17 min read
Computing‑in‑Memory (CiM) Technology: Concepts, History, Advantages, Classifications and Application Scenarios
DataFunSummit
DataFunSummit
Feb 8, 2023 · Product Management

Content‑Driven Data Product Management: Challenges, Governance Frameworks, and Implementation Strategies

This article shares practical insights from a data product expert on the problems faced by content‑oriented data products, outlines a comprehensive governance methodology—including DAMA, Huawei, and Alibaba frameworks—and demonstrates how to operationalize these ideas through concrete examples such as event‑tracking and metric governance.

Big DataData GovernanceData Product Management
0 likes · 16 min read
Content‑Driven Data Product Management: Challenges, Governance Frameworks, and Implementation Strategies
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Feb 8, 2023 · Big Data

Why MapReduce and Spark Still Matter: A Deep Dive into Distributed Computing

Distributed computing splits massive tasks across multiple servers, and this article explains the classic MapReduce batch engine and the modern Spark framework, covering their architectures, strengths, limitations, and evolution, while highlighting key features like fault tolerance, in‑memory processing, and real‑time streaming capabilities.

Big DataMapReduceSpark
0 likes · 12 min read
Why MapReduce and Spark Still Matter: A Deep Dive into Distributed Computing
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 8, 2023 · Big Data

How Alibaba Cloud EMR 2.0 Redefines Open‑Source Big Data Platforms

This article summarizes Alibaba Cloud senior product expert He Yuan's presentation on EMR 2.0, outlining the challenges of open‑source big data, the evolution of EMR, and the new features—including cloud‑native architecture, enhanced performance, diverse resource models, and expanded analysis scenarios—aimed at reducing cost and complexity.

Alibaba CloudBig DataCloud Native
0 likes · 11 min read
How Alibaba Cloud EMR 2.0 Redefines Open‑Source Big Data Platforms
Youzan Coder
Youzan Coder
Feb 7, 2023 · Big Data

Automated Offline Data Cost Optimization in Youzan's Data Platform

Youzan built an automated offline data cost‑optimization platform that gathers accurate metadata, mines unused or failing tables and tasks, and safely decommissions them through a backend‑frontend workflow with owner validation, notifications, rollback safeguards, and plans to extend lineage coverage and real‑time asset handling.

Big DataCost reductionData Governance
0 likes · 11 min read
Automated Offline Data Cost Optimization in Youzan's Data Platform
Data Thinking Notes
Data Thinking Notes
Feb 6, 2023 · Big Data

How Tencent Tackles Data Governance Challenges with the WeData Platform

This article outlines Tencent's data governance challenges, its internal three‑stage practice, detailed case studies such as Tencent News and PCG cost governance, and introduces the WeData platform's architecture and tools for standardization, quality, security, and metadata management, concluding with a Q&A session.

Big DataData GovernanceData Platform
0 likes · 17 min read
How Tencent Tackles Data Governance Challenges with the WeData Platform
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 6, 2023 · Big Data

Real-Time Data Warehouse Solutions with Hudi: Scenarios, Challenges, and Optimizations

This article presents an in‑depth overview of real‑time data‑warehouse scenarios, discusses challenges such as timeliness, update efficiency, and resource consumption, and details practical solutions using Apache Hudi, Flink, Presto, and related optimizations for ingestion, indexing, compaction, and query performance.

Big DataData LakeFlink
0 likes · 17 min read
Real-Time Data Warehouse Solutions with Hudi: Scenarios, Challenges, and Optimizations
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 4, 2023 · Big Data

Apache Linkis Graduates to Top-Level Project – Overview, Core Features, Roadmap, and Ecosystem

The article announces Apache Linkis’s graduation to an Apache top‑level project, explains its role as a computing middleware linking applications to engines like Spark, Hive, and Flink, details its core capabilities, roadmap, ecosystem integrations, and provides official resources for the community.

ApacheBig DataComputing Middleware
0 likes · 8 min read
Apache Linkis Graduates to Top-Level Project – Overview, Core Features, Roadmap, and Ecosystem
DataFunTalk
DataFunTalk
Feb 4, 2023 · Big Data

Design and Practice of Tencent Lighthouse Fusion Analysis Engine

This article presents the design and implementation of Tencent Lighthouse's Fusion Analysis Engine, covering its background, challenges, fusion architecture, kernel optimizations, acceleration techniques, practical outcomes, and future evolution directions for high‑performance data access.

Big DataFusion EngineLighthouse
0 likes · 12 min read
Design and Practice of Tencent Lighthouse Fusion Analysis Engine
Kuaishou Big Data
Kuaishou Big Data
Feb 3, 2023 · Big Data

Inside Kuaishou’s Company‑Wide Metric Platform: Architecture, Lessons & Best Practices

This article details Kuaishou’s three‑year evolution of its metric middle platform, covering the data infrastructure, key challenges of data inconsistency and low analysis efficiency, the enterprise‑level OneMetric solution, architectural design, development phases, practical lessons, system implementation, and real‑world applications.

Big DataKuaishoudata engineering
0 likes · 23 min read
Inside Kuaishou’s Company‑Wide Metric Platform: Architecture, Lessons & Best Practices
DataFunTalk
DataFunTalk
Feb 2, 2023 · Big Data

SeaTunnel: Design Goals, Current Status, Architecture, and Future Roadmap

This article provides a comprehensive overview of Apache SeaTunnel, covering its design objectives, current capabilities such as multi‑engine support and extensive connector ecosystem, detailed architecture including engine‑independent APIs and execution flows, and outlines the upcoming roadmap to expand connectors, launch a visual web UI, and introduce a dedicated SeaTunnel Engine.

ApacheBatch ProcessingBig Data
0 likes · 12 min read
SeaTunnel: Design Goals, Current Status, Architecture, and Future Roadmap
DataFunTalk
DataFunTalk
Jan 31, 2023 · Big Data

Tencent's Data Governance Practices and Technical Implementation

This article presents Tencent's comprehensive data governance framework, covering its definition, objectives, challenges, methodology, organizational structure, metadata management, data asset lifecycle, security measures, and technical implementation details such as microservice architecture, data collection, lineage analysis, and storage solutions.

Big DataData GovernanceMetadata Management
0 likes · 19 min read
Tencent's Data Governance Practices and Technical Implementation
DataFunTalk
DataFunTalk
Jan 31, 2023 · Big Data

SPI Refactoring Practice in Apache InLong Manager to Reduce Maintenance Cost and Enhance Extensibility

This article presents the SPI-based refactoring of Apache InLong Manager, describing the project's background, existing maintenance challenges, the concept of Java Service Provider Interface, the concrete implementation steps, code restructuring, and the resulting benefits such as higher code reuse, easier extension, and reduced DDL changes.

Apache InLongBig DataCode Refactoring
0 likes · 10 min read
SPI Refactoring Practice in Apache InLong Manager to Reduce Maintenance Cost and Enhance Extensibility
Bilibili Tech
Bilibili Tech
Jan 31, 2023 · Big Data

Design and Optimization of Real-Time Data Quality Control (DQC) Platform on Bilibili's Big Data System

Bilibili redesigned its real-time data-quality control platform by replacing per-rule Flink jobs with a unified, dynamically-configured architecture that classifies Kafka topics, aggregates via InfluxDB full-table and continuous queries, mitigates data inflation, adds a high-performance proxy, and implements robust monitoring and recovery to ensure scalable, reliable data quality for its big-data services.

Big DataDQCFlink
0 likes · 22 min read
Design and Optimization of Real-Time Data Quality Control (DQC) Platform on Bilibili's Big Data System
DataFunTalk
DataFunTalk
Jan 30, 2023 · Big Data

Data Governance Strategies: Principles, Practices, and Real‑World Case Studies

The article explains why data governance is essential for high‑quality data in big‑data organizations, outlines narrow and broad governance scopes, presents strategic principles, and shares eight detailed case studies from leading Chinese tech companies illustrating practical implementation and lessons learned.

Big DataData Governance
0 likes · 7 min read
Data Governance Strategies: Principles, Practices, and Real‑World Case Studies
Data Thinking Notes
Data Thinking Notes
Jan 29, 2023 · Big Data

How to Turn Data Assets into Business Value: A Roadmap for Enterprises

Enterprises must shift their perception of data assets and embed data‑value into every digital process, establishing governance, unified asset catalogs, operational metrics, security controls, integration, services, and visualization to transform raw data into strategic business outcomes.

Big DataData GovernanceData Integration
0 likes · 12 min read
How to Turn Data Assets into Business Value: A Roadmap for Enterprises
DataFunSummit
DataFunSummit
Jan 29, 2023 · Big Data

Data Serviceization at JD: From Zero to One and Beyond

This article presents JD's data service platform, describing its origin, performance optimizations, flexible API generation, caching strategies, service orchestration, and governance, and includes a Q&A that addresses security, performance, and multi‑source data handling challenges.

APIBig DataData Service
0 likes · 11 min read
Data Serviceization at JD: From Zero to One and Beyond
DataFunTalk
DataFunTalk
Jan 28, 2023 · Big Data

Data Lake vs Data Warehouse: Differences, Evolution, and Integrated Lakehouse Design

This article explores the ongoing debate between data lakes and data warehouses, clarifies their distinct purposes and technologies, discusses how they can coexist or complement each other, and introduces the concept of an integrated lakehouse architecture while promoting a comprehensive data intelligence knowledge map.

Big DataData LakeLakehouse
0 likes · 5 min read
Data Lake vs Data Warehouse: Differences, Evolution, and Integrated Lakehouse Design
DataFunSummit
DataFunSummit
Jan 27, 2023 · Big Data

Data Governance Strategies: Principles, Practices, and Case Studies

The article explains the importance of data governance, distinguishes narrow and broad governance, outlines strategic principles such as systemic engineering and prioritization, and presents eight case studies from leading Chinese tech companies illustrating practical implementations and effective strategies.

Big DataData GovernanceData Management
0 likes · 8 min read
Data Governance Strategies: Principles, Practices, and Case Studies
Tencent Cloud Developer
Tencent Cloud Developer
Jan 26, 2023 · Operations

Technical Article Digest: Operations, AI, Web3, Rust, Big Data, and More

This technical digest surveys Tencent’s health‑code operations architecture, dissects ChatGPT’s training pipeline, contrasts Web 2.0 and Web 3.0 on Ethereum, explains AI‑generated art, details WeChat’s overload controls and QQ Music’s high‑availability design, examines the rapid scaling of the “Sheep Sheep” mini‑game, introduces Rust for front‑end developers, showcases big‑data football prediction models, and outlines common C++ pitfalls and best‑practice recommendations.

Big DataC++Rust
0 likes · 7 min read
Technical Article Digest: Operations, AI, Web3, Rust, Big Data, and More
DataFunTalk
DataFunTalk
Jan 26, 2023 · Big Data

Tencent Data Governance Practices and the WeData Platform

This article outlines Tencent's data governance challenges, internal practices across three maturity stages, and introduces the WeData platform that provides comprehensive capabilities for data assetization, cost control, quality assurance, security, and metadata management to support large‑scale big‑data operations.

Big DataData GovernanceTencent
0 likes · 15 min read
Tencent Data Governance Practices and the WeData Platform
DataFunTalk
DataFunTalk
Jan 26, 2023 · Big Data

Data Governance Strategies: Principles, Practices, and Real‑World Case Studies

This article explains why data is a company's most valuable asset, distinguishes narrow and broad data‑governance approaches, outlines strategic design principles, and presents eight detailed case studies from leading Chinese tech firms illustrating practical governance implementations and lessons learned.

Big DataData Governance
0 likes · 8 min read
Data Governance Strategies: Principles, Practices, and Real‑World Case Studies
DataFunSummit
DataFunSummit
Jan 23, 2023 · Big Data

Design and Practice of the 58 Agile BI System (Starfire)

This article presents a comprehensive overview of the 58 Agile BI platform called Starfire, covering its background, technical architecture, core permission and query engine challenges, MPP cache acceleration, visualization resource library, developer services, and future development directions.

ArchitectureBIBig Data
0 likes · 13 min read
Design and Practice of the 58 Agile BI System (Starfire)
DataFunSummit
DataFunSummit
Jan 22, 2023 · Big Data

Applying Spark SQL at Ping An Insurance: Business Background, Deployment Choices, Migration Process, and Lessons Learned

This article details how Ping An Insurance migrated its offline Hive SQL workloads to Spark SQL, covering business background, deployment mode selection, migration workflow, typical challenges, optimization measures, and the resulting performance and resource utilization improvements.

Big DataCluster MigrationDeployment Modes
0 likes · 16 min read
Applying Spark SQL at Ping An Insurance: Business Background, Deployment Choices, Migration Process, and Lessons Learned
DataFunSummit
DataFunSummit
Jan 21, 2023 · Big Data

Building and Evolving Data Management Systems: From IT to DT Era, Standards, Models, and Marketization

This article outlines the evolution of data management in the big‑data era, covering the history of the industry, key governance frameworks such as DMBOK, DCMM and DMM, the steps to construct a data‑management system, the requirements for a data‑factor market, and an introduction to the DataEasy company and its services.

Big DataDCMMDMBOK
0 likes · 15 min read
Building and Evolving Data Management Systems: From IT to DT Era, Standards, Models, and Marketization
DataFunTalk
DataFunTalk
Jan 20, 2023 · Big Data

Introduction to Flink CDC: Incremental Snapshot Algorithm and Framework

This article introduces Flink CDC, explains its incremental snapshot algorithm and the 2.0 framework design, compares it with traditional CDC pipelines, discusses the core API and dialect concept, and outlines community growth and future plans, providing a comprehensive technical overview for data engineers.

Apache FlinkBig DataChange Data Capture
0 likes · 13 min read
Introduction to Flink CDC: Incremental Snapshot Algorithm and Framework
DataFunTalk
DataFunTalk
Jan 19, 2023 · Big Data

Tencent Alluxio: Accelerating the Next Generation of Big Data and AI

This article presents a comprehensive overview of Tencent's Alluxio project, covering the evolution of big‑data architecture, recent Alluxio research progress, typical deployment cases, and future work, while highlighting performance improvements, integration with cloud and AI workloads, and community contributions.

AIAlluxioBig Data
0 likes · 21 min read
Tencent Alluxio: Accelerating the Next Generation of Big Data and AI
NetEase Cloud Music Tech Team
NetEase Cloud Music Tech Team
Jan 17, 2023 · Big Data

How NetEase Cloud Music Cut Data Pipeline Delays by 60% with Full‑Link Baseline Governance

This case study details NetEase Cloud Music's full‑link baseline governance initiative, outlining the challenges of massive data pipelines, the metrics used to measure success, the three‑pronged action plan (infrastructure, task optimization, and standards), and the resulting improvements in availability, resource utilization, and monitoring accuracy.

Big DataResource Optimizationbaseline governance
0 likes · 11 min read
How NetEase Cloud Music Cut Data Pipeline Delays by 60% with Full‑Link Baseline Governance
Huolala Tech
Huolala Tech
Jan 16, 2023 · Big Data

How Leading Logistics Companies Master Data Governance for Cost and Stability

At the 2022 DataFun Summit, data governance experts from Huolala, Zhongtong, and SF Express shared comprehensive practices—including governance drivers, quality monitoring, model management, master data processes, platform architecture, cost control, and stability measures—illustrating how large logistics firms implement end‑to‑end data governance to boost efficiency, compliance, and business value.

Big DataCost ManagementData Governance
0 likes · 13 min read
How Leading Logistics Companies Master Data Governance for Cost and Stability
JD Tech
JD Tech
Jan 13, 2023 · Big Data

UData: Solving the Last Mile of Data Usage – Architecture, Query Engine Design, and Federated Query Enhancements

This article introduces the UData platform, explains its data‑integration architecture, details the StarRocks‑based query engine workflow from SQL parsing to distributed execution, and describes recent optimizations such as computation push‑down, support for JSF/HTTP/ClickHouse external tables, and a proxy‑based federated query framework.

Big DataData IntegrationQuery Engine
0 likes · 20 min read
UData: Solving the Last Mile of Data Usage – Architecture, Query Engine Design, and Federated Query Enhancements
DataFunSummit
DataFunSummit
Jan 12, 2023 · Big Data

Industrial IoT Data Collection Platform: Neuron v2.0 Architecture, Design, and Case Studies

This article presents a comprehensive overview of EMQ's Neuron industrial IoT data collection platform, detailing the lessons learned from version 1.x, the redesigned v2.0 architecture, core modules, plugin mechanisms, data‑tag management, eKuiper integration, and two real‑world case studies in oil‑field and smart‑factory environments.

Big DataIoTdata collection
0 likes · 16 min read
Industrial IoT Data Collection Platform: Neuron v2.0 Architecture, Design, and Case Studies
Ctrip Technology
Ctrip Technology
Jan 12, 2023 · Big Data

Evolution of Ctrip's Log System: From Elasticsearch to ClickHouse and Log 3.0

This article details the evolution of Ctrip's log infrastructure, describing the shift from fragmented departmental logging to a unified Elasticsearch-based platform, the migration to ClickHouse for cost‑effective, high‑performance storage, and the subsequent Log 3.0 redesign that leverages Kubernetes, sharding, and a unified query governance layer to handle petabyte‑scale data.

Big DataClickHouseCloud Native
0 likes · 16 min read
Evolution of Ctrip's Log System: From Elasticsearch to ClickHouse and Log 3.0
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jan 12, 2023 · Operations

What Is DataOps and How Can It Transform Your Data Management?

DataOps, the data‑centric counterpart of DevOps, combines agile principles, standardized tools, and cross‑team collaboration to manage the full data lifecycle—from integration and development to storage, governance, and service—enabling organizations to handle massive, diverse datasets efficiently, reduce silos, and turn data into actionable value.

Big DataData GovernanceData Integration
0 likes · 15 min read
What Is DataOps and How Can It Transform Your Data Management?
vivo Internet Technology
vivo Internet Technology
Jan 11, 2023 · Cloud Native

Practices of Distributed Message Middleware at vivo: From RocketMQ to Kafka and Pulsar

vivo’s Internet Storage team details how it operates RocketMQ for low‑latency online services and Kafka for massive big‑data pipelines, outlines resource isolation, traffic balancing, intelligent throttling, and governance practices, and describes its migration from RabbitMQ and planned shift from Kafka to cloud‑native Pulsar.

Big DataCloud NativeKafka
0 likes · 22 min read
Practices of Distributed Message Middleware at vivo: From RocketMQ to Kafka and Pulsar
Data Thinking Notes
Data Thinking Notes
Jan 10, 2023 · Big Data

How Bilibili Built a Scalable Data Quality Platform for Billions of Events

This article describes Bilibili’s data quality platform, outlining its background, objectives, theoretical models, workflow stages (recording, checking, alerting), DSL for metrics, root‑cause analysis, scheduling strategies, heterogeneous source integration, rule coverage, intelligent monitoring, and future plans to achieve automated, real‑time, high‑reliability data assurance for massive daily workloads.

Big DataData QualityRoot Cause Analysis
0 likes · 21 min read
How Bilibili Built a Scalable Data Quality Platform for Billions of Events
dbaplus Community
dbaplus Community
Jan 10, 2023 · Big Data

Choosing the Right OLAP Engine: Druid vs ClickHouse and Optimization Tips

This article introduces OLAP concepts, compares major OLAP solutions such as Druid, Kylin, Doris, and ClickHouse, outlines their features and suitable scenarios, and shares practical optimization techniques—including materialized views, caching, node tiering, and query tuning—to improve performance for high‑concurrency analytical workloads.

Big DataClickHouseDruid
0 likes · 16 min read
Choosing the Right OLAP Engine: Druid vs ClickHouse and Optimization Tips
DataFunSummit
DataFunSummit
Jan 10, 2023 · Big Data

Exploring Iceberg in Huawei Terminal Cloud: Architecture, Features, and Future Plans

This article presents a comprehensive overview of Iceberg's adoption in Huawei Terminal Cloud, covering its architectural overview, key features such as Git‑style data management, real‑time processing, acceleration layers, and future development directions, along with a Q&A session addressing performance and implementation details.

Big DataData LakeFlink
0 likes · 15 min read
Exploring Iceberg in Huawei Terminal Cloud: Architecture, Features, and Future Plans
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jan 10, 2023 · Big Data

How Alibaba’s Dolphin Engine Uses Flink + Hologres for Real‑Time Big Data

The Dolphin engine, built by Alibaba’s Data Engine team, combines Flink and Hologres to deliver ultra‑large‑scale OLAP, streaming, batch, and AI capabilities for real‑time advertising analytics, offering smart materialization, intelligent indexing, and vector recall while supporting millions of advertisers and petabyte‑level data.

AIBig DataFlink
0 likes · 13 min read
How Alibaba’s Dolphin Engine Uses Flink + Hologres for Real‑Time Big Data
DataFunSummit
DataFunSummit
Jan 9, 2023 · Big Data

JD Data‑Driven Business Development: Building a Business Metric Data System and Marketplace Governance

The article outlines JD's data‑driven business development strategy, describing the current challenges of its business data marketplace, the governance framework—including layered architecture, standardization, ClickHouse dictionary refresh, and optimization measures—and the resulting performance improvements and future outlook.

Big DataClickHouseData Governance
0 likes · 13 min read
JD Data‑Driven Business Development: Building a Business Metric Data System and Marketplace Governance
DataFunTalk
DataFunTalk
Jan 8, 2023 · Big Data

ByteDance Event‑Tracking Data Cost Governance Practices

This article describes ByteDance's comprehensive approach to managing the massive volume of event‑tracking (埋点) data, detailing the background, cost‑reduction strategies, experience review, future plans, and a Q&A session that together illustrate how systematic data governance can dramatically cut storage and processing expenses.

Big DataByteDanceData Governance
0 likes · 18 min read
ByteDance Event‑Tracking Data Cost Governance Practices
DataFunSummit
DataFunSummit
Jan 7, 2023 · Big Data

Redefining the Customer Data Platform (CDP) for New Energy Vehicle Companies

This article explores why the automotive industry's shift to new energy vehicles necessitates a redefinition of the Customer Data Platform (CDP), detailing the changing traffic structure, varied departmental demands, CDP typologies, implementation strategies, and the benefits of a unified, extensible CDP architecture for marketing, sales, and after‑sales.

Big DataCDPData Platform
0 likes · 13 min read
Redefining the Customer Data Platform (CDP) for New Energy Vehicle Companies
Data Thinking Notes
Data Thinking Notes
Jan 5, 2023 · Big Data

Why Data Lakes Are Outshining Traditional Data Warehouses: A Deep Dive

This comprehensive guide explains the evolution from traditional data warehouses to modern data lakes, detailing concepts, architectures, differences, implementation steps, and real‑world case studies, while also comparing major cloud providers' solutions and highlighting how data platforms support digital transformation and analytics.

AnalyticsBig DataData Lake
0 likes · 97 min read
Why Data Lakes Are Outshining Traditional Data Warehouses: A Deep Dive
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Jan 4, 2023 · Big Data

Choosing the Right Data Architecture: Warehouse, Mart, or Lake?

Understanding enterprise data platforms requires grasping the differences between data warehouses, data marts, and data lakes, their architectures, use cases, and key capabilities such as integration, real‑time processing, governance, and cost control, to guide organizations in building scalable, flexible data solutions.

Big DataData Mart
0 likes · 15 min read
Choosing the Right Data Architecture: Warehouse, Mart, or Lake?
DataFunSummit
DataFunSummit
Jan 4, 2023 · Big Data

Data Intelligence Expert Interview – Maturity, Trends, and Practices of Data Middle Platforms

The interview gathers insights from data‑platform experts on the maturity stages, technology trends, implementation methodologies, open‑source ecosystems, system architectures, governance, security, and assessment criteria of modern data middle platforms, offering a comprehensive guide for practitioners.

Big DataData GovernanceData Observability
0 likes · 28 min read
Data Intelligence Expert Interview – Maturity, Trends, and Practices of Data Middle Platforms
Data Thinking Notes
Data Thinking Notes
Jan 3, 2023 · Big Data

How a Scalable Data Service Platform Transforms Big Data into APIs

This article outlines the design and implementation of a unified data service platform that standardizes data access, accelerates model processing, provides flexible API construction, and ensures high availability through gateway, caching, and monitoring, ultimately reducing cost and improving efficiency for both C‑end and B‑end applications.

Big DataData PlatformService Architecture
0 likes · 25 min read
How a Scalable Data Service Platform Transforms Big Data into APIs
Tencent Cloud Developer
Tencent Cloud Developer
Jan 3, 2023 · Big Data

How Tencent’s Cloud‑Native Lakehouse Tackles PB‑Scale Performance Challenges

This article analyzes Tencent Cloud’s DLC lakehouse solution, explaining the unified data lake‑warehouse architecture, the performance hurdles of object‑storage‑based analytics, and the multi‑dimensional caching, virtual‑cluster elasticity, and advanced filter techniques that enable second‑level analysis on petabyte‑scale data while reducing costs.

Big DataDLCLakehouse
0 likes · 13 min read
How Tencent’s Cloud‑Native Lakehouse Tackles PB‑Scale Performance Challenges
ITPUB
ITPUB
Jan 3, 2023 · Databases

How DragonF MPP DB Redefines Cloud‑Native Data Warehousing at Massive Scale

The article details the design, core features, and real‑world performance of the DragonF MPP DB, a cloud‑native, compute‑storage‑separated database that overcomes traditional MPP limitations, supports millions of daily jobs, and outlines its future roadmap for ultra‑large‑scale data platforms.

Big DataCloud NativeMPP
0 likes · 11 min read
How DragonF MPP DB Redefines Cloud‑Native Data Warehousing at Massive Scale
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 3, 2023 · Big Data

Migrating Hive SQL Jobs to Flink Using the SQL Gateway

This article explains how to use Apache Flink 1.16's SQL Gateway to migrate Hive SQL tasks to Flink, covering the underlying Hive‑on‑Flink architecture, dialect compatibility, streaming and batch demos, configuration details, and practical tips for developers and platform engineers.

Batch ProcessingBig DataFlink
0 likes · 19 min read
Migrating Hive SQL Jobs to Flink Using the SQL Gateway
DataFunTalk
DataFunTalk
Jan 3, 2023 · Big Data

Tencent Unified Big Data Scheduling Platform – Architecture, Design, and Operations

The article presents an in‑depth overview of Tencent's self‑developed Unified Scheduling Platform, detailing its system architecture, design challenges, performance optimizations, resource‑fair scheduling mechanisms, operational metrics, future roadmap, and a Q&A session that together illustrate how the platform enables massive offline data processing at scale.

Big DataDistributed SystemsPerformance Optimization
0 likes · 18 min read
Tencent Unified Big Data Scheduling Platform – Architecture, Design, and Operations
Code Ape Tech Column
Code Ape Tech Column
Jan 3, 2023 · Big Data

Elasticsearch vs ClickHouse: Performance, Cost, and Deployment Guide

This article compares Elasticsearch and ClickHouse in terms of write throughput, query speed, and server cost, then provides a step‑by‑step deployment guide for a private data pipeline using Zookeeper, Kafka, FileBeat, and ClickHouse, along with common issues and their solutions.

Big DataClickHouseDeployment
0 likes · 15 min read
Elasticsearch vs ClickHouse: Performance, Cost, and Deployment Guide