Tagged articles
3675 articles
Page 10 of 37
Data Thinking Notes
Data Thinking Notes
Oct 22, 2023 · Big Data

Boosting Big Data Governance Capabilities for Digital Transformation

This article outlines how enterprises can enhance their big data governance capabilities during digital transformation, covering the background and challenges of data governance, the emergence of data capability as a core competency with implementation paths, and practical suggestions for governance projects, illustrated with national-level examples.

Big DataData GovernanceDigital Transformation
0 likes · 3 min read
Boosting Big Data Governance Capabilities for Digital Transformation
DataFunSummit
DataFunSummit
Oct 22, 2023 · Big Data

How Kuaishou E‑commerce Leverages OLAP and a Unified Data Architecture to Solve Business Data Challenges

This article explains how Kuaishou's e‑commerce team built a unified OLAP‑based data platform—covering data ingestion, consistent dimensional and fact layers, metric management, and real‑time services—to address rapid growth, metric inconsistency, and operational inefficiencies across multiple business scenarios.

Big DataData ArchitectureE‑commerce
0 likes · 20 min read
How Kuaishou E‑commerce Leverages OLAP and a Unified Data Architecture to Solve Business Data Challenges
DataFunTalk
DataFunTalk
Oct 22, 2023 · Operations

Bilibili Data Quality Assurance System: Architecture, Practices, and Case Study

This article presents Bilibili's data quality assurance system, detailing its evolution across four data platform stages, the multi‑layer architecture, core capabilities such as a quality data warehouse, digital‑driven continuous optimization, and efficient incident handling, and concludes with a real‑world case study and future outlook.

Big DataQuality assurancedata-warehouse
0 likes · 21 min read
Bilibili Data Quality Assurance System: Architecture, Practices, and Case Study
dbaplus Community
dbaplus Community
Oct 18, 2023 · Databases

Doris vs ClickHouse: Which Database Delivers Faster Writes and Queries?

This article presents a systematic performance comparison between Doris and ClickHouse, covering data ingestion speed, SQL syntax differences, hardware impact, and detailed query benchmarks across multiple scenarios, ultimately revealing that each system excels in different use cases.

Big DataClickHousePerformance
0 likes · 15 min read
Doris vs ClickHouse: Which Database Delivers Faster Writes and Queries?
DataFunSummit
DataFunSummit
Oct 18, 2023 · Big Data

Kuaishou Data Lake Construction with Apache Hudi: Architecture, Challenges, and Solutions

This article explains why Kuaishou built a data lake, outlines the shortcomings of its previous Lambda architecture, describes the adoption of Apache Hudi for unified batch‑stream processing, and details the five major technical challenges and the corresponding solutions implemented to improve performance, consistency, and operational reliability.

Apache HudiBig DataData Architecture
0 likes · 17 min read
Kuaishou Data Lake Construction with Apache Hudi: Architecture, Challenges, and Solutions
DataFunSummit
DataFunSummit
Oct 16, 2023 · Big Data

Bilibili's Iceberg‑Based Lakehouse Platform: Technical Practices for Sub‑Second Query Response

This article details Bilibili's implementation of an Iceberg‑based lakehouse platform that unifies storage and analytics, addressing Hive’s performance and latency issues through multidimensional sorting, various file‑level indexes, cube pre‑aggregation, star‑tree structures, and an automated Magnus service for intelligent optimization, achieving near‑second query responses.

Big DataIcebergLakehouse
0 likes · 14 min read
Bilibili's Iceberg‑Based Lakehouse Platform: Technical Practices for Sub‑Second Query Response
DataFunSummit
DataFunSummit
Oct 16, 2023 · Big Data

Elegant Dimensional Modeling and Multi‑Dimensional Analysis Design Practice

In this presentation, Qiu Shengchang shares his 13‑year experience designing elegant data‑warehouse architectures, detailing a highly generic dimensional model, extreme partitioned tables, and a universal multi‑dimensional analysis framework that enables rapid, comprehensive reporting on massive datasets.

Big DataMulti-dimensional Analysisdata-warehouse
0 likes · 3 min read
Elegant Dimensional Modeling and Multi‑Dimensional Analysis Design Practice
DataFunSummit
DataFunSummit
Oct 15, 2023 · Big Data

Construction and Architecture of JD One-Service Data Service System

This article details JD's three‑stage evolution of its data service platform, explains thematic (topic‑based) data services, introduces the One‑Service unified architecture, and outlines future plans for standardization, low‑code front‑end, and operational improvements.

Big DataData PlatformData Service
0 likes · 13 min read
Construction and Architecture of JD One-Service Data Service System
dbaplus Community
dbaplus Community
Oct 14, 2023 · Big Data

What Is a Data Warehouse? From Basics to Modern Practices

This article explains what a data warehouse is, contrasts it with traditional databases, outlines the evolution from classic to internet‑scale warehouses, details modeling approaches and layered architectures, discusses KPI dictionaries, date dimensions, naming standards, data governance, incremental loading techniques, and upstream/downstream coordination.

Big DataData GovernanceETL
0 likes · 25 min read
What Is a Data Warehouse? From Basics to Modern Practices
DataFunSummit
DataFunSummit
Oct 13, 2023 · Big Data

Practical Experience of Flink on Kubernetes at Kuaishou

This article presents Kuaishou's comprehensive journey of adopting Flink on Kubernetes, covering its background, evolution, architecture, production migration, observability, testing, and future plans, and demonstrates how large‑scale streaming workloads are transformed to a cloud‑native environment.

Big DataFlinkKubernetes
0 likes · 14 min read
Practical Experience of Flink on Kubernetes at Kuaishou
DataFunTalk
DataFunTalk
Oct 13, 2023 · Big Data

Design Principles, Architecture, and Applications of the Open‑Source LakeSoul Lakehouse Framework

This article provides a comprehensive technical overview of LakeSoul, an open‑source, cloud‑native lakehouse framework, covering its design philosophy, core features, architecture, performance benchmarks, real‑time ingestion, incremental computation, multi‑stream joining, security, community progress, and future roadmap.

Big DataData LakehouseFlink
0 likes · 16 min read
Design Principles, Architecture, and Applications of the Open‑Source LakeSoul Lakehouse Framework
Data Thinking Notes
Data Thinking Notes
Oct 11, 2023 · Big Data

How ByteDance Optimized Its E‑Commerce Data Lake to Cut Costs and Boost Real‑Time Accuracy

ByteDance revamped its traditional Lambda architecture for e‑commerce traffic data by introducing a new lake ingestion solution that reduces development and operational costs, ensures timely and stable data, and outlines future plans covering business background, ODS lake design, archiving tags, delayed data handling, and real‑time stability.

Big DataData LakeFlink
0 likes · 7 min read
How ByteDance Optimized Its E‑Commerce Data Lake to Cut Costs and Boost Real‑Time Accuracy
政采云技术
政采云技术
Oct 10, 2023 · Artificial Intelligence

Predicting Membership Purchase with Logistic Regression: Feature Engineering, Model Training, Evaluation, and Deployment

This article presents a complete workflow for predicting whether users will purchase a membership using logistic regression, covering data collection, feature selection, handling imbalanced samples, model training, hyper‑parameter tuning, threshold optimization, evaluation metrics such as accuracy, precision, recall, AUC, lift, and finally deployment on a big‑data platform with PySpark.

Big DataModel Evaluationfeature engineering
0 likes · 17 min read
Predicting Membership Purchase with Logistic Regression: Feature Engineering, Model Training, Evaluation, and Deployment
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Oct 9, 2023 · Big Data

How We Cut MaxCompute Costs Using Information Schema Insights

This article details how a fast‑growing HR SaaS company analyzed MaxCompute billing spikes, identified five key cost drivers, leveraged tenant‑level Information Schema to extract task metadata, applied SQL‑based cost formulas, and implemented targeted optimizations that stabilized their cloud data‑processing expenses.

Big DataCost OptimizationInformation Schema
0 likes · 10 min read
How We Cut MaxCompute Costs Using Information Schema Insights
MaGe Linux Operations
MaGe Linux Operations
Oct 8, 2023 · Big Data

Understanding Kafka: Core Concepts, Architecture, and Performance Secrets

This article explains Kafka’s fundamental role as a message system, detailing topics, partitions, producers, consumers, replica management, consumer groups, the controller, Zookeeper coordination, and performance optimizations such as sequential writes, zero‑copy, log segmentation, and network design, providing a comprehensive overview for big‑data practitioners.

Big DataDistributed SystemsKafka
0 likes · 11 min read
Understanding Kafka: Core Concepts, Architecture, and Performance Secrets
DataFunTalk
DataFunTalk
Oct 8, 2023 · Big Data

Full-Process DataOps Practices for Large-Scale Business Data Reporting at Baidu

This article reveals how Baidu implements end‑to‑end DataOps for its commercial data products, covering challenges of massive report generation, the design of a layered data architecture, platform‑wide automation, serverless deployment, risk control, monitoring, and optimization to achieve scalable, reliable data pipelines.

Big DataDataOpsServerless
0 likes · 13 min read
Full-Process DataOps Practices for Large-Scale Business Data Reporting at Baidu
Efficient Ops
Efficient Ops
Oct 7, 2023 · Big Data

Master Kafka Basics: Topics, Partitions, Producers, and Cluster Architecture

This article explains Kafka's role as a messaging system, covering core concepts such as topics, partitions, producers, consumers, messages, cluster architecture, replicas, consumer groups, controller coordination with Zookeeper, and performance optimizations like sequential writes and zero‑copy networking.

Big DataDistributed SystemsKafka
0 likes · 11 min read
Master Kafka Basics: Topics, Partitions, Producers, and Cluster Architecture
DataFunTalk
DataFunTalk
Oct 7, 2023 · Big Data

Alibaba DataWorks Data Stability Governance: Challenges, Solutions, and Practices

This article presents Alibaba's experience in addressing large‑scale data stability challenges by outlining common problems, governance principles, baseline monitoring, team collaboration methods, practical implementations, and proactive measures to ensure reliable and accurate data production on the DataWorks platform.

AlibabaBig DataData Governance
0 likes · 12 min read
Alibaba DataWorks Data Stability Governance: Challenges, Solutions, and Practices
Efficient Ops
Efficient Ops
Oct 6, 2023 · Operations

How China Post’s Next‑Gen IT Monitoring Platform Drives Smart Operations

The article details China Post’s new generation IT infrastructure intelligent operation monitoring platform, highlighting its architecture, data collection, stream‑batch processing, AI‑driven algorithms, and one‑stop portal, and explains how the solution exemplifies cutting‑edge digital transformation practices showcased at the 2023 China International Service Trade Fair.

AIBig DataDigital Transformation
0 likes · 9 min read
How China Post’s Next‑Gen IT Monitoring Platform Drives Smart Operations
DataFunTalk
DataFunTalk
Oct 5, 2023 · Big Data

Building a Unified Streaming‑Batch Lakehouse with Amoro Mixed Iceberg

This article describes how Shanghai Steel Union leveraged Amoro Mixed Iceberg on top of Apache Iceberg to create a unified streaming‑batch lakehouse, addressing small‑file and upsert challenges, simplifying architecture, improving data freshness, and providing a scalable solution for real‑time and batch analytics.

AmoroApache IcebergBig Data
0 likes · 13 min read
Building a Unified Streaming‑Batch Lakehouse with Amoro Mixed Iceberg
ITPUB
ITPUB
Oct 4, 2023 · Backend Development

How to Speed Up Slow Elasticsearch Aggregations with execution_hint "map"

In a high‑traffic e‑commerce system, sharding makes cross‑shop queries inefficient, and adding terms aggregations in Elasticsearch caused queries to take dozens of seconds, but using the "execution_hint":"map" option dramatically reduces aggregation latency.

Big DataElasticsearchPerformance Optimization
0 likes · 7 min read
How to Speed Up Slow Elasticsearch Aggregations with execution_hint "map"
DataFunTalk
DataFunTalk
Oct 4, 2023 · Big Data

Understanding Power Law Distributions in Content Ecosystems: Data Science Insights and Applications

This article explores how data scientists at Tencent analyze and model the shape of data in content ecosystems, focusing on normal and power‑law distributions, their prevalence, theoretical mechanisms, practical implications for traffic and compensation strategies, and methods such as integer programming, graph analysis, and causal inference.

Big DataPower LawStatistical Distribution
0 likes · 19 min read
Understanding Power Law Distributions in Content Ecosystems: Data Science Insights and Applications
DataFunSummit
DataFunSummit
Oct 1, 2023 · Big Data

Iceberg Data Lake: Core Features, Xiaomi Use Cases, and Future Plans

This presentation introduces Iceberg's core capabilities, details Xiaomi's practical applications—including log ingestion, near‑real‑time warehousing, offline challenges, column‑level encryption, and Hive migration—and outlines future development directions such as materialized views and cloud migration, providing a comprehensive view of modern data‑lake engineering.

Big DataData LakeFlink
0 likes · 22 min read
Iceberg Data Lake: Core Features, Xiaomi Use Cases, and Future Plans
DataFunTalk
DataFunTalk
Sep 30, 2023 · Big Data

Building a Marketing‑Oriented Data Middle Platform: Concepts and Practices

This article outlines how a marketing‑focused data middle platform can be constructed by integrating online and offline behavior data, business data, and third‑party sources, then applying data integration, modeling, processing, and application capabilities to enable data‑driven user journeys and personalized marketing strategies.

Big DataData Integrationdata modeling
0 likes · 13 min read
Building a Marketing‑Oriented Data Middle Platform: Concepts and Practices
ITPUB
ITPUB
Sep 29, 2023 · Big Data

How Vivo Scaled Hive Metastore Using TiDB: A Deep Dive into Big Data Metadata

This article recounts Vivo’s journey to horizontally scale its Hive Metastore service by evaluating MySQL sharding, the open‑source Waggle‑Dance gateway, and ultimately selecting TiDB, detailing the migration process, configuration tweaks, performance benchmarks, encountered issues such as primary‑key conflicts, index choices, memory spikes, and the solutions implemented to ensure stable, high‑performance metadata storage for massive data volumes.

Big DataHive MetastorePerformance Optimization
0 likes · 22 min read
How Vivo Scaled Hive Metastore Using TiDB: A Deep Dive into Big Data Metadata
DataFunSummit
DataFunSummit
Sep 28, 2023 · Big Data

Real‑time Risk Control Practices at NetEase Games Using Apache Flink

The article details NetEase Games' challenges in payment‑environment risk control and explains how they transformed a T+1 batch workflow into a fully real‑time risk‑control system with Apache Flink, describing the platform architecture, data modeling, session windows, joins, and future development plans.

Big DataFlinkReal-time Risk Control
0 likes · 19 min read
Real‑time Risk Control Practices at NetEase Games Using Apache Flink
vivo Internet Technology
vivo Internet Technology
Sep 27, 2023 · Big Data

Horizontal Scaling of Hive Metastore Service at Vivo: Evaluation, TiDB Migration, and Lessons Learned

Vivo’s big‑data team horizontally scaled its Hive Metastore by evaluating MySQL sharding (Waggle‑Dance) against a TiDB migration, ultimately adopting TiDB, which after a synchronized cut‑over delivered ~15% faster queries, 80% DDL latency reduction, linear scaling, low resource use, and valuable operational lessons.

Big DataHive MetastoreTiDB
0 likes · 19 min read
Horizontal Scaling of Hive Metastore Service at Vivo: Evaluation, TiDB Migration, and Lessons Learned
DataFunTalk
DataFunTalk
Sep 25, 2023 · Big Data

Tag System Construction Practice at 58: Pain Points, Solutions, Architecture, and Management Platform

This article details the practical implementation of a tag system at 58, covering business stages that require tagging, common challenges and solutions, a three‑layer architecture, lifecycle management, evaluation metrics, and a unified tag management platform to support scalable, scenario‑driven data products.

Big DataLabel ArchitectureTag Management
0 likes · 17 min read
Tag System Construction Practice at 58: Pain Points, Solutions, Architecture, and Management Platform
Huolala Tech
Huolala Tech
Sep 21, 2023 · Big Data

How We Built a Scalable Data Migration Framework for Billions of Transactions

This article details the design and implementation of a custom, high‑throughput data migration framework that handles petabyte‑scale transaction data, supports heterogeneous source/target schemas, ensures zero‑downtime operation, and provides robust scheduling, checkpointing, and fault‑tolerance mechanisms.

Big DataData MigrationDistributed Systems
0 likes · 17 min read
How We Built a Scalable Data Migration Framework for Billions of Transactions
Architect
Architect
Sep 19, 2023 · Big Data

How Tianyan Beats ELK: Inside a High‑Performance Distributed Log Service

This article analyzes the challenges of logging in distributed services, compares the traditional ELK stack with Baidu's Tianyan platform, and details Tianyan's architecture, data collection, high‑throughput transmission, storage, retrieval, resource isolation, dynamic cleanup, and best‑practice recommendations, complete with code examples and performance insights.

Big DataDistributed SystemsELK
0 likes · 30 min read
How Tianyan Beats ELK: Inside a High‑Performance Distributed Log Service
DataFunTalk
DataFunTalk
Sep 16, 2023 · Big Data

StarRocks Data Lake Analysis, Materialized Views, and Lakehouse Architecture

This article explains how StarRocks 3.0 extends real‑time data‑warehouse capabilities to support data‑lake analysis, external catalog integration, Trino compatibility, extensive I/O optimizations, and powerful materialized‑view features that together enable a unified, cloud‑native Lakehouse solution with high performance and flexible resource isolation.

Big DataData LakeLakehouse
0 likes · 20 min read
StarRocks Data Lake Analysis, Materialized Views, and Lakehouse Architecture
Bilibili Tech
Bilibili Tech
Sep 15, 2023 · Big Data

Introducing Bilibili's SQLScan: Architecture, Key Technologies, and Production Impact

Bilibili's SQLScan is a static‑code analysis tool that parses Hive, Spark, Presto and Flink SQL via Antlr4, builds a unified AST, applies engine‑specific metadata plugins for rule enforcement, provides field‑lineage and cost‑analysis services, and has processed hundreds of thousands of daily queries, intercepting thousands of problematic statements to improve data quality and operational efficiency.

Big DataBilibiliData Lineage
0 likes · 11 min read
Introducing Bilibili's SQLScan: Architecture, Key Technologies, and Production Impact
Programmer DD
Programmer DD
Sep 15, 2023 · Big Data

How Alluxio Manages Massive Metadata: Inode, Block, MountTable, and Worker Insights

This article examines Alluxio's open-source distributed file system, detailing the core types of metadata—inode, block, mount table, and worker—along with the mechanisms for their storage, management, and optimization in both HEAP and ROCKS modes, and provides practical configuration guidance for scaling large-scale data environments.

AlluxioBig DataDistributed File System
0 likes · 15 min read
How Alluxio Manages Massive Metadata: Inode, Block, MountTable, and Worker Insights
DataFunTalk
DataFunTalk
Sep 13, 2023 · Big Data

Design and Implementation of a Lakehouse Data Platform Based on Apache Hudi at Taikang Life Insurance

This article details Taikang Life Insurance's end‑to‑end technical selection, architecture design, implementation, and custom enhancements of an Apache Hudi‑driven lakehouse platform for large‑scale health‑insurance data, covering background, component evaluation, performance benchmarking, multi‑layer architecture, and real‑world results.

Apache HudiBig DataData Governance
0 likes · 44 min read
Design and Implementation of a Lakehouse Data Platform Based on Apache Hudi at Taikang Life Insurance
DataFunTalk
DataFunTalk
Sep 12, 2023 · Big Data

Building an Intelligent Data Governance Platform at NetEase Cloud Music: Architecture, Practices, and Future Plans

This article presents a comprehensive case study of NetEase Cloud Music’s metadata‑driven intelligent governance platform, detailing its scale, construction background, modular architecture, rule‑based automation, practical deployment, and future roadmap for sustainable data ecosystem management.

Big DataData Governanceautomation
0 likes · 22 min read
Building an Intelligent Data Governance Platform at NetEase Cloud Music: Architecture, Practices, and Future Plans
DataFunTalk
DataFunTalk
Sep 10, 2023 · Big Data

Ping An Life Insurance’s Data Middle Platform Construction Practice

The presentation details Ping An Life’s four‑stage data middle‑platform initiative—defining data capability as the foundation of digital transformation, outlining the platform’s architecture and governance, showcasing business‑value applications, and discussing talent and cultural considerations—to illustrate how a large insurer builds a scalable, real‑time data ecosystem.

Big DataData GovernanceDigital Transformation
0 likes · 9 min read
Ping An Life Insurance’s Data Middle Platform Construction Practice
DataFunTalk
DataFunTalk
Sep 9, 2023 · Big Data

Presto + Tencent DOP (Alluxio) Architecture and Optimization Practices for Financial OLAP

This article presents the practical implementation of Presto combined with Tencent DOP (Alluxio) in a financial OLAP scenario, detailing background and architectural evolution, the Presto‑Alluxio design, optimization techniques for caching, storage scalability, ORC handling, and performance results, followed by conclusions and future directions.

AlluxioBig DataOLAP
0 likes · 15 min read
Presto + Tencent DOP (Alluxio) Architecture and Optimization Practices for Financial OLAP
21CTO
21CTO
Sep 8, 2023 · Big Data

Why Real-Time Data Processing Is the Next Frontier for Data Engineers

Real-time data processing transforms traditional batch pipelines by delivering fresh, low‑latency data to millions of concurrent users, leveraging event‑driven architectures, streaming engines, and real‑time databases, with use cases ranging from fraud detection to personalized e‑commerce and operational dashboards, and includes reference architectures and tool recommendations.

ArchitectureBig DataReal-time Processing
0 likes · 16 min read
Why Real-Time Data Processing Is the Next Frontier for Data Engineers
DataFunSummit
DataFunSummit
Sep 8, 2023 · Big Data

Tianqiong OLAP Real‑time Lakehouse Fusion Platform Architecture Practice

This article explains why lake‑warehouse fusion is needed, describes the challenges of integrating real‑time data warehouses with data lakes, introduces a new StarRocks‑based architecture that supports real‑time ingestion, cooling, offline loading, and adaptive hot‑cold query rewriting, and outlines future plans and Q&A.

Big DataData IntegrationLakehouse
0 likes · 21 min read
Tianqiong OLAP Real‑time Lakehouse Fusion Platform Architecture Practice
Huolala Tech
Huolala Tech
Sep 7, 2023 · Big Data

How Huolala Ensures Doris Stability: Real-World Big Data Practices

This article details Huolala's big‑data architecture and the practical measures—ranging from background analysis and stability challenges to case studies, discovery mechanisms, capacity planning, high‑availability, and automation—that the company employs to guarantee Doris's reliability and performance across its rapidly growing logistics platform.

Big DataOLAPcapacity planning
0 likes · 15 min read
How Huolala Ensures Doris Stability: Real-World Big Data Practices
StarRocks
StarRocks
Sep 6, 2023 · Big Data

How Paimon + StarRocks Revolutionize Lakehouse Analytics

This article reviews traditional Lambda and Kappa data‑warehouse architectures, then details four Paimon‑StarRocks lakehouse solutions—including a data‑lake center, accelerated query with materialized views, hot‑cold data separation, and the JNI connector—while also outlining StarRocks’ future roadmap for lakehouse analytics.

Big DataLakehousePaimon
0 likes · 11 min read
How Paimon + StarRocks Revolutionize Lakehouse Analytics
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Sep 6, 2023 · Databases

REDck: A Cloud‑Native Real‑Time OLAP Data Warehouse Built on ClickHouse

REDck is a cloud‑native, real‑time OLAP data warehouse built on ClickHouse that adds elastic compute and storage scaling, object‑storage optimizations, multi‑level caching, and exactly‑once ingestion, delivering petabyte‑scale interactive analytics with ten‑fold CPU efficiency, ten‑fold cost reduction, and 99.9% availability.

Big DataClickHouseReal-time OLAP
0 likes · 21 min read
REDck: A Cloud‑Native Real‑Time OLAP Data Warehouse Built on ClickHouse
JD Retail Technology
JD Retail Technology
Sep 4, 2023 · Big Data

JD Mini Program Data Center: Architecture, Milestones, and Real‑time Analytics Solutions

The article details the JD Mini Program platform, its data‑center development milestones, comprehensive business panorama, technical architecture, data collection, storage, and analysis pipelines—including Flink‑based real‑time monitoring, ClickHouse custom analytics, and Elasticsearch user‑behavior insights—while outlining current challenges and future AI‑driven enhancements.

Big DataClickHouseElasticsearch
0 likes · 16 min read
JD Mini Program Data Center: Architecture, Milestones, and Real‑time Analytics Solutions
Data Thinking Notes
Data Thinking Notes
Sep 3, 2023 · Big Data

How to Build an Effective Data Governance Framework: Steps & Best Practices

This article outlines a comprehensive data governance framework for Chinese enterprises, covering organizational structures, data asset inventory, six‑stage methodology, and the creation of unified data standards and quality rules to support effective digital transformation and data‑driven decision making.

Big DataData GovernanceData Management
0 likes · 13 min read
How to Build an Effective Data Governance Framework: Steps & Best Practices
dbaplus Community
dbaplus Community
Sep 3, 2023 · Big Data

How NetEase Yanxuan Migrated from Lambda to Iceberg for Seamless Batch‑Stream Integration

This article explains how NetEase Yanxuan upgraded its legacy Lambda architecture to an Iceberg‑based batch‑stream unified platform, detailing the original data pipeline, the challenges faced, the evaluation of Iceberg versus Hudi and DeltaLake, and the concrete engineering optimizations and governance measures implemented to achieve lower latency and higher query performance.

Batch-Stream IntegrationBig DataFlink
0 likes · 14 min read
How NetEase Yanxuan Migrated from Lambda to Iceberg for Seamless Batch‑Stream Integration
DataFunTalk
DataFunTalk
Sep 3, 2023 · Big Data

Evolution of OLAP at Xingyun Retail Credit Using Apache Doris

This article details how Xingyun Retail Credit transitioned from traditional data warehouses to an Apache Doris‑based OLAP solution, covering data demand generation, OLAP engine selection challenges, multi‑stage implementation, performance optimizations, data‑warehouse construction, real‑world use cases, and future roadmap.

Apache DorisBig DataETL
0 likes · 16 min read
Evolution of OLAP at Xingyun Retail Credit Using Apache Doris
DataFunSummit
DataFunSummit
Sep 2, 2023 · Big Data

Practical Experience of Bilibili's Big Data Cluster Mixed Deployment Architecture

This article details Bilibili's offline big‑data cluster challenges, the mixed‑deployment architecture that combines offline and online resources, the Amiya service's over‑commit and eviction mechanisms, performance optimizations, monitoring strategies, and future plans to further improve resource utilization and scheduling.

AmiyaBig DataBilibili
0 likes · 14 min read
Practical Experience of Bilibili's Big Data Cluster Mixed Deployment Architecture
DataFunTalk
DataFunTalk
Aug 30, 2023 · Big Data

Design and Implementation of Baidu Cloud Block Storage EC System for Large‑Scale Data

This article presents Baidu Cloud's block storage architecture, comparing replication and erasure‑coding fault‑tolerance methods, detailing the challenges of applying EC to mutable block data, and describing a two‑layer append‑engine solution with selective 3‑replica caching, cost‑benefit compaction, and performance optimizations for low‑cost, high‑throughput storage.

Big DataStorage Architectureappend engine
0 likes · 14 min read
Design and Implementation of Baidu Cloud Block Storage EC System for Large‑Scale Data
ByteDance Data Platform
ByteDance Data Platform
Aug 30, 2023 · Big Data

How We Cut Offline Data Warehouse SLA Delay from 13 Days to Zero with DataLeap

The article details how the "Xingfu Li" real‑estate platform tackled a 13‑day offline data‑warehouse SLA delay by adopting Volcano Engine's DataLeap suite, outlining the challenges, the three‑step governance process, and the measurable improvements achieved across task coverage, alert reduction, and data stability.

Big DataData GovernanceDataLeap
0 likes · 10 min read
How We Cut Offline Data Warehouse SLA Delay from 13 Days to Zero with DataLeap
JD Tech
JD Tech
Aug 30, 2023 · Databases

A Comprehensive Overview of Database Evolution, Types, and Data Structure Design Techniques

This article explains key database terminology, traces the history of database technologies, compares relational, NoSQL, NewSQL, OLTP/OLAP, columnar, time‑series and graph databases, and demonstrates practical data‑structure designs such as zipper tables, bit operations, bitmaps, bloom filters, and ring queues for software development.

Big DataData StructuresNoSQL
0 likes · 27 min read
A Comprehensive Overview of Database Evolution, Types, and Data Structure Design Techniques
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Aug 30, 2023 · Big Data

How Transaction Table2.0 Cuts Data Deduplication Costs by 98% in MaxCompute

This article explains how Renliji's data warehouse team leveraged MaxCompute's Transaction Table2.0 to dramatically reduce incremental data deduplication costs and execution time, while also introducing efficient small‑file merging, time‑travel queries, and future data‑sync strategies for a high‑growth HR SaaS platform.

Big DataCost OptimizationMaxCompute
0 likes · 11 min read
How Transaction Table2.0 Cuts Data Deduplication Costs by 98% in MaxCompute
DataFunTalk
DataFunTalk
Aug 29, 2023 · Big Data

MaxCompute Incremental Update, Processing Architecture, and Intelligent Data Warehouse Optimizations

This article presents a comprehensive overview of MaxCompute's incremental update and processing architecture, the design of intelligent materialized views, and the engine's adaptive execution optimizations, detailing the integrated near‑real‑time and batch pipelines, transactional table 2.0, and practical Q&A.

Big DataMaxComputedata-warehouse
0 likes · 21 min read
MaxCompute Incremental Update, Processing Architecture, and Intelligent Data Warehouse Optimizations
DataFunSummit
DataFunSummit
Aug 28, 2023 · Big Data

Building Data Production Pipelines with DataOps: Concepts, Practices, and a Six‑Stage Workflow

This article introduces DataOps, outlines its background and the problems it addresses, describes NetEase’s big‑data product ecosystem, and details a six‑stage data production pipeline—including coding, orchestration, testing, code review, release approval, and deployment – plus insights into two pipeline explorations.

Big DataData QualityDataOps
0 likes · 15 min read
Building Data Production Pipelines with DataOps: Concepts, Practices, and a Six‑Stage Workflow
DataFunTalk
DataFunTalk
Aug 28, 2023 · Big Data

Practical Experience of an E‑commerce Platform’s Offline and Real‑time Data Warehouse

This article shares the practical architecture, technology selection, implementation details, and evolution of an e‑commerce platform’s offline and real‑time data warehouses, covering data modeling, processing pipelines, system components such as Hive, Spark, Flink, ClickHouse, Doris, and Hudi, and the lessons learned from multiple production deployments.

Big DataClickHouseE‑commerce
0 likes · 18 min read
Practical Experience of an E‑commerce Platform’s Offline and Real‑time Data Warehouse
Volcano Engine Developer Services
Volcano Engine Developer Services
Aug 25, 2023 · Cloud Native

How ByteDance Scaled with Multi‑Cloud: Lessons from Their Cloud‑Native Journey

ByteDance’s multi‑cloud evolution, driven by rapid business growth, cost control, and compliance needs, showcases a distributed cloud‑native platform built on open‑source orchestration, unified resource management, and advanced data‑lake solutions, while addressing operational complexity, interoperability, and emerging AI‑driven challenges.

AIBig DataKubernetes
0 likes · 14 min read
How ByteDance Scaled with Multi‑Cloud: Lessons from Their Cloud‑Native Journey
iQIYI Technical Product Team
iQIYI Technical Product Team
Aug 25, 2023 · Big Data

Venus Log Platform Architecture Evolution: From ELK to Data Lake

The Venus log platform at iQiyi migrated from an ElasticSearch‑Kibana architecture to an Iceberg‑based data lake with Trino, cutting storage and compute costs by over 70%, boosting stability by 85%, and efficiently supporting billions of daily logs through write‑heavy, low‑query workloads.

Big DataElasticsearchIceberg
0 likes · 22 min read
Venus Log Platform Architecture Evolution: From ELK to Data Lake
Tencent Cloud Developer
Tencent Cloud Developer
Aug 23, 2023 · Big Data

WeChat Experiment Platform: Architecture Design and Iceberg Lakehouse Optimization

The WeChat Experiment Platform migrated its 60,000 metric, 200,000 core, 30 PB plus data pipeline to an Iceberg based lakehouse, leveraging three layer metadata, fine grained partitioning, MERGE into writes, time travel snapshots and skew handling UDFs, which cut core time by 69%, saved ~100 PB storage, and reduced latency by up to 70%.

Big DataIcebergLakehouse
0 likes · 18 min read
WeChat Experiment Platform: Architecture Design and Iceberg Lakehouse Optimization
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 22, 2023 · Big Data

DataOps Practices and Challenges at ByteDance: From Model to Productization

The article summarizes ByteDance's DataOps journey, detailing its mid‑platform tool and Data BP model, core performance metrics, quality, hardware and human efficiency challenges, concrete DataOps implementation, productization through DataLeap, best‑practice promotion, and future outlook for data‑driven business value.

Big DataByteDanceData Governance
0 likes · 17 min read
DataOps Practices and Challenges at ByteDance: From Model to Productization
JD Retail Technology
JD Retail Technology
Aug 21, 2023 · Artificial Intelligence

ChatGPT-4 Enhances Data Analysis Efficiency and Insight Across Big Data Scenarios

This article examines how ChatGPT-4, as an advanced natural‑language‑processing model, can streamline data analysis tasks—from generating Hive table definitions and sample data to crafting complex HiveSQL queries, visualizing results, and implementing ClickHouse and Flink solutions—thereby improving efficiency, insight, and problem‑solving in big‑data environments.

Artificial IntelligenceBig DataChatGPT-4
0 likes · 7 min read
ChatGPT-4 Enhances Data Analysis Efficiency and Insight Across Big Data Scenarios
DataFunTalk
DataFunTalk
Aug 21, 2023 · Databases

Case Study: Building a Real‑Time Log Data Analysis Platform with Apache Doris at China Unicom

This article describes how China Unicom’s Western Innovation Research Institute designed and deployed a centralized, real‑time log analytics platform using Apache Doris, detailing the migration from Hive and ClickHouse, performance optimizations, storage cost reductions, and the resulting improvements in data ingestion, query speed, and operational efficiency.

Apache DorisBig DataCold‑Hot Data Management
0 likes · 18 min read
Case Study: Building a Real‑Time Log Data Analysis Platform with Apache Doris at China Unicom
DataFunSummit
DataFunSummit
Aug 20, 2023 · Big Data

Kuaishou Data Service System: Modeling, Architecture, and Future Directions

This article presents Kuaishou's comprehensive data service system, covering its domain modeling, evolution from custom to unified services, the Octo query engine and data preparation platform architecture, the dual data API and analysis services, and future plans for intelligence and serverless high‑performance capabilities.

Big DataData PlatformData Service
0 likes · 16 min read
Kuaishou Data Service System: Modeling, Architecture, and Future Directions
DataFunTalk
DataFunTalk
Aug 20, 2023 · Databases

Best Practices for Building Low‑Cost Data Lake Analytics with AnalyticDB MySQL and Serverless Spark

This article presents a comprehensive technical overview of Alibaba Cloud AnalyticDB MySQL and its Serverless Spark integration, detailing architecture, core optimizations, security enhancements, and real‑world case studies that demonstrate how to achieve cost‑effective, high‑performance data lake analytics.

AnalyticDBBig DataData Lake
0 likes · 19 min read
Best Practices for Building Low‑Cost Data Lake Analytics with AnalyticDB MySQL and Serverless Spark
Model Perspective
Model Perspective
Aug 19, 2023 · Artificial Intelligence

Unlocking Hidden Patterns: How Tensor Decomposition Powers Modern AI

This article introduces tensors and tensor decomposition, explains core operations, explores CP and other factorization methods, and demonstrates Python implementations for music and movie recommendation systems, highlighting how these techniques reveal hidden structures in large‑scale data.

Big DataCP decompositionPython
0 likes · 15 min read
Unlocking Hidden Patterns: How Tensor Decomposition Powers Modern AI
ITPUB
ITPUB
Aug 18, 2023 · Databases

Key Takeaways from DTCC2023: Vector Databases, Data Privacy, and Intelligent Ops

The 14th China Database Technology Conference (DTCC2023) showcased cutting‑edge advances in vector databases, data privacy, MySQL security, and AI‑driven intelligent operations, featuring insights from industry leaders at Huawei, Tencent, eBay, Bilibili and more.

AIBig DataDatabase Security
0 likes · 10 min read
Key Takeaways from DTCC2023: Vector Databases, Data Privacy, and Intelligent Ops

How Lakehouse Architecture is Transforming Hadoop: A Deep Dive into Hudi, Iceberg, and Delta Lake

This article analyzes the rise of lake‑house architecture in the Hadoop ecosystem, compares the technical capabilities of Hudi, Iceberg and Delta Lake, details implementation enhancements such as MOR and multi‑writer support, showcases Flink integration, presents a real‑time marketing use case, and outlines future development directions.

Big DataData GovernanceDelta Lake
0 likes · 14 min read
How Lakehouse Architecture is Transforming Hadoop: A Deep Dive into Hudi, Iceberg, and Delta Lake
Sohu Tech Products
Sohu Tech Products
Aug 16, 2023 · Big Data

Understanding HBase Compaction: Principles, Process, Throttling Strategies and Real‑World Optimizations

This article explains HBase’s LSM‑Tree compaction fundamentals—including minor and major compaction triggers, file‑selection policies, dynamic throughput throttling, and practical tuning examples that show how adjusting size limits, thread pools, and off‑peak settings can dramatically improve read latency and cluster stability.

Big DataHBasePerformance Tuning
0 likes · 35 min read
Understanding HBase Compaction: Principles, Process, Throttling Strategies and Real‑World Optimizations
dbaplus Community
dbaplus Community
Aug 15, 2023 · Databases

Why ClickHouse Outperforms MySQL, Elasticsearch, and HBase for Massive Event Data

This article examines the massive data storage and real‑time analysis needs of an activity platform, evaluates MySQL, sharded MySQL, Elasticsearch and HBase, and explains why ClickHouse—with its columnar storage, MergeTree engine, vectorized execution, and distributed architecture—offers the best balance of write performance, query speed, and scalability for billions of records.

Big DataClickHouseColumnar Database
0 likes · 31 min read
Why ClickHouse Outperforms MySQL, Elasticsearch, and HBase for Massive Event Data
DataFunTalk
DataFunTalk
Aug 14, 2023 · Big Data

Data Warehouse Modeling Platform: Exploration and Practice at NetEase Yanxuan

This article details NetEase Yanxuan’s exploration and practice of a data warehouse modeling platform, covering background, current challenges, a comprehensive solution, step‑by‑step implementation, and the resulting improvements in model standardization, automation, and business value.

Big DataModelingautomation
0 likes · 18 min read
Data Warehouse Modeling Platform: Exploration and Practice at NetEase Yanxuan
Data Thinking Notes
Data Thinking Notes
Aug 13, 2023 · Big Data

How to Successfully Deliver a Data Governance Project: Step‑by‑Step Guide

This article outlines a comprehensive methodology for delivering a data governance project, covering planning, blueprint design, implementation, and acceptance phases, with detailed guidance on team formation, stakeholder roles, requirement analysis, platform architecture, management processes, and post‑deployment operations.

Big DataData GovernanceData Platform
0 likes · 12 min read
How to Successfully Deliver a Data Governance Project: Step‑by‑Step Guide
DataFunSummit
DataFunSummit
Aug 13, 2023 · Big Data

KwaiBI: Evolution of Kuaishou’s One‑Stop Business Intelligence Platform from 1.0 to 2.0

The article details Kuaishou’s KwaiBI business intelligence platform evolution, covering its 1.0 tool‑based implementation, the 2.0 standardized architecture built on an indicator middle‑platform, core processes, data integration, self‑service features, and future directions for self‑service and intelligent analytics.

BIBig DataData Integration
0 likes · 22 min read
KwaiBI: Evolution of Kuaishou’s One‑Stop Business Intelligence Platform from 1.0 to 2.0
Youzan Coder
Youzan Coder
Aug 8, 2023 · Big Data

Kylin4 Deployment and Performance Optimizations at Youzan

Since 2018 Youzan has migrated all online services to Kylin4, addressing long cube rebuilds, single‑point cache, CPU spikes, and throttling gaps by adding batch segment builds, low‑priority concurrency controls, Redis‑based query caching, parquet skew mitigation, range‑query acceleration, and class‑loader optimizations, which together doubled query‑per‑second capacity to 150, cut latency by up to 50 % and reduced CPU usage.

Big DataCubeKylin
0 likes · 17 min read
Kylin4 Deployment and Performance Optimizations at Youzan
DataFunSummit
DataFunSummit
Aug 8, 2023 · Artificial Intelligence

Xiaomi’s Experience in Deploying Intelligent Analytics: Productization, Challenges, and Future Plans

The article shares Xiaomi’s practical experience in building and productizing intelligent analytics, explaining why it is needed, how it integrates with BI, the essential prerequisites, staged implementation, technical challenges, and future roadmap including smart alerts, automated insights, and data Q&A.

AIBI IntegrationBig Data
0 likes · 15 min read
Xiaomi’s Experience in Deploying Intelligent Analytics: Productization, Challenges, and Future Plans
Selected Java Interview Questions
Selected Java Interview Questions
Aug 8, 2023 · Big Data

Processing 10GB Age Data on a 4GB Memory Machine Using Java: Single‑Threaded and Multi‑Threaded Solutions

This article demonstrates how to generate, read, and analyze a 10 GB file of age statistics on a 4 GB RAM, 2‑core machine using Java, comparing a single‑threaded counting method with a producer‑consumer multi‑threaded approach that dramatically improves CPU utilization and reduces processing time.

Big DataMemory ManagementPerformance Optimization
0 likes · 11 min read
Processing 10GB Age Data on a 4GB Memory Machine Using Java: Single‑Threaded and Multi‑Threaded Solutions
DataFunTalk
DataFunTalk
Aug 5, 2023 · Big Data

Apache Celeborn (Incubating): Design, Performance, Stability, and Elasticity of a Remote Shuffle Service

This article reviews the limitations of traditional Spark shuffle, introduces Apache Celeborn (Incubating) as a remote shuffle service, and details its design for performance, stability, and elasticity, including push shuffle, partition splitting, columnar shuffle, multi‑layer storage, congestion control, and real‑world evaluation.

Apache SparkBig DataPerformance
0 likes · 19 min read
Apache Celeborn (Incubating): Design, Performance, Stability, and Elasticity of a Remote Shuffle Service
DataFunSummit
DataFunSummit
Aug 4, 2023 · Big Data

LakeSoul: An Open‑Source Real‑Time Data Lakehouse Framework – Design, Architecture, Benchmarks and Future Roadmap

This article introduces LakeSoul, an open‑source end‑to‑end real‑time lakehouse framework, detailing its design philosophy, key technologies such as ELT, metadata management, upsert and merge‑on‑read capabilities, performance benchmarks, real‑world use cases, and the roadmap for future enhancements.

Big DataData LakehouseELT
0 likes · 18 min read
LakeSoul: An Open‑Source Real‑Time Data Lakehouse Framework – Design, Architecture, Benchmarks and Future Roadmap