Tagged articles
103 articles
Page 1 of 2
DataFunSummit
DataFunSummit
Apr 20, 2026 · Industry Insights

How Apache Gravitino Solves Data Fragmentation in the Multi‑Cloud AI Era

In a Data for AI meetup, Datastrato's VP of Engineering Shi Shaofeng explains how Apache Gravitino's metadata federation, metalake architecture, and unified access control address multi‑cloud data fragmentation, compliance, and AI‑driven governance while outlining version 1.1.0 enhancements and the roadmap for 1.2.0.

AI data governanceApache Gravitinometadata lake
0 likes · 12 min read
How Apache Gravitino Solves Data Fragmentation in the Multi‑Cloud AI Era
DataFunSummit
DataFunSummit
Apr 19, 2026 · Big Data

How OPPO Built a Multi‑Modal Data Lake with Gravitino and Curvine

OPPO’s data‑lake team, led by David, detailed their transition from Hive‑Spark to a unified multi‑modal lake, leveraging Gravitino for cross‑engine metadata management and the open‑source Curvine cache to eliminate data silos, boost I/O performance, and support massive image, recommendation, and AI‑Agent workloads.

Big DataData Lakedistributed cache
0 likes · 11 min read
How OPPO Built a Multi‑Modal Data Lake with Gravitino and Curvine
dbaplus Community
dbaplus Community
Mar 31, 2026 · Industry Insights

Why Most Data Governance Projects Fail and How to Build a Practical, Engineer‑Friendly Solution

Most companies see data governance fail not because of technology but because they start with the wrong direction, focusing on rules, platforms, and processes that add friction instead of improving data usability, and the article provides a step‑by‑step, low‑overhead approach with concrete SQL and Python templates to fix it.

Data GovernanceEngineering ProductivityPython
0 likes · 25 min read
Why Most Data Governance Projects Fail and How to Build a Practical, Engineer‑Friendly Solution
DataFunSummit
DataFunSummit
Mar 25, 2026 · Big Data

How Apache Gravitino and OpenLineage Transform Data Governance for AI‑Driven Enterprises

In the era of AI and multi‑cloud, this article analyzes the core challenges of data governance—data silos, quality gaps, and compliance risks—and explains how Apache Gravitino’s unified metadata architecture together with OpenLineage’s standardized lineage model provide a scalable, automated solution for intelligent, real‑time data management.

Apache GravitinoBig DataData Governance
0 likes · 15 min read
How Apache Gravitino and OpenLineage Transform Data Governance for AI‑Driven Enterprises
Big Data Tech Team
Big Data Tech Team
Jan 19, 2026 · Big Data

What Is Data Fabric and How It Can Eliminate Data Silos Today

This article explains the concept of Data Fabric, debunks common misconceptions, outlines the three key drivers behind its rise, and provides a practical four‑step roadmap—including metadata, semantic layers, policy engines, and AI—to help teams of any size adopt the technology.

AIData FabricData Integration
0 likes · 7 min read
What Is Data Fabric and How It Can Eliminate Data Silos Today
DataFunSummit
DataFunSummit
Dec 1, 2025 · Big Data

7 Cutting-Edge Data Engineering Practices Shaping AI-Driven Data Lakes

This article collection showcases seven advanced data engineering solutions—from Tencent Cloud's Iceberg batch‑stream integration and Apache Gravitino metadata lineage to Xiaohongshu's Lakehouse evolution and multimodal AI data lake implementations—highlighting architectural innovations, performance optimizations, and real‑world deployment insights for modern big‑data platforms.

Apache GravitinoApache IcebergBatch-Stream Integration
0 likes · 7 min read
7 Cutting-Edge Data Engineering Practices Shaping AI-Driven Data Lakes
DataFunSummit
DataFunSummit
Nov 24, 2025 · Big Data

How Tencent Cloud Uses Iceberg, Gravitino and Multimodal Lakes for Unified Data Processing

This article series explores Tencent Cloud's Iceberg‑based batch‑stream integration, Apache Gravitino's unified metadata and lineage solution, Xiaohongshu's data‑architecture evolution for the Big AI Data era, and a practical Data+AI multimodal data‑lake implementation, highlighting challenges, architectural designs, and performance gains.

Big DataData LakeIceberg
0 likes · 7 min read
How Tencent Cloud Uses Iceberg, Gravitino and Multimodal Lakes for Unified Data Processing
DataFunSummit
DataFunSummit
Oct 29, 2025 · Big Data

How Douyin’s Data Asset Platform Revolutionizes Big Data Lineage

This article introduces Douyin Group’s Data Asset Management Platform, explaining its shift from traditional metadata to a comprehensive data‑asset approach, detailing the platform’s capabilities, and focusing on the evolution and application of full‑link data lineage across four key topics to improve visibility, quality, security, and cost efficiency.

Big DataData AssetsDouyin
0 likes · 5 min read
How Douyin’s Data Asset Platform Revolutionizes Big Data Lineage
Instant Consumer Technology Team
Instant Consumer Technology Team
Oct 28, 2025 · Artificial Intelligence

Can Data Virtualization Deliver Millisecond Real‑Time Features Across Stores?

This article shares a three‑year journey of building a data‑virtualization‑based, multi‑environment feature management framework for real‑time risk decision platforms, detailing challenges like heterogeneous storage, cold‑start, and operational stability, and presenting a unified architecture that decouples physical storage from business logic.

Big DataReal-time analyticsdata virtualization
0 likes · 16 min read
Can Data Virtualization Deliver Millisecond Real‑Time Features Across Stores?
DataFunSummit
DataFunSummit
Oct 12, 2025 · Big Data

How Douyin’s Data Asset Platform Revolutionizes Big Data Lineage

This article introduces Douyin Group’s Data Asset Management Platform, explaining its shift from traditional metadata to comprehensive data assets, detailing the evolution, architecture, and applications of its full‑link big data lineage, and offering strategic guidance for building effective lineage systems.

Data AssetData GovernanceData Lineage
0 likes · 5 min read
How Douyin’s Data Asset Platform Revolutionizes Big Data Lineage
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Sep 22, 2025 · Cloud Computing

How Mantle Breaks the Hierarchical Namespace Bottleneck in Cloud Object Storage

The Mantle system, presented in a SOSP'25 paper by Baidu's storage team and collaborators, delivers a distributed hierarchical namespace for cloud object storage that overcomes traditional scalability and performance limits, enabling massive data lake workloads with dramatically reduced latency and vastly increased throughput.

Distributed SystemsSOSPcloud storage
0 likes · 8 min read
How Mantle Breaks the Hierarchical Namespace Bottleneck in Cloud Object Storage
Data Thinking Notes
Data Thinking Notes
Sep 14, 2025 · Artificial Intelligence

How to Build a Robust Tool Integration Module for AI Agents

This article explains the architecture, core components, and step‑by‑step implementation of a tool usage module that enables AI agents to standardize, select, execute, and transform external tools, illustrated with a sales data analysis case and detailed code snippets.

AI AgentLLMmetadata management
0 likes · 9 min read
How to Build a Robust Tool Integration Module for AI Agents
DataFunSummit
DataFunSummit
Sep 2, 2025 · Big Data

How Xiaomi Cuts Costs and Boosts Performance with Cloud‑Native Data Lake Architecture

Xiaomi’s engineers explain how they tackled data‑lake challenges—small files, metadata latency, and multi‑cloud costs—by combining compact storage, Gravitino‑based metadata governance, Iceberg and Paimon formats, and JuiceFS abstraction, achieving lower storage expenses, faster queries, and a roadmap toward intelligent, real‑time, multimodal lakehouses.

Big DataData LakeStorage Optimization
0 likes · 14 min read
How Xiaomi Cuts Costs and Boosts Performance with Cloud‑Native Data Lake Architecture
DataFunTalk
DataFunTalk
Aug 28, 2025 · Big Data

How JD Retail Tackles Data Governance Challenges to Boost Efficiency

JD Retail faces growing data volume, redundant models, and resource‑intensive storage, prompting a comprehensive data‑governance strategy that defines standards, streamlines architecture, isolates development, and optimizes compute and storage costs, ultimately enabling more efficient, secure, and agile data operations across the enterprise.

Big DataData ArchitectureData Governance
0 likes · 8 min read
How JD Retail Tackles Data Governance Challenges to Boost Efficiency
Big Data Tech Team
Big Data Tech Team
Jun 9, 2025 · Industry Insights

How AI Large Models Transform Data Governance: 2025 Insights & Best Practices

This article examines the essence of data governance, outlines its four core domains, proposes a strategic and technical implementation roadmap, evaluates effectiveness with the DCAM model, and explores how AI large models can enhance metadata, data quality, and compliance while highlighting practical limitations and future trends.

AI Large ModelsData QualityFuture Trends
0 likes · 9 min read
How AI Large Models Transform Data Governance: 2025 Insights & Best Practices
DataFunSummit
DataFunSummit
Jun 6, 2025 · Big Data

How Unicom Digital’s Integrated Data Platform Revolutionizes Metadata Management

This article details Unicom Digital’s metadata management practice on its integrated data platform, covering the strategic background of data, key challenges, award-winning capabilities, three-pronged solutions—automation, linking+, and AI—along with practical implementations, full‑chain lineage, data responsibility, lifecycle management, and future AI‑driven enhancements.

AIAutomationBig Data
0 likes · 18 min read
How Unicom Digital’s Integrated Data Platform Revolutionizes Metadata Management
Big Data Technology & Architecture
Big Data Technology & Architecture
May 16, 2025 · Big Data

Apache Gravitino: An Open‑Source Metadata Lake for Unified Data and AI Asset Management

Apache Gravitino is an open‑source metadata service platform that provides a unified, high‑performance, geographically distributed metadata lake, enabling end‑to‑end data governance, multi‑engine access, and direct management of both structured and unstructured data assets across diverse systems.

Apache GravitinoData GovernanceData Lake
0 likes · 9 min read
Apache Gravitino: An Open‑Source Metadata Lake for Unified Data and AI Asset Management
Ma Wei Says
Ma Wei Says
Mar 30, 2025 · Fundamentals

How Kafka 4.0’s KRaft Replaces ZooKeeper with Raft Consensus

Kafka 4.0 introduces KRaft, a ZooKeeper‑free metadata layer built on the Raft consensus algorithm, detailing role transitions, leader election, log replication, controller and broker responsibilities, and fault‑tolerance mechanisms, enabling a more scalable and self‑managed architecture for large‑scale distributed streaming.

Consensus AlgorithmDistributed SystemsKRaft
0 likes · 13 min read
How Kafka 4.0’s KRaft Replaces ZooKeeper with Raft Consensus
Big Data Tech Team
Big Data Tech Team
Feb 17, 2025 · Industry Insights

How DeepSeek Transforms Data Warehouse Development: 5 Game-Changing Benefits

DeepSeek, the popular Chinese large‑language model, boosts data‑warehouse engineers' productivity by offering free, open‑source AI assistance across code writing, model design, metadata management, data quality monitoring, and governance, ultimately maximizing enterprise data asset value.

Data QualityData WarehouseDeepSeek
0 likes · 5 min read
How DeepSeek Transforms Data Warehouse Development: 5 Game-Changing Benefits
Architects' Tech Alliance
Architects' Tech Alliance
Jan 5, 2025 · Fundamentals

HadaFS: A New Burst Buffer File System for Scalable High‑Performance Computing

The article presents HadaFS, a novel burst‑buffer‑based distributed file system that combines the scalability of local burst buffers with the data‑sharing advantages of shared buffers, details its LTA architecture, metadata handling, the Hadash management tool, and extensive performance evaluations on the SNS supercomputer.

Burst BufferHPC StoragePerformance Evaluation
0 likes · 18 min read
HadaFS: A New Burst Buffer File System for Scalable High‑Performance Computing
Bilibili Tech
Bilibili Tech
Dec 17, 2024 · Big Data

Apache Gravitino: Metadata Management Practices and Production Experience at Bilibili

Bilibili adopted Apache Gravitino as a unified metadata platform that decouples consumers, consolidates schemas and Fileset‑based unstructured data across heterogeneous sources, cuts metadata and storage costs, resolves inconsistencies, boosts Hive Metastore performance, and enables features such as Iceberg branching and future AI‑centric governance.

Apache GravitinoBig DataFileset
0 likes · 20 min read
Apache Gravitino: Metadata Management Practices and Production Experience at Bilibili
Huolala Tech
Huolala Tech
Dec 5, 2024 · Big Data

Huolala’s Metadata Platform: Scaling Data Lineage, AI Search & Cost Governance

Huolala’s data team details the evolution of its metadata management platform—covering architecture, stages from early Hive‑ETL to real‑time field‑level lineage, AI‑driven smart search, cost‑governance mechanisms, and security classifications—showcasing practical solutions for data discoverability, efficiency, and protection at scale.

AI searchData Lineagecost governance
0 likes · 27 min read
Huolala’s Metadata Platform: Scaling Data Lineage, AI Search & Cost Governance
ByteDance Data Platform
ByteDance Data Platform
Nov 27, 2024 · Big Data

Inside Douyin’s Data Asset Platform: Transforming Data Lineage and Governance

Douyin Group’s data asset management platform introduces a systematic "manage, find, use" approach that unifies metadata collection, full‑coverage data lineage, and a suite of applications across development, governance, asset utilization, and security, while outlining its architecture, modeling, quality metrics, and future roadmap.

Data GovernanceData Lineagemetadata management
0 likes · 14 min read
Inside Douyin’s Data Asset Platform: Transforming Data Lineage and Governance
DataFunTalk
DataFunTalk
Nov 10, 2024 · Big Data

Douyin Group Data Asset Management Platform and Data Lineage Architecture Overview

This article provides a comprehensive overview of Douyin Group's data asset management platform, detailing the evolution, architecture, and applications of its large‑scale data lineage system, and discusses future directions for enhancing data quality, cost efficiency, and security across the organization.

Data GovernanceData Lineagemetadata management
0 likes · 15 min read
Douyin Group Data Asset Management Platform and Data Lineage Architecture Overview
Bilibili Tech
Bilibili Tech
Nov 1, 2024 · Big Data

Magnus: Intelligent Data Optimization Service for Iceberg Tables in Bilibili's Lakehouse Platform

Magnus is Bilibili’s self‑developed intelligent service that continuously optimizes Iceberg tables by scheduling snapshot expiration, orphan‑file cleanup, manifest rewriting, and multi‑dimensional data optimizations—including small‑file merging, sorting, distribution, and index creation—while automatically recommending configurations from real‑time query logs, delivering over 99.9% task success and up to 30% scan‑data reduction.

Data LakeIcebergIntelligent Recommendation
0 likes · 15 min read
Magnus: Intelligent Data Optimization Service for Iceberg Tables in Bilibili's Lakehouse Platform
DataFunSummit
DataFunSummit
Aug 13, 2024 · Big Data

Data Cost Reduction and Efficiency: Qichacha's Data Architecture and Multi‑Cloud Unified Design

This article presents Qichacha's comprehensive data‑cost‑reduction strategy, detailing its Hadoop‑based three‑pillar architecture, layered data warehouse, Hive upgrades, unified metadata across multi‑cloud clusters, middleware choices such as Alluxio and JuiceFS, version‑compatible hybrid clouds, and Kubernetes‑driven resource orchestration to achieve scalable, low‑cost data processing.

Big DataData WarehouseHadoop
0 likes · 16 min read
Data Cost Reduction and Efficiency: Qichacha's Data Architecture and Multi‑Cloud Unified Design

How Hudi MetaServer Transforms Metadata Management and Performance in Data Lakes

This article examines the challenges of Hudi metadata stored on HDFS, introduces the independently developed Hudi MetaServer for centralized metadata, visual management, unified permission control, TTL, expression payloads, and multi‑active scaling, and outlines future enhancements such as LLS, multi‑table fusion, and JDBC support.

Big DataData LakeHudi
0 likes · 11 min read
How Hudi MetaServer Transforms Metadata Management and Performance in Data Lakes
Bilibili Tech
Bilibili Tech
Jul 19, 2024 · Big Data

Bilibili's One-Stop Big Data Cluster Management Platform (BMR) - Architecture and Implementation

Bilibili’s one‑stop Big Data Cluster Management Platform (BMR) consolidates HDFS, Spark, Flink, ClickHouse, Kafka and other services into a unified system that evolved through four stages—standardization, metadata‑driven construction, containerization, and observability—addressing node consistency, scaling, fault self‑healing, and resource optimization while delivering elastic scaling, automated start/stop, and future cost‑saving and stability enhancements.

Cluster ManagementObservabilityResource Optimization
0 likes · 12 min read
Bilibili's One-Stop Big Data Cluster Management Platform (BMR) - Architecture and Implementation
vivo Internet Technology
vivo Internet Technology
May 29, 2024 · Operations

vivo CICD Artifact Management: Evolution and Implementation Practices

vivo’s CICD artifact management has evolved from manual builds to a comprehensive Platform Management 2.0 that provides unified storage, multi‑type support, version control, promotion, security scanning, lifecycle policies, and fine‑grained access, dramatically reducing errors and operational costs.

Artifact ManagementArtifact PromotionCICD
0 likes · 15 min read
vivo CICD Artifact Management: Evolution and Implementation Practices
DataFunSummit
DataFunSummit
May 21, 2024 · Operations

Bilibili Data Governance Operational Framework Practice

This article presents Bilibili's practical data governance operational framework, introducing the DAMA‑Bok methodology, detailing two real‑world cases on storage‑level risk and data‑loss post‑mortem, and outlining the organizational, metadata, and embedded governance mechanisms that drive cost and quality improvements.

DAMA-BokData Qualitycost governance
0 likes · 19 min read
Bilibili Data Governance Operational Framework Practice
DataFunTalk
DataFunTalk
May 19, 2024 · Big Data

Tencent's Multi-Engine Unified Metadata and Permission Management for Big Data

This article introduces Tencent's Big Data Processing Suite (TBDS), discusses challenges of data silos, and presents Gravitino's open‑source unified metadata service and permission model, detailing how it integrates Hadoop, MPP, and various catalog plugins to provide consistent access control across heterogeneous data platforms.

Big DataGravitinoHadoop
0 likes · 12 min read
Tencent's Multi-Engine Unified Metadata and Permission Management for Big Data
Bitu Technology
Bitu Technology
Jan 17, 2024 · Artificial Intelligence

Rosetta Stone: Scalable ID Mapping System for Tubi's Content Library Using LLMs and Embeddings

This article describes how Tubi built the Rosetta Stone system—a flexible ID mapping workflow that leverages large language models, embedding similarity ranking, and K‑nearest‑neighbors to unify and enrich metadata across a 200,000‑title library, improve content recommendation, and streamline operations.

Big DataLLMcontent ID mapping
0 likes · 10 min read
Rosetta Stone: Scalable ID Mapping System for Tubi's Content Library Using LLMs and Embeddings
Programmer DD
Programmer DD
Sep 15, 2023 · Big Data

How Alluxio Manages Massive Metadata: Inode, Block, MountTable, and Worker Insights

This article examines Alluxio's open-source distributed file system, detailing the core types of metadata—inode, block, mount table, and worker—along with the mechanisms for their storage, management, and optimization in both HEAP and ROCKS modes, and provides practical configuration guidance for scaling large-scale data environments.

AlluxioBig DataDistributed File System
0 likes · 15 min read
How Alluxio Manages Massive Metadata: Inode, Block, MountTable, and Worker Insights
DataFunTalk
DataFunTalk
Sep 12, 2023 · Big Data

Building an Intelligent Data Governance Platform at NetEase Cloud Music: Architecture, Practices, and Future Plans

This article presents a comprehensive case study of NetEase Cloud Music’s metadata‑driven intelligent governance platform, detailing its scale, construction background, modular architecture, rule‑based automation, practical deployment, and future roadmap for sustainable data ecosystem management.

AutomationBig DataData Governance
0 likes · 22 min read
Building an Intelligent Data Governance Platform at NetEase Cloud Music: Architecture, Practices, and Future Plans
Weimob Technology Center
Weimob Technology Center
Aug 1, 2023 · Big Data

How Weimeng Transformed Data Asset Governance: A Practical Blueprint for Enterprises

Facing fragmented metadata, unclear ownership, and costly data duplication, Weimeng implemented a comprehensive data asset governance framework—covering metadata standards, lineage visualization, metric normalization, and cost management—to boost data quality, security, and business value across its new‑retail platform.

Data GovernanceData Lineagedata operations
0 likes · 15 min read
How Weimeng Transformed Data Asset Governance: A Practical Blueprint for Enterprises
Didi Tech
Didi Tech
Jul 31, 2023 · Big Data

Data Serviceization at Didi: Architecture, Phases, and Standard Metric Service

Didi’s data serviceization converts raw business data into consumable services through a four‑stage pipeline—integration, development, production, and back‑flow—while the Data Dream Factory and Shu‑Chain platform automate synchronization, provide a unified access gateway for thousands of APIs, and introduce a standard metric service that abstracts storage complexities and ensures high‑performance, secure data delivery.

Data IntegrationData Platformdata serviceization
0 likes · 16 min read
Data Serviceization at Didi: Architecture, Phases, and Standard Metric Service
AntTech
AntTech
Jul 11, 2023 · Operations

Achieving Full-Stack Observability for Cloud and On-Premise Applications with Ant Group's BOS Platform

This article examines the challenges of maintaining stability across cloud and on‑premise environments, explains how Ant Group's Business‑Intelligent Observability Service (BOS) addresses these issues through unified metadata, seamless application integration, data standardization, and extensive case studies, and demonstrates the resulting improvements in reliability and operational efficiency.

Full-stack TracingObservabilitycloud computing
0 likes · 16 min read
Achieving Full-Stack Observability for Cloud and On-Premise Applications with Ant Group's BOS Platform
DataFunTalk
DataFunTalk
May 22, 2023 · Big Data

Alibaba Cloud Data Lake: Unified Metadata and Storage Management Practices

This article explains Alibaba Cloud's data lake architecture, unified metadata services, storage management optimizations, and format handling techniques, illustrating how lakehouse concepts, multi‑engine support, and lifecycle policies enable efficient, secure, and cost‑effective big data processing in the cloud.

Big DataCloud ServicesData Lake
0 likes · 22 min read
Alibaba Cloud Data Lake: Unified Metadata and Storage Management Practices
Data Thinking Notes
Data Thinking Notes
Apr 5, 2023 · Big Data

Mastering Data Governance: From Challenges to End‑to‑End Solutions

This article explores the key problems data governance aims to solve, outlines a comprehensive governance framework, and details practical implementation steps—including tool integration, metadata management, lake‑in and lake‑out processes, and governance policies—to achieve a closed‑loop, value‑driven data ecosystem.

Big DataData GovernanceData Lake
0 likes · 13 min read
Mastering Data Governance: From Challenges to End‑to‑End Solutions
DataFunSummit
DataFunSummit
Mar 31, 2023 · Big Data

Data Governance Practices and Implementation at DataCake

The article outlines DataCake's data governance journey, describing the challenges of data silos and cost inefficiencies, the strategic thinking behind a unified metadata platform, the implementation of governance tools, cost analysis modules, and asset inventory, and concludes with results, future plans, and a Q&A session.

Big DataOperational Efficiencycost analysis
0 likes · 14 min read
Data Governance Practices and Implementation at DataCake
DataFunSummit
DataFunSummit
Mar 1, 2023 · Big Data

Data Governance: Challenges, Framework, and Implementation Practices

This article explains the problems that data governance addresses, outlines a comprehensive governance framework—including system architecture, processes, and policies—and describes practical implementation steps such as integrated tooling, standardized modeling, metadata management, lake‑in and lake‑out governance, and organizational structures for sustainable data management.

Big DataGovernance Frameworkmetadata management
0 likes · 12 min read
Data Governance: Challenges, Framework, and Implementation Practices
DataFunTalk
DataFunTalk
Feb 26, 2023 · Big Data

Design, Optimization, and Use Cases of Data Lineage in ByteDance's DataLeap Platform

This article presents an in‑depth overview of DataLeap's data lineage capabilities, covering the challenges, multi‑layer model design, implementation with Apache Atlas and JanusGraph, performance optimizations, diverse use cases across asset, development, governance and security domains, and future trends for lineage technology.

Apache AtlasBig DataData Governance
0 likes · 19 min read
Design, Optimization, and Use Cases of Data Lineage in ByteDance's DataLeap Platform
Youzan Coder
Youzan Coder
Feb 7, 2023 · Big Data

Automated Offline Data Cost Optimization in Youzan's Data Platform

Youzan built an automated offline data cost‑optimization platform that gathers accurate metadata, mines unused or failing tables and tasks, and safely decommissions them through a backend‑frontend workflow with owner validation, notifications, rollback safeguards, and plans to extend lineage coverage and real‑time asset handling.

Big DataCost reductionData Governance
0 likes · 11 min read
Automated Offline Data Cost Optimization in Youzan's Data Platform
DataFunSummit
DataFunSummit
Feb 2, 2023 · Big Data

Data Governance Strategies: Concepts, Practices, and Case Studies

The article explains why data is a critical corporate asset, distinguishes narrow and broad data‑governance approaches, outlines strategic principles such as treating governance as a systematic, prioritized effort, and presents eight real‑world case studies from companies like Tencent, SF Tech, Huolala, and NetEase.

Case StudiesData Qualitymetadata management
0 likes · 7 min read
Data Governance Strategies: Concepts, Practices, and Case Studies
DataFunTalk
DataFunTalk
Jan 31, 2023 · Big Data

Tencent's Data Governance Practices and Technical Implementation

This article presents Tencent's comprehensive data governance framework, covering its definition, objectives, challenges, methodology, organizational structure, metadata management, data asset lifecycle, security measures, and technical implementation details such as microservice architecture, data collection, lineage analysis, and storage solutions.

Big DataData GovernanceTencent
0 likes · 19 min read
Tencent's Data Governance Practices and Technical Implementation
DataFunTalk
DataFunTalk
Jan 1, 2023 · Big Data

Zhihu's Real-Time Computing Platform: From Skytree 1.0 to Mipha 2.0

Zhihu’s real‑time computing platform, initially built as Skytree 1.0 on Kubernetes and later re‑engineered as Mipha 2.0 with Flink SQL, unified metadata management, dynamic jar loading, UDF support, Protobuf format, CDC integration, and extensive operational optimizations, now processes petabyte‑scale data with high reliability.

FlinkKubernetesReal‑Time Computing
0 likes · 21 min read
Zhihu's Real-Time Computing Platform: From Skytree 1.0 to Mipha 2.0
Data Thinking Notes
Data Thinking Notes
Nov 24, 2022 · Fundamentals

How to Build an Enterprise Data Governance System from Scratch

This article explains what data governance is, why enterprises need it, the key components such as data quality, metadata, master data, asset and security management, and provides a step‑by‑step framework, organizational structure, platform features, evaluation methods and common pitfalls.

Data AssetsData GovernanceData Quality
0 likes · 17 min read
How to Build an Enterprise Data Governance System from Scratch
Data Thinking Notes
Data Thinking Notes
Nov 10, 2022 · Big Data

Building Kuaishou’s Scalable Metadata Management Platform for Big Data

This article details Kuaishou’s evolution of its metadata management platform—from early Hive‑centric beginnings to a unified 2.0 architecture and a forward‑looking 3.0 vision—highlighting challenges, key technologies, and how metadata drives data production, consumption, governance, and cost optimization across the big‑data middle platform.

Data GovernanceData Platformmetadata lineage
0 likes · 17 min read
Building Kuaishou’s Scalable Metadata Management Platform for Big Data
DataFunSummit
DataFunSummit
Nov 4, 2022 · Big Data

Real-Time Data Lake Practice at ByteDance: Architecture, Challenges, and Solutions

ByteDance’s data platform team explains their real‑time data lake implementation, covering its evolving definition, six core capabilities, challenges such as data management, concurrent updates, performance and log ingestion, and detailed case studies of multi‑stage deployment, indexing, metadata services, and future roadmap.

HudiReal-time Data LakeStreaming
0 likes · 32 min read
Real-Time Data Lake Practice at ByteDance: Architecture, Challenges, and Solutions
Python Crawling & Data Mining
Python Crawling & Data Mining
Oct 30, 2022 · Big Data

Why Ozone Is the Next‑Generation Distributed Object Store for Big Data

This article explains how Ozone, the Hadoop community’s new distributed object‑storage system, overcomes HDFS’s small‑file limitations with a hierarchical Volume‑Bucket‑Object model, detailing its architecture, components, data flow for creating and reading objects, and the benefits of its scalable, fault‑tolerant design.

Big DataHadoopOzone
0 likes · 12 min read
Why Ozone Is the Next‑Generation Distributed Object Store for Big Data
Tencent Cloud Developer
Tencent Cloud Developer
Sep 27, 2022 · Big Data

GooseFS: Accelerating Cloud Storage for Big Data and Data Lake Platforms

GooseFS, Tencent Cloud’s Hadoop‑compatible storage accelerator, adds a local NVMe‑SSD cache layer to cloud‑native data lakes, letting users boost query speeds by up to 46 % and cut backend bandwidth by 200 Gbps without code changes, as demonstrated by a music‑industry customer’s 200‑node deployment caching ten million files.

Cost reductionData LakeGooseFS
0 likes · 16 min read
GooseFS: Accelerating Cloud Storage for Big Data and Data Lake Platforms
DataFunSummit
DataFunSummit
Aug 12, 2022 · Big Data

JD's Big Data Cross‑Domain and Hierarchical Storage Practices

JD’s article details its big‑data platform’s cross‑domain and hierarchical storage solutions, describing the challenges of multi‑datacenter data synchronization, the architecture of its storage layer, the implemented asynchronous and synchronous data flows, topology management, metadata tagging, and performance‑enhancing techniques for efficient, disaster‑resilient data handling.

Data PlatformHierarchical Storagecross-domain storage
0 likes · 11 min read
JD's Big Data Cross‑Domain and Hierarchical Storage Practices
DataFunTalk
DataFunTalk
Jul 14, 2022 · Big Data

Real‑Time Data Lake Practices at ByteDance and Alibaba: Architecture, Challenges, and Solutions

This article presents detailed case studies of ByteDance and Alibaba implementing real‑time data lake solutions with Hudi and Flink, describing the business drivers, architectural challenges, and the specific technical strategies such as unified metadata layers, optimistic locking, scalable hash indexing, and CDC‑based incremental ETL to achieve low‑latency, high‑throughput data processing.

FlinkHudiReal-time Data Lake
0 likes · 9 min read
Real‑Time Data Lake Practices at ByteDance and Alibaba: Architecture, Challenges, and Solutions
DataFunTalk
DataFunTalk
Jul 13, 2022 · Databases

Technical Analysis and Case Studies of Knowledge Graphs by Neo4j

This presentation explains where knowledge resides in data architectures, demonstrates knowledge‑graph‑driven skill discovery, metadata management, and semantic search, and concludes with a comparison of GraphQL and Cypher for graph queries, illustrated with real‑world Neo4j case studies.

CypherGraphQLKnowledge Graph
0 likes · 11 min read
Technical Analysis and Case Studies of Knowledge Graphs by Neo4j
ByteDance Data Platform
ByteDance Data Platform
Jun 8, 2022 · Backend Development

How ByteDance Optimized Data Catalog Performance with Apache Atlas and JanusGraph

This article details ByteDance's 2021 overhaul of its Data Catalog system, the performance regressions encountered after switching to Apache Atlas, and the step‑by‑step backend optimizations—including JanusGraph tuning, Gremlin query refactoring, parallel processing, and write‑path improvements—that reduced latency from minutes to seconds.

Apache AtlasData CatalogJanusGraph
0 likes · 12 min read
How ByteDance Optimized Data Catalog Performance with Apache Atlas and JanusGraph
Big Data Technology Architecture
Big Data Technology Architecture
Jun 5, 2022 · Big Data

Introduction to Data Lake Concepts, Capabilities, and Applications

This article explains the origin and definition of data lakes, describes their ability to store structured, semi‑structured and unstructured data at any scale on‑premises or in the cloud, outlines essential lake capabilities such as unified storage, raw‑data preservation, scalable compute, metadata and security management, and compares data lakes with data warehouses and lakehouse architectures through real‑world cloud‑native examples.

cloud storagemetadata management
0 likes · 16 min read
Introduction to Data Lake Concepts, Capabilities, and Applications
vivo Internet Technology
vivo Internet Technology
May 25, 2022 · Big Data

Understanding Druid Metadata Management and Architecture

Apache Druid manages metadata through a layered, distributed system where the Overlord coordinates ingestion tasks, MiddleManagers launch Peons to create segments, Coordinators and Historical nodes store and serve segment data, Brokers route queries, while MySQL, Zookeeper, memory, and local files synchronize metadata for fault‑tolerant, high‑performance OLAP analytics.

Big DataDruidQuery Processing
0 likes · 19 min read
Understanding Druid Metadata Management and Architecture
DataFunTalk
DataFunTalk
May 23, 2022 · Big Data

Real-Time Data Lake Practices at ByteDance: Architecture, Challenges, and Solutions

ByteDance shares its real‑time data lake implementation, covering the evolving definition of data lakes, six core capabilities, challenges such as data management, weak concurrent updates, performance, and log ingestion, and detailed solutions including Hudi Metastore Server, bucket indexing, multi‑stage use cases, and future roadmap.

Batch ProcessingHudiReal-time Data Lake
0 likes · 32 min read
Real-Time Data Lake Practices at ByteDance: Architecture, Challenges, and Solutions
Airbnb Technology Team
Airbnb Technology Team
May 12, 2022 · Information Security

Airbnb Data Privacy and Security Engineering – Data Protection Platform (DPP) Overview and Madoka Metadata System

Airbnb’s Data Protection Platform (DPP) combines automated discovery, classification, encryption and privacy‑orchestration services—Inspekt, Angmar, Cipher, Obliviate, Minister, and the Madoka metadata system—to continuously inventory petabyte‑scale MySQL, Hive and S3 assets, track ownership and security attributes, and enforce GDPR, PIPL and CCPA compliance.

AirbnbAutomationData Protection
0 likes · 15 min read
Airbnb Data Privacy and Security Engineering – Data Protection Platform (DPP) Overview and Madoka Metadata System
ByteDance Data Platform
ByteDance Data Platform
Apr 27, 2022 · Big Data

How ByteDance Built a Scalable Data Catalog: Key Technologies and Future Plans

ByteDance’s Data Catalog article details the system’s unified metadata model, standardized ingestion connectors, search optimization techniques, lineage capabilities, and storage layer enhancements, highlighting key technical designs, performance improvements, and future work to advance data governance and asset utilization.

Data CatalogData LineageStorage Optimization
0 likes · 12 min read
How ByteDance Built a Scalable Data Catalog: Key Technologies and Future Plans
ByteDance Data Platform
ByteDance Data Platform
Dec 31, 2021 · Big Data

How ByteDance Leverages Hudi for a Real‑Time Data Lake Platform

This article introduces ByteDance’s real‑time data lake platform built on Apache Hudi, covering Hudi fundamentals, table types, indexing, practical use cases, platform optimizations, and future roadmap, illustrating how the system enables low‑latency, scalable analytics across batch and streaming workloads.

HudiLakehousemetadata management
0 likes · 11 min read
How ByteDance Leverages Hudi for a Real‑Time Data Lake Platform
Ctrip Technology
Ctrip Technology
Dec 16, 2021 · Big Data

Data Standard Management Practices in Ctrip Vacation Data Governance

This article outlines Ctrip Vacation's data standard management approach, covering why standards are needed, the three‑element framework of scope, tools, and policies, and detailed practices for data integration, production change handling, metadata governance, portal dashboard standardization, and self‑service query templating.

Big DataData GovernanceData Integration
0 likes · 12 min read
Data Standard Management Practices in Ctrip Vacation Data Governance
DataFunTalk
DataFunTalk
Jul 27, 2021 · Big Data

Building a Real‑Time Data Warehouse with Apache Doris at Shuhai Supply Chain

This article describes how Shuhai Supply Chain upgraded its data warehouse from a complex, high‑cost 1.0 architecture to a streamlined, real‑time solution built around Apache Doris, detailing the motivations, design choices, zero‑code ingestion, metadata management, Flink connector, and the resulting performance gains.

Apache DorisBig DataFlink
0 likes · 13 min read
Building a Real‑Time Data Warehouse with Apache Doris at Shuhai Supply Chain
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 29, 2021 · Big Data

Huawei Data Governance Practices and Metadata Management

This article outlines Huawei's data governance practices, detailing its digital transformation vision, two-stage data management evolution, structured and unstructured data classification frameworks, external data compliance, and comprehensive metadata management architecture, highlighting challenges and solutions for enterprise-wide data assets.

Digital TransformationHuaweidata classification
0 likes · 20 min read
Huawei Data Governance Practices and Metadata Management
Big Data Technology Architecture
Big Data Technology Architecture
Jun 10, 2021 · Big Data

Understanding Apache Iceberg: Design, Architecture, and Its Application at NetEase Cloud Music

This article explains Apache Iceberg’s table‑format design, compares it with Hive’s limitations, details its snapshot‑based architecture and metadata handling, and describes how NetEase Cloud Music leveraged Iceberg to dramatically improve large‑scale log processing performance and stability.

Apache IcebergSparkTable Format
0 likes · 12 min read
Understanding Apache Iceberg: Design, Architecture, and Its Application at NetEase Cloud Music
macrozheng
macrozheng
May 8, 2021 · Big Data

Why Kafka 2.8 Drops Zookeeper: Architecture, Challenges, and KIP‑500

This article explains how Kafka 2.8 removes its dependency on Zookeeper, describes Kafka's core concepts and its interaction with Zookeeper, outlines the role of the Controller, discusses operational complexities and upgrade paths with KIP‑500, and highlights the benefits of the new KRaft‑based architecture.

Distributed SystemsKIP-500KRaft
0 likes · 10 min read
Why Kafka 2.8 Drops Zookeeper: Architecture, Challenges, and KIP‑500
DataFunTalk
DataFunTalk
Feb 8, 2021 · Big Data

Ozone: The Next‑Generation Distributed Storage System Aiming to Replace HDFS

This article explains how Apache Ozone, built on the HDDS layer, addresses the scalability, memory, and performance limitations of HDFS by splitting metadata services, using RocksDB, implementing fine‑grained locking, RAFT‑based HA, and offering rich APIs, while outlining current challenges and future roadmap.

Big DataHDDSHDFS
0 likes · 29 min read
Ozone: The Next‑Generation Distributed Storage System Aiming to Replace HDFS
DataFunSummit
DataFunSummit
Nov 17, 2020 · Big Data

Sohu Intelligent Media Data Warehouse Architecture and Technical Practices

This article presents Sohu Intelligent Media's data warehouse construction practice, covering fundamental concepts, batch and real‑time processing, OLAP theory, multidimensional modeling, workflow management, data quality, metadata lineage, and security, with a focus on Apache Doris and a Lambda‑style architecture.

Apache DorisBatch ProcessingData Quality
0 likes · 18 min read
Sohu Intelligent Media Data Warehouse Architecture and Technical Practices
Beike Product & Technology
Beike Product & Technology
Nov 13, 2020 · Big Data

Beike One‑Stop Big Data Development Platform: Architecture, Evolution, and Future Outlook

The article summarizes Beike's one‑stop big data development platform, describing its data business background, the evolution from a simple Hadoop‑Kafka‑Hive stack to a metadata‑driven, asset‑oriented platform, and outlines current capabilities in data management, integration, scheduling, quality, openness, and future plans.

Big DataData GovernanceData Platform
0 likes · 11 min read
Beike One‑Stop Big Data Development Platform: Architecture, Evolution, and Future Outlook
Alibaba Cloud Developer
Alibaba Cloud Developer
Oct 25, 2020 · Big Data

How Alibaba’s Cloud‑Native Data Lake Solves Big Data Challenges

Alibaba Cloud’s Data Lake Analytics (DLA) tackles the growing complexity of data scenarios by offering cloud‑native, serverless solutions for data lake management, massive metadata construction, and high‑performance Spark and Presto engines, while addressing challenges such as high entry barriers, stability, and multi‑tenant isolation.

Cloud NativeData LakePresto
0 likes · 22 min read
How Alibaba’s Cloud‑Native Data Lake Solves Big Data Challenges
ITPUB
ITPUB
Oct 16, 2020 · Big Data

How NetEase Cloud Music Built a Real‑Time Data Warehouse with Flink & Calcite

This article details NetEase Cloud Music's evolution of a real‑time data warehouse built on Flink 1.9 and Calcite, covering platform scale, architectural design, metadata management, SDK simplifications, monitoring improvements, and concrete use cases such as AB‑testing, live reporting, and feature serving.

Big DataCalciteFlink
0 likes · 8 min read
How NetEase Cloud Music Built a Real‑Time Data Warehouse with Flink & Calcite
Architecture Digest
Architecture Digest
Sep 12, 2020 · Backend Development

Zookeeper Usage Scenarios and Interview Analysis

This article explains common Zookeeper usage scenarios—including distributed coordination, distributed locking, metadata/configuration management, and high‑availability—provides interview‑style analysis, and illustrates each case with diagrams, helping Java developers understand how Zookeeper supports core distributed system functions.

LockZooKeepercoordination
0 likes · 5 min read
Zookeeper Usage Scenarios and Interview Analysis
58 Tech
58 Tech
Jul 13, 2020 · Big Data

Design and Implementation of a Financial Data Warehouse: Architecture, Modeling, Quality Monitoring, and Metadata Management

This article presents a comprehensive design and implementation guide for a financial data warehouse, covering background needs, modeling methodology choices, a layered architecture, data quality monitoring, metadata management, naming and coding standards, and future development directions.

Big DataData QualityData Warehouse
0 likes · 11 min read
Design and Implementation of a Financial Data Warehouse: Architecture, Modeling, Quality Monitoring, and Metadata Management
Big Data Technology Architecture
Big Data Technology Architecture
Apr 20, 2020 · Big Data

Introduction to HDFS: Architecture, Features, Replication, Rack Awareness, and Metadata Management

This article provides a comprehensive overview of Hadoop Distributed File System (HDFS), covering its streaming data access model, key characteristics, master‑slave architecture, block storage and replication mechanisms, rack‑aware placement strategy, and how the NameNode manages metadata and checkpoints.

Distributed File SystemHDFSHadoop
0 likes · 7 min read
Introduction to HDFS: Architecture, Features, Replication, Rack Awareness, and Metadata Management
Dada Group Technology
Dada Group Technology
Apr 15, 2020 · Big Data

Practice Experience of Dada Group's Real-Time Computation SQLization Using Dada Flink SQL

This article details Dada Group's development of the Dada Flink SQL engine, describing its background, architecture, parser design, dimension‑table join strategies, numerous enhancements such as HA support, Kafka keyword handling, metadata integration, Redis and ClickHouse sinks, BINLOG simplification, and future migration plans toward Flink 1.10.

ClickHouseFlinkReal‑Time Computing
0 likes · 12 min read
Practice Experience of Dada Group's Real-Time Computation SQLization Using Dada Flink SQL
dbaplus Community
dbaplus Community
Jan 14, 2020 · Big Data

How OPPO Built a Real‑Time Data Warehouse with Flink SQL

This article details{32-64 words} OPPO's evolution from an offline data warehouse to a real‑time platform, describing the business scale, data‑mid platform architecture, migration strategy using Flink SQL, extensions like AthenaX, and practical use cases such as real‑time ETL, CTR calculation, and tag import.

ETLFlinkSQL
0 likes · 18 min read
How OPPO Built a Real‑Time Data Warehouse with Flink SQL
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 12, 2019 · Big Data

Origin Data Governance Platform: Architecture, Modules, and Implementation at Meituan

The article describes Meituan's Origin Data Governance Platform, detailing its background, challenges, architectural redesign, core modules such as data storage, metadata, business, security, and application management, as well as its internal workflow, achievements, and future roadmap for unified, secure, and high‑performance data services.

Meituanmetadata managementplatform architecture
0 likes · 22 min read
Origin Data Governance Platform: Architecture, Modules, and Implementation at Meituan
Mafengwo Technology
Mafengwo Technology
Sep 26, 2019 · Big Data

Mafengwo’s Data Warehouse & Middle Platform: Architecture, Modeling, Toolchain

This article details Mafengwo’s journey in constructing a data warehouse and data middle platform, covering the core three‑layer architecture, hybrid modeling approaches, the supporting toolchain for data synchronization, scheduling, and metadata management, and the design of an indicator platform for business analytics.

Big Data ArchitectureData Middle PlatformData Warehouse
0 likes · 18 min read
Mafengwo’s Data Warehouse & Middle Platform: Architecture, Modeling, Toolchain
DataFunTalk
DataFunTalk
Aug 1, 2019 · Big Data

Streaming Data Platform Practices and Challenges at Beike Real Estate

This article presents an in‑depth overview of Beike's four‑layer streaming data platform, covering the foundational infrastructure, capability aggregation, data content, and output layers, as well as the challenges of metadata management, real‑time processing, and productization through the Ark and Tianyan systems.

Ark platformBeikeTianyan
0 likes · 14 min read
Streaming Data Platform Practices and Challenges at Beike Real Estate
Meituan Technology Team
Meituan Technology Team
Dec 27, 2018 · Big Data

Meituan Origin Data Governance Platform: Architecture and Practices

Meituan’s Origin Data Governance Platform inserts a unified governance layer between its data‑warehouse and application stacks, consolidating metric and dimension definitions, automating metadata management, enforcing security and workflow controls, and delivering cross‑engine query, monitoring and lineage capabilities that resolve inconsistencies and boost trust across dozens of internal data platforms.

metadata managementplatform architecture
0 likes · 21 min read
Meituan Origin Data Governance Platform: Architecture and Practices
Youzan Coder
Youzan Coder
Aug 3, 2018 · Big Data

Youzan Data Warehouse Metadata System: From Manual Tables to Metadata‑Driven Architecture

Youzan’s data‑warehouse metadata system evolved from manually maintained tables to an automated data dictionary and finally to a metadata‑driven architecture that automatically captures technical, business, and process metadata, visualizes lineage, tracks resource usage, manages synchronization rules and permissions, and now aims to improve novice usability with visual models and impact‑analysis tools.

Big DataData WarehouseHive
0 likes · 11 min read
Youzan Data Warehouse Metadata System: From Manual Tables to Metadata‑Driven Architecture
ITPUB
ITPUB
May 30, 2018 · Backend Development

How JD.com Engineered Its Own Distributed Storage System for Billions of Files

This article chronicles JD.com's journey from recognizing massive storage demands to designing, building, and evolving a self‑developed distributed storage platform—JFS—that handles small and large files, powers a custom image system, object storage, and future container‑native workloads.

Backend EngineeringJFSdistributed storage
0 likes · 16 min read
How JD.com Engineered Its Own Distributed Storage System for Billions of Files
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Jul 1, 2017 · Fundamentals

Designing Distributed File Systems: Solving Local FS Limits

Distributed file systems extend traditional local storage by partitioning data across multiple servers, using a master node for metadata and coordination, handling namespace, replication, load balancing, caching, and client interfaces, thereby overcoming file size, quantity, and concurrency constraints of ext3, reiserfs, and similar local filesystems.

Distributed File SystemReplicationcaching
0 likes · 15 min read
Designing Distributed File Systems: Solving Local FS Limits
ITFLY8 Architecture Home
ITFLY8 Architecture Home
May 8, 2017 · Fundamentals

Designing Scalable Distributed File Systems: Architecture, Challenges, and Solutions

This article explains how distributed file systems overcome the limitations of traditional local file systems by using a master‑metadata server, multiple data nodes, and client interfaces, and it details the key architectural components, common problems, and practical engineering solutions such as replication, load balancing, and caching.

Distributed File SystemReplicationarchitecture
0 likes · 15 min read
Designing Scalable Distributed File Systems: Architecture, Challenges, and Solutions
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Feb 23, 2017 · Fundamentals

How Ceph Monitor Uses Paxos to Ensure Consistent Metadata Management

This article explains the role of Ceph Monitor as the metadata management component in Ceph, detailing its centralized yet scalable design, the trade‑offs between centralized and peer‑to‑peer approaches, and how an improved Paxos algorithm with Bootstrap, Recovery, and read/write phases ensures consistent, fault‑tolerant cluster operation.

CephCluster ConsistencyPaxos
0 likes · 9 min read
How Ceph Monitor Uses Paxos to Ensure Consistent Metadata Management
dbaplus Community
dbaplus Community
Dec 11, 2016 · Operations

How to Modernize DBA Operations: Building an Automated Database Management Platform

This article shares the design and implementation of an automated DBA platform that modernizes database operations through metadata collection, self‑service data access, scripted deployment, backup, restore, and performance monitoring, addressing common pain points in large‑scale e‑commerce environments.

DBA automationbackup scriptsmetadata management
0 likes · 11 min read
How to Modernize DBA Operations: Building an Automated Database Management Platform