Tagged articles
3672 articles
Page 5 of 37
Data Thinking Notes
Data Thinking Notes
Nov 19, 2024 · Big Data

Unlocking Data Value: The Four Stages of Enterprise Data Asset Realization

This article explains how enterprises transform raw data into valuable assets through four development stages, a triple‑entry accounting theory, and a detailed end‑to‑end process that covers data collection, resource building, product development, trading, evaluation, and financialization.

Big DataData AssetData Governance
0 likes · 14 min read
Unlocking Data Value: The Four Stages of Enterprise Data Asset Realization
AntData
AntData
Nov 18, 2024 · Databases

Modern Data Paradigms: From Relational Databases to Vector Retrieval and AI

This article surveys the evolution of modern data technologies—from the 4V characteristics of big data and the limitations of traditional relational databases, through the rise of NoSQL and polyglot persistence, to embedding‑driven vector search, hybrid retrieval and RAG, illustrating how each paradigm frees applications from data constraints.

Artificial IntelligenceBig DataData Architecture
0 likes · 30 min read
Modern Data Paradigms: From Relational Databases to Vector Retrieval and AI
DaTaobao Tech
DaTaobao Tech
Nov 15, 2024 · Big Data

Engineering Practices for a Billion‑Scale Image Asset Platform

The article recounts how the author built a billion‑scale AI image‑asset library by replacing a week‑long import with a clustered‑table, sharded pipeline, MD5‑based unique keys, a custom DataWorks task scheduler, and multi‑engine query layers, sharing practical engineering practices learned through successive iterations.

Big DataHashingImage Processing
0 likes · 14 min read
Engineering Practices for a Billion‑Scale Image Asset Platform
AsiaInfo Technology: New Tech Exploration
AsiaInfo Technology: New Tech Exploration
Nov 15, 2024 · Artificial Intelligence

How PaaS for AI Optimizes Large‑Model Workloads on Kubernetes

This article analyzes the three core technologies behind PaaS for AI—GPU resource management, node data optimization, and task scheduling—detailing their concepts, component architecture, critical workflows, technical advantages, and future challenges, while illustrating practical configurations with Kubernetes and Volcano examples.

Big DataCloud NativeKubernetes
0 likes · 16 min read
How PaaS for AI Optimizes Large‑Model Workloads on Kubernetes
Architecture & Thinking
Architecture & Thinking
Nov 15, 2024 · Databases

How Baidu’s TDE‑ClickHouse Delivers Sub‑Second Analytics on Billion‑Row Datasets

This article explains how Baidu’s TDE‑ClickHouse, as a core engine of the Turing 3.0 ecosystem, overcomes platform fragmentation, quality issues, and usability challenges through the OneData+ development paradigm, multi‑level aggregation, projection, query‑caching, bulk‑load ingestion, and a cloud‑native architecture to achieve sub‑second query response for massive data volumes.

Big DataCloud NativeDistributed Systems
0 likes · 22 min read
How Baidu’s TDE‑ClickHouse Delivers Sub‑Second Analytics on Billion‑Row Datasets
Youzan Coder
Youzan Coder
Nov 13, 2024 · Big Data

How a Unified Metric Service Transforms Data Queries with Headless BI

Facing inconsistent metrics and low reuse in siloed data services, the team built a unified metric service using a headless BI semantic layer and virtual data models, enabling consistent metric definitions, reusable data models, AI-friendly queries, and faster, scalable reporting across the organization.

Big DataHeadless BILLM integration
0 likes · 17 min read
How a Unified Metric Service Transforms Data Queries with Headless BI
Baidu Geek Talk
Baidu Geek Talk
Nov 13, 2024 · Industry Insights

Why Cloud‑Native Data Lakes Are the New Standard for Storage Acceleration

This article analyzes the evolution of data‑lake storage acceleration, compares traditional parallel file systems, object‑storage‑based solutions and modern cache‑enabled architectures, and explains how cloud‑native data lakes address scalability, cost, and performance challenges for AI and big‑data workloads.

Big DataCloud NativeData Lake
0 likes · 24 min read
Why Cloud‑Native Data Lakes Are the New Standard for Storage Acceleration
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 12, 2024 · Big Data

Adaptive Query Execution (AQE) in Apache Spark 4.0: A Revolution in Query Optimization

This article explains how Adaptive Query Execution (AQE) in Apache Spark 4.0 dynamically optimizes query plans through features such as join reordering, partition pruning, skew handling and coalescing, delivering significant performance gains, resource efficiency and reduced manual tuning across real‑world big‑data workloads.

Adaptive Query ExecutionApache SparkBig Data
0 likes · 13 min read
Adaptive Query Execution (AQE) in Apache Spark 4.0: A Revolution in Query Optimization
DataFunSummit
DataFunSummit
Nov 11, 2024 · Big Data

Understanding Spark SQL Parsing Layer and Its Optimizations

This talk, the third in a Spark series, introduces the Spark SQL parsing layer, explains its architecture and integration with ANTLR4, details core implementation classes, and presents a real‑world optimization case that reduces code complexity and improves maintainability.

Antlr4Big DataScala
0 likes · 15 min read
Understanding Spark SQL Parsing Layer and Its Optimizations
Architect
Architect
Nov 8, 2024 · Backend Development

How Ctrip Scaled Its Travel Product Log System to Billions of Records

This article traces the evolution of Ctrip’s travel product log platform—from a single‑table DB approach to a platform‑wide ES + HBase solution—detailing the challenges of massive data volume, the architectural decisions, RowKey design, write and query flows, and the subsequent extensions that enabled billion‑scale log storage and fast retrieval.

Backend ArchitectureBig DataCtrip
0 likes · 17 min read
How Ctrip Scaled Its Travel Product Log System to Billions of Records
DataFunSummit
DataFunSummit
Nov 8, 2024 · Big Data

Roundtable Discussion on Data Lake Technology Maturity and Governance Practices

Experts from Kuaishou, former Tencent, Ping An Insurance and others discuss data lake maturity, column‑level governance, resource management of unstructured data, and automated optimization techniques such as Iceberg small‑file merging, highlighting how these advances improve data quality and business decision‑making.

Big DataColumn-level GovernanceData Lake
0 likes · 6 min read
Roundtable Discussion on Data Lake Technology Maturity and Governance Practices
Data Thinking Notes
Data Thinking Notes
Nov 7, 2024 · Fundamentals

What’s Inside China’s New National Data Standard System Guide?

The Chinese government has issued a comprehensive ‘National Data Standard System Construction Guide’ that outlines a roadmap to build a unified data standards framework by 2026, detailing design principles, system architecture, core standard categories, and providing a downloadable full guide.

Big DataData GovernanceInformation Management
0 likes · 6 min read
What’s Inside China’s New National Data Standard System Guide?
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 7, 2024 · Big Data

Douyin Group's Data Management Strategies: Enhancing Metric Stability and Reusability

This article outlines Douyin Group's approach to handling petabyte‑scale data, addressing metric inconsistencies, and improving data product agility through a four‑layer Volcano Engine platform, systematic indicator production‑management‑consumption cycles, organizational design, automation, and future plans for large‑model‑driven metric splitting.

AnalyticsBig DataData Management
0 likes · 20 min read
Douyin Group's Data Management Strategies: Enhancing Metric Stability and Reusability
Baidu Geek Talk
Baidu Geek Talk
Nov 6, 2024 · Cloud Computing

Baidu Canghai Storage Unified Technology Base: Architecture and Evolution of Metadata, Namespace, and Data Layers

Baidu’s Canghai Storage unifies metadata, hierarchical namespace, and data layers into a Meta‑Aware, three‑generation architecture that scales to trillions of metadata items and zettabyte‑scale data, using a distributed transactional KV store, single‑machine‑distributed namespace, and online erasure‑coding micro‑services to deliver high performance, low cost, and seamless scalability.

Big DataDistributed SystemsNewSQL
0 likes · 18 min read
Baidu Canghai Storage Unified Technology Base: Architecture and Evolution of Metadata, Namespace, and Data Layers
DataFunTalk
DataFunTalk
Nov 6, 2024 · Big Data

How Data Lakes Empower AI: Insights from Industry Experts

In a panel discussion, experts from Kuaishou, Ping An, and Datastrato explain how data lake architectures, columnar storage formats like Apache Iceberg, and vector‑enabled lake formats are enhancing feature management, supporting generative AI workloads, and accelerating machine‑learning pipelines.

Apache IcebergBig DataData Lake
0 likes · 6 min read
How Data Lakes Empower AI: Insights from Industry Experts
ByteDance Data Platform
ByteDance Data Platform
Nov 6, 2024 · Big Data

How Douyin’s Data Platform Overcomes EB‑Scale Metric Challenges

This article explains how Douyin Group tackles massive data volume, quality, and efficiency issues by building a four‑layer intelligent platform, standardizing metric management, automating metric decomposition, and creating reusable metric services that boost agility, stability, and cross‑team collaboration.

Big DataData PlatformData Quality
0 likes · 20 min read
How Douyin’s Data Platform Overcomes EB‑Scale Metric Challenges
JD Tech Talk
JD Tech Talk
Nov 6, 2024 · Artificial Intelligence

Understanding Data Science and Its Applications at JD.com

This article explains the fundamentals of data science, outlines its key components and processes, and details how JD.com leverages data science across e‑commerce, finance, healthcare, and logistics to improve efficiency, reduce costs, and enhance user experiences, while also discussing future trends such as quantum computing and digital twins.

Big DataQuantum Computing
0 likes · 21 min read
Understanding Data Science and Its Applications at JD.com
Data Thinking Notes
Data Thinking Notes
Nov 5, 2024 · Big Data

How a Next‑Gen Data Management Platform Boosts Efficiency and Innovation

This article outlines the motivations, objectives, and architectural design of a next‑generation data management platform, detailing its four‑layer “four‑ization” approach, core services such as data integration, modeling, API provisioning, componentization, as well as governance, security, and operational best practices.

Big DataData GovernanceData Integration
0 likes · 20 min read
How a Next‑Gen Data Management Platform Boosts Efficiency and Innovation
Baidu Tech Salon
Baidu Tech Salon
Nov 5, 2024 · Big Data

Accelerating Data Lake Storage for Big Data and AI: Baidu's Solutions

Baidu’s Data Lake Storage Acceleration 2.0 replaces traditional HDFS with a scalable object‑storage foundation, introducing an adaptive hierarchical namespace, high‑throughput streaming engine, RapidFS caching, and fully compatible BOS‑HDFS APIs, thereby delivering up to 70 % higher throughput, lower costs, and seamless migration for big‑data and AI workloads.

BOS-HDFSBig DataData Lake
0 likes · 11 min read
Accelerating Data Lake Storage for Big Data and AI: Baidu's Solutions
JD Tech Talk
JD Tech Talk
Nov 5, 2024 · Big Data

Low-Code Generation of Flink StreamGraph, JobGraph, and ExecutionGraph

This article explains how to generate Flink's StreamGraph, JobGraph, and ExecutionGraph using a low‑code canvas approach, detailing the underlying concepts, the transformation pipeline from DataStream to DAG, and providing Java code examples for building and assembling operators via drag‑and‑drop.

Big DataExecutionGraphFlink
0 likes · 5 min read
Low-Code Generation of Flink StreamGraph, JobGraph, and ExecutionGraph
Baidu Geek Talk
Baidu Geek Talk
Nov 4, 2024 · Big Data

Why Object Storage Is Replacing HDFS for Modern Data Lakes: Baidu’s 2.0 Acceleration

Data lakes have evolved from HDFS to object storage, addressing resource inefficiency, scalability limits, and operational burdens; Baidu’s Data Lake Storage Acceleration 2.0 introduces hierarchical Namespace 2.0, a streaming storage engine, RapidFS caching, and a fully HDFS‑compatible BOS‑HDFS layer to boost performance and support massive AI workloads.

BaiduBig DataData Lake
0 likes · 12 min read
Why Object Storage Is Replacing HDFS for Modern Data Lakes: Baidu’s 2.0 Acceleration

How Apache SeaTunnel Redefines Data Integration for Modern Data Platforms

This article reviews the evolution of data‑integration architectures toward EtLT, explains the core capabilities of Apache SeaTunnel, and details how a Chinese data‑platform vendor applied and extended SeaTunnel to simplify batch and streaming ingestion, unify multi‑engine processing, and reduce development and operational costs.

Apache SeaTunnelBig DataConnector Development
0 likes · 17 min read
How Apache SeaTunnel Redefines Data Integration for Modern Data Platforms
Test Development Learning Exchange
Test Development Learning Exchange
Nov 2, 2024 · Big Data

Python Data Parsing and Large‑Scale Data Processing Techniques

This article introduces Python's built‑in modules and popular libraries for parsing CSV, JSON, and XML files, demonstrates advanced data manipulation with pandas, and presents multiple strategies—including chunked reading, Dask, PySpark, HDF5, databases, Vaex, and NumPy memory‑mapping—for efficiently handling very large datasets.

Big DataCSVData Parsing
0 likes · 14 min read
Python Data Parsing and Large‑Scale Data Processing Techniques
DataFunSummit
DataFunSummit
Nov 1, 2024 · Big Data

DataFun Summit Session Overview and E‑book Access Instructions

The article outlines how to obtain the DataFun Summit e‑book by following the public account instructions and provides concise English summaries of twelve technical sessions covering data lineage, integration, AI language models, multimodal content, game AI agents, lake‑warehouse governance, big‑data architecture, and cluster management.

Big DataData IntegrationDataOps
0 likes · 5 min read
DataFun Summit Session Overview and E‑book Access Instructions
Open Source Tech Hub
Open Source Tech Hub
Oct 31, 2024 · Big Data

How Bilibili Scaled Its Search Index with Distributed KV Storage and Spark

Bilibili transformed its search indexing pipeline by replacing a manual, low‑throughput process with a distributed KV store (Taishan) and Spark‑based construction, achieving unified data ingestion, reduced resource consumption, faster full‑ and incremental builds, and a shift from daily to hourly indexing cycles.

Big DataKV StoreProtobuf
0 likes · 25 min read
How Bilibili Scaled Its Search Index with Distributed KV Storage and Spark
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Oct 31, 2024 · Big Data

How EMR Serverless Spark Powers the Next‑Gen Lakehouse Era

This article traces the evolution of data platforms, explains the rise of lakehouse architecture, and details how Alibaba Cloud's EMR Serverless Spark delivers one‑stop development, high performance, and full ecosystem compatibility, illustrated with real‑world case studies from Midea and Eagle Network.

Big DataData PlatformEMR Serverless Spark
0 likes · 16 min read
How EMR Serverless Spark Powers the Next‑Gen Lakehouse Era
ByteDance Data Platform
ByteDance Data Platform
Oct 30, 2024 · Big Data

How Volcano Engine’s DataLeap Platform Transforms Data Service Management

Volcano Engine’s DataLeap platform offers a unified API service solution that transforms raw data into reliable, secure data services, featuring full lifecycle management, monitoring, permission control, rate limiting, and visual API orchestration to simplify complex data workflows and improve operational efficiency across big-data scenarios.

API orchestrationBig DataData Service
0 likes · 21 min read
How Volcano Engine’s DataLeap Platform Transforms Data Service Management
JD Retail Technology
JD Retail Technology
Oct 29, 2024 · Big Data

JD Unified Storage Practice: Cross‑Region and Tiered Storage on HDFS

This article details JD's large‑scale HDFS unified storage implementation, covering cross‑region storage challenges, topology design, asynchronous block replication, flow‑control mechanisms, tiered storage strategies, automatic hot‑cold data migration, and the resulting performance and cost improvements for big‑data workloads.

Big DataCross-Region StorageData Management
0 likes · 20 min read
JD Unified Storage Practice: Cross‑Region and Tiered Storage on HDFS
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 28, 2024 · Big Data

Key Considerations for Using Paimon Primary Key Tables

This article explains the characteristics of Paimon primary key tables, covering bucket selection, cross‑partition update issues, recommended record‑level expiration settings, and two approaches to handle file compaction, including configuration tweaks and dedicated compaction tasks.

Big DataBucketFlink
0 likes · 6 min read
Key Considerations for Using Paimon Primary Key Tables
DataFunSummit
DataFunSummit
Oct 26, 2024 · Big Data

Kuaishou Metric Middle Platform: Design, Architecture, and Practices

This article presents Kuaishou's metric middle platform, detailing its background, design principles, architecture, metric management, data modeling, unified analysis language OAX, federated query engine OCTO, acceleration strategies, and future directions, illustrating how it improves data quality, development efficiency, and analytical capabilities at scale.

AnalyticsBig DataData Platform
0 likes · 64 min read
Kuaishou Metric Middle Platform: Design, Architecture, and Practices
Bilibili Tech
Bilibili Tech
Oct 25, 2024 · Big Data

DataFunSummit2024: Next-Generation Data Architecture Technology Summit

DataFunSummit2024, co-hosted by Bilibili, convenes industry experts, scholars, and enterprise leaders across six forums to discuss next‑generation data architecture, showcasing Bilibili’s Iceberg‑based stream‑batch innovations, AI‑BI analytics, NoETL practices, and emerging alternatives to Lambda architecture.

AI+BIBig DataData Architecture
0 likes · 3 min read
DataFunSummit2024: Next-Generation Data Architecture Technology Summit
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Oct 25, 2024 · Big Data

How Real-Time Flink Powers Automotive Big Data: Architecture & Case Studies

This article, based on Alibaba Cloud expert Li Lubing’s presentation, examines the rapid growth of China’s new energy vehicle market, outlines typical automotive big‑data architectures, compares Lambda and real‑time lakehouse solutions built with Flink and Apache Paimon, and showcases real‑world customer deployments.

Big DataFlinkLakehouse
0 likes · 18 min read
How Real-Time Flink Powers Automotive Big Data: Architecture & Case Studies
Data Thinking Notes
Data Thinking Notes
Oct 24, 2024 · Big Data

Why Data Metrics Matter: Building Effective Indicator Systems

Understanding what data metrics are, how to construct comprehensive indicator systems, and why they are essential for data‑driven decision‑making, operational efficiency, and unified statistical standards is crucial for businesses seeking to leverage big‑data technologies and improve strategic outcomes.

Big DataBusiness AnalyticsData Governance
0 likes · 12 min read
Why Data Metrics Matter: Building Effective Indicator Systems
DataFunSummit
DataFunSummit
Oct 24, 2024 · Big Data

Bilibili’s Large Language Model‑Based Intelligent Assistant for the Big Data Platform: Architecture, Principles, and Deployment

This article details Bilibili’s implementation of a large‑language‑model‑driven intelligent assistant for its massive big‑data platform, covering background, problem analysis, architectural design, knowledge‑base construction, precision and recall challenges, deployment across offline and real‑time Spark/Flink diagnostics, and future outlooks.

Big DataFlinkIntelligent Assistant
0 likes · 23 min read
Bilibili’s Large Language Model‑Based Intelligent Assistant for the Big Data Platform: Architecture, Principles, and Deployment
iQIYI Technical Product Team
iQIYI Technical Product Team
Oct 24, 2024 · Big Data

iQIYI Multi-AZ Unified Scheduling Architecture for Big Data

iQIYI’s Multi‑AZ unified scheduling architecture combines a unified storage layer (QBFS), an abstracted compute scheduler (QBCS), and a federated metadata service (Waggle Dance) to seamlessly route data and jobs across availability zones, cut storage costs up to 65 %, reduce overall big‑data workload expenses by more than 35 %, and lay the groundwork for future hybrid‑cloud expansion.

Big DataCompute SchedulingUnified Storage
0 likes · 15 min read
iQIYI Multi-AZ Unified Scheduling Architecture for Big Data
Baidu Geek Talk
Baidu Geek Talk
Oct 22, 2024 · Big Data

How Baidu’s DATAPILOT Uses NVIDIA RAPIDS to Supercharge SQL Analytics

Baidu’s DATAPILOT platform combines natural‑language interaction with GPU‑accelerated Spark‑RAPIDS to turn complex, multi‑table SQL queries into seconds‑fast results, boosting ad‑revenue analysis efficiency by up to five‑fold while reducing infrastructure costs.

Apache SparkBaiduBig Data
0 likes · 10 min read
How Baidu’s DATAPILOT Uses NVIDIA RAPIDS to Supercharge SQL Analytics
Code Ape Tech Column
Code Ape Tech Column
Oct 21, 2024 · Big Data

Design and Optimization of Querying 100k Records from Tens of Millions Using ClickHouse, Elasticsearch, HBase, and RediSearch

This article presents a business-driven requirement to extract no more than 100,000 records from a pool of tens of millions, evaluates four technical solutions—including multithreaded ClickHouse pagination, Elasticsearch scroll‑scan, an ES‑HBase hybrid, and RediSearch + RedisJSON—provides implementation details, performance measurements, and practical recommendations for large‑scale data querying.

Big DataHBaseRediSearch
0 likes · 11 min read
Design and Optimization of Querying 100k Records from Tens of Millions Using ClickHouse, Elasticsearch, HBase, and RediSearch
DaTaobao Tech
DaTaobao Tech
Oct 18, 2024 · Artificial Intelligence

Taobao AI Virtual Try-On: Offline Data Processing and Performance Optimization

Taobao’s AI virtual‑try‑on system pre‑computes fitting results offline, writes them into the Item Center via scalable ScheduleX tasks, optimizes pagination, locking and flow‑control, and thereby processes millions of apparel items in under thirty minutes with 99.9% success and reliable checkpoint‑resume monitoring.

Big Dataaioffline processing
0 likes · 16 min read
Taobao AI Virtual Try-On: Offline Data Processing and Performance Optimization
Java Architecture Stack
Java Architecture Stack
Oct 18, 2024 · Big Data

How to Fix Spark OOM Errors: Practical Memory & Performance Tuning

This guide analyzes common Spark Out‑Of‑Memory scenarios—such as massive data volumes, data skew, and improper resource allocation—and provides step‑by‑step configurations, memory‑management tweaks, partitioning strategies, and shuffle optimizations to prevent OOM failures in production.

Big DataMemory TuningOOM
0 likes · 8 min read
How to Fix Spark OOM Errors: Practical Memory & Performance Tuning
StarRocks
StarRocks
Oct 16, 2024 · Big Data

How to Build a High‑Performance Lakehouse with StarRocks and Apache Hive

This guide walks through the core concepts of Apache Hive, its architecture and key features, then shows how to integrate Hive with StarRocks via the Hive Catalog, construct ODS/DWD/DWS/ADS tables, enable DataCache, use materialized views, and handle automatic partition detection for fast lakehouse analytics.

Apache HiveBig DataDataCache
0 likes · 17 min read
How to Build a High‑Performance Lakehouse with StarRocks and Apache Hive
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 16, 2024 · Databases

Kuaishou's Lakehouse‑Integrated OLAP Architecture with Apache Doris: Design, Migration, and Optimization

The article describes how Kuaishou transformed its high‑traffic OLAP system from a separated lake‑and‑warehouse architecture using Hive/Hudi and ClickHouse into a unified lakehouse solution powered by Apache Doris, detailing the challenges, design choices, caching and automatic materialization mechanisms, and the resulting performance and governance improvements.

Apache DorisBig DataData Caching
0 likes · 18 min read
Kuaishou's Lakehouse‑Integrated OLAP Architecture with Apache Doris: Design, Migration, and Optimization
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Oct 14, 2024 · Databases

How Baidu’s New Cloud‑Native Databases Power Enterprise AI in 2024

At the 2024 Baidu Cloud Summit, the speaker detailed recent breakthroughs across Baidu’s cloud‑native database suite—including PegaDB KV, GaiaDB relational, VDB vector, and the integrated DBSC, EDAP, and DBStack platforms—highlighting performance, cost, scalability, and AI‑ready features that address enterprise data challenges.

Big DataEnterprise Dataai
0 likes · 11 min read
How Baidu’s New Cloud‑Native Databases Power Enterprise AI in 2024
DataFunSummit
DataFunSummit
Oct 13, 2024 · Big Data

Enterprise Digital Intelligence Capability Maturity Model (EDMM): Definitions, Framework, and Future Roadmap

This article presents the China Information and Communications Research Institute’s research on the Enterprise Digital Intelligence Capability Maturity Model (EDMM), detailing the concepts of data, intelligent, and knowledge middle platforms, the model’s four‑layer framework, its development stages, value propositions, long‑term mechanisms, and upcoming work plans.

Artificial IntelligenceBig DataData Platform
0 likes · 24 min read
Enterprise Digital Intelligence Capability Maturity Model (EDMM): Definitions, Framework, and Future Roadmap
DataFunSummit
DataFunSummit
Oct 11, 2024 · Artificial Intelligence

Feature Production and Component Modeling in the Intelligent Era: From Feature Generation to Modular Modeling

This article introduces a cloud‑based feature production platform that simplifies feature engineering for recommendation, risk control and machine learning, explains its component‑based modeling framework, and answers common questions about deployment, performance, and customization, highlighting cross‑platform compatibility and optimization techniques.

Artificial IntelligenceBig DataFeature Store
0 likes · 19 min read
Feature Production and Component Modeling in the Intelligent Era: From Feature Generation to Modular Modeling
JavaEdge
JavaEdge
Oct 7, 2024 · Big Data

Master Data Analysis: From Collection to Visualization

This guide explains why data analysis is essential, breaks it into three core stages—data collection, data mining, and data visualization—offers practical tool recommendations, and presents principles for efficient learning and skill development.

Big DataData visualizationPython
0 likes · 10 min read
Master Data Analysis: From Collection to Visualization
DataFunSummit
DataFunSummit
Oct 7, 2024 · Big Data

Guangdong Mobile’s Data‑Weaving Practice: Building a Virtual Data Center for Big Data Governance

Since 2017, Guangdong Mobile has advanced its digital transformation by integrating data‑weaving technology with AI large models to overcome data silos, improve governance, and support a rapidly growing mix of structured and unstructured data across its enterprise, culminating in a virtual data‑center architecture that drives efficient data storage, processing, and business innovation.

AI integrationBig DataData weaving
0 likes · 13 min read
Guangdong Mobile’s Data‑Weaving Practice: Building a Virtual Data Center for Big Data Governance
DataFunSummit
DataFunSummit
Oct 4, 2024 · Big Data

JD Retail HDFS Unified Storage: Cross‑Region and Tiered Storage Practices

This article presents JD Retail's large‑scale HDFS deployment, detailing its unified storage architecture, cross‑region data replication challenges and solutions, tiered storage strategies for hot, warm and cold data, and the operational modules that together improve performance, reliability and cost efficiency in a big‑data environment.

Big DataCross-Region StorageDistributed File System
0 likes · 21 min read
JD Retail HDFS Unified Storage: Cross‑Region and Tiered Storage Practices
DataFunTalk
DataFunTalk
Oct 3, 2024 · Big Data

Data Lake Technology Maturity Curve: Architecture, Design Principles, Core Functions, and Open‑Source Solutions

Amid growing data demands, this article explains the data lake technology maturity curve, detailing lake‑warehouse architectural patterns, design principles, core functionalities, and the four leading open‑source solutions (Hudi, Iceberg, Delta Lake, Paimon) to guide enterprises in building flexible, scalable, and governed data platforms.

Big DataData ArchitectureData Lake
0 likes · 10 min read
Data Lake Technology Maturity Curve: Architecture, Design Principles, Core Functions, and Open‑Source Solutions
DataFunSummit
DataFunSummit
Sep 30, 2024 · Big Data

Apache Hudi from Zero to One: The Swiss Army Knife for Data Ingestion – Hudi Streamer (Part 9)

This article introduces Apache Hudi Streamer, a versatile Spark‑based data ingestion tool likened to a Swiss Army knife, detailing its core options—including table configuration, continuous mode, source classes, transformers, table services, catalog synchronization, and advanced features—while guiding users on practical pipeline setup.

Apache HudiBig DataSpark
0 likes · 10 min read
Apache Hudi from Zero to One: The Swiss Army Knife for Data Ingestion – Hudi Streamer (Part 9)
JD Tech
JD Tech
Sep 28, 2024 · Big Data

From Early Coding to Big Data Architecture: A Personal Journey Through Data Platforms, Cloud Migration, and System Design

The article chronicles the author’s 30‑year programming career, detailing early experiences, the evolution from JavaScript projects to large‑scale big‑data architectures, cloud migration, business‑agnostic framework design, interactive analytics, and reflections on becoming an independent software architect.

Big DataData Architecturecareer journey
0 likes · 24 min read
From Early Coding to Big Data Architecture: A Personal Journey Through Data Platforms, Cloud Migration, and System Design
macrozheng
macrozheng
Sep 27, 2024 · Big Data

Master DataX: Efficient Offline Data Sync for Heterogeneous Sources

This guide walks through the challenges of synchronizing massive datasets across heterogeneous databases, introduces Alibaba's open‑source DataX tool, explains its framework‑plugin architecture, and provides step‑by‑step instructions—including environment setup, installation, job configuration, and both full and incremental MySQL synchronization—complete with code examples and performance metrics.

Big DataData IntegrationDataX
0 likes · 15 min read
Master DataX: Efficient Offline Data Sync for Heterogeneous Sources
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Sep 27, 2024 · Big Data

How Alibaba Cloud’s New Vectorized Engines Are Revolutionizing Real‑Time Big Data Processing

At the 2024 Cloud Xi Conference, Alibaba Cloud unveiled a suite of vectorized big‑data solutions—including the Flash engine for Flink, EMR Serverless Spark with a 300% speed boost, upgraded lakehouse architecture, and real‑world case studies—showcasing massive performance gains, cost reductions, and broader serverless adoption.

Big DataData LakeFlink
0 likes · 8 min read
How Alibaba Cloud’s New Vectorized Engines Are Revolutionizing Real‑Time Big Data Processing
Data Thinking Notes
Data Thinking Notes
Sep 26, 2024 · Big Data

How Data Platforms Are Shifting from Cost Efficiency to Value in the AI Era

The talk reviews the evolution of data technologies from early database storage to today’s generative AI-driven era, highlighting how massive data, multimodal processing, and advanced analytics are transforming data systems from cost‑centered infrastructures to value‑focused ecosystems that empower intelligent agents, open data ecosystems, and new application paradigms.

Big DataData PlatformsData Value
0 likes · 19 min read
How Data Platforms Are Shifting from Cost Efficiency to Value in the AI Era
DataFunSummit
DataFunSummit
Sep 26, 2024 · Big Data

Apache Hudi Incremental Processing and Change Data Capture (CDC): Overview, Incremental Query, and CDC

This article explains Apache Hudi's incremental processing capabilities, covering an overview of the medallion architecture, detailed configuration for incremental queries, the introduction of Change Data Capture (CDC) with required table properties, and a review of how these features enable richer data insights in modern data lake environments.

Apache HudiBig DataChange Data Capture
0 likes · 9 min read
Apache Hudi Incremental Processing and Change Data Capture (CDC): Overview, Incremental Query, and CDC
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 26, 2024 · Big Data

Key Features of Apache Paimon 0.9.0 Release

The Apache Paimon 0.9.0 release introduces production‑ready Branch support, native Iceberg compatibility, a caching catalog for faster OLAP queries, improved Bucketed Append tables with reduced small‑file issues, and full DELETE/UPDATE/MERGE‑INTO capabilities for Append tables, making the system more usable and efficient.

Apache PaimonBig DataBranch
0 likes · 5 min read
Key Features of Apache Paimon 0.9.0 Release
DataFunSummit
DataFunSummit
Sep 25, 2024 · Big Data

Evolution of Big Data AI Development Paradigm and Alibaba Cloud’s Integrated Architecture

This article examines how large‑scale big‑data platforms can simplify AI application development, outlines the shift from model‑centric to data‑centric paradigms, and shares Alibaba Cloud’s practical experiences in building an integrated big‑data‑AI architecture, including MaxCompute, Hologres, MaxFrame, and vector search capabilities.

AI integrationBig DataData Platform
0 likes · 19 min read
Evolution of Big Data AI Development Paradigm and Alibaba Cloud’s Integrated Architecture
Architect
Architect
Sep 24, 2024 · Industry Insights

How Bilibili Re‑engineered Its Search Indexing Pipeline for Hour‑Level Turnaround

This article details Bilibili's transformation of its search offline indexing architecture—from a manual, low‑throughput MySQL‑centric process to a distributed, KV‑based, protobuf‑driven pipeline that leverages Taishan storage and Spark, cutting build cycles from days to hours while solving performance, consistency, and maintenance challenges.

Big DataDistributed SystemsProtobuf
0 likes · 24 min read
How Bilibili Re‑engineered Its Search Indexing Pipeline for Hour‑Level Turnaround
DataFunTalk
DataFunTalk
Sep 24, 2024 · Big Data

Data Lake Technology Maturity Curve: Architecture Modes, Design Principles, Core Functions, and Applications

This article explains the rapid growth of data-driven businesses, the challenges of traditional data warehouses, and how modern data lake technologies such as Delta Lake, Hudi, Iceberg, and Paimon form a maturity curve that guides enterprises in architecture choices, design principles, core capabilities, and practical applications.

Big DataData LakeDelta Lake
0 likes · 12 min read
Data Lake Technology Maturity Curve: Architecture Modes, Design Principles, Core Functions, and Applications
Data Thinking Notes
Data Thinking Notes
Sep 23, 2024 · Big Data

Why Data Asset Rights Matter: Unlocking Value in the Digital Economy

This article examines the definition, policy background, and significance of data asset rights in China, outlining how clear ownership structures can incentivize data production, promote circulation, resolve data silos, and support the rapid growth of the digital economy.

Big DataData AssetData Governance
0 likes · 15 min read
Why Data Asset Rights Matter: Unlocking Value in the Digital Economy
Wukong Talks Architecture
Wukong Talks Architecture
Sep 23, 2024 · Backend Development

Evolution of the Ctrip Travel Product Log System: Architecture, Challenges, and Solutions

This article describes the development trajectory of Ctrip's travel product log system, detailing its three major phases—from a single‑table DB approach to a platform‑based solution and finally an empowered version—while discussing technical challenges, design decisions, and the implementation of HBase, Elasticsearch, and related components to handle billions of log entries efficiently.

BackendBig DataElasticsearch
0 likes · 15 min read
Evolution of the Ctrip Travel Product Log System: Architecture, Challenges, and Solutions
DataFunSummit
DataFunSummit
Sep 20, 2024 · Databases

Key Topics and Abstracts from DataFun Summit: Graph DB, Vector DB, Real-Time Data Warehouses, and Cloud‑Native Solutions

The article presents a collection of technical abstracts from the DataFun Summit, covering XiaoHongShu's REDgraph distributed graph database, DingoDB's multimodal vector database, Tencent's Tianqiong autonomous data platform, real‑time data warehouse architectures at Douyin and Ant Group, and Alibaba Cloud's serverless ClickHouse offering, all aimed at advancing large‑scale data processing and analytics.

Big Datareal-time data warehouse
0 likes · 5 min read
Key Topics and Abstracts from DataFun Summit: Graph DB, Vector DB, Real-Time Data Warehouses, and Cloud‑Native Solutions
DataFunTalk
DataFunTalk
Sep 20, 2024 · Databases

Technical Paper Summaries on Graph Databases, Vector Databases, and Real-Time Data Warehousing

This article compiles concise English summaries of several technical papers covering Xiaohongshu's REDgraph graph database, DingoDB vector database, Tianqiong autonomous data platform, Douyin's real‑time data warehouse, financial‑grade data warehousing, Alibaba Cloud ClickHouse Serverless offering, best practices in financial data governance, and 58.com user‑profile data warehouse construction.

Big DataGraph Databasedata-warehouse
0 likes · 5 min read
Technical Paper Summaries on Graph Databases, Vector Databases, and Real-Time Data Warehousing
DataFunTalk
DataFunTalk
Sep 19, 2024 · Databases

Technical Topics Overview from DataFun Summit: Graph Database, Vector Database, Real-time Data Warehouse, and Cloud‑Native Solutions

The article presents a collection of technical overviews—including a graph database for distributed queries, a next‑generation vector database, real‑time data warehouse architectures at Douyin and Ant Group, a cloud‑native ClickHouse service, and best practices for financial data warehousing—while also explaining how to obtain the related e‑book.

Big DataCloud NativeGraph Database
0 likes · 4 min read
Technical Topics Overview from DataFun Summit: Graph Database, Vector Database, Real-time Data Warehouse, and Cloud‑Native Solutions
DataFunSummit
DataFunSummit
Sep 18, 2024 · Big Data

Data Summit Abstracts: Graph Database, Vector Database, Real-time Data Warehouse, and Cloud‑Native Analytics

The article presents a series of technical abstracts covering Xiaohongshu's distributed graph database, DingoDB's multimodal vector store, Tianqiong's autonomous data‑warehouse innovations, Douyin's storage‑based real‑time warehouse, financial‑grade real‑time warehousing, Alibaba Cloud ClickHouse Serverless, best practices in financial data governance, and 58.com’s user‑profile warehouse construction.

Big Datareal-time data warehouse
0 likes · 5 min read
Data Summit Abstracts: Graph Database, Vector Database, Real-time Data Warehouse, and Cloud‑Native Analytics
ByteDance Data Platform
ByteDance Data Platform
Sep 18, 2024 · Big Data

Apache Calcite for Multi‑Engine Metric Management: Practices & Roadmap

This article explains the technical principles and best practices of multi‑engine metric management based on Apache Calcite, covering common metric management methods, implementation details of unified SQL, virtual columns, and SQL defined functions, and outlines ByteDance’s future roadmap for extending these capabilities.

Apache CalciteBig DataSQL Defined Function
0 likes · 16 min read
Apache Calcite for Multi‑Engine Metric Management: Practices & Roadmap
21CTO
21CTO
Sep 17, 2024 · Big Data

Why AWS Donated OpenSearch to the Linux Foundation and Its Impact on Search

Amazon Web Services transferred its OpenSearch project—a fork of Elasticsearch and Kibana—to the newly formed OpenSearch Software Foundation under the Linux Foundation, gaining vendor‑neutral governance and support from members like AWS, Uber, Canonical, and Aiven, to foster broader community development of search, analytics, and vector database applications.

AnalyticsBig DataLinux Foundation
0 likes · 4 min read
Why AWS Donated OpenSearch to the Linux Foundation and Its Impact on Search
DataFunTalk
DataFunTalk
Sep 17, 2024 · Databases

Overview of Recent Advances in Graph, Vector, and Real-Time Data Warehouse Technologies

This article presents a collection of technical abstracts covering graph database parallel query optimization, next‑generation vector databases, real‑time data warehouse architectures, and cloud‑native analytics solutions, while also providing instructions for obtaining the full e‑book via a WeChat public account.

Big DataCloud NativeGraph Database
0 likes · 5 min read
Overview of Recent Advances in Graph, Vector, and Real-Time Data Warehouse Technologies
DataFunSummit
DataFunSummit
Sep 16, 2024 · Databases

DataFun Summit: Technical Papers on Graph Databases, Vector Databases, Real‑Time Data Warehouses and Industry Data Practices

The DataFun Summit page presents a collection of technical papers covering graph database parallel queries, next‑generation vector databases, real‑time data warehouse architectures, and best practices in finance and e‑commerce, while also providing instructions for obtaining the e‑book via a public account.

Big DataReal-time analyticsdata-warehouse
0 likes · 5 min read
DataFun Summit: Technical Papers on Graph Databases, Vector Databases, Real‑Time Data Warehouses and Industry Data Practices
DataFunSummit
DataFunSummit
Sep 14, 2024 · Big Data

Apache Hudi Concurrency Control: Overview, MVCC, and OCC

This article provides a comprehensive overview of concurrency control in Apache Hudi, explaining ACID properties, the role of MVCC and OCC, and how Hudi coordinates multiple writers and table services to achieve serializable scheduling while maintaining high performance.

Apache HudiBig DataConcurrency Control
0 likes · 8 min read
Apache Hudi Concurrency Control: Overview, MVCC, and OCC
Kuaishou Tech
Kuaishou Tech
Sep 13, 2024 · Big Data

Blaze: Kuaishou’s Rust‑Based Vectorized Execution Engine for Spark SQL

Blaze is a Rust‑implemented, DataFusion‑based vectorized execution engine created by Kuaishou to accelerate Spark SQL queries, delivering up to 60% faster computation, 30% average compute‑power gains in production, and extensive architectural innovations such as native engine, protobuf protocol, JNI bridge, and Spark extension, while being open‑source and compatible with Spark 3.0‑3.5.

Big DataDataFusionRust
0 likes · 11 min read
Blaze: Kuaishou’s Rust‑Based Vectorized Execution Engine for Spark SQL
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Sep 13, 2024 · Big Data

How Qimao Scales 20PB Data with StarRocks, Flink, and Real‑Time Analytics

Qimao, a Shanghai‑based cultural entertainment internet firm, details its 20 PB big‑data architecture built on StarRocks, Flink, Hive, and Redis, covering data ingestion, real‑time processing, audience selection, metric anomaly drill‑down, 730‑day aggregation, and future plans for metric acceleration and full‑link data governance.

Big DataData GovernanceFlink
0 likes · 13 min read
How Qimao Scales 20PB Data with StarRocks, Flink, and Real‑Time Analytics
Data Thinking Notes
Data Thinking Notes
Sep 12, 2024 · Information Security

How to Overcome the Top 3 Data Flow Challenges and Secure Your Data Assets

This article outlines the framework for data element circulation, identifies three major security and compliance challenges in data flow, and presents five practical measures plus a six‑step method for incorporating data assets into financial statements to enhance transparency and value.

Big DataData AssetData Flow
0 likes · 10 min read
How to Overcome the Top 3 Data Flow Challenges and Secure Your Data Assets
Sohu Tech Products
Sohu Tech Products
Sep 11, 2024 · Big Data

Tencent Real-time Lakehouse Intelligent Optimization Practice

Tencent’s real‑time lakehouse combines Spark, Flink, StarRocks and Presto compute layers with Iceberg‑based management and HDFS/COS storage, and its Intelligent Optimize Service—comprising Compaction, Expiration, Cleaning, Clustering, Index and Auto‑Engine modules—automatically reduces merge time, improves query performance, enables secondary indexing, and dynamically routes hot partitions, while future plans target cold/hot separation, materialized view acceleration, and AI‑driven optimizations.

Big DataLakehousePyIceberg
0 likes · 12 min read
Tencent Real-time Lakehouse Intelligent Optimization Practice
AntTech
AntTech
Sep 10, 2024 · Big Data

From DATA for AI to AI for DATA: Evolution of Ant Group’s Intelligent Data System

The talk reviews the rapid evolution of data technologies—from early database foundations and big‑data breakthroughs to the rise of generative AI—highlighting how Ant Group’s data platform is shifting from a cost‑efficiency focus to a value‑centric, multimodal, AI‑driven ecosystem.

Artificial IntelligenceBig DataData Platforms
0 likes · 17 min read
From DATA for AI to AI for DATA: Evolution of Ant Group’s Intelligent Data System
AntData
AntData
Sep 9, 2024 · Big Data

From Cost‑Efficiency to Value‑Centric: The Evolution of Data Systems in the Data+AI Era

The article reviews the rapid advances in generative AI and big‑data technologies, traces the historical development of data infrastructure, and argues that modern data systems are shifting from a cost‑efficiency focus to a value‑centric paradigm driven by multimodal, non‑structured data, vector search and machine‑oriented services.

@DataArtificial IntelligenceBig Data
0 likes · 18 min read
From Cost‑Efficiency to Value‑Centric: The Evolution of Data Systems in the Data+AI Era
Baidu Geek Talk
Baidu Geek Talk
Sep 9, 2024 · Big Data

TDS Platform Overview: Architecture, Modules, and Features of Baidu MEG's Turing 3.0 Data Ecosystem

The TDS platform, central to Baidu MEG’s Turing 3.0 ecosystem, unifies data development, warehouse management, monitoring, and resource control through Spark‑based TDE, a visual studio, and AI‑enhanced tools like Smart Diagnosis and Text2SQL, enabling standardized workflows, scalable scheduling, and handling over 30 k daily tasks.

Big DataData DevelopmentData Governance
0 likes · 21 min read
TDS Platform Overview: Architecture, Modules, and Features of Baidu MEG's Turing 3.0 Data Ecosystem
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Sep 9, 2024 · Big Data

Why DataFusion is Revolutionizing Big Data Queries with Rust and Arrow

This article introduces DataFusion, a high‑performance, Rust‑based query engine that leverages Apache Arrow’s columnar memory format to enable fast, extensible data processing across multiple storage formats and cloud sources, explains its architecture, execution model, and provides practical Rust code examples for custom extensions.

Apache ArrowBig DataDataFusion
0 likes · 16 min read
Why DataFusion is Revolutionizing Big Data Queries with Rust and Arrow
DataFunSummit
DataFunSummit
Sep 8, 2024 · Big Data

Building and Optimizing a Cross‑Border E‑Commerce Data Platform: Architecture, Challenges, and Protonbase‑Based Solutions

This article presents Xide International's cross‑border e‑commerce data platform, detailing its multi‑layer business architecture, the scalability and data‑access problems encountered, and how a Protonbase‑driven data‑warehouse and micro‑service redesign dramatically improved query speed, operational efficiency, and cost.

Big DataData PlatformMicroservices
0 likes · 11 min read
Building and Optimizing a Cross‑Border E‑Commerce Data Platform: Architecture, Challenges, and Protonbase‑Based Solutions
Didi Tech
Didi Tech
Sep 5, 2024 · Industry Insights

How Didi Built a Multi‑Protocol, Petabyte‑Scale Storage System for AI Training

Facing petabyte‑level data, billions of small files, and the need for POSIX, S3, and HDFS compatibility, Didi designed a new generation of non‑structured storage—OrangeFS—by analyzing internal systems, combining multiple storage solutions, reusing GIFT technology, and implementing a high‑performance metadata service, multi‑protocol fusion, and robust scalability features.

AI storageBig DataCloud Native
0 likes · 27 min read
How Didi Built a Multi‑Protocol, Petabyte‑Scale Storage System for AI Training
dbaplus Community
dbaplus Community
Sep 4, 2024 · Big Data

How Ctrip Scaled Its Data Platform to Multi‑IDC Architecture with Spark 3, Kyuubi, and Celeborn

This article details how Ctrip’s data platform evolved from a single‑IDC design to a multi‑IDC, tiered storage and scheduling architecture, covering the challenges of rapid data growth, the migration to Spark 3 via Kyuubi, the introduction of Celeborn shuffle service, and the resulting performance and reliability gains.

Big DataHDFSKyuubi
0 likes · 23 min read
How Ctrip Scaled Its Data Platform to Multi‑IDC Architecture with Spark 3, Kyuubi, and Celeborn
DataFunTalk
DataFunTalk
Sep 4, 2024 · Artificial Intelligence

Data+AI Data Lake Technologies: Challenges, Apache Iceberg Overview, and Vector Table Implementations with PyIceberg

This article explores the evolution of data lakes for AI, discusses the challenges of AI-era data management, introduces Apache Iceberg and its architecture, demonstrates PyIceberg-based AI training and inference pipelines, and presents vector table designs with LSH indexing and performance optimizations.

Apache IcebergBig DataData Lake
0 likes · 22 min read
Data+AI Data Lake Technologies: Challenges, Apache Iceberg Overview, and Vector Table Implementations with PyIceberg
DataFunSummit
DataFunSummit
Aug 31, 2024 · Big Data

Apache Hudi Clustering: Workflow and Layout Optimization Strategies (Part 6)

This article explains Apache Hudi's clustering service, detailing its workflow, three execution modes, and layout optimization strategies—including linear, Z‑order, and Hilbert space‑filling curves—to improve storage locality and query performance in large‑scale data lake environments.

Apache HudiBig DataSpace-filling Curves
0 likes · 8 min read
Apache Hudi Clustering: Workflow and Layout Optimization Strategies (Part 6)
Data Thinking Notes
Data Thinking Notes
Aug 29, 2024 · Big Data

How ICBC Evolved Its Data Intelligence Architecture for Real‑Time Insights

At the 2024 Data Intelligence Conference, ICBC's Big Data and AI Lab detailed the evolution of its data intelligence platform, covering architectural redesign, real‑time data warehouse technology, unified intelligent data tools, and future development directions to boost efficiency and innovation.

Big DataData Platformarchitecture evolution
0 likes · 3 min read
How ICBC Evolved Its Data Intelligence Architecture for Real‑Time Insights