Tagged articles

3672 articles

Page 5 of 37

Nov 19, 2024 · Big Data

Unlocking Data Value: The Four Stages of Enterprise Data Asset Realization

This article explains how enterprises transform raw data into valuable assets through four development stages, a triple‑entry accounting theory, and a detailed end‑to‑end process that covers data collection, resource building, product development, trading, evaluation, and financialization.

Big DataData AssetData Governance

0 likes · 14 min read

Unlocking Data Value: The Four Stages of Enterprise Data Asset Realization

AntData

Nov 18, 2024 · Databases

Modern Data Paradigms: From Relational Databases to Vector Retrieval and AI

This article surveys the evolution of modern data technologies—from the 4V characteristics of big data and the limitations of traditional relational databases, through the rise of NoSQL and polyglot persistence, to embedding‑driven vector search, hybrid retrieval and RAG, illustrating how each paradigm frees applications from data constraints.

Artificial IntelligenceBig DataData Architecture

0 likes · 30 min read

Modern Data Paradigms: From Relational Databases to Vector Retrieval and AI

DaTaobao Tech

Nov 15, 2024 · Big Data

Engineering Practices for a Billion‑Scale Image Asset Platform

The article recounts how the author built a billion‑scale AI image‑asset library by replacing a week‑long import with a clustered‑table, sharded pipeline, MD5‑based unique keys, a custom DataWorks task scheduler, and multi‑engine query layers, sharing practical engineering practices learned through successive iterations.

Big DataHashingImage Processing

0 likes · 14 min read

Engineering Practices for a Billion‑Scale Image Asset Platform

AsiaInfo Technology: New Tech Exploration

Nov 15, 2024 · Artificial Intelligence

How PaaS for AI Optimizes Large‑Model Workloads on Kubernetes

This article analyzes the three core technologies behind PaaS for AI—GPU resource management, node data optimization, and task scheduling—detailing their concepts, component architecture, critical workflows, technical advantages, and future challenges, while illustrating practical configurations with Kubernetes and Volcano examples.

Big DataCloud NativeKubernetes

0 likes · 16 min read

How PaaS for AI Optimizes Large‑Model Workloads on Kubernetes

Architecture & Thinking

Nov 15, 2024 · Databases

How Baidu’s TDE‑ClickHouse Delivers Sub‑Second Analytics on Billion‑Row Datasets

This article explains how Baidu’s TDE‑ClickHouse, as a core engine of the Turing 3.0 ecosystem, overcomes platform fragmentation, quality issues, and usability challenges through the OneData+ development paradigm, multi‑level aggregation, projection, query‑caching, bulk‑load ingestion, and a cloud‑native architecture to achieve sub‑second query response for massive data volumes.

Big DataCloud NativeDistributed Systems

0 likes · 22 min read

How Baidu’s TDE‑ClickHouse Delivers Sub‑Second Analytics on Billion‑Row Datasets

Youzan Coder

Nov 13, 2024 · Big Data

How a Unified Metric Service Transforms Data Queries with Headless BI

Facing inconsistent metrics and low reuse in siloed data services, the team built a unified metric service using a headless BI semantic layer and virtual data models, enabling consistent metric definitions, reusable data models, AI-friendly queries, and faster, scalable reporting across the organization.

Big DataHeadless BILLM integration

0 likes · 17 min read

How a Unified Metric Service Transforms Data Queries with Headless BI

Baidu Geek Talk

Nov 13, 2024 · Industry Insights

Why Cloud‑Native Data Lakes Are the New Standard for Storage Acceleration

This article analyzes the evolution of data‑lake storage acceleration, compares traditional parallel file systems, object‑storage‑based solutions and modern cache‑enabled architectures, and explains how cloud‑native data lakes address scalability, cost, and performance challenges for AI and big‑data workloads.

Big DataCloud NativeData Lake

0 likes · 24 min read

Why Cloud‑Native Data Lakes Are the New Standard for Storage Acceleration

Big Data Technology & Architecture

Nov 12, 2024 · Big Data

Adaptive Query Execution (AQE) in Apache Spark 4.0: A Revolution in Query Optimization

This article explains how Adaptive Query Execution (AQE) in Apache Spark 4.0 dynamically optimizes query plans through features such as join reordering, partition pruning, skew handling and coalescing, delivering significant performance gains, resource efficiency and reduced manual tuning across real‑world big‑data workloads.

Adaptive Query ExecutionApache SparkBig Data

0 likes · 13 min read

Adaptive Query Execution (AQE) in Apache Spark 4.0: A Revolution in Query Optimization

Baidu Intelligent Cloud Tech Hub

Nov 12, 2024 · Big Data

Why Data Lake Storage Acceleration Is the New Standard in Cloud‑Native AI

The article examines the evolution of data lake storage acceleration, compares various solutions, and explains how metadata, read/write, and end‑to‑end optimizations enable scalable, cost‑effective AI and big‑data workloads in cloud‑native environments.

AI trainingBig DataData Lake

0 likes · 24 min read

Why Data Lake Storage Acceleration Is the New Standard in Cloud‑Native AI

DataFunSummit

Nov 11, 2024 · Big Data

Understanding Spark SQL Parsing Layer and Its Optimizations

This talk, the third in a Spark series, introduces the Spark SQL parsing layer, explains its architecture and integration with ANTLR4, details core implementation classes, and presents a real‑world optimization case that reduces code complexity and improves maintainability.

Antlr4Big DataScala

0 likes · 15 min read

Understanding Spark SQL Parsing Layer and Its Optimizations

Architect

Nov 8, 2024 · Backend Development

How Ctrip Scaled Its Travel Product Log System to Billions of Records

This article traces the evolution of Ctrip’s travel product log platform—from a single‑table DB approach to a platform‑wide ES + HBase solution—detailing the challenges of massive data volume, the architectural decisions, RowKey design, write and query flows, and the subsequent extensions that enabled billion‑scale log storage and fast retrieval.

Backend ArchitectureBig DataCtrip

0 likes · 17 min read

How Ctrip Scaled Its Travel Product Log System to Billions of Records

DataFunSummit

Nov 8, 2024 · Big Data

Roundtable Discussion on Data Lake Technology Maturity and Governance Practices

Experts from Kuaishou, former Tencent, Ping An Insurance and others discuss data lake maturity, column‑level governance, resource management of unstructured data, and automated optimization techniques such as Iceberg small‑file merging, highlighting how these advances improve data quality and business decision‑making.

Big DataColumn-level GovernanceData Lake

0 likes · 6 min read

Roundtable Discussion on Data Lake Technology Maturity and Governance Practices

Data Thinking Notes

Nov 7, 2024 · Fundamentals

What’s Inside China’s New National Data Standard System Guide?

The Chinese government has issued a comprehensive ‘National Data Standard System Construction Guide’ that outlines a roadmap to build a unified data standards framework by 2026, detailing design principles, system architecture, core standard categories, and providing a downloadable full guide.

Big DataData GovernanceInformation Management

0 likes · 6 min read

What’s Inside China’s New National Data Standard System Guide?

Big Data Technology & Architecture

Nov 7, 2024 · Big Data

Douyin Group's Data Management Strategies: Enhancing Metric Stability and Reusability

This article outlines Douyin Group's approach to handling petabyte‑scale data, addressing metric inconsistencies, and improving data product agility through a four‑layer Volcano Engine platform, systematic indicator production‑management‑consumption cycles, organizational design, automation, and future plans for large‑model‑driven metric splitting.

AnalyticsBig DataData Management

0 likes · 20 min read

Douyin Group's Data Management Strategies: Enhancing Metric Stability and Reusability

Baidu Geek Talk

Nov 6, 2024 · Cloud Computing

Baidu Canghai Storage Unified Technology Base: Architecture and Evolution of Metadata, Namespace, and Data Layers

Baidu’s Canghai Storage unifies metadata, hierarchical namespace, and data layers into a Meta‑Aware, three‑generation architecture that scales to trillions of metadata items and zettabyte‑scale data, using a distributed transactional KV store, single‑machine‑distributed namespace, and online erasure‑coding micro‑services to deliver high performance, low cost, and seamless scalability.

Big DataDistributed SystemsNewSQL

0 likes · 18 min read

Baidu Canghai Storage Unified Technology Base: Architecture and Evolution of Metadata, Namespace, and Data Layers

DataFunTalk

Nov 6, 2024 · Big Data

How Data Lakes Empower AI: Insights from Industry Experts

In a panel discussion, experts from Kuaishou, Ping An, and Datastrato explain how data lake architectures, columnar storage formats like Apache Iceberg, and vector‑enabled lake formats are enhancing feature management, supporting generative AI workloads, and accelerating machine‑learning pipelines.

Apache IcebergBig DataData Lake

0 likes · 6 min read

How Data Lakes Empower AI: Insights from Industry Experts

ByteDance Data Platform

Nov 6, 2024 · Big Data

How Douyin’s Data Platform Overcomes EB‑Scale Metric Challenges

This article explains how Douyin Group tackles massive data volume, quality, and efficiency issues by building a four‑layer intelligent platform, standardizing metric management, automating metric decomposition, and creating reusable metric services that boost agility, stability, and cross‑team collaboration.

Big DataData PlatformData Quality

0 likes · 20 min read

How Douyin’s Data Platform Overcomes EB‑Scale Metric Challenges

JD Tech Talk

Nov 6, 2024 · Artificial Intelligence

Understanding Data Science and Its Applications at JD.com

This article explains the fundamentals of data science, outlines its key components and processes, and details how JD.com leverages data science across e‑commerce, finance, healthcare, and logistics to improve efficiency, reduce costs, and enhance user experiences, while also discussing future trends such as quantum computing and digital twins.

Big DataQuantum Computing

0 likes · 21 min read

Understanding Data Science and Its Applications at JD.com

Data Thinking Notes

Nov 5, 2024 · Big Data

How a Next‑Gen Data Management Platform Boosts Efficiency and Innovation

This article outlines the motivations, objectives, and architectural design of a next‑generation data management platform, detailing its four‑layer “four‑ization” approach, core services such as data integration, modeling, API provisioning, componentization, as well as governance, security, and operational best practices.

Big DataData GovernanceData Integration

0 likes · 20 min read

How a Next‑Gen Data Management Platform Boosts Efficiency and Innovation

Baidu Tech Salon

Nov 5, 2024 · Big Data

Accelerating Data Lake Storage for Big Data and AI: Baidu's Solutions

Baidu’s Data Lake Storage Acceleration 2.0 replaces traditional HDFS with a scalable object‑storage foundation, introducing an adaptive hierarchical namespace, high‑throughput streaming engine, RapidFS caching, and fully compatible BOS‑HDFS APIs, thereby delivering up to 70 % higher throughput, lower costs, and seamless migration for big‑data and AI workloads.

BOS-HDFSBig DataData Lake

0 likes · 11 min read

Accelerating Data Lake Storage for Big Data and AI: Baidu's Solutions

JD Tech Talk

Nov 5, 2024 · Big Data

Low-Code Generation of Flink StreamGraph, JobGraph, and ExecutionGraph

This article explains how to generate Flink's StreamGraph, JobGraph, and ExecutionGraph using a low‑code canvas approach, detailing the underlying concepts, the transformation pipeline from DataStream to DAG, and providing Java code examples for building and assembling operators via drag‑and‑drop.

Big DataExecutionGraphFlink

0 likes · 5 min read

Low-Code Generation of Flink StreamGraph, JobGraph, and ExecutionGraph

Baidu Geek Talk

Nov 4, 2024 · Big Data

Why Object Storage Is Replacing HDFS for Modern Data Lakes: Baidu’s 2.0 Acceleration

Data lakes have evolved from HDFS to object storage, addressing resource inefficiency, scalability limits, and operational burdens; Baidu’s Data Lake Storage Acceleration 2.0 introduces hierarchical Namespace 2.0, a streaming storage engine, RapidFS caching, and a fully HDFS‑compatible BOS‑HDFS layer to boost performance and support massive AI workloads.

BaiduBig DataData Lake

0 likes · 12 min read

Why Object Storage Is Replacing HDFS for Modern Data Lakes: Baidu’s 2.0 Acceleration

AsiaInfo Technology: New Tech Exploration

Nov 4, 2024 · Big Data

How Apache SeaTunnel Redefines Data Integration for Modern Data Platforms

This article reviews the evolution of data‑integration architectures toward EtLT, explains the core capabilities of Apache SeaTunnel, and details how a Chinese data‑platform vendor applied and extended SeaTunnel to simplify batch and streaming ingestion, unify multi‑engine processing, and reduce development and operational costs.

Apache SeaTunnelBig DataConnector Development

0 likes · 17 min read

How Apache SeaTunnel Redefines Data Integration for Modern Data Platforms

Test Development Learning Exchange

Nov 2, 2024 · Big Data

Python Data Parsing and Large‑Scale Data Processing Techniques

This article introduces Python's built‑in modules and popular libraries for parsing CSV, JSON, and XML files, demonstrates advanced data manipulation with pandas, and presents multiple strategies—including chunked reading, Dask, PySpark, HDF5, databases, Vaex, and NumPy memory‑mapping—for efficiently handling very large datasets.

Big DataCSVData Parsing

0 likes · 14 min read

Python Data Parsing and Large‑Scale Data Processing Techniques

DataFunSummit

Nov 1, 2024 · Big Data

DataFun Summit Session Overview and E‑book Access Instructions

The article outlines how to obtain the DataFun Summit e‑book by following the public account instructions and provides concise English summaries of twelve technical sessions covering data lineage, integration, AI language models, multimodal content, game AI agents, lake‑warehouse governance, big‑data architecture, and cluster management.

Big DataData IntegrationDataOps

0 likes · 5 min read

DataFun Summit Session Overview and E‑book Access Instructions

Open Source Tech Hub

Oct 31, 2024 · Big Data

How Bilibili Scaled Its Search Index with Distributed KV Storage and Spark

Bilibili transformed its search indexing pipeline by replacing a manual, low‑throughput process with a distributed KV store (Taishan) and Spark‑based construction, achieving unified data ingestion, reduced resource consumption, faster full‑ and incremental builds, and a shift from daily to hourly indexing cycles.

Big DataKV StoreProtobuf

0 likes · 25 min read

How Bilibili Scaled Its Search Index with Distributed KV Storage and Spark

Big Data Technology & Architecture

Oct 31, 2024 · Big Data

Understanding Paimon's Changelog Producer: Four Modes and Their Trade‑offs

The article explains Paimon's changelog‑producer capability, detailing its purpose, storage format, and the four generation modes—None, Input, Lookup, and Full Compaction—while comparing their costs, implementation details, and suitability for different data sources such as CDC.

Big DataDeltaLookup

0 likes · 16 min read

Understanding Paimon's Changelog Producer: Four Modes and Their Trade‑offs

Alibaba Cloud Big Data AI Platform

Oct 31, 2024 · Big Data

How EMR Serverless Spark Powers the Next‑Gen Lakehouse Era

This article traces the evolution of data platforms, explains the rise of lakehouse architecture, and details how Alibaba Cloud's EMR Serverless Spark delivers one‑stop development, high performance, and full ecosystem compatibility, illustrated with real‑world case studies from Midea and Eagle Network.

Big DataData PlatformEMR Serverless Spark

0 likes · 16 min read

How EMR Serverless Spark Powers the Next‑Gen Lakehouse Era

ByteDance Data Platform

Oct 30, 2024 · Big Data

How Volcano Engine’s DataLeap Platform Transforms Data Service Management

Volcano Engine’s DataLeap platform offers a unified API service solution that transforms raw data into reliable, secure data services, featuring full lifecycle management, monitoring, permission control, rate limiting, and visual API orchestration to simplify complex data workflows and improve operational efficiency across big-data scenarios.

API orchestrationBig DataData Service

0 likes · 21 min read

How Volcano Engine’s DataLeap Platform Transforms Data Service Management

JD Retail Technology

Oct 29, 2024 · Big Data

JD Unified Storage Practice: Cross‑Region and Tiered Storage on HDFS

This article details JD's large‑scale HDFS unified storage implementation, covering cross‑region storage challenges, topology design, asynchronous block replication, flow‑control mechanisms, tiered storage strategies, automatic hot‑cold data migration, and the resulting performance and cost improvements for big‑data workloads.

Big DataCross-Region StorageData Management

0 likes · 20 min read

JD Unified Storage Practice: Cross‑Region and Tiered Storage on HDFS

Big Data Technology & Architecture

Oct 28, 2024 · Big Data

Key Considerations for Using Paimon Primary Key Tables

This article explains the characteristics of Paimon primary key tables, covering bucket selection, cross‑partition update issues, recommended record‑level expiration settings, and two approaches to handle file compaction, including configuration tweaks and dedicated compaction tasks.

Big DataBucketFlink

0 likes · 6 min read

Key Considerations for Using Paimon Primary Key Tables

DataFunSummit

Oct 26, 2024 · Big Data

Kuaishou Metric Middle Platform: Design, Architecture, and Practices

This article presents Kuaishou's metric middle platform, detailing its background, design principles, architecture, metric management, data modeling, unified analysis language OAX, federated query engine OCTO, acceleration strategies, and future directions, illustrating how it improves data quality, development efficiency, and analytical capabilities at scale.

AnalyticsBig DataData Platform

0 likes · 64 min read

Kuaishou Metric Middle Platform: Design, Architecture, and Practices

Bilibili Tech

Oct 25, 2024 · Big Data

DataFunSummit2024: Next-Generation Data Architecture Technology Summit

DataFunSummit2024, co-hosted by Bilibili, convenes industry experts, scholars, and enterprise leaders across six forums to discuss next‑generation data architecture, showcasing Bilibili’s Iceberg‑based stream‑batch innovations, AI‑BI analytics, NoETL practices, and emerging alternatives to Lambda architecture.

AI+BIBig DataData Architecture

0 likes · 3 min read

DataFunSummit2024: Next-Generation Data Architecture Technology Summit

Alibaba Cloud Big Data AI Platform

Oct 25, 2024 · Big Data

How Real-Time Flink Powers Automotive Big Data: Architecture & Case Studies

This article, based on Alibaba Cloud expert Li Lubing’s presentation, examines the rapid growth of China’s new energy vehicle market, outlines typical automotive big‑data architectures, compares Lambda and real‑time lakehouse solutions built with Flink and Apache Paimon, and showcases real‑world customer deployments.

Big DataFlinkLakehouse

0 likes · 18 min read

How Real-Time Flink Powers Automotive Big Data: Architecture & Case Studies

Data Thinking Notes

Oct 24, 2024 · Big Data

Why Data Metrics Matter: Building Effective Indicator Systems

Understanding what data metrics are, how to construct comprehensive indicator systems, and why they are essential for data‑driven decision‑making, operational efficiency, and unified statistical standards is crucial for businesses seeking to leverage big‑data technologies and improve strategic outcomes.

Big DataBusiness AnalyticsData Governance

0 likes · 12 min read

Why Data Metrics Matter: Building Effective Indicator Systems

DataFunSummit

Oct 24, 2024 · Big Data

Bilibili’s Large Language Model‑Based Intelligent Assistant for the Big Data Platform: Architecture, Principles, and Deployment

This article details Bilibili’s implementation of a large‑language‑model‑driven intelligent assistant for its massive big‑data platform, covering background, problem analysis, architectural design, knowledge‑base construction, precision and recall challenges, deployment across offline and real‑time Spark/Flink diagnostics, and future outlooks.

Big DataFlinkIntelligent Assistant

0 likes · 23 min read

Bilibili’s Large Language Model‑Based Intelligent Assistant for the Big Data Platform: Architecture, Principles, and Deployment

iQIYI Technical Product Team

Oct 24, 2024 · Big Data

iQIYI Multi-AZ Unified Scheduling Architecture for Big Data

iQIYI’s Multi‑AZ unified scheduling architecture combines a unified storage layer (QBFS), an abstracted compute scheduler (QBCS), and a federated metadata service (Waggle Dance) to seamlessly route data and jobs across availability zones, cut storage costs up to 65 %, reduce overall big‑data workload expenses by more than 35 %, and lay the groundwork for future hybrid‑cloud expansion.

Big DataCompute SchedulingUnified Storage

0 likes · 15 min read

iQIYI Multi-AZ Unified Scheduling Architecture for Big Data

Baidu Geek Talk

Oct 22, 2024 · Big Data

How Baidu’s DATAPILOT Uses NVIDIA RAPIDS to Supercharge SQL Analytics

Baidu’s DATAPILOT platform combines natural‑language interaction with GPU‑accelerated Spark‑RAPIDS to turn complex, multi‑table SQL queries into seconds‑fast results, boosting ad‑revenue analysis efficiency by up to five‑fold while reducing infrastructure costs.

Apache SparkBaiduBig Data

0 likes · 10 min read

How Baidu’s DATAPILOT Uses NVIDIA RAPIDS to Supercharge SQL Analytics

DataFunSummit

Oct 22, 2024 · Big Data

From Self‑Built BI to Volcano Engine: Challenges, Selection, Operations, and Future Outlook

The article recounts Firefly Thinking's early BI system limitations, the decision‑making process that led to adopting Volcano Engine, subsequent operational strategies to unlock tool potential, and a forward‑looking vision of data analysis in the large‑model era.

AnalyticsBIBig Data

0 likes · 18 min read

From Self‑Built BI to Volcano Engine: Challenges, Selection, Operations, and Future Outlook

Code Ape Tech Column

Oct 21, 2024 · Big Data

Design and Optimization of Querying 100k Records from Tens of Millions Using ClickHouse, Elasticsearch, HBase, and RediSearch

This article presents a business-driven requirement to extract no more than 100,000 records from a pool of tens of millions, evaluates four technical solutions—including multithreaded ClickHouse pagination, Elasticsearch scroll‑scan, an ES‑HBase hybrid, and RediSearch + RedisJSON—provides implementation details, performance measurements, and practical recommendations for large‑scale data querying.

Big DataHBaseRediSearch

0 likes · 11 min read

Design and Optimization of Querying 100k Records from Tens of Millions Using ClickHouse, Elasticsearch, HBase, and RediSearch

Efficient Ops

Oct 20, 2024 · Operations

Key Takeaways from the 24th GOPS Global Operations Conference – Shanghai

The article recaps the two‑day 24th GOPS Global Operations Conference in Shanghai, highlighting opening remarks, major speaker sessions on DevOps, BizDevOps, AIOps, large‑model applications, industry case studies, and provides links to presentation materials.

Big DataDevOpsOperations

0 likes · 10 min read

Key Takeaways from the 24th GOPS Global Operations Conference – Shanghai

DataFunSummit

Oct 19, 2024 · Big Data

Data Quality Governance in the Financial Industry: Challenges, Frameworks, and Practical Implementation

This article examines how data quality governance is applied in the financial sector, covering regulatory background, key challenges, management system design, practical methodologies, and evaluation standards to improve data assets and support digital transformation.

Big DataFinancial Industry

0 likes · 18 min read

Data Quality Governance in the Financial Industry: Challenges, Frameworks, and Practical Implementation

DaTaobao Tech

Oct 18, 2024 · Artificial Intelligence

Taobao AI Virtual Try-On: Offline Data Processing and Performance Optimization

Taobao’s AI virtual‑try‑on system pre‑computes fitting results offline, writes them into the Item Center via scalable ScheduleX tasks, optimizes pagination, locking and flow‑control, and thereby processes millions of apparel items in under thirty minutes with 99.9% success and reliable checkpoint‑resume monitoring.

Big Dataaioffline processing

0 likes · 16 min read

Taobao AI Virtual Try-On: Offline Data Processing and Performance Optimization

Java Architecture Stack

Oct 18, 2024 · Big Data

How to Fix Spark OOM Errors: Practical Memory & Performance Tuning

This guide analyzes common Spark Out‑Of‑Memory scenarios—such as massive data volumes, data skew, and improper resource allocation—and provides step‑by‑step configurations, memory‑management tweaks, partitioning strategies, and shuffle optimizations to prevent OOM failures in production.

Big DataMemory TuningOOM

0 likes · 8 min read

How to Fix Spark OOM Errors: Practical Memory & Performance Tuning

StarRocks

Oct 16, 2024 · Big Data

How to Build a High‑Performance Lakehouse with StarRocks and Apache Hive

This guide walks through the core concepts of Apache Hive, its architecture and key features, then shows how to integrate Hive with StarRocks via the Hive Catalog, construct ODS/DWD/DWS/ADS tables, enable DataCache, use materialized views, and handle automatic partition detection for fast lakehouse analytics.

Apache HiveBig DataDataCache

0 likes · 17 min read

How to Build a High‑Performance Lakehouse with StarRocks and Apache Hive

Big Data Technology & Architecture

Oct 16, 2024 · Databases

Kuaishou's Lakehouse‑Integrated OLAP Architecture with Apache Doris: Design, Migration, and Optimization

The article describes how Kuaishou transformed its high‑traffic OLAP system from a separated lake‑and‑warehouse architecture using Hive/Hudi and ClickHouse into a unified lakehouse solution powered by Apache Doris, detailing the challenges, design choices, caching and automatic materialization mechanisms, and the resulting performance and governance improvements.

Apache DorisBig DataData Caching

0 likes · 18 min read

Kuaishou's Lakehouse‑Integrated OLAP Architecture with Apache Doris: Design, Migration, and Optimization

Baidu Intelligent Cloud Tech Hub

Oct 14, 2024 · Databases

How Baidu’s New Cloud‑Native Databases Power Enterprise AI in 2024

At the 2024 Baidu Cloud Summit, the speaker detailed recent breakthroughs across Baidu’s cloud‑native database suite—including PegaDB KV, GaiaDB relational, VDB vector, and the integrated DBSC, EDAP, and DBStack platforms—highlighting performance, cost, scalability, and AI‑ready features that address enterprise data challenges.

Big DataEnterprise Dataai

0 likes · 11 min read

How Baidu’s New Cloud‑Native Databases Power Enterprise AI in 2024

DataFunSummit

Oct 13, 2024 · Big Data

Enterprise Digital Intelligence Capability Maturity Model (EDMM): Definitions, Framework, and Future Roadmap

This article presents the China Information and Communications Research Institute’s research on the Enterprise Digital Intelligence Capability Maturity Model (EDMM), detailing the concepts of data, intelligent, and knowledge middle platforms, the model’s four‑layer framework, its development stages, value propositions, long‑term mechanisms, and upcoming work plans.

Artificial IntelligenceBig DataData Platform

0 likes · 24 min read

Enterprise Digital Intelligence Capability Maturity Model (EDMM): Definitions, Framework, and Future Roadmap

Big Data Technology & Architecture

Oct 12, 2024 · Big Data

Introduction to Apache Paimon: Architecture, Unified Storage, and Core Concepts

This article introduces Apache Paimon, an open‑source table format that supports batch and streaming reads and writes, explains its architecture, unified storage model, and core concepts such as file layout, snapshots, manifests, data files, partitions, and consistency guarantees.

Apache PaimonBig DataOLAP

0 likes · 6 min read

Introduction to Apache Paimon: Architecture, Unified Storage, and Core Concepts

DataFunSummit

Oct 11, 2024 · Artificial Intelligence

Feature Production and Component Modeling in the Intelligent Era: From Feature Generation to Modular Modeling

This article introduces a cloud‑based feature production platform that simplifies feature engineering for recommendation, risk control and machine learning, explains its component‑based modeling framework, and answers common questions about deployment, performance, and customization, highlighting cross‑platform compatibility and optimization techniques.

Artificial IntelligenceBig DataFeature Store

0 likes · 19 min read

Feature Production and Component Modeling in the Intelligent Era: From Feature Generation to Modular Modeling

JD Retail Technology

Oct 11, 2024 · Big Data

JD Retail Data Lake Architecture: Challenges, Optimizations, and Future Plans

This article presents JD Retail's data lake architecture overhaul, detailing the shortcomings of the Lambda model, the migration to Flink‑Hudi‑Spark pipelines, performance gains, storage savings, unified APIs, and upcoming improvements for resilience and automation.

Big DataData LakeFlink

0 likes · 11 min read

JD Retail Data Lake Architecture: Challenges, Optimizations, and Future Plans

DataFunSummit

Oct 8, 2024 · Big Data

Understanding Spark SQL Analyzer: Principles, Optimization Cases, and Rule‑Pruning in Spark 3.2+

This article explains the Spark SQL analysis layer, its core principles, how analysis rules such as ResolveRelations work, and the major pruning optimization introduced in Spark 3.2 that reduces unnecessary rule traversal, illustrated with concrete code examples and Q&A.

Big DataSparkTree Pruning

0 likes · 20 min read

Understanding Spark SQL Analyzer: Principles, Optimization Cases, and Rule‑Pruning in Spark 3.2+

JavaEdge

Oct 7, 2024 · Big Data

Master Data Analysis: From Collection to Visualization

This guide explains why data analysis is essential, breaks it into three core stages—data collection, data mining, and data visualization—offers practical tool recommendations, and presents principles for efficient learning and skill development.

Big DataData visualizationPython

0 likes · 10 min read

Master Data Analysis: From Collection to Visualization

DataFunSummit

Oct 7, 2024 · Big Data

Guangdong Mobile’s Data‑Weaving Practice: Building a Virtual Data Center for Big Data Governance

Since 2017, Guangdong Mobile has advanced its digital transformation by integrating data‑weaving technology with AI large models to overcome data silos, improve governance, and support a rapidly growing mix of structured and unstructured data across its enterprise, culminating in a virtual data‑center architecture that drives efficient data storage, processing, and business innovation.

AI integrationBig DataData weaving

0 likes · 13 min read

Guangdong Mobile’s Data‑Weaving Practice: Building a Virtual Data Center for Big Data Governance

Python Programming Learning Circle

Oct 6, 2024 · Big Data

Weibo Hot Search Data Crawling, Analysis, and Visualization Project

This article presents a Python‑based project that continuously crawls Weibo hot‑search data, stores it with timestamps, and visualizes trends through dynamic bar, line, and word‑cloud charts using libraries such as BeautifulSoup, pandas, schedule, pyecharts, and jieba.

Big DataPyechartsWeb Scraping

0 likes · 10 min read

Weibo Hot Search Data Crawling, Analysis, and Visualization Project

DataFunSummit

Oct 4, 2024 · Big Data

JD Retail HDFS Unified Storage: Cross‑Region and Tiered Storage Practices

This article presents JD Retail's large‑scale HDFS deployment, detailing its unified storage architecture, cross‑region data replication challenges and solutions, tiered storage strategies for hot, warm and cold data, and the operational modules that together improve performance, reliability and cost efficiency in a big‑data environment.

Big DataCross-Region StorageDistributed File System

0 likes · 21 min read

JD Retail HDFS Unified Storage: Cross‑Region and Tiered Storage Practices

DataFunTalk

Oct 3, 2024 · Big Data

Data Lake Technology Maturity Curve: Architecture, Design Principles, Core Functions, and Open‑Source Solutions

Amid growing data demands, this article explains the data lake technology maturity curve, detailing lake‑warehouse architectural patterns, design principles, core functionalities, and the four leading open‑source solutions (Hudi, Iceberg, Delta Lake, Paimon) to guide enterprises in building flexible, scalable, and governed data platforms.

Big DataData ArchitectureData Lake

0 likes · 10 min read

Data Lake Technology Maturity Curve: Architecture, Design Principles, Core Functions, and Open‑Source Solutions

DataFunSummit

Oct 1, 2024 · Big Data

Apache Hudi from Zero to One: Highlighting Key Features of Version 1.0 (Part 10)

The article explains Apache Hudi’s three‑layer architecture and details four major 1.0 enhancements—LSM‑tree timeline, non‑blocking concurrency control, file‑group reader/writer APIs, and function indexes—while providing a brief review and links to the Hudi 1.x RFC.

Apache HudiBig DataConcurrency Control

0 likes · 9 min read

Apache Hudi from Zero to One: Highlighting Key Features of Version 1.0 (Part 10)

DataFunSummit

Sep 30, 2024 · Big Data

Apache Hudi from Zero to One: The Swiss Army Knife for Data Ingestion – Hudi Streamer (Part 9)

This article introduces Apache Hudi Streamer, a versatile Spark‑based data ingestion tool likened to a Swiss Army knife, detailing its core options—including table configuration, continuous mode, source classes, transformers, table services, catalog synchronization, and advanced features—while guiding users on practical pipeline setup.

Apache HudiBig DataSpark

0 likes · 10 min read

Apache Hudi from Zero to One: The Swiss Army Knife for Data Ingestion – Hudi Streamer (Part 9)

JD Tech

Sep 28, 2024 · Big Data

From Early Coding to Big Data Architecture: A Personal Journey Through Data Platforms, Cloud Migration, and System Design

The article chronicles the author’s 30‑year programming career, detailing early experiences, the evolution from JavaScript projects to large‑scale big‑data architectures, cloud migration, business‑agnostic framework design, interactive analytics, and reflections on becoming an independent software architect.

Big DataData Architecturecareer journey

0 likes · 24 min read

From Early Coding to Big Data Architecture: A Personal Journey Through Data Platforms, Cloud Migration, and System Design

DataFunSummit

Sep 27, 2024 · Big Data

Data Lake Technology Maturity Curve: Architecture Modes, Design Principles, Core Functions, and Applications

This article explains the data lake technology maturity curve, covering lake‑warehouse architecture patterns, design principles, core capabilities of major open‑source lake engines (Hudi, Iceberg, Delta Lake, Paimon), and practical application scenarios for modern data‑driven enterprises.

Big DataData LakeDelta Lake

0 likes · 10 min read

Data Lake Technology Maturity Curve: Architecture Modes, Design Principles, Core Functions, and Applications

macrozheng

Sep 27, 2024 · Big Data

Master DataX: Efficient Offline Data Sync for Heterogeneous Sources

This guide walks through the challenges of synchronizing massive datasets across heterogeneous databases, introduces Alibaba's open‑source DataX tool, explains its framework‑plugin architecture, and provides step‑by‑step instructions—including environment setup, installation, job configuration, and both full and incremental MySQL synchronization—complete with code examples and performance metrics.

Big DataData IntegrationDataX

0 likes · 15 min read

Master DataX: Efficient Offline Data Sync for Heterogeneous Sources

Alibaba Cloud Big Data AI Platform

Sep 27, 2024 · Big Data

How Alibaba Cloud’s New Vectorized Engines Are Revolutionizing Real‑Time Big Data Processing

At the 2024 Cloud Xi Conference, Alibaba Cloud unveiled a suite of vectorized big‑data solutions—including the Flash engine for Flink, EMR Serverless Spark with a 300% speed boost, upgraded lakehouse architecture, and real‑world case studies—showcasing massive performance gains, cost reductions, and broader serverless adoption.

Big DataData LakeFlink

0 likes · 8 min read

How Alibaba Cloud’s New Vectorized Engines Are Revolutionizing Real‑Time Big Data Processing

Data Thinking Notes

Sep 26, 2024 · Big Data

How Data Platforms Are Shifting from Cost Efficiency to Value in the AI Era

The talk reviews the evolution of data technologies from early database storage to today’s generative AI-driven era, highlighting how massive data, multimodal processing, and advanced analytics are transforming data systems from cost‑centered infrastructures to value‑focused ecosystems that empower intelligent agents, open data ecosystems, and new application paradigms.

Big DataData PlatformsData Value

0 likes · 19 min read

How Data Platforms Are Shifting from Cost Efficiency to Value in the AI Era

DataFunSummit

Sep 26, 2024 · Big Data

Apache Hudi Incremental Processing and Change Data Capture (CDC): Overview, Incremental Query, and CDC

This article explains Apache Hudi's incremental processing capabilities, covering an overview of the medallion architecture, detailed configuration for incremental queries, the introduction of Change Data Capture (CDC) with required table properties, and a review of how these features enable richer data insights in modern data lake environments.

Apache HudiBig DataChange Data Capture

0 likes · 9 min read

Apache Hudi Incremental Processing and Change Data Capture (CDC): Overview, Incremental Query, and CDC

Big Data Technology & Architecture

Sep 26, 2024 · Big Data

Key Features of Apache Paimon 0.9.0 Release

The Apache Paimon 0.9.0 release introduces production‑ready Branch support, native Iceberg compatibility, a caching catalog for faster OLAP queries, improved Bucketed Append tables with reduced small‑file issues, and full DELETE/UPDATE/MERGE‑INTO capabilities for Append tables, making the system more usable and efficient.

Apache PaimonBig DataBranch

0 likes · 5 min read

Key Features of Apache Paimon 0.9.0 Release

DataFunSummit

Sep 25, 2024 · Big Data

Evolution of Big Data AI Development Paradigm and Alibaba Cloud’s Integrated Architecture

This article examines how large‑scale big‑data platforms can simplify AI application development, outlines the shift from model‑centric to data‑centric paradigms, and shares Alibaba Cloud’s practical experiences in building an integrated big‑data‑AI architecture, including MaxCompute, Hologres, MaxFrame, and vector search capabilities.

AI integrationBig DataData Platform

0 likes · 19 min read

Architect

Sep 24, 2024 · Industry Insights

How Bilibili Re‑engineered Its Search Indexing Pipeline for Hour‑Level Turnaround

This article details Bilibili's transformation of its search offline indexing architecture—from a manual, low‑throughput MySQL‑centric process to a distributed, KV‑based, protobuf‑driven pipeline that leverages Taishan storage and Spark, cutting build cycles from days to hours while solving performance, consistency, and maintenance challenges.

Big DataDistributed SystemsProtobuf

0 likes · 24 min read

How Bilibili Re‑engineered Its Search Indexing Pipeline for Hour‑Level Turnaround

DataFunTalk

Sep 24, 2024 · Big Data

Data Lake Technology Maturity Curve: Architecture Modes, Design Principles, Core Functions, and Applications

This article explains the rapid growth of data-driven businesses, the challenges of traditional data warehouses, and how modern data lake technologies such as Delta Lake, Hudi, Iceberg, and Paimon form a maturity curve that guides enterprises in architecture choices, design principles, core capabilities, and practical applications.

Big DataData LakeDelta Lake

0 likes · 12 min read

Data Thinking Notes

Sep 23, 2024 · Big Data

Why Data Asset Rights Matter: Unlocking Value in the Digital Economy

This article examines the definition, policy background, and significance of data asset rights in China, outlining how clear ownership structures can incentivize data production, promote circulation, resolve data silos, and support the rapid growth of the digital economy.

Big DataData AssetData Governance

0 likes · 15 min read

Why Data Asset Rights Matter: Unlocking Value in the Digital Economy

Wukong Talks Architecture

Sep 23, 2024 · Backend Development

Evolution of the Ctrip Travel Product Log System: Architecture, Challenges, and Solutions

This article describes the development trajectory of Ctrip's travel product log system, detailing its three major phases—from a single‑table DB approach to a platform‑based solution and finally an empowered version—while discussing technical challenges, design decisions, and the implementation of HBase, Elasticsearch, and related components to handle billions of log entries efficiently.

BackendBig DataElasticsearch

0 likes · 15 min read

Evolution of the Ctrip Travel Product Log System: Architecture, Challenges, and Solutions

DataFunSummit

Sep 20, 2024 · Databases

Key Topics and Abstracts from DataFun Summit: Graph DB, Vector DB, Real-Time Data Warehouses, and Cloud‑Native Solutions

The article presents a collection of technical abstracts from the DataFun Summit, covering XiaoHongShu's REDgraph distributed graph database, DingoDB's multimodal vector database, Tencent's Tianqiong autonomous data platform, real‑time data warehouse architectures at Douyin and Ant Group, and Alibaba Cloud's serverless ClickHouse offering, all aimed at advancing large‑scale data processing and analytics.

Big Datareal-time data warehouse

0 likes · 5 min read

Key Topics and Abstracts from DataFun Summit: Graph DB, Vector DB, Real-Time Data Warehouses, and Cloud‑Native Solutions

DataFunTalk

Sep 20, 2024 · Databases

Technical Paper Summaries on Graph Databases, Vector Databases, and Real-Time Data Warehousing

This article compiles concise English summaries of several technical papers covering Xiaohongshu's REDgraph graph database, DingoDB vector database, Tianqiong autonomous data platform, Douyin's real‑time data warehouse, financial‑grade data warehousing, Alibaba Cloud ClickHouse Serverless offering, best practices in financial data governance, and 58.com user‑profile data warehouse construction.

Big DataGraph Databasedata-warehouse

0 likes · 5 min read

Technical Paper Summaries on Graph Databases, Vector Databases, and Real-Time Data Warehousing

DataFunTalk

Sep 19, 2024 · Databases

Technical Topics Overview from DataFun Summit: Graph Database, Vector Database, Real-time Data Warehouse, and Cloud‑Native Solutions

The article presents a collection of technical overviews—including a graph database for distributed queries, a next‑generation vector database, real‑time data warehouse architectures at Douyin and Ant Group, a cloud‑native ClickHouse service, and best practices for financial data warehousing—while also explaining how to obtain the related e‑book.

Big DataCloud NativeGraph Database

0 likes · 4 min read

Technical Topics Overview from DataFun Summit: Graph Database, Vector Database, Real-time Data Warehouse, and Cloud‑Native Solutions

StarRocks

Sep 19, 2024 · Big Data

How Ele.me Built a Real‑Time Lakehouse: From 1.0 to 3.0 with Flink, Paimon & StarRocks

This article details Ele.me's journey in evolving its real‑time data warehouse, covering the original 1.0 architecture, the 2.0 lakehouse redesign with Paimon and StarRocks, performance evaluations of lake formats and query engines, and the roadmap toward a 3.0 streaming lakehouse solution.

Big DataFlinkLakehouse

0 likes · 16 min read

How Ele.me Built a Real‑Time Lakehouse: From 1.0 to 3.0 with Flink, Paimon & StarRocks

Data Thinking Notes

Sep 18, 2024 · Big Data

How to Build a Robust Banking Data Indicator System for Better Decision‑Making

The article explains why banks need a scientific data indicator system, outlines steps to define business goals, construct comprehensive metrics, collect and process data, establish standards, build an analysis platform, and continuously refine the system to support data‑driven decisions.

BankingBig DataData Analytics

0 likes · 6 min read

How to Build a Robust Banking Data Indicator System for Better Decision‑Making

DataFunSummit

Sep 18, 2024 · Big Data

Data Summit Abstracts: Graph Database, Vector Database, Real-time Data Warehouse, and Cloud‑Native Analytics

The article presents a series of technical abstracts covering Xiaohongshu's distributed graph database, DingoDB's multimodal vector store, Tianqiong's autonomous data‑warehouse innovations, Douyin's storage‑based real‑time warehouse, financial‑grade real‑time warehousing, Alibaba Cloud ClickHouse Serverless, best practices in financial data governance, and 58.com’s user‑profile warehouse construction.

Big Datareal-time data warehouse

0 likes · 5 min read

Data Summit Abstracts: Graph Database, Vector Database, Real-time Data Warehouse, and Cloud‑Native Analytics

ByteDance Data Platform

Sep 18, 2024 · Big Data

Apache Calcite for Multi‑Engine Metric Management: Practices & Roadmap

This article explains the technical principles and best practices of multi‑engine metric management based on Apache Calcite, covering common metric management methods, implementation details of unified SQL, virtual columns, and SQL defined functions, and outlines ByteDance’s future roadmap for extending these capabilities.

Apache CalciteBig DataSQL Defined Function

0 likes · 16 min read

Apache Calcite for Multi‑Engine Metric Management: Practices & Roadmap

21CTO

Sep 17, 2024 · Big Data

Why AWS Donated OpenSearch to the Linux Foundation and Its Impact on Search

Amazon Web Services transferred its OpenSearch project—a fork of Elasticsearch and Kibana—to the newly formed OpenSearch Software Foundation under the Linux Foundation, gaining vendor‑neutral governance and support from members like AWS, Uber, Canonical, and Aiven, to foster broader community development of search, analytics, and vector database applications.

AnalyticsBig DataLinux Foundation

0 likes · 4 min read

Why AWS Donated OpenSearch to the Linux Foundation and Its Impact on Search

DataFunTalk

Sep 17, 2024 · Databases

Overview of Recent Advances in Graph, Vector, and Real-Time Data Warehouse Technologies

This article presents a collection of technical abstracts covering graph database parallel query optimization, next‑generation vector databases, real‑time data warehouse architectures, and cloud‑native analytics solutions, while also providing instructions for obtaining the full e‑book via a WeChat public account.

Big DataCloud NativeGraph Database

0 likes · 5 min read

Overview of Recent Advances in Graph, Vector, and Real-Time Data Warehouse Technologies

DataFunSummit

Sep 16, 2024 · Databases

DataFun Summit: Technical Papers on Graph Databases, Vector Databases, Real‑Time Data Warehouses and Industry Data Practices

The DataFun Summit page presents a collection of technical papers covering graph database parallel queries, next‑generation vector databases, real‑time data warehouse architectures, and best practices in finance and e‑commerce, while also providing instructions for obtaining the e‑book via a public account.

Big DataReal-time analyticsdata-warehouse

0 likes · 5 min read

DataFunSummit

Sep 14, 2024 · Big Data

Apache Hudi Concurrency Control: Overview, MVCC, and OCC

This article provides a comprehensive overview of concurrency control in Apache Hudi, explaining ACID properties, the role of MVCC and OCC, and how Hudi coordinates multiple writers and table services to achieve serializable scheduling while maintaining high performance.

Apache HudiBig DataConcurrency Control

0 likes · 8 min read

Apache Hudi Concurrency Control: Overview, MVCC, and OCC

Kuaishou Tech

Sep 13, 2024 · Big Data

Blaze: Kuaishou’s Rust‑Based Vectorized Execution Engine for Spark SQL

Blaze is a Rust‑implemented, DataFusion‑based vectorized execution engine created by Kuaishou to accelerate Spark SQL queries, delivering up to 60% faster computation, 30% average compute‑power gains in production, and extensive architectural innovations such as native engine, protobuf protocol, JNI bridge, and Spark extension, while being open‑source and compatible with Spark 3.0‑3.5.

Big DataDataFusionRust

0 likes · 11 min read

Blaze: Kuaishou’s Rust‑Based Vectorized Execution Engine for Spark SQL

Alibaba Cloud Big Data AI Platform

Sep 13, 2024 · Big Data

How Qimao Scales 20PB Data with StarRocks, Flink, and Real‑Time Analytics

Qimao, a Shanghai‑based cultural entertainment internet firm, details its 20 PB big‑data architecture built on StarRocks, Flink, Hive, and Redis, covering data ingestion, real‑time processing, audience selection, metric anomaly drill‑down, 730‑day aggregation, and future plans for metric acceleration and full‑link data governance.

Big DataData GovernanceFlink

0 likes · 13 min read

How Qimao Scales 20PB Data with StarRocks, Flink, and Real‑Time Analytics

Data Thinking Notes

Sep 12, 2024 · Information Security

How to Overcome the Top 3 Data Flow Challenges and Secure Your Data Assets

This article outlines the framework for data element circulation, identifies three major security and compliance challenges in data flow, and presents five practical measures plus a six‑step method for incorporating data assets into financial statements to enhance transparency and value.

Big DataData AssetData Flow

0 likes · 10 min read

How to Overcome the Top 3 Data Flow Challenges and Secure Your Data Assets

Sohu Tech Products

Sep 11, 2024 · Big Data

Tencent Real-time Lakehouse Intelligent Optimization Practice

Tencent’s real‑time lakehouse combines Spark, Flink, StarRocks and Presto compute layers with Iceberg‑based management and HDFS/COS storage, and its Intelligent Optimize Service—comprising Compaction, Expiration, Cleaning, Clustering, Index and Auto‑Engine modules—automatically reduces merge time, improves query performance, enables secondary indexing, and dynamically routes hot partitions, while future plans target cold/hot separation, materialized view acceleration, and AI‑driven optimizations.

Big DataLakehousePyIceberg

0 likes · 12 min read

Tencent Real-time Lakehouse Intelligent Optimization Practice

AntTech

Sep 10, 2024 · Big Data

From DATA for AI to AI for DATA: Evolution of Ant Group’s Intelligent Data System

The talk reviews the rapid evolution of data technologies—from early database foundations and big‑data breakthroughs to the rise of generative AI—highlighting how Ant Group’s data platform is shifting from a cost‑efficiency focus to a value‑centric, multimodal, AI‑driven ecosystem.

Artificial IntelligenceBig DataData Platforms

0 likes · 17 min read

From DATA for AI to AI for DATA: Evolution of Ant Group’s Intelligent Data System

AntData

Sep 9, 2024 · Big Data

From Cost‑Efficiency to Value‑Centric: The Evolution of Data Systems in the Data+AI Era

The article reviews the rapid advances in generative AI and big‑data technologies, traces the historical development of data infrastructure, and argues that modern data systems are shifting from a cost‑efficiency focus to a value‑centric paradigm driven by multimodal, non‑structured data, vector search and machine‑oriented services.

@DataArtificial IntelligenceBig Data

0 likes · 18 min read

From Cost‑Efficiency to Value‑Centric: The Evolution of Data Systems in the Data+AI Era

Baidu Geek Talk

Sep 9, 2024 · Big Data

TDS Platform Overview: Architecture, Modules, and Features of Baidu MEG's Turing 3.0 Data Ecosystem

The TDS platform, central to Baidu MEG’s Turing 3.0 ecosystem, unifies data development, warehouse management, monitoring, and resource control through Spark‑based TDE, a visual studio, and AI‑enhanced tools like Smart Diagnosis and Text2SQL, enabling standardized workflows, scalable scheduling, and handling over 30 k daily tasks.

Big DataData DevelopmentData Governance

0 likes · 21 min read

TDS Platform Overview: Architecture, Modules, and Features of Baidu MEG's Turing 3.0 Data Ecosystem

360 Zhihui Cloud Developer

Sep 9, 2024 · Big Data

Why DataFusion is Revolutionizing Big Data Queries with Rust and Arrow

This article introduces DataFusion, a high‑performance, Rust‑based query engine that leverages Apache Arrow’s columnar memory format to enable fast, extensible data processing across multiple storage formats and cloud sources, explains its architecture, execution model, and provides practical Rust code examples for custom extensions.

Apache ArrowBig DataDataFusion

0 likes · 16 min read

Why DataFusion is Revolutionizing Big Data Queries with Rust and Arrow

DataFunSummit

Sep 8, 2024 · Big Data

Building and Optimizing a Cross‑Border E‑Commerce Data Platform: Architecture, Challenges, and Protonbase‑Based Solutions

This article presents Xide International's cross‑border e‑commerce data platform, detailing its multi‑layer business architecture, the scalability and data‑access problems encountered, and how a Protonbase‑driven data‑warehouse and micro‑service redesign dramatically improved query speed, operational efficiency, and cost.

Big DataData PlatformMicroservices

0 likes · 11 min read

Building and Optimizing a Cross‑Border E‑Commerce Data Platform: Architecture, Challenges, and Protonbase‑Based Solutions

Big Data Technology & Architecture

Sep 7, 2024 · Big Data

Answering Interview Questions on Binlog Loss and Recovery Using Flink CDC

The article explains how to prepare for interview "if" questions about Binlog loss by describing Flink CDC's binlog extraction principles, possible recovery mechanisms, and practical strategies such as resetting offsets, extending log retention, and building offline‑online reconciliation pipelines.

Big DataBinlogData Streaming

0 likes · 4 min read

Answering Interview Questions on Binlog Loss and Recovery Using Flink CDC

Data Thinking Notes

Sep 5, 2024 · Big Data

How to Turn Data into Valuable Assets: Strategies for Data Asset Management

This article examines the concept, development trajectory, property rights, and monetization pathways of data assets, outlines a comprehensive data asset management framework, and proposes practical implementation plans to help enterprises unlock and capitalize on their data resources.

Big DataData AssetsData Governance

0 likes · 8 min read

How to Turn Data into Valuable Assets: Strategies for Data Asset Management

Didi Tech

Sep 5, 2024 · Industry Insights

How Didi Built a Multi‑Protocol, Petabyte‑Scale Storage System for AI Training

Facing petabyte‑level data, billions of small files, and the need for POSIX, S3, and HDFS compatibility, Didi designed a new generation of non‑structured storage—OrangeFS—by analyzing internal systems, combining multiple storage solutions, reusing GIFT technology, and implementing a high‑performance metadata service, multi‑protocol fusion, and robust scalability features.

AI storageBig DataCloud Native

0 likes · 27 min read

How Didi Built a Multi‑Protocol, Petabyte‑Scale Storage System for AI Training

dbaplus Community

Sep 4, 2024 · Big Data

How Ctrip Scaled Its Data Platform to Multi‑IDC Architecture with Spark 3, Kyuubi, and Celeborn

This article details how Ctrip’s data platform evolved from a single‑IDC design to a multi‑IDC, tiered storage and scheduling architecture, covering the challenges of rapid data growth, the migration to Spark 3 via Kyuubi, the introduction of Celeborn shuffle service, and the resulting performance and reliability gains.

Big DataHDFSKyuubi

0 likes · 23 min read

How Ctrip Scaled Its Data Platform to Multi‑IDC Architecture with Spark 3, Kyuubi, and Celeborn

DataFunTalk

Sep 4, 2024 · Artificial Intelligence

Data+AI Data Lake Technologies: Challenges, Apache Iceberg Overview, and Vector Table Implementations with PyIceberg

This article explores the evolution of data lakes for AI, discusses the challenges of AI-era data management, introduces Apache Iceberg and its architecture, demonstrates PyIceberg-based AI training and inference pipelines, and presents vector table designs with LSH indexing and performance optimizations.

Apache IcebergBig DataData Lake

0 likes · 22 min read

Data+AI Data Lake Technologies: Challenges, Apache Iceberg Overview, and Vector Table Implementations with PyIceberg

Data Thinking Notes

Sep 3, 2024 · Big Data

Why Your Business Needs a Unified Data Indicator Platform—and How to Build One

This article explains the challenges of fragmented metric definitions, the benefits of a centralized indicator platform for unified management, agile development, high‑performance querying, and outlines the architecture, capabilities, construction process, and operational best practices to maximize data value.

Big DataData GovernanceData Platform

0 likes · 18 min read

Why Your Business Needs a Unified Data Indicator Platform—and How to Build One

DataFunSummit

Aug 31, 2024 · Big Data

Apache Hudi Clustering: Workflow and Layout Optimization Strategies (Part 6)

This article explains Apache Hudi's clustering service, detailing its workflow, three execution modes, and layout optimization strategies—including linear, Z‑order, and Hilbert space‑filling curves—to improve storage locality and query performance in large‑scale data lake environments.

Apache HudiBig DataSpace-filling Curves

0 likes · 8 min read

Apache Hudi Clustering: Workflow and Layout Optimization Strategies (Part 6)

DataFunSummit

Aug 30, 2024 · Big Data

Kuaishou's Data Lake Journey with Apache Hudi: Architecture Evolution, Use Cases, and Lessons Learned

The article details Kuaishou's adoption of a data lake powered by Apache Hudi, covering the challenges of growing data warehouses, the migration from Hive to Hudi, concrete business case studies, promotion strategies, and key takeaways for large‑scale data engineering.

Apache HudiBig DataData Lake

0 likes · 12 min read

Kuaishou's Data Lake Journey with Apache Hudi: Architecture Evolution, Use Cases, and Lessons Learned

Data Thinking Notes

Aug 29, 2024 · Big Data

How ICBC Evolved Its Data Intelligence Architecture for Real‑Time Insights

At the 2024 Data Intelligence Conference, ICBC's Big Data and AI Lab detailed the evolution of its data intelligence platform, covering architectural redesign, real‑time data warehouse technology, unified intelligent data tools, and future development directions to boost efficiency and innovation.

Big DataData Platformarchitecture evolution

0 likes · 3 min read

How ICBC Evolved Its Data Intelligence Architecture for Real‑Time Insights