Tagged articles
3672 articles
Page 1 of 37
DataFunSummit
DataFunSummit
May 20, 2026 · Big Data

How Kuaishou’s Real‑Time Data Lake Boosts AI and BI Architecture

The article explains how Kuaishou partnered with Apache Hudi to overhaul its ODS‑based data lake, addressing latency, storage cost, and complexity for AI and BI workloads, detailing the evolution from mysql‑to‑hive to mysql‑to‑hudi 1.0 and 2.0, the resulting performance gains, cost savings, and future roadmap.

AIBIBig Data
0 likes · 20 min read
How Kuaishou’s Real‑Time Data Lake Boosts AI and BI Architecture
DataFunTalk
DataFunTalk
May 19, 2026 · Industry Insights

From Single‑Point Copilot to Platform‑Level Agentic: Real Challenges and Future Forks for Data Platforms

A live discussion dissected the shift from single‑point Copilot assistants to platform‑level Agentic data platforms, exposing hard architectural, security, knowledge‑base, evaluation, stability‑cost, and governance challenges while debating whether the future will favor a super‑agent or a multi‑agent ecosystem.

Agentic AIBig DataData Platform
0 likes · 18 min read
From Single‑Point Copilot to Platform‑Level Agentic: Real Challenges and Future Forks for Data Platforms
DataFunSummit
DataFunSummit
May 17, 2026 · Industry Insights

From Single‑point Copilot to Platform‑level Agentic: Real Challenges and Future Paths for Data Platforms

A 90‑minute live discussion with data experts from vivo and YangQianGuan reveals that moving from a simple Copilot assistant to a platform‑level Agentic data system requires fundamental architectural changes, new infrastructure for memory, planning, tool orchestration, security guardrails, knowledge management, robust evaluation, and a clear ROI strategy.

AI GovernanceAgenticBig Data
0 likes · 19 min read
From Single‑point Copilot to Platform‑level Agentic: Real Challenges and Future Paths for Data Platforms
Data Party THU
Data Party THU
May 15, 2026 · Artificial Intelligence

2026 Big Data Challenge Announces Monthly Star Winners and Shares Winning Teams’ Insights

The 2026 China University Computer Competition – Big Data Challenge reveals the Monthly Star award winners, each receiving 800 RMB, and presents detailed experience reports from the top teams covering feature engineering, model selection, training validation, and ensemble strategies for stock prediction.

Big DataModel FusionStock Prediction
0 likes · 7 min read
2026 Big Data Challenge Announces Monthly Star Winners and Shares Winning Teams’ Insights
dbaplus Community
dbaplus Community
May 14, 2026 · Big Data

Building a ‘One‑Sentence Bank’: Big Data and AI Fusion for Small Banks

The article outlines the evolution of big data in banking, compares management models for heterogeneous data, describes the shift from data engineering to knowledge engineering, introduces LLMOps for high‑quality knowledge bases, and details how integrating AI and data can enable a “one‑sentence bank” that answers queries and executes tasks.

BankingBig DataData Governance
0 likes · 22 min read
Building a ‘One‑Sentence Bank’: Big Data and AI Fusion for Small Banks
DataFunTalk
DataFunTalk
May 11, 2026 · Big Data

How Xiaohongshu Re‑engineered Its Data Architecture for the Big AI Data Era

Xiaohongshu transformed its data platform from a simple ClickHouse‑based ad‑hoc analysis to a Lambda‑style architecture and finally to a lakehouse built on Iceberg, StarRocks, Flink and Spark, cutting architecture complexity, resource and development costs by two‑thirds while supporting trillions of daily events with sub‑second query latency.

Big DataClickHouseFlink
0 likes · 22 min read
How Xiaohongshu Re‑engineered Its Data Architecture for the Big AI Data Era
DataFunTalk
DataFunTalk
May 8, 2026 · Big Data

How MaxCompute Evolves into a Data+AI Platform: Architecture, Core Capabilities, and Real-World Cases

The article explains how Alibaba Cloud's MaxCompute has been transformed into a cloud‑native Data+AI platform, detailing its layered architecture, multimodal storage, model management, hybrid compute scheduling, SQL AI functions, the MaxFrame Python framework, and several enterprise case studies that demonstrate performance gains and flexible resource orchestration.

AI integrationBig DataCloud Native
0 likes · 11 min read
How MaxCompute Evolves into a Data+AI Platform: Architecture, Core Capabilities, and Real-World Cases
DataFunTalk
DataFunTalk
May 6, 2026 · Big Data

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

The article details Xiaohongshu's four‑stage data‑platform evolution—from a simple ClickHouse ad‑hoc setup to a Lambda‑based 2.0 design and finally a lakehouse‑driven 3.0 architecture—highlighting the adoption of general incremental compute, cost‑reduction to one‑third, performance gains of up to ten‑fold, and the SPOT standards that guide the new system.

Big DataClickHouseData Architecture
0 likes · 21 min read
How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era
DataFunTalk
DataFunTalk
Apr 29, 2026 · Big Data

How Xiaohongshu Revamped Its Data Architecture for the Big AI Data Era

Xiaohongshu transformed its data platform from a simple ClickHouse‑based analytics stack to a unified lakehouse with generic incremental compute, cutting architecture complexity, resource cost, and development effort by roughly one‑third while supporting petabyte‑scale, sub‑second queries across its 350 million‑user app.

Big DataClickHouseData Architecture
0 likes · 22 min read
How Xiaohongshu Revamped Its Data Architecture for the Big AI Data Era
Model Perspective
Model Perspective
Apr 28, 2026 · Big Data

How a Taiwan Ban Became Free Advertising for Amap’s Map App

A recent Taiwan government warning against Amap turned into a viral boost, exposing the app’s superior traffic‑light countdown, massive data‑driven network effects, and the underlying reverse‑propagation model that explains why the ban accelerated downloads rather than suppressing them.

AmapBig Datamobile navigation
0 likes · 11 min read
How a Taiwan Ban Became Free Advertising for Amap’s Map App
DataFunTalk
DataFunTalk
Apr 28, 2026 · Artificial Intelligence

From “Lobster” to Ontology: DACon Reveals the Next Trend in Self‑Evolving AI Agents

The DACon conference in Shanghai gathered over 8,000 developers and experts, showcasing 50 talks that explored self‑evolving AI agents, the open‑source GenericAgent framework, data‑governance ontology, Agent‑Ready big‑data infrastructure, and AI+AR ecosystems, while highlighting practical case studies and future industry directions.

AI AgentsAI+ARBig Data
0 likes · 11 min read
From “Lobster” to Ontology: DACon Reveals the Next Trend in Self‑Evolving AI Agents
DataFunSummit
DataFunSummit
Apr 27, 2026 · Artificial Intelligence

How Tencent Games Leverages AI to Turn Data Governance into a Service

Tencent Games’ data governance team details an AI‑driven, end‑to‑end semantic framework that shifts traditional rule‑based data management to a service‑oriented model, cutting storage waste by 30 %, halving development time, and boosting asset recommendation accuracy to 95 % across its global gaming platform.

AIBig DataData Governance
0 likes · 19 min read
How Tencent Games Leverages AI to Turn Data Governance into a Service
DataFunSummit
DataFunSummit
Apr 25, 2026 · Big Data

AI‑Era Multimodal Data Lake Infrastructure: TBDS Design, Storage, Compute, and Governance

The article analyzes how Tencent Cloud's TBDS platform tackles the AI era's multimodal data lake challenges through a native storage format (Lance), elastic Ray‑based compute, standardized metadata with Gravitino, and automated governance via Lakekeeper, citing architecture details, performance numbers, and real‑world deployments.

AI InfrastructureBig DataGravitino
0 likes · 13 min read
AI‑Era Multimodal Data Lake Infrastructure: TBDS Design, Storage, Compute, and Governance
DataFunSummit
DataFunSummit
Apr 24, 2026 · Artificial Intelligence

AI‑Driven Data Governance as a Service: Tencent Games' Paradigm Shift

This talk details how Tencent Games leverages AI to transform its data governance from rule‑based, passive processes into a semantic, service‑oriented paradigm, addressing resource waste, low collaboration efficiency, and scalability challenges while delivering measurable improvements in cost, speed, and asset quality.

AIAutomationBig Data
0 likes · 19 min read
AI‑Driven Data Governance as a Service: Tencent Games' Paradigm Shift
DataFunTalk
DataFunTalk
Apr 22, 2026 · Industry Insights

How Xiaohongshu Cut Data Platform Costs by Two‑Thirds with Incremental Computing

This article details Xiaohongshu's journey from a ClickHouse‑based batch analytics stack to a unified lakehouse architecture powered by generic incremental computing, showing how the company reduced architecture complexity, resource consumption and development effort each to roughly one‑third while supporting trillions of daily events with sub‑10‑second query latency.

Big DataData ArchitectureLakehouse
0 likes · 24 min read
How Xiaohongshu Cut Data Platform Costs by Two‑Thirds with Incremental Computing
Big Data Tech Team
Big Data Tech Team
Apr 22, 2026 · Big Data

Inside Big Tech: Full Breakdown of AI Agents for Data Warehouse Governance

The article analyzes how leading internet companies embed AI agents across the entire data‑warehouse lifecycle to automate governance, presenting real‑world case studies from Alibaba, ByteDance, JD.com and Tencent, and quantifies benefits such as over 65% reduction in manual effort, 50% drop in metric duplication, and a 40% boost in resource utilization.

AI AgentsAutomationBig Data
0 likes · 10 min read
Inside Big Tech: Full Breakdown of AI Agents for Data Warehouse Governance
DataFunSummit
DataFunSummit
Apr 21, 2026 · Industry Insights

How SelectDB Cuts 60% Costs and Boosts Real‑Time Performance for New Energy Batteries

The whitepaper analyzes the data‑driven transformation of the new‑energy battery sector, outlines four core challenges—massive data streams, fast‑changing R&D demands, long manufacturing cycles, and multi‑dimensional quality standards—and demonstrates how SelectDB’s unified lake‑warehouse architecture delivers million‑level throughput, second‑level latency, up to 30× query speedup, and 60% cost reduction across real‑world case studies.

Big DataCase StudyData Warehouse
0 likes · 18 min read
How SelectDB Cuts 60% Costs and Boosts Real‑Time Performance for New Energy Batteries
DataFunSummit
DataFunSummit
Apr 19, 2026 · Big Data

How OPPO Built a Multi‑Modal Data Lake with Gravitino and Curvine

OPPO’s data‑lake team, led by David, detailed their transition from Hive‑Spark to a unified multi‑modal lake, leveraging Gravitino for cross‑engine metadata management and the open‑source Curvine cache to eliminate data silos, boost I/O performance, and support massive image, recommendation, and AI‑Agent workloads.

Big DataData Lakedistributed cache
0 likes · 11 min read
How OPPO Built a Multi‑Modal Data Lake with Gravitino and Curvine
Big Data Tech Team
Big Data Tech Team
Apr 17, 2026 · Industry Insights

Can AI Replace Data Warehouse Engineers? Exploring the Future of Data Modeling

The article examines how large‑language‑model AI can automate data‑warehouse modeling tasks—generating SQL, designing schemas, handling ETL, and tracing lineage—while highlighting current pain points, practical limitations, and four emerging trends that will reshape the role of data engineers over the next few years.

AIBig DataData Warehouse
0 likes · 11 min read
Can AI Replace Data Warehouse Engineers? Exploring the Future of Data Modeling
Ctrip Technology
Ctrip Technology
Apr 16, 2026 · Big Data

How Ray + DuckDB Cut 9B-Row Attribution Queries from 40s to 15s

When attribution analysis on over 900 million rows slowed to more than 40 seconds and threatened cluster stability, Ctrip's smart attribution team rebuilt the architecture with Ray and DuckDB, achieving sub‑15‑second query times, 160 % performance gain, and complete resource isolation.

Attribution AnalysisBig DataDuckDB
0 likes · 22 min read
How Ray + DuckDB Cut 9B-Row Attribution Queries from 40s to 15s
DataFunTalk
DataFunTalk
Apr 16, 2026 · Big Data

How Xiaohongshu Cut Data Architecture Costs by Two‑Thirds with Incremental Computing

This article details Xiaohongshu's data platform evolution from a simple ClickHouse‑based ad‑hoc system to a Lambda‑style architecture and finally a lakehouse solution, highlighting how the adoption of a new incremental computing model reduced architectural complexity, resource consumption and development effort each to roughly one‑third while delivering sub‑second query performance on petabyte‑scale data.

Big DataData ArchitectureLakehouse
0 likes · 21 min read
How Xiaohongshu Cut Data Architecture Costs by Two‑Thirds with Incremental Computing
DataFunSummit
DataFunSummit
Apr 15, 2026 · Industry Insights

Why Traditional Data Platforms Fail and How Ontology Drives Triple‑Digit ROI

The article analyzes costly data‑platform failures—such as a $40 million payroll system in San Francisco schools and a collapsed Healthcare.gov launch—identifies the root cause as ineffective data middle platforms, and demonstrates how Palantir’s ontology‑based three‑layer architecture (semantic, dynamics, decision) can turn data into actionable insights, delivering triple‑digit ROI for enterprises like BP, Novartis, and General Mills.

Big DataData PlatformOntology
0 likes · 5 min read
Why Traditional Data Platforms Fail and How Ontology Drives Triple‑Digit ROI
DataFunTalk
DataFunTalk
Apr 11, 2026 · Industry Insights

Why Most Intelligent Data Analytics Fail and How Aloudata’s Agent Architecture Solves It

This article examines three common misconceptions in enterprise intelligent data analysis, explains how a semantic metric layer can break data silos, and details Aloudata Agent’s dual‑path engine, multi‑agent collaboration, and product design that together deliver trustworthy, deep, and democratized analytics for modern businesses.

AIAgent ArchitectureAttribution Analysis
0 likes · 18 min read
Why Most Intelligent Data Analytics Fail and How Aloudata’s Agent Architecture Solves It
DataFunTalk
DataFunTalk
Apr 10, 2026 · Big Data

How Xiaohongshu Cut Data Architecture Costs by Two‑Thirds with Incremental Computing

This article analyzes Xiaohongshu's data platform evolution—from a simple ClickHouse‑based analytics layer to a Lambda architecture and finally a lakehouse design—highlighting how adopting a new incremental computing model reduced architecture complexity, resource consumption, and development effort each to roughly one‑third while delivering sub‑second query performance on petabyte‑scale data.

Big DataData ArchitectureLakehouse
0 likes · 22 min read
How Xiaohongshu Cut Data Architecture Costs by Two‑Thirds with Incremental Computing
Big Data Tech Team
Big Data Tech Team
Apr 9, 2026 · Industry Insights

Why Data Engineers Are the New AI Powerhouses: 4 Core Reasons & Actionable Tips

The article analyzes why data development engineers are becoming more valuable in the AI era, outlining four core reasons—including data‑driven AI limits, the rise of RAG architectures, heightened data compliance, and a talent shortage—while offering concrete advice on mastering real‑time pipelines, unstructured data, and AI infrastructure.

AI InfrastructureBig DataRAG
0 likes · 8 min read
Why Data Engineers Are the New AI Powerhouses: 4 Core Reasons & Actionable Tips
Alibaba Cloud Observability
Alibaba Cloud Observability
Apr 6, 2026 · Cloud Native

How Alibaba Cloud Built Real‑Time OpenAPI Monitoring with Flink + SLS

This article details the design and implementation of a cloud‑native, real‑time monitoring system for Alibaba Cloud OpenAPI, covering background challenges, a Flink‑SLS architecture, multi‑region data processing, checkpoint and state‑backend tuning, source‑side predicate pushdown, visualization with Grafana, and production results.

Big DataCloud NativeFlink
0 likes · 21 min read
How Alibaba Cloud Built Real‑Time OpenAPI Monitoring with Flink + SLS
Big Data Tech Team
Big Data Tech Team
Apr 1, 2026 · Big Data

Why Your 2026 Big Data Resume Is Being Ignored and How to Fix It

In the 2026 spring hiring season, many big‑data job seekers see their resumes disappear because they still focus on offline batch processing, while employers now demand real‑time streaming, AI‑driven data pipelines, and cloud‑native deployment skills such as Flink, vector databases, and Kubernetes.

AI integrationBig DataCloud Native
0 likes · 7 min read
Why Your 2026 Big Data Resume Is Being Ignored and How to Fix It
Big Data Tech Team
Big Data Tech Team
Mar 30, 2026 · Big Data

2026 Data Warehouse Interview Guide: Essential Questions for All Three Rounds

This article compiles a comprehensive set of data‑warehouse interview questions—including self‑introduction prompts, SQL and window‑function challenges, data‑skew solutions, architecture design, file‑format trade‑offs, governance, and team‑leadership topics—to help candidates prepare for first, second, and third‑round interviews at leading tech firms.

Big DataCareer DevelopmentData Governance
0 likes · 7 min read
2026 Data Warehouse Interview Guide: Essential Questions for All Three Rounds
vivo Internet Technology
vivo Internet Technology
Mar 25, 2026 · Industry Insights

How Vivo Scaled Marketing Automation with Presto, Bitmap, and StarRocks

This case study details how Vivo’s marketing automation platform evolved its data‑driven architecture—from a Presto‑based wide‑table design, through a Bitmap optimization, to a StarRocks migration—addressing performance bottlenecks, reducing resource costs, and enhancing data security.

Big DataBitmapData Architecture
0 likes · 11 min read
How Vivo Scaled Marketing Automation with Presto, Bitmap, and StarRocks
DeWu Technology
DeWu Technology
Mar 25, 2026 · Big Data

How Code LLM Transforms E‑commerce Data Warehouses: From Data Rights to AI‑Driven Automation

This article analyzes how large‑language models for code, exemplified by Claude Code, are integrated into an e‑commerce data‑warehouse ecosystem, defining data‑rights boundaries, introducing agentic workflows, decoupling cognitive and execution runtimes, and establishing standardized I/O contracts to achieve safe, scalable AI‑assisted development and governance.

Big DataCode LLMData Warehouse
0 likes · 24 min read
How Code LLM Transforms E‑commerce Data Warehouses: From Data Rights to AI‑Driven Automation
DataFunSummit
DataFunSummit
Mar 25, 2026 · Big Data

How Apache Gravitino and OpenLineage Transform Data Governance for AI‑Driven Enterprises

In the era of AI and multi‑cloud, this article analyzes the core challenges of data governance—data silos, quality gaps, and compliance risks—and explains how Apache Gravitino’s unified metadata architecture together with OpenLineage’s standardized lineage model provide a scalable, automated solution for intelligent, real‑time data management.

Apache GravitinoBig DataData Governance
0 likes · 15 min read
How Apache Gravitino and OpenLineage Transform Data Governance for AI‑Driven Enterprises
DataFunSummit
DataFunSummit
Mar 24, 2026 · Industry Insights

How DataWorks Is Transforming Big Data Development with AI Agents

The article outlines DataWorks' evolution from a decade‑long big‑data governance platform to an AI‑driven Copilot and autonomous Agent system, detailing its technical foundations, tool‑adaptation layer, context engineering, security safeguards, and future vision of a professional, open, and intelligent big‑data development ecosystem.

AI CopilotAgentBig Data
0 likes · 13 min read
How DataWorks Is Transforming Big Data Development with AI Agents
DataFunSummit
DataFunSummit
Mar 16, 2026 · Big Data

How MaxCompute Evolves into an AI‑Native Data Warehouse: Architecture, Capabilities, and Real‑World Cases

This article outlines MaxCompute's 15‑year transformation from a traditional structured‑compute engine to an AI‑native data warehouse, detailing its data, heterogeneous compute, and model capabilities, showcasing three core ability pillars, real‑world case studies, and future development directions.

AI-nativeBig DataCase Study
0 likes · 7 min read
How MaxCompute Evolves into an AI‑Native Data Warehouse: Architecture, Capabilities, and Real‑World Cases
DataFunTalk
DataFunTalk
Mar 3, 2026 · Big Data

Exploring Tencent Cloud’s Iceberg Batch‑Stream Integration and AI‑Driven Data Governance

This article presents a series of seven technical case studies—including Tencent Cloud’s Iceberg‑based batch‑stream integration, AI‑driven data governance with Apache Gravitino, Xiaohongshu’s lakehouse evolution, and a multimodal data‑lake solution—detailing challenges, architectural designs, implementation steps, performance results, and future directions.

AIBig DataData Lake
0 likes · 8 min read
Exploring Tencent Cloud’s Iceberg Batch‑Stream Integration and AI‑Driven Data Governance
DeWu Technology
DeWu Technology
Mar 2, 2026 · Big Data

Mastering Spark UI: Deep Dive into Metrics, Tuning, and Real‑World Cases

This article provides a comprehensive guide to Spark UI, explaining each primary and secondary tab, the key metrics they expose, and how to interpret them for performance bottleneck detection, followed by two detailed case studies and practical tuning recommendations for Spark workloads.

Big DataCase StudySpark
0 likes · 19 min read
Mastering Spark UI: Deep Dive into Metrics, Tuning, and Real‑World Cases
DataFunSummit
DataFunSummit
Mar 1, 2026 · Big Data

How Ant Group’s Flex Engine Supercharges Flink with Vectorization

This article details Ant Group’s Flex vectorized engine built on Velox, covering the current state of vectorization, Flex’s architecture (Flink + Velox), core feature development, correctness guarantees, large‑scale deployment results, and future directions for full‑link vectorization and broader hardware support.

Big DataFlexFlink
0 likes · 18 min read
How Ant Group’s Flex Engine Supercharges Flink with Vectorization
DataFunSummit
DataFunSummit
Feb 8, 2026 · Big Data

Kuaishou’s Data Lake Upgrade with Hudi: Solving AI & BI Challenges

The article explains how Kuaishou modernized its data lake by partnering with Apache Hudi to address latency, storage cost, and consistency issues in both AI and BI pipelines, detailing architectural changes, new ingestion tools, partitioning strategies, compaction mechanisms, performance gains and future plans.

AIBIBig Data
0 likes · 20 min read
Kuaishou’s Data Lake Upgrade with Hudi: Solving AI & BI Challenges
DataFunSummit
DataFunSummit
Feb 7, 2026 · Big Data

How Flink Enables Real‑Time AI Inference and Agent Construction

This article explains Apache Flink’s stream processing fundamentals, introduces the open‑source Flink Agents framework for building event‑driven AI agents, details Alibaba Cloud’s Flink AI Function for real‑time LLM inference, and showcases demos, architecture, integration patterns, and practical use cases such as VOC analysis, live‑stream analytics, and intelligent operations.

Apache FlinkBig DataReal-time inference
0 likes · 24 min read
How Flink Enables Real‑Time AI Inference and Agent Construction
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 4, 2026 · Big Data

How Paimon + StarRocks Power Real‑Time OLAP for Double‑11 Mega‑Sales

During Double‑11 mega‑sales, Taobao Group faced exploding OLAP query traffic, costly data sync pipelines, and slow near‑real‑time analytics, so they unified real‑time and batch data in Paimon, leveraged StarRocks for high‑performance lake queries, tuned cluster settings, and saved nearly ten‑million yuan annually while cutting refresh latency by 80%.

Big DataData LakeOLAP
0 likes · 22 min read
How Paimon + StarRocks Power Real‑Time OLAP for Double‑11 Mega‑Sales
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 2, 2026 · Big Data

Real‑Time Analytics with Alibaba Cloud Serverless Spark & Paimon for Taobao Flash Sale

This article details how Alibaba Cloud EMR Serverless Spark combined with the Paimon lakehouse framework enables Taobao Flash Sale’s retail data team to achieve low‑latency, high‑throughput real‑time analytics, batch processing, and feature generation, outlining architecture evolution, performance gains, and practical Spark tuning techniques.

Big DataLakehousePaimon
0 likes · 18 min read
Real‑Time Analytics with Alibaba Cloud Serverless Spark & Paimon for Taobao Flash Sale
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 2, 2026 · Big Data

How We Built a Scalable Lakehouse Architecture with StarRocks, Paimon, and Flink

This article details the evolution of a data warehouse at RenliJia from a MaxCompute‑centric setup to a modern lakehouse using StarRocks, Paimon, Flink, and Fluss, describing design goals, technical evaluations, implementation steps for offline, OLAP, and real‑time workloads, and the challenges and future plans that emerged.

Big DataData WarehouseFlink
0 likes · 25 min read
How We Built a Scalable Lakehouse Architecture with StarRocks, Paimon, and Flink
Raymond Ops
Raymond Ops
Jan 30, 2026 · Big Data

Build an Enterprise‑Grade HDFS HA and YARN Scheduler from Scratch

This guide walks you through designing and deploying a highly available HDFS architecture with dual NameNodes, ZooKeeper‑based failover, and a tuned YARN resource scheduler, covering detailed configuration files, failover testing, performance tuning, monitoring, automated health checks, capacity planning, and best‑practice checklists for production‑grade big‑data platforms.

AutomationBig DataHA
0 likes · 28 min read
Build an Enterprise‑Grade HDFS HA and YARN Scheduler from Scratch
Radish, Keep Going!
Radish, Keep Going!
Jan 30, 2026 · Big Data

How Uber Scaled Data Replication to Petabytes Daily with Distcp Optimizations

Uber tackled the challenge of replicating over 350 PB of data across on‑premise and cloud lakes by redesigning Hadoop Distcp, moving intensive tasks to the Application Master, parallelising copy‑listing and commit phases, and leveraging Uber‑mapper jobs to dramatically cut latency and improve resource efficiency.

Big DataDistcpHadoop
0 likes · 17 min read
How Uber Scaled Data Replication to Petabytes Daily with Distcp Optimizations
Data Party THU
Data Party THU
Jan 29, 2026 · Big Data

How a Tsinghua Big Data Program Turned a Chemistry PhD into an AI‑Powered Process Engineer

This article recounts a Tsinghua University PhD student's journey through a multidisciplinary big‑data training program, detailing the acquisition of AI and data‑science skills, the creation of novel algorithms like MicroFlowSAM and ImageRAG, and their successful application to chemical engineering research and industry projects.

Big DataChemical EngineeringIndustrial Application
0 likes · 8 min read
How a Tsinghua Big Data Program Turned a Chemistry PhD into an AI‑Powered Process Engineer
Big Data Tech Team
Big Data Tech Team
Jan 22, 2026 · Industry Insights

Top 10 Open‑Source Data Visualization Platforms You Should Know

This article presents a concise overview of ten popular open‑source data visualization tools—including Echarts, D3.js, Grafana, Plotly, Redash, Metabase, Superset, Kibana, AntV, and Pyecharts—highlighting their main features, typical use cases, and visual examples to help readers choose the right solution for their needs.

Big DataD3.jsData visualization
0 likes · 6 min read
Top 10 Open‑Source Data Visualization Platforms You Should Know
Ray's Galactic Tech
Ray's Galactic Tech
Jan 22, 2026 · Big Data

Export 1 Billion Elasticsearch Docs in 3 Hours Using PIT + Slice

This guide explains how to reliably export over a billion Elasticsearch documents within a few hours by using Point‑In‑Time (PIT) snapshots combined with parallel Slice processing, covering diagnostics, performance modeling, consistency levels, failure recovery, and resource isolation.

Big DataData ExportElasticsearch
0 likes · 7 min read
Export 1 Billion Elasticsearch Docs in 3 Hours Using PIT + Slice
StarRocks
StarRocks
Jan 22, 2026 · Big Data

How Paimon + StarRocks Accelerates Double‑11 OLAP Queries by 80% Refresh Speed

This article explains how Taotian Group unified real‑time and offline data using Paimon as lake storage and StarRocks for high‑performance OLAP, eliminating costly sync pipelines, cutting refresh time by about 80%, saving nearly ten million yuan annually, and detailing the architecture, cluster safeguards, configuration tweaks, monitoring, and future roadmap for large‑scale promotional events.

Big DataData ArchitectureOLAP
0 likes · 24 min read
How Paimon + StarRocks Accelerates Double‑11 OLAP Queries by 80% Refresh Speed
DataFunSummit
DataFunSummit
Jan 18, 2026 · Big Data

How Ray Reinvents AI Data Pipelines for Massive Multimodal Inference

This article examines the shortcomings of traditional big‑data engines for AI workloads, presents a Ray‑based heterogeneous fusion architecture that unifies CPU/GPU scheduling, Python ecosystems, and streaming‑batch processing, and details fault‑tolerance, checkpointing, compute‑storage separation, resource‑utilization, scalability, and observability improvements that enable thousands of nodes and dramatically higher GPU efficiency.

Big DataCloud NativeRay
0 likes · 31 min read
How Ray Reinvents AI Data Pipelines for Massive Multimodal Inference
ByteDance Data Platform
ByteDance Data Platform
Jan 15, 2026 · Artificial Intelligence

Why Model Evaluation Can Be Cool: Innovative Automated Testing for Data‑Driven LLM Agents

In the era of rapidly advancing large‑model technology, the article outlines the challenges of evaluating data‑centric LLM agents, proposes a three‑layer evaluation framework covering basic capabilities, component‑level checks, and end‑to‑end business impact, and shares practical innovations such as semantic‑equivalence SQL matching, agent‑as‑judge pipelines, and a unified assessment platform.

Agent as judgeAutomated TestingBig Data
0 likes · 22 min read
Why Model Evaluation Can Be Cool: Innovative Automated Testing for Data‑Driven LLM Agents
StarRocks
StarRocks
Jan 15, 2026 · Artificial Intelligence

How AI‑First Lakehouse Redefines Data Platforms for Multimodal Analytics

The article outlines the evolution from traditional OLAP to an AI‑first Lakehouse, detailing unified multimodal storage, CPU/GPU heterogeneous scheduling, native vector search, in‑database AI inference, agent‑centric execution, and self‑evolving platform capabilities that together reshape modern data analytics.

AIAgent ArchitectureBig Data
0 likes · 11 min read
How AI‑First Lakehouse Redefines Data Platforms for Multimodal Analytics
AsiaInfo Technology: New Tech Exploration
AsiaInfo Technology: New Tech Exploration
Jan 6, 2026 · Industry Insights

Apache Paimon: Boosting Real-Time Data Lakes for Fraud Detection & Manufacturing

This article examines Apache Paimon’s innovative lakehouse architecture, detailing its LSM‑Tree storage, flexible merge engine, and multi‑engine integration, and showcases two real‑world deployments—an operator’s real‑time fraud‑prevention system and a manufacturing firm’s unified data platform—highlighting performance gains and cost reductions.

Apache PaimonBig DataCase Study
0 likes · 15 min read
Apache Paimon: Boosting Real-Time Data Lakes for Fraud Detection & Manufacturing
Big Data Tech Team
Big Data Tech Team
Dec 29, 2025 · Big Data

Master Big Data Development: A Complete Roadmap from Beginner to Expert

This guide presents a comprehensive big‑data development roadmap, detailing industry opportunities, a six‑module technology stack, four progressive learning stages, hands‑on project ideas, interview question strategies, common pitfalls, and curated resources, helping aspiring engineers become proficient and interview‑ready while avoiding common mistakes.

Big DataInterview PreparationLearning Path
0 likes · 11 min read
Master Big Data Development: A Complete Roadmap from Beginner to Expert
Big Data Tech Team
Big Data Tech Team
Dec 26, 2025 · Interview Experience

How to Nail a 2‑Minute Data Engineer Self‑Introduction

This guide outlines a concise, 1.5‑2‑minute self‑introduction for data engineering interviews, highlighting essential personal details, technical stack, project achievements, business impact, and common pitfalls to avoid, with a concrete example and actionable tips.

Big Datacareer advicedata engineering
0 likes · 5 min read
How to Nail a 2‑Minute Data Engineer Self‑Introduction
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Dec 24, 2025 · Big Data

How Paimon’s Column‑Separation Architecture Powers Real‑Time Multi‑Modal Lakehouse for AI

This article explains the challenges of frequent column changes in AI feature engineering, introduces Paimon’s column‑separation storage with a global continuous Row ID, details its Blob data type for efficient multi‑modal handling, and outlines production results and future roadmap for building an AI‑native data lakehouse.

Apache PaimonBig DataBlob
0 likes · 11 min read
How Paimon’s Column‑Separation Architecture Powers Real‑Time Multi‑Modal Lakehouse for AI
DataFunTalk
DataFunTalk
Dec 17, 2025 · Artificial Intelligence

How Large Language Models Unlock Field‑Level Data Lineage at Scale

This talk explains how a data platform tackled massive, heterogeneous enterprise data by using large language models and prompt engineering to automatically extract field‑level lineage from SQL scripts, achieve over 80% coverage, and raise accuracy above 95%, dramatically cutting impact‑analysis time.

AI for data engineeringBig DataData Lineage
0 likes · 6 min read
How Large Language Models Unlock Field‑Level Data Lineage at Scale
JD Tech Talk
JD Tech Talk
Dec 12, 2025 · Big Data

Understanding Hudi Core Concepts: Timeline, Indexes, and Table Types Explained

This article explains Apache Hudi’s core concepts, including its timeline architecture, file layout, indexing mechanisms, and the two primary table types—Copy on Write and Merge on Read—along with their trade‑offs and the various query modes such as snapshot, time‑travel, and incremental queries.

Apache HudiBig DataData Lake
0 likes · 9 min read
Understanding Hudi Core Concepts: Timeline, Indexes, and Table Types Explained
JD Cloud Developers
JD Cloud Developers
Dec 12, 2025 · Big Data

Apache Hudi Core Concepts: Timeline, Indexes, Table Types & Queries

This article explains Apache Hudi’s core architecture, detailing the timeline mechanism, file layout, indexing strategies, the two main table types (Copy‑On‑Write and Merge‑On‑Read), and various query modes such as snapshot, time‑travel, read‑optimized and incremental queries.

Apache HudiBig DataData Lake
0 likes · 9 min read
Apache Hudi Core Concepts: Timeline, Indexes, Table Types & Queries
vivo Internet Technology
vivo Internet Technology
Dec 10, 2025 · Big Data

Vivo’s 800‑Day Journey Optimizing Celeborn Remote Shuffle Service at PB Scale

This technical report details how Vivo’s big‑data platform adopted Celeborn as its remote shuffle service, evaluated alternatives, tuned hardware and software configurations, implemented performance and stability enhancements, and outlines future operational and community‑driven improvements for handling petabyte‑scale shuffle workloads.

Big DataKubernetesRemote Shuffle Service
0 likes · 20 min read
Vivo’s 800‑Day Journey Optimizing Celeborn Remote Shuffle Service at PB Scale
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 10, 2025 · Big Data

What’s New in Apache Spark 4.0? Deep Dive into 2025 Core Updates

The 2025 release of Apache Spark 4.0 brings a comprehensive overhaul—including default ANSI SQL mode, full SQL scripting support, a new Real‑Time streaming mode, adaptive query execution, dynamic memory management, and GPU‑accelerated MLlib—significantly boosting performance, reliability, and developer productivity across big‑data workloads.

Apache SparkBig DataGPU Acceleration
0 likes · 9 min read
What’s New in Apache Spark 4.0? Deep Dive into 2025 Core Updates
Raymond Ops
Raymond Ops
Dec 7, 2025 · Operations

Ceph Uncovered: Architecture, Deployment, and Ops Best Practices

Ceph is an open‑source distributed storage platform offering object, block, and file services with high availability, scalability, and self‑management; the guide explains its core components, CRUSH algorithm, storage interfaces, deployment steps using ceph‑deploy, operational monitoring, performance tuning, and common use cases in cloud and big‑data environments.

Big DataCephDeployment
0 likes · 11 min read
Ceph Uncovered: Architecture, Deployment, and Ops Best Practices
dbaplus Community
dbaplus Community
Dec 6, 2025 · Big Data

Why Precise Data Warehouse Naming Boosts Efficiency and Cuts Costs

In the era of digital transformation, chaotic data warehouse naming wastes resources, while a well‑defined naming convention improves maintainability, collaboration, and business value, as demonstrated by real‑world cases showing three‑fold query speed gains and up to 60% reduction in cross‑team effort.

Big DataData Warehousebest practices
0 likes · 6 min read
Why Precise Data Warehouse Naming Boosts Efficiency and Cuts Costs
Data STUDIO
Data STUDIO
Dec 5, 2025 · Big Data

Why Parquet Is the Default Choice for Big Data Storage

The article explains how Apache Parquet’s columnar layout, multi‑level row‑group structure, projection and predicate push‑down, and advanced compression and encoding make it the high‑performance, space‑efficient storage format that powers modern big‑data ecosystems and tools like Spark, Python pandas, and ClickHouse.

Big DataClickHouseColumnar Storage
0 likes · 11 min read
Why Parquet Is the Default Choice for Big Data Storage
Code Ape Tech Column
Code Ape Tech Column
Dec 5, 2025 · Big Data

Optimizing 100K Record Retrieval from 10M‑Row Pools: ClickHouse, ES Scroll, ES+HBase, RediSearch

This article examines several engineering solutions for extracting up to 100,000 records from a ten‑million‑row pool, comparing multi‑threaded ClickHouse pagination, Elasticsearch scroll‑scan, an ES‑plus‑HBase hybrid, and RediSearch + RedisJSON, and presents performance measurements and practical trade‑offs.

Big DataClickHouseElasticsearch
0 likes · 12 min read
Optimizing 100K Record Retrieval from 10M‑Row Pools: ClickHouse, ES Scroll, ES+HBase, RediSearch
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 28, 2025 · Big Data

What’s New in Apache Paimon 2025? Core Performance, AI Integration & Real‑Time Lakehouse Updates

The 2025 Apache Paimon release brings major performance boosts, AI‑centric multimodal storage, deeper streaming‑batch integration, and broader engine compatibility, detailing query and write optimizations, memory management tweaks, and a unified lake format for structured and unstructured data.

AI integrationApache PaimonBig Data
0 likes · 6 min read
What’s New in Apache Paimon 2025? Core Performance, AI Integration & Real‑Time Lakehouse Updates
DataFunSummit
DataFunSummit
Nov 27, 2025 · Big Data

How BMW Turned Data Into Growth: A Sensors Data Case Study

This article details BMW's digital transformation journey using Sensors Data, covering the background of rapid app growth, the cross‑regional data collection challenges, the systematic solution architecture—including mapping, preprocessing, and historical data migration—and the resulting business impact and future AI‑driven roadmap.

AnalyticsBig DataDigital Transformation
0 likes · 13 min read
How BMW Turned Data Into Growth: A Sensors Data Case Study
Ctrip Technology
Ctrip Technology
Nov 27, 2025 · Big Data

How Ctrip Cut Query Latency by 85% with StarRocks’ Compute‑Storage Separation

Ctrip migrated its massive User Behavior Tracking system from ClickHouse to a compute‑storage separated StarRocks cluster on Kubernetes, achieving millisecond‑level query latency, halving storage usage, reducing node count, and sustaining millions‑of‑rows‑per‑second write throughput while simplifying scaling and operations.

Big DataClickHouseCompute-Storage Separation
0 likes · 15 min read
How Ctrip Cut Query Latency by 85% with StarRocks’ Compute‑Storage Separation
DataFunSummit
DataFunSummit
Nov 24, 2025 · Big Data

How Tencent Cloud Uses Iceberg, Gravitino and Multimodal Lakes for Unified Data Processing

This article series explores Tencent Cloud's Iceberg‑based batch‑stream integration, Apache Gravitino's unified metadata and lineage solution, Xiaohongshu's data‑architecture evolution for the Big AI Data era, and a practical Data+AI multimodal data‑lake implementation, highlighting challenges, architectural designs, and performance gains.

Big DataData LakeIceberg
0 likes · 7 min read
How Tencent Cloud Uses Iceberg, Gravitino and Multimodal Lakes for Unified Data Processing
DataFunSummit
DataFunSummit
Nov 23, 2025 · Artificial Intelligence

How Large Language Models Are Revolutionizing Banking Data Integration

This article examines the challenges of traditional banking data, explains how large language models can fuse structured and unstructured information, outlines a new data‑centric infrastructure and governance approach, and describes the DiFY platform’s AI‑agent and DataOps capabilities for agile, non‑intrusive integration with core banking systems.

AI AgentsBig DataData Governance
0 likes · 16 min read
How Large Language Models Are Revolutionizing Banking Data Integration
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 20, 2025 · Big Data

Mastering Large‑Scale Data Migration: Challenges, Strategies and Real‑World Solutions

This article explains why data migration is the essential first step for cloud modernization, outlines the technical challenges of moving terabytes to petabytes, compares physical and logical migration methods, and presents practical solutions and real‑world case studies across Hive, cloud warehouses, lake‑house formats and analytic databases.

Big DataData MigrationETL
0 likes · 56 min read
Mastering Large‑Scale Data Migration: Challenges, Strategies and Real‑World Solutions
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 15, 2025 · Big Data

From a Decade-Long Big Data Journey to a Cloud‑Native Lakehouse

This article chronicles a ten‑year evolution of a self‑built big data platform—detailing early Hadoop clusters, successive migrations to Spark, Hive, Hudi, and StarRocks, the operational challenges encountered, and the comprehensive shift to Alibaba Cloud EMR Serverless that delivered significant cost, performance, and stability gains while outlining future intelligent‑ecosystem plans.

Big DataData LakeSpark
0 likes · 17 min read
From a Decade-Long Big Data Journey to a Cloud‑Native Lakehouse
Instant Consumer Technology Team
Instant Consumer Technology Team
Nov 10, 2025 · Big Data

Fixing Multi‑Version, Multi‑Cluster and HA with Apache Kyuubi for Spark/Flink

Apache Kyuubi, an enterprise‑grade multi‑tenant data gateway, replaces Livy and Flink SQL Gateway to support multiple engine versions, cross‑cluster elastic scheduling, high‑availability batch jobs, and traffic control, dramatically reducing deployment complexity, improving resource utilization, and accelerating release cycles for large‑scale Spark and Flink workloads.

Apache KyuubiBig DataData Gateway
0 likes · 18 min read
Fixing Multi‑Version, Multi‑Cluster and HA with Apache Kyuubi for Spark/Flink
DataFunSummit
DataFunSummit
Nov 10, 2025 · Big Data

How Xiaohongshu Cut Data Architecture Costs by One‑Third with Incremental Computing

This article explains how Xiaohongshu, a lifestyle community with over 350 million monthly users, transformed its data platform from a traditional Lambda architecture to a next‑generation incremental computing model, reducing architectural complexity, resource consumption and development effort each by roughly two‑thirds while supporting massive real‑time and offline data demands.

AIBig DataData Architecture
0 likes · 6 min read
How Xiaohongshu Cut Data Architecture Costs by One‑Third with Incremental Computing
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 7, 2025 · Big Data

Unlock Enterprise‑Grade Data Pipelines with DMS Airflow: Features, Integration & Code Samples

This article introduces DMS Airflow, an enterprise‑level data workflow orchestration platform built on Apache Airflow, covering its advanced DAG capabilities, deep DMS integration, scheduling, task dependency management, dynamic task generation, resource scaling, security features, and practical code examples for SQL, Spark, DTS, and Notebook tasks.

AirflowBig DataDMS
0 likes · 20 min read
Unlock Enterprise‑Grade Data Pipelines with DMS Airflow: Features, Integration & Code Samples
Ops Community
Ops Community
Nov 6, 2025 · Big Data

Zero Data Loss Kafka Cluster Scaling: From 3 to 10 Nodes – A Complete Guide

This comprehensive guide walks you through expanding or shrinking a production‑grade Kafka cluster—covering prerequisites, anti‑pattern warnings, environment matrices, step‑by‑step expansion and contraction procedures, partition rebalancing principles, monitoring, best practices, and troubleshooting—to ensure zero data loss during scaling.

Big DataKafkaPartition Rebalancing
0 likes · 27 min read
Zero Data Loss Kafka Cluster Scaling: From 3 to 10 Nodes – A Complete Guide
DataFunTalk
DataFunTalk
Nov 1, 2025 · Big Data

How Kuaishou E‑Commerce Built a Data Metric System to Power Decision‑Making

The article examines Kuaishou’s e‑commerce data metric system, detailing why a metric framework is essential, how it was built, the product practice, management methods, and the challenges faced by data product managers, engineers, and operators across production, querying, and usage stages.

Big DataData ProductKuaishou
0 likes · 6 min read
How Kuaishou E‑Commerce Built a Data Metric System to Power Decision‑Making
DataFunTalk
DataFunTalk
Oct 31, 2025 · Big Data

How Kuaishou E‑Commerce Built a Data Metric System to Power Decision‑Making

This article explores Kuaishou e‑commerce's journey in constructing a comprehensive data metric system, detailing its business context, the necessity of metrics, challenges faced by data product managers and engineers, practical implementation steps, management practices, and a concluding Q&A.

Big DataKuaishoudata metrics
0 likes · 6 min read
How Kuaishou E‑Commerce Built a Data Metric System to Power Decision‑Making
Instant Consumer Technology Team
Instant Consumer Technology Team
Oct 29, 2025 · Big Data

Revolutionizing Feature Engineering with Distributed Tech & Configurable Services

Facing PB‑scale user behavior data and millions of feature dimensions, the platform transformed its search, advertising, and recommendation pipelines by adopting a distributed, configurable‑service architecture that delivers high‑throughput streaming, elastic storage, rapid feature iteration, and robust fault‑tolerance for AI‑driven personalization.

Big DataData ArchitectureDistributed Systems
0 likes · 17 min read
Revolutionizing Feature Engineering with Distributed Tech & Configurable Services
DataFunSummit
DataFunSummit
Oct 29, 2025 · Big Data

How Huolala Scaled to 40PB: Inside Their Evolving Big Data Storage Architecture

Huolala, founded in 2013, runs a massive cross‑cloud hybrid big‑data storage platform of over 40 PB across 3,000+ machines, evolving through four online‑storage phases, robust HA design, performance‑cost optimizations, AI vector storage, and a cost‑governance system that saved more than half of its storage expenses.

AI vector storageBig DataCost Optimization
0 likes · 18 min read
How Huolala Scaled to 40PB: Inside Their Evolving Big Data Storage Architecture
ByteDance Data Platform
ByteDance Data Platform
Oct 29, 2025 · Big Data

How Volcano Engine’s Multimodal Data Lake Tackles AI Agent Challenges

The article explores how Volcano Engine’s multimodal data lake architecture addresses the storage, compute, and management challenges of AI agents by introducing new formats like Lance, upgrading engines such as Spark and Daft, and providing unified tools for processing, versioning, and querying massive multimodal datasets.

Big DataDaft engineLance format
0 likes · 13 min read
How Volcano Engine’s Multimodal Data Lake Tackles AI Agent Challenges
NiuNiu MaTe
NiuNiu MaTe
Oct 29, 2025 · Backend Development

How to Build a Billion‑User Real‑Time Leaderboard: Architecture, Tools, and Pitfalls

This article walks through the end‑to‑end design of a leaderboard that must serve over 100 million users with 100 k queries per second, covering requirement clarification, real‑time and accuracy challenges, technology selection such as Redis ZSet, multi‑layer architecture, sharding, caching, monitoring, and practical implementation tips to achieve low latency, high consistency, and cost‑effective scalability.

Big DataDistributed SystemsReal-Time
0 likes · 19 min read
How to Build a Billion‑User Real‑Time Leaderboard: Architecture, Tools, and Pitfalls
DataFunSummit
DataFunSummit
Oct 29, 2025 · Big Data

How Douyin’s Data Asset Platform Revolutionizes Big Data Lineage

This article introduces Douyin Group’s Data Asset Management Platform, explaining its shift from traditional metadata to a comprehensive data‑asset approach, detailing the platform’s capabilities, and focusing on the evolution and application of full‑link data lineage across four key topics to improve visibility, quality, security, and cost efficiency.

Big DataData AssetsDouyin
0 likes · 5 min read
How Douyin’s Data Asset Platform Revolutionizes Big Data Lineage
Radish, Keep Going!
Radish, Keep Going!
Oct 28, 2025 · Big Data

How Netflix Achieved Petabyte-Scale, Sub-Second Log Queries with ClickHouse

Netflix processes over 5 PB of logs daily, handling millions of events per second, and by layering hot and cold storage, using a custom lexer for fingerprinting, native protocol serialization, and sharded tag maps, they reduced query latency from seconds to sub‑second levels with ClickHouse.

Big DataClickHouseDistributed Systems
0 likes · 8 min read
How Netflix Achieved Petabyte-Scale, Sub-Second Log Queries with ClickHouse