Tagged articles

Big Data

3720 articles · Page 1 of 38

Jul 4, 2026 · Industry Insights

How a Modern Data Platform Is Redefining the Future of Insurance

The article details how Ping An Property & Casualty transformed its legacy siloed data architecture into a systematic Kunpeng Intelligent Platform, built three core pillars—Agent platform, OSI semantic layer, and AI tools—boosted ChatBI accuracy, evaluated OpenClaw’s limits, and delivered end‑to‑end AI across marketing, underwriting, claims, agriculture, and forecasting.

AIBig DataData Platform

0 likes · 12 min read

How a Modern Data Platform Is Redefining the Future of Insurance

DataFunTalk

Jun 30, 2026 · Big Data

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

Xiaohongshu, with over 3.5 billion monthly users and daily logs in the trillions, migrated 500 PB of data to Alibaba Cloud and iterated its data platform through four architecture generations—ClickHouse‑based ad‑hoc, Lambda, Lakehouse, and a unified incremental compute model—cutting resource, development, and storage costs to one‑third while delivering sub‑10‑second query latency at petabyte scale.

Big DataClickHouseData Architecture

0 likes · 22 min read

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

DataFunTalk

Jun 25, 2026 · Big Data

From Writing SQL to Speaking Requirements: Practical Guide to DataWorks Data Agent

This article walks through using DataWorks Data Agent to automate end‑to‑end data‑warehouse development—from preparing source tables and a structured requirement document, uploading it, crafting task commands, selecting execution modes and models, to the agent generating SQL, building workflows, publishing them, and producing a final report—all without writing SQL manually.

AI AutomationBig DataData Agent

0 likes · 16 min read

From Writing SQL to Speaking Requirements: Practical Guide to DataWorks Data Agent

DataFunTalk

Jun 24, 2026 · Big Data

How Xiaohongshu Re‑engineered Its Data Architecture for the Big AI Data Era

Xiaohongshu, with over 350 million monthly users and daily logs in the billions, migrated its data platform from AWS to Alibaba Cloud and iterated four times—from a ClickHouse‑based ad‑hoc layer to a Lambda architecture and finally a Lakehouse with incremental compute—cutting architecture complexity, resource cost and development effort each to about one‑third while delivering second‑level analytics on trillion‑scale data.

Big DataClickHouseData Architecture

0 likes · 22 min read

How Xiaohongshu Re‑engineered Its Data Architecture for the Big AI Data Era

dbaplus Community

Jun 23, 2026 · Big Data

From Hand‑Written SQL to One‑Click Validation: Alibaba’s Verify‑Data Agent Skill Design Review

The article details how Alibaba’s production‑grade Verify‑Data Agent Skill replaces manual, multi‑SQL data validation with a single natural‑language command, automating table discovery, SQL generation, execution, and review‑level reporting, achieving up to 30‑minute turnaround, comprehensive coverage, and robust risk controls for big‑data pipelines.

Big DataData QualityData Validation

0 likes · 28 min read

From Hand‑Written SQL to One‑Click Validation: Alibaba’s Verify‑Data Agent Skill Design Review

DataFunTalk

Jun 21, 2026 · Big Data

How Zhihu Optimized Spark Jobs with Gluten: A Practical Deep‑Dive

This article details Zhihu's end‑to‑end experience of migrating Spark SQL workloads to the open‑source Gluten framework, covering background performance benchmarks, the architecture of Gluten and Velox, consistency and performance challenges encountered during migration, the concrete fixes applied, and the resulting resource savings and future plans.

Big DataGlutenOptimization

0 likes · 22 min read

How Zhihu Optimized Spark Jobs with Gluten: A Practical Deep‑Dive

DataFunSummit

Jun 20, 2026 · Big Data

Building an Agentic Analytics Platform for the Gaming Industry with SelectDB

The article analyzes the fourfold challenges of game‑industry data analysis—high timeliness, massive concurrency, heterogeneous sources, and petabyte‑scale volumes—and explains how SelectDB’s evolution to an AI‑Ready, Agentic platform with MCP and a semantic layer addresses these issues through real‑time OLAP, multimodal processing, and autonomous decision loops.

AI-ReadyAgentic AIBig Data

0 likes · 16 min read

Building an Agentic Analytics Platform for the Gaming Industry with SelectDB

DataFunTalk

Jun 20, 2026 · Big Data

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

The article details Xiaohongshu's step‑by‑step migration from a simple ClickHouse‑based analytics stack to a Lambda‑style 2.0 architecture and finally to a Lakehouse‑based 3.0 design, highlighting concrete performance numbers, cost reductions, and the definition of a generic incremental‑compute model (SPOT) that underpins the evolution.

Big DataClickHouseData Architecture

0 likes · 22 min read

DataFunSummit

Jun 19, 2026 · Big Data

Near‑Real‑Time Data Warehousing with Yunqi Lakehouse: Cases from Xiaohongshu, Kuaishou, Meituan

The article examines how Xiaohongshu, Kuaishou and Meituan adopted Yunqi Lakehouse’s General Incremental Computing and Single‑Engine architecture to achieve near‑real‑time data warehouses, cutting resource usage to as low as 1/20 of full‑batch jobs, reducing data latency from days to minutes, and improving query performance.

Big DataCase StudyGeneral Incremental Computing

0 likes · 12 min read

Near‑Real‑Time Data Warehousing with Yunqi Lakehouse: Cases from Xiaohongshu, Kuaishou, Meituan

DataFunTalk

Jun 16, 2026 · Big Data

How MaxCompute Evolves Data Platforms for AI: Architecture, Features, and Real‑World Cases

The article explains how Alibaba Cloud's MaxCompute transforms a traditional data warehouse into a cloud‑native, multimodal Data+AI platform by introducing a four‑layer architecture, SQL‑based AI functions, the Python‑native MaxFrame framework, and a series of industry case studies that demonstrate performance gains and flexible resource scheduling.

Big DataCloud NativeData+AI

0 likes · 11 min read

How MaxCompute Evolves Data Platforms for AI: Architecture, Features, and Real‑World Cases

IT Learning Made Simple

Jun 14, 2026 · Industry Insights

Why Data Architects Are the Hottest Talent in the DT Era

The article explains why data architects have become essential in the DT era, detailing their responsibilities, core skills, big‑data technology stack, governance practices, career paths, and the tools they use to turn data into a strategic asset for enterprises.

Big DataCareer PathData Architecture

0 likes · 9 min read

Why Data Architects Are the Hottest Talent in the DT Era

dbaplus Community

Jun 14, 2026 · Big Data

Why Big Data Is Falling Silent: When Scale Can’t Fake Value Anymore

Although national data production reached 52.26 ZB in 2025 and keeps growing, the term “big data” is disappearing because it no longer serves as an organizational credit that hides the need for real value, responsibility, and measurable business impact, especially in the AI era.

AI impactBig DataData Governance

0 likes · 13 min read

Why Big Data Is Falling Silent: When Scale Can’t Fake Value Anymore

DataFunTalk

Jun 11, 2026 · Artificial Intelligence

How Qichacha Leverages Large Language Models for Field‑Level Data Lineage

This article details Qichacha's use of large language models to extract field‑level data lineage from heterogeneous, non‑standard code and ETL assets, describing the motivation, architectural blueprint, practical challenges such as cost, accuracy and hallucination, and the resulting improvements in impact analysis, metric tracing, and sensitive‑data governance.

Big DataData GovernanceFlink

0 likes · 11 min read

How Qichacha Leverages Large Language Models for Field‑Level Data Lineage

iQIYI Technical Product Team

Jun 11, 2026 · Big Data

How iQIYI’s QBFS Enables Seamless Hybrid‑Cloud Storage and Cuts Big‑Data Costs by Over 30%

iQIYI’s big‑data team built a self‑developed QBFS virtual file system that unifies private and multiple public clouds, providing transparent routing, automatic migration, intelligent caching and fine‑grained governance, which together reduce storage and compute costs by more than 30 % while supporting scalable analytics.

Big DataCachingData Migration

0 likes · 21 min read

How iQIYI’s QBFS Enables Seamless Hybrid‑Cloud Storage and Cuts Big‑Data Costs by Over 30%

IT Learning Made Simple

Jun 8, 2026 · R&D Management

The Essential Gear to Become a Software Architect

This guide maps the complete skill tree for aspiring software architects, detailing foundational knowledge, core competencies such as system design and performance tuning, extended expertise in cloud‑native and big‑data technologies, and a staged learning roadmap to help newcomers acquire the necessary gear.

Big DataCloud NativePerformance Optimization

0 likes · 9 min read

The Essential Gear to Become a Software Architect

DataFunSummit

Jun 7, 2026 · Artificial Intelligence

How Qichacha Uses Large Language Models for Field‑Level Data Lineage

This article details Qichacha's technical journey of applying large language models to resolve field‑level data lineage challenges in a complex, multi‑source data environment, describing the motivation, architecture, practical implementation, engineering trade‑offs, and measurable outcomes.

AIBig DataData Governance

0 likes · 11 min read

How Qichacha Uses Large Language Models for Field‑Level Data Lineage

Digital Planet

Jun 6, 2026 · Big Data

Why Has the Term “Big Data” Suddenly Disappeared?

Although data production continues to surge—reaching 52.26 ZB in 2025—the “big data” label is fading because its original narrative of scale as value has run out, exposing a credit‑and‑responsibility gap that forces organizations to demand concrete business impact rather than mere infrastructure.

AI impactBig DataData Governance

0 likes · 15 min read

Why Has the Term “Big Data” Suddenly Disappeared?

Alibaba Cloud Big Data AI Platform

Jun 4, 2026 · Big Data

Scalar‑Vector Hybrid Search in a Data Lake with One SQL on EMR Serverless Spark

EMR Serverless Spark now supports scalar‑vector hybrid search via DLF Global Index, allowing a single Spark SQL statement to perform vector similarity and scalar filtering together, eliminating data movement, reducing latency, and boosting performance for scenarios such as autonomous driving, e‑commerce, and knowledge‑base retrieval.

Big DataDLF Global IndexEMR Serverless Spark

0 likes · 17 min read

Scalar‑Vector Hybrid Search in a Data Lake with One SQL on EMR Serverless Spark

Alibaba Cloud Big Data AI Platform

May 28, 2026 · Artificial Intelligence

From Assisted to Autonomous: How DataWorks Data Agent Revolutionizes Data Intelligence

DataWorks Data Agent advances from an assisted, code‑completion tool to a fully autonomous data‑intelligent agent, using a dual‑engine CLI/Claw architecture, unified runtime, open Skill ecosystem, and CPU‑GPU co‑optimization to automatically understand requirements, explore data, generate code, execute tasks, and deliver end‑to‑end results for developers and operators.

AIAutomationBig Data

0 likes · 10 min read

From Assisted to Autonomous: How DataWorks Data Agent Revolutionizes Data Intelligence

DataFunSummit

May 28, 2026 · Artificial Intelligence

How DataWorks Data Agent Advances from Augmented Assistance to Full Autonomy

The article analyzes DataWorks Data Agent’s evolution from a helper‑style tool to an autonomous data‑centric AI agent, detailing its five‑stage roadmap, dual‑engine CLI/Claw architecture, unified runtime kernel, open skill ecosystem, and CPU‑GPU joint optimization for enterprise‑grade data automation.

AIAutomationBig Data

0 likes · 12 min read

How DataWorks Data Agent Advances from Augmented Assistance to Full Autonomy

DataFunTalk

May 28, 2026 · Big Data

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

Xiaohongshu transformed its data platform from a simple ClickHouse‑based ad‑hoc analysis to a Lambda‑style architecture and finally to a lakehouse with generic incremental compute, cutting architecture complexity, resource and development costs by one‑third while delivering second‑level queries over trillions of rows.

Big DataClickHouseData Architecture

0 likes · 21 min read

Big Data Technology & Architecture

May 26, 2026 · Big Data

Advanced Paimon Production Issues: 10 Rare Compaction‑Related Problems and Fixes

This article enumerates ten uncommon, compaction‑related problems encountered in large‑scale Paimon deployments, explains their root causes—such as RPC timeouts, snapshot expiration, file corruption, and write conflicts—and provides concrete configuration tweaks and operational steps to resolve each issue.

Big DataCompactionFlink

0 likes · 9 min read

Advanced Paimon Production Issues: 10 Rare Compaction‑Related Problems and Fixes

DataFunTalk

May 25, 2026 · Big Data

MaxCompute’s AI‑Ready Evolution: Architecture, Features, and Real‑World Use Cases

This article examines how Alibaba Cloud’s MaxCompute platform has been transformed for AI workloads, detailing its multi‑layer architecture, multimodal data storage, SQL AI functions, the Python‑based MaxFrame framework, and real‑world deployments in large‑model preprocessing, autonomous driving, and multimodal image labeling.

AIBig DataDistributed Computing

0 likes · 12 min read

MaxCompute’s AI‑Ready Evolution: Architecture, Features, and Real‑World Use Cases

AI Large-Model Wave and Transformation Guide

May 25, 2026 · Artificial Intelligence

AI‑Powered Underwater Simulation: Autonomous Perception, Decision & Execution

The article presents a comprehensive AI‑driven framework for unmanned underwater vehicles, detailing a three‑layer decision architecture, human‑machine collaboration models, conflict‑resolution mechanisms, data acquisition and simulation pipelines, ontology‑based knowledge graphs, and self‑evolution processes to enable reliable autonomous perception, planning, and actuation in complex marine environments.

Big DataOperationsR&D Management

0 likes · 30 min read

AI‑Powered Underwater Simulation: Autonomous Perception, Decision & Execution

AI Large-Model Wave and Transformation Guide

May 24, 2026 · Industry Insights

From CIA‑Labeled ‘Garbage’ to Military Disappointment: Palantir’s Series of Failures

The article chronicles Palantir’s two‑decade saga of high‑profile setbacks—from a $5 billion, six‑year military AI project and a failed financial platform to stalled consumer data alliances—showing how advanced algorithms falter when detached from real‑world business needs.

AIBig DataIndustry Analysis

0 likes · 8 min read

From CIA‑Labeled ‘Garbage’ to Military Disappointment: Palantir’s Series of Failures

Big Data Tech Team

May 24, 2026 · Big Data

Data Warehouse Interview Pitfall Guide 2.0: Avoid Common SQL, Modeling, and ETL Mistakes

This guide compiles the most frequent interview pitfalls for data warehouse roles, covering SQL join and aggregation errors, window function misuse, subquery versus CTE performance myths, dimensional modeling mistakes, SCD implementation traps, layered design issues, data quality handling, ETL traps, Hive and Spark performance questions, real‑time warehousing considerations, and effective interview strategies.

Big DataETLHive

0 likes · 3 min read

Data Warehouse Interview Pitfall Guide 2.0: Avoid Common SQL, Modeling, and ETL Mistakes

DataFunTalk

May 22, 2026 · Big Data

How Xiaohongshu Cut Data Architecture Complexity and Cost by One‑Third in the Big AI Data Era

The article details Xiaohongshu's evolution from a simple ClickHouse‑based analytics layer to a Lambda‑enabled 2.0 stack and finally a Lakehouse‑based 3.0 architecture, showing how each iteration reduced infrastructure complexity, resource consumption and development effort by roughly one‑third while supporting trillions of daily events and AI‑driven use cases.

Big DataClickHouseData Architecture

0 likes · 21 min read

How Xiaohongshu Cut Data Architecture Complexity and Cost by One‑Third in the Big AI Data Era

DataFunSummit

May 21, 2026 · Big Data

Alibaba Cloud’s Agent-Ready Big Data AI Infrastructure: Boosting Data Development from Hours to Minutes

Facing a projected 85% of enterprises deploying internal agents within two years, Alibaba Cloud proposes an Agent-Ready big‑data AI infrastructure—comprising a unified data lake, real‑time processing, high‑dimensional vector retrieval, elastic model serving, and comprehensive security governance—that has already cut data‑development cycles from hours to 5‑10 minutes in internal model‑training and Taobao flash‑sale scenarios.

AIAgent-ReadyBig Data

0 likes · 15 min read

Alibaba Cloud’s Agent-Ready Big Data AI Infrastructure: Boosting Data Development from Hours to Minutes

DataFunSummit

May 20, 2026 · Big Data

How Kuaishou’s Real‑Time Data Lake Boosts AI and BI Architecture

The article explains how Kuaishou partnered with Apache Hudi to overhaul its ODS‑based data lake, addressing latency, storage cost, and complexity for AI and BI workloads, detailing the evolution from mysql‑to‑hive to mysql‑to‑hudi 1.0 and 2.0, the resulting performance gains, cost savings, and future roadmap.

AIBIBig Data

0 likes · 20 min read

How Kuaishou’s Real‑Time Data Lake Boosts AI and BI Architecture

Linyb Geek Road

May 20, 2026 · Big Data

Why 90% of Companies Get Data Governance Wrong and How to Reduce Friction

Most data‑governance initiatives fail not because of lacking technology but because they add friction; the article explains how companies mistakenly focus on rules, platforms, and processes, and offers a step‑by‑step approach—identifying high‑value tables, minimal metadata, targeted quality rules, and fast issue diagnosis—to make governance truly useful.

Big DataData GovernanceData Quality

0 likes · 29 min read

Why 90% of Companies Get Data Governance Wrong and How to Reduce Friction

DataFunTalk

May 19, 2026 · Industry Insights

From Single‑Point Copilot to Platform‑Level Agentic: Real Challenges and Future Forks for Data Platforms

A live discussion dissected the shift from single‑point Copilot assistants to platform‑level Agentic data platforms, exposing hard architectural, security, knowledge‑base, evaluation, stability‑cost, and governance challenges while debating whether the future will favor a super‑agent or a multi‑agent ecosystem.

Agentic AIBig DataData Platform

0 likes · 18 min read

From Single‑Point Copilot to Platform‑Level Agentic: Real Challenges and Future Forks for Data Platforms

DataFunSummit

May 17, 2026 · Industry Insights

From Single‑point Copilot to Platform‑level Agentic: Real Challenges and Future Paths for Data Platforms

A 90‑minute live discussion with data experts from vivo and YangQianGuan reveals that moving from a simple Copilot assistant to a platform‑level Agentic data system requires fundamental architectural changes, new infrastructure for memory, planning, tool orchestration, security guardrails, knowledge management, robust evaluation, and a clear ROI strategy.

AI GovernanceBig DataData Platform

0 likes · 19 min read

From Single‑point Copilot to Platform‑level Agentic: Real Challenges and Future Paths for Data Platforms

Data Party THU

May 15, 2026 · Artificial Intelligence

2026 Big Data Challenge Announces Monthly Star Winners and Shares Winning Teams’ Insights

The 2026 China University Computer Competition – Big Data Challenge reveals the Monthly Star award winners, each receiving 800 RMB, and presents detailed experience reports from the top teams covering feature engineering, model selection, training validation, and ensemble strategies for stock prediction.

Big DataModel FusionTime Series Validation

0 likes · 7 min read

2026 Big Data Challenge Announces Monthly Star Winners and Shares Winning Teams’ Insights

dbaplus Community

May 14, 2026 · Big Data

Building a ‘One‑Sentence Bank’: Big Data and AI Fusion for Small Banks

The article outlines the evolution of big data in banking, compares management models for heterogeneous data, describes the shift from data engineering to knowledge engineering, introduces LLMOps for high‑quality knowledge bases, and details how integrating AI and data can enable a “one‑sentence bank” that answers queries and executes tasks.

Big DataData GovernanceKnowledge Engineering

0 likes · 22 min read

Building a ‘One‑Sentence Bank’: Big Data and AI Fusion for Small Banks

vivo Internet Technology

May 13, 2026 · Big Data

How Vivo Upgraded a Million‑Node YARN Cluster: Architecture, Scheduler Switch, and Performance Optimizations

This article details Vivo's end‑to‑end upgrade of a YARN 2.6.0 cluster to a modern version for a million‑node, hundred‑thousand‑tasks‑per‑day platform, covering architectural evolution, scheduler migration, compatibility fixes, performance tuning, and service‑continuity strategies.

Big DataCapacity SchedulerHadoop

0 likes · 28 min read

How Vivo Upgraded a Million‑Node YARN Cluster: Architecture, Scheduler Switch, and Performance Optimizations

DeWu Technology

May 13, 2026 · Big Data

How BP Claw Solves AI Coding Input Challenges in FlinkSpec’s Real‑Time Data Warehouse

The article explains how BP Claw tackles unstable AI coding results by automatically converting low‑quality PRD documents into structured, high‑quality requirements, applying token‑saving strategies, strict hallucination guards, and multi‑skill orchestration, which together boost FlinkSpec’s real‑time data‑warehouse delivery efficiency by up to 30%.

AI codingBP ClawBig Data

0 likes · 17 min read

How BP Claw Solves AI Coding Input Challenges in FlinkSpec’s Real‑Time Data Warehouse

DataFunTalk

May 11, 2026 · Big Data

How Xiaohongshu Re‑engineered Its Data Architecture for the Big AI Data Era

Xiaohongshu transformed its data platform from a simple ClickHouse‑based ad‑hoc analysis to a Lambda‑style architecture and finally to a lakehouse built on Iceberg, StarRocks, Flink and Spark, cutting architecture complexity, resource and development costs by two‑thirds while supporting trillions of daily events with sub‑second query latency.

Big DataClickHouseFlink

0 likes · 22 min read

Zhihu Tech Column

May 9, 2026 · Big Data

How Zhihu Built a Unified OneID System to Consolidate Fragmented User Identities

Zhihu created a unified OneID framework that merges scattered account, device, and behavior data into a global unique identifier, using strong and weak IDs, graph‑based connectivity, device governance, and a device half‑life model to improve recommendation, push, and advertising effectiveness.

Big DataDevice GovernanceGraph Computation

0 likes · 11 min read

How Zhihu Built a Unified OneID System to Consolidate Fragmented User Identities

DataFunTalk

May 8, 2026 · Big Data

How MaxCompute Evolves into a Data+AI Platform: Architecture, Core Capabilities, and Real-World Cases

The article explains how Alibaba Cloud's MaxCompute has been transformed into a cloud‑native Data+AI platform, detailing its layered architecture, multimodal storage, model management, hybrid compute scheduling, SQL AI functions, the MaxFrame Python framework, and several enterprise case studies that demonstrate performance gains and flexible resource orchestration.

AI integrationBig DataCloud Native

0 likes · 11 min read

How MaxCompute Evolves into a Data+AI Platform: Architecture, Core Capabilities, and Real-World Cases

DataFunTalk

May 6, 2026 · Big Data

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

The article details Xiaohongshu's four‑stage data‑platform evolution—from a simple ClickHouse ad‑hoc setup to a Lambda‑based 2.0 design and finally a lakehouse‑driven 3.0 architecture—highlighting the adoption of general incremental compute, cost‑reduction to one‑third, performance gains of up to ten‑fold, and the SPOT standards that guide the new system.

Big DataClickHouseData Architecture

0 likes · 21 min read

DataFunTalk

Apr 29, 2026 · Big Data

How Xiaohongshu Revamped Its Data Architecture for the Big AI Data Era

Xiaohongshu transformed its data platform from a simple ClickHouse‑based analytics stack to a unified lakehouse with generic incremental compute, cutting architecture complexity, resource cost, and development effort by roughly one‑third while supporting petabyte‑scale, sub‑second queries across its 350 million‑user app.

Big DataClickHouseData Architecture

0 likes · 22 min read

How Xiaohongshu Revamped Its Data Architecture for the Big AI Data Era

Model Perspective

Apr 28, 2026 · Big Data

How a Taiwan Ban Became Free Advertising for Amap’s Map App

A recent Taiwan government warning against Amap turned into a viral boost, exposing the app’s superior traffic‑light countdown, massive data‑driven network effects, and the underlying reverse‑propagation model that explains why the ban accelerated downloads rather than suppressing them.

AmapBig DataNetwork Effects

0 likes · 11 min read

How a Taiwan Ban Became Free Advertising for Amap’s Map App

DataFunTalk

Apr 28, 2026 · Artificial Intelligence

From “Lobster” to Ontology: DACon Reveals the Next Trend in Self‑Evolving AI Agents

The DACon conference in Shanghai gathered over 8,000 developers and experts, showcasing 50 talks that explored self‑evolving AI agents, the open‑source GenericAgent framework, data‑governance ontology, Agent‑Ready big‑data infrastructure, and AI+AR ecosystems, while highlighting practical case studies and future industry directions.

AI AgentsAI+ARBig Data

0 likes · 11 min read

From “Lobster” to Ontology: DACon Reveals the Next Trend in Self‑Evolving AI Agents

DataFunSummit

Apr 27, 2026 · Artificial Intelligence

How Tencent Games Leverages AI to Turn Data Governance into a Service

Tencent Games’ data governance team details an AI‑driven, end‑to‑end semantic framework that shifts traditional rule‑based data management to a service‑oriented model, cutting storage waste by 30 %, halving development time, and boosting asset recommendation accuracy to 95 % across its global gaming platform.

AIBig DataData Governance

0 likes · 19 min read

How Tencent Games Leverages AI to Turn Data Governance into a Service

DataFunSummit

Apr 25, 2026 · Big Data

AI‑Era Multimodal Data Lake Infrastructure: TBDS Design, Storage, Compute, and Governance

The article analyzes how Tencent Cloud's TBDS platform tackles the AI era's multimodal data lake challenges through a native storage format (Lance), elastic Ray‑based compute, standardized metadata with Gravitino, and automated governance via Lakekeeper, citing architecture details, performance numbers, and real‑world deployments.

AI InfrastructureBig DataGravitino

0 likes · 13 min read

AI‑Era Multimodal Data Lake Infrastructure: TBDS Design, Storage, Compute, and Governance

DataFunSummit

Apr 24, 2026 · Artificial Intelligence

AI‑Driven Data Governance as a Service: Tencent Games' Paradigm Shift

This talk details how Tencent Games leverages AI to transform its data governance from rule‑based, passive processes into a semantic, service‑oriented paradigm, addressing resource waste, low collaboration efficiency, and scalability challenges while delivering measurable improvements in cost, speed, and asset quality.

AIAutomationBig Data

0 likes · 19 min read

AI‑Driven Data Governance as a Service: Tencent Games' Paradigm Shift

Java Tech Enthusiast

Apr 23, 2026 · Industry Insights

12306 Crackdown Triggers Widespread Failures on Third‑Party Ticket Platforms

Ahead of the May Day holiday, many users reported errors and failed bookings on third‑party ticket services such as Ctrip, Didi and Tongcheng, after 12306’s new big‑data‑driven risk‑control system introduced a “slow‑queue” mechanism that blocked millions of suspicious transactions.

12306Big DataRailway

0 likes · 6 min read

12306 Crackdown Triggers Widespread Failures on Third‑Party Ticket Platforms

DataFunTalk

Apr 22, 2026 · Industry Insights

How Xiaohongshu Cut Data Platform Costs by Two‑Thirds with Incremental Computing

This article details Xiaohongshu's journey from a ClickHouse‑based batch analytics stack to a unified lakehouse architecture powered by generic incremental computing, showing how the company reduced architecture complexity, resource consumption and development effort each to roughly one‑third while supporting trillions of daily events with sub‑10‑second query latency.

Big DataData ArchitectureLakehouse

0 likes · 24 min read

How Xiaohongshu Cut Data Platform Costs by Two‑Thirds with Incremental Computing

Big Data Tech Team

Apr 22, 2026 · Big Data

Inside Big Tech: Full Breakdown of AI Agents for Data Warehouse Governance

The article analyzes how leading internet companies embed AI agents across the entire data‑warehouse lifecycle to automate governance, presenting real‑world case studies from Alibaba, ByteDance, JD.com and Tencent, and quantifies benefits such as over 65% reduction in manual effort, 50% drop in metric duplication, and a 40% boost in resource utilization.

AI AgentsAutomationBig Data

0 likes · 10 min read

Inside Big Tech: Full Breakdown of AI Agents for Data Warehouse Governance

DataFunSummit

Apr 21, 2026 · Industry Insights

How SelectDB Cuts 60% Costs and Boosts Real‑Time Performance for New Energy Batteries

The whitepaper analyzes the data‑driven transformation of the new‑energy battery sector, outlines four core challenges—massive data streams, fast‑changing R&D demands, long manufacturing cycles, and multi‑dimensional quality standards—and demonstrates how SelectDB’s unified lake‑warehouse architecture delivers million‑level throughput, second‑level latency, up to 30× query speedup, and 60% cost reduction across real‑world case studies.

Big DataCase StudyData Warehouse

0 likes · 18 min read

How SelectDB Cuts 60% Costs and Boosts Real‑Time Performance for New Energy Batteries

DataFunSummit

Apr 19, 2026 · Big Data

How OPPO Built a Multi‑Modal Data Lake with Gravitino and Curvine

OPPO’s data‑lake team, led by David, detailed their transition from Hive‑Spark to a unified multi‑modal lake, leveraging Gravitino for cross‑engine metadata management and the open‑source Curvine cache to eliminate data silos, boost I/O performance, and support massive image, recommendation, and AI‑Agent workloads.

Big DataData LakeMultimodal

0 likes · 11 min read

How OPPO Built a Multi‑Modal Data Lake with Gravitino and Curvine

Big Data Tech Team

Apr 17, 2026 · Industry Insights

Can AI Replace Data Warehouse Engineers? Exploring the Future of Data Modeling

The article examines how large‑language‑model AI can automate data‑warehouse modeling tasks—generating SQL, designing schemas, handling ETL, and tracing lineage—while highlighting current pain points, practical limitations, and four emerging trends that will reshape the role of data engineers over the next few years.

AIBig DataData Warehouse

0 likes · 11 min read

Can AI Replace Data Warehouse Engineers? Exploring the Future of Data Modeling

Ctrip Technology

Apr 16, 2026 · Big Data

How Ray + DuckDB Cut 9B-Row Attribution Queries from 40s to 15s

When attribution analysis on over 900 million rows slowed to more than 40 seconds and threatened cluster stability, Ctrip's smart attribution team rebuilt the architecture with Ray and DuckDB, achieving sub‑15‑second query times, 160 % performance gain, and complete resource isolation.

Attribution AnalysisBig DataDistributed Computing

0 likes · 22 min read

How Ray + DuckDB Cut 9B-Row Attribution Queries from 40s to 15s

DataFunTalk

Apr 16, 2026 · Big Data

How Xiaohongshu Cut Data Architecture Costs by Two‑Thirds with Incremental Computing

This article details Xiaohongshu's data platform evolution from a simple ClickHouse‑based ad‑hoc system to a Lambda‑style architecture and finally a lakehouse solution, highlighting how the adoption of a new incremental computing model reduced architectural complexity, resource consumption and development effort each to roughly one‑third while delivering sub‑second query performance on petabyte‑scale data.

Big DataData ArchitectureLakehouse

0 likes · 21 min read

How Xiaohongshu Cut Data Architecture Costs by Two‑Thirds with Incremental Computing

DataFunSummit

Apr 15, 2026 · Industry Insights

Why Traditional Data Platforms Fail and How Ontology Drives Triple‑Digit ROI

The article analyzes costly data‑platform failures—such as a $40 million payroll system in San Francisco schools and a collapsed Healthcare.gov launch—identifies the root cause as ineffective data middle platforms, and demonstrates how Palantir’s ontology‑based three‑layer architecture (semantic, dynamics, decision) can turn data into actionable insights, delivering triple‑digit ROI for enterprises like BP, Novartis, and General Mills.

Big DataData PlatformOntology

0 likes · 5 min read

Why Traditional Data Platforms Fail and How Ontology Drives Triple‑Digit ROI

DataFunTalk

Apr 11, 2026 · Industry Insights

Why Most Intelligent Data Analytics Fail and How Aloudata’s Agent Architecture Solves It

This article examines three common misconceptions in enterprise intelligent data analysis, explains how a semantic metric layer can break data silos, and details Aloudata Agent’s dual‑path engine, multi‑agent collaboration, and product design that together deliver trustworthy, deep, and democratized analytics for modern businesses.

AIAttribution AnalysisBig Data

0 likes · 18 min read

Why Most Intelligent Data Analytics Fail and How Aloudata’s Agent Architecture Solves It

DataFunTalk

Apr 10, 2026 · Big Data

How Xiaohongshu Cut Data Architecture Costs by Two‑Thirds with Incremental Computing

This article analyzes Xiaohongshu's data platform evolution—from a simple ClickHouse‑based analytics layer to a Lambda architecture and finally a lakehouse design—highlighting how adopting a new incremental computing model reduced architecture complexity, resource consumption, and development effort each to roughly one‑third while delivering sub‑second query performance on petabyte‑scale data.

Big DataData ArchitectureLakehouse

0 likes · 22 min read

Big Data Tech Team

Apr 9, 2026 · Industry Insights

Why Data Engineers Are the New AI Powerhouses: 4 Core Reasons & Actionable Tips

The article analyzes why data development engineers are becoming more valuable in the AI era, outlining four core reasons—including data‑driven AI limits, the rise of RAG architectures, heightened data compliance, and a talent shortage—while offering concrete advice on mastering real‑time pipelines, unstructured data, and AI infrastructure.

AI InfrastructureBig DataData Engineering

0 likes · 8 min read

Why Data Engineers Are the New AI Powerhouses: 4 Core Reasons & Actionable Tips

Alibaba Cloud Observability

Apr 6, 2026 · Cloud Native

How Alibaba Cloud Built Real‑Time OpenAPI Monitoring with Flink + SLS

This article details the design and implementation of a cloud‑native, real‑time monitoring system for Alibaba Cloud OpenAPI, covering background challenges, a Flink‑SLS architecture, multi‑region data processing, checkpoint and state‑backend tuning, source‑side predicate pushdown, visualization with Grafana, and production results.

Big DataCloud NativeFlink

0 likes · 21 min read

How Alibaba Cloud Built Real‑Time OpenAPI Monitoring with Flink + SLS

Big Data Tech Team

Apr 1, 2026 · Big Data

Why Your 2026 Big Data Resume Is Being Ignored and How to Fix It

In the 2026 spring hiring season, many big‑data job seekers see their resumes disappear because they still focus on offline batch processing, while employers now demand real‑time streaming, AI‑driven data pipelines, and cloud‑native deployment skills such as Flink, vector databases, and Kubernetes.

AI integrationBig DataCloud Native

0 likes · 7 min read

Why Your 2026 Big Data Resume Is Being Ignored and How to Fix It

Big Data Tech Team

Mar 30, 2026 · Big Data

2026 Data Warehouse Interview Guide: Essential Questions for All Three Rounds

This article compiles a comprehensive set of data‑warehouse interview questions—including self‑introduction prompts, SQL and window‑function challenges, data‑skew solutions, architecture design, file‑format trade‑offs, governance, and team‑leadership topics—to help candidates prepare for first, second, and third‑round interviews at leading tech firms.

Big DataData GovernanceSQL

0 likes · 7 min read

2026 Data Warehouse Interview Guide: Essential Questions for All Three Rounds

vivo Internet Technology

Mar 25, 2026 · Industry Insights

How Vivo Scaled Marketing Automation with Presto, Bitmap, and StarRocks

This case study details how Vivo’s marketing automation platform evolved its data‑driven architecture—from a Presto‑based wide‑table design, through a Bitmap optimization, to a StarRocks migration—addressing performance bottlenecks, reducing resource costs, and enhancing data security.

Big DataData ArchitectureOLAP

0 likes · 11 min read

How Vivo Scaled Marketing Automation with Presto, Bitmap, and StarRocks

DeWu Technology

Mar 25, 2026 · Big Data

How Code LLM Transforms E‑commerce Data Warehouses: From Data Rights to AI‑Driven Automation

This article analyzes how large‑language models for code, exemplified by Claude Code, are integrated into an e‑commerce data‑warehouse ecosystem, defining data‑rights boundaries, introducing agentic workflows, decoupling cognitive and execution runtimes, and establishing standardized I/O contracts to achieve safe, scalable AI‑assisted development and governance.

Big DataCode LLMData Warehouse

0 likes · 24 min read

How Code LLM Transforms E‑commerce Data Warehouses: From Data Rights to AI‑Driven Automation

DataFunSummit

Mar 25, 2026 · Big Data

How Apache Gravitino and OpenLineage Transform Data Governance for AI‑Driven Enterprises

In the era of AI and multi‑cloud, this article analyzes the core challenges of data governance—data silos, quality gaps, and compliance risks—and explains how Apache Gravitino’s unified metadata architecture together with OpenLineage’s standardized lineage model provide a scalable, automated solution for intelligent, real‑time data management.

Apache GravitinoBig DataData Governance

0 likes · 15 min read

How Apache Gravitino and OpenLineage Transform Data Governance for AI‑Driven Enterprises

DataFunSummit

Mar 24, 2026 · Industry Insights

How DataWorks Is Transforming Big Data Development with AI Agents

The article outlines DataWorks' evolution from a decade‑long big‑data governance platform to an AI‑driven Copilot and autonomous Agent system, detailing its technical foundations, tool‑adaptation layer, context engineering, security safeguards, and future vision of a professional, open, and intelligent big‑data development ecosystem.

AI CopilotAgentBig Data

0 likes · 13 min read

How DataWorks Is Transforming Big Data Development with AI Agents

DataFunSummit

Mar 16, 2026 · Big Data

How MaxCompute Evolves into an AI‑Native Data Warehouse: Architecture, Capabilities, and Real‑World Cases

This article outlines MaxCompute's 15‑year transformation from a traditional structured‑compute engine to an AI‑native data warehouse, detailing its data, heterogeneous compute, and model capabilities, showcasing three core ability pillars, real‑world case studies, and future development directions.

AI-nativeBig DataCase Study

0 likes · 7 min read

How MaxCompute Evolves into an AI‑Native Data Warehouse: Architecture, Capabilities, and Real‑World Cases

Big Data Technology Tribe

Mar 8, 2026 · Big Data

How Spark Structured Streaming’s Real-Time Mode Achieves Millisecond Latency

This article explains Spark Structured Streaming’s new Real-Time Mode introduced in Spark 4.1, detailing how it reduces latency to the millisecond level by redesigning micro‑batch processing, concurrent stage scheduling, streaming shuffle, and checkpointing, and compares it with Flink’s native streaming.

Big DataReal-Time ModeStreaming

0 likes · 11 min read

How Spark Structured Streaming’s Real-Time Mode Achieves Millisecond Latency

Big Data Technology & Architecture

Mar 6, 2026 · Big Data

What’s New in Big Data Frameworks? ClickHouse, Fluss, Delta Lake, StarRocks & More (Mar 2026)

This roundup compiles the latest releases across major data platforms—including ClickHouse, Apache Fluss, Delta Lake, StarRocks, Apache Pulsar and DolphinScheduler—highlighting version numbers, key feature additions, security fixes, and emerging trends shaping the big‑data ecosystem.

Apache FlussBig DataClickHouse

0 likes · 19 min read

What’s New in Big Data Frameworks? ClickHouse, Fluss, Delta Lake, StarRocks & More (Mar 2026)

DataFunTalk

Mar 3, 2026 · Big Data

Exploring Tencent Cloud’s Iceberg Batch‑Stream Integration and AI‑Driven Data Governance

This article presents a series of seven technical case studies—including Tencent Cloud’s Iceberg‑based batch‑stream integration, AI‑driven data governance with Apache Gravitino, Xiaohongshu’s lakehouse evolution, and a multimodal data‑lake solution—detailing challenges, architectural designs, implementation steps, performance results, and future directions.

AIBig DataData Lake

0 likes · 8 min read

Exploring Tencent Cloud’s Iceberg Batch‑Stream Integration and AI‑Driven Data Governance

DeWu Technology

Mar 2, 2026 · Big Data

Mastering Spark UI: Deep Dive into Metrics, Tuning, and Real‑World Cases

This article provides a comprehensive guide to Spark UI, explaining each primary and secondary tab, the key metrics they expose, and how to interpret them for performance bottleneck detection, followed by two detailed case studies and practical tuning recommendations for Spark workloads.

Big DataCase StudyMetrics

0 likes · 19 min read

Mastering Spark UI: Deep Dive into Metrics, Tuning, and Real‑World Cases

Big Data Technology Tribe

Mar 2, 2026 · Big Data

How Ray Data’s LogicalOptimizer Transforms Plans for Faster Execution

This article explains Ray Data’s execution pipeline, detailing the LogicalOptimizer’s architecture, core abstractions, rule‑based optimization process, and both logical and physical rule sets, with concrete code examples and practical illustrations of each optimization technique.

Big DataDistributed ComputingLogical Optimizer

0 likes · 14 min read

How Ray Data’s LogicalOptimizer Transforms Plans for Faster Execution

DataFunSummit

Mar 1, 2026 · Big Data

How Ant Group’s Flex Engine Supercharges Flink with Vectorization

This article details Ant Group’s Flex vectorized engine built on Velox, covering the current state of vectorization, Flex’s architecture (Flink + Velox), core feature development, correctness guarantees, large‑scale deployment results, and future directions for full‑link vectorization and broader hardware support.

Big DataFlexFlink

0 likes · 18 min read

How Ant Group’s Flex Engine Supercharges Flink with Vectorization

Architecture Digest

Feb 12, 2026 · Operations

How to Build a Scalable Kube‑Prometheus Monitoring Stack for Big Data on Kubernetes

This article explains how to design and implement a robust monitoring solution for big‑data components running on Kubernetes using Prometheus, covering metric exposure methods, scrape configurations, alerting architecture, custom exporters, and practical deployment tips.

AlertmanagerBig DataExporter

0 likes · 18 min read

How to Build a Scalable Kube‑Prometheus Monitoring Stack for Big Data on Kubernetes

DataFunSummit

Feb 8, 2026 · Big Data

Kuaishou’s Data Lake Upgrade with Hudi: Solving AI & BI Challenges

The article explains how Kuaishou modernized its data lake by partnering with Apache Hudi to address latency, storage cost, and consistency issues in both AI and BI pipelines, detailing architectural changes, new ingestion tools, partitioning strategies, compaction mechanisms, performance gains and future plans.

AIBIBig Data

0 likes · 20 min read

Kuaishou’s Data Lake Upgrade with Hudi: Solving AI & BI Challenges

DataFunSummit

Feb 7, 2026 · Big Data

How Flink Enables Real‑Time AI Inference and Agent Construction

This article explains Apache Flink’s stream processing fundamentals, introduces the open‑source Flink Agents framework for building event‑driven AI agents, details Alibaba Cloud’s Flink AI Function for real‑time LLM inference, and showcases demos, architecture, integration patterns, and practical use cases such as VOC analysis, live‑stream analytics, and intelligent operations.

Apache FlinkBig DataCloud Computing

0 likes · 24 min read

How Flink Enables Real‑Time AI Inference and Agent Construction

Alibaba Cloud Big Data AI Platform

Feb 4, 2026 · Big Data

How Paimon + StarRocks Power Real‑Time OLAP for Double‑11 Mega‑Sales

During Double‑11 mega‑sales, Taobao Group faced exploding OLAP query traffic, costly data sync pipelines, and slow near‑real‑time analytics, so they unified real‑time and batch data in Paimon, leveraged StarRocks for high‑performance lake queries, tuned cluster settings, and saved nearly ten‑million yuan annually while cutting refresh latency by 80%.

Big DataData LakeOLAP

0 likes · 22 min read

How Paimon + StarRocks Power Real‑Time OLAP for Double‑11 Mega‑Sales

Alibaba Cloud Big Data AI Platform

Feb 2, 2026 · Big Data

Real‑Time Analytics with Alibaba Cloud Serverless Spark & Paimon for Taobao Flash Sale

This article details how Alibaba Cloud EMR Serverless Spark combined with the Paimon lakehouse framework enables Taobao Flash Sale’s retail data team to achieve low‑latency, high‑throughput real‑time analytics, batch processing, and feature generation, outlining architecture evolution, performance gains, and practical Spark tuning techniques.

Big DataLakehousePaimon

0 likes · 18 min read

Real‑Time Analytics with Alibaba Cloud Serverless Spark & Paimon for Taobao Flash Sale

Alibaba Cloud Big Data AI Platform

Feb 2, 2026 · Big Data

How We Built a Scalable Lakehouse Architecture with StarRocks, Paimon, and Flink

This article details the evolution of a data warehouse at RenliJia from a MaxCompute‑centric setup to a modern lakehouse using StarRocks, Paimon, Flink, and Fluss, describing design goals, technical evaluations, implementation steps for offline, OLAP, and real‑time workloads, and the challenges and future plans that emerged.

Big DataData WarehouseFlink

0 likes · 25 min read

How We Built a Scalable Lakehouse Architecture with StarRocks, Paimon, and Flink

Big Data Tech Team

Feb 2, 2026 · Big Data

Choosing the Right Data Sync Tool: Sqoop vs DataX vs Flink CDC vs Airbyte

This article analyzes the architecture, sync modes, latency, scalability, usability, and deployment aspects of four popular data synchronization solutions—Sqoop, DataX, Flink CDC, and Airbyte—and provides a practical decision tree to avoid common misuse pitfalls in enterprise data pipelines.

AirbyteBig DataData synchronization

0 likes · 9 min read

Choosing the Right Data Sync Tool: Sqoop vs DataX vs Flink CDC vs Airbyte

Raymond Ops

Jan 30, 2026 · Big Data

Build an Enterprise‑Grade HDFS HA and YARN Scheduler from Scratch

This guide walks you through designing and deploying a highly available HDFS architecture with dual NameNodes, ZooKeeper‑based failover, and a tuned YARN resource scheduler, covering detailed configuration files, failover testing, performance tuning, monitoring, automated health checks, capacity planning, and best‑practice checklists for production‑grade big‑data platforms.

AutomationBig DataHA

0 likes · 28 min read

Build an Enterprise‑Grade HDFS HA and YARN Scheduler from Scratch

Radish, Keep Going!

Jan 30, 2026 · Big Data

How Uber Scaled Data Replication to Petabytes Daily with Distcp Optimizations

Uber tackled the challenge of replicating over 350 PB of data across on‑premise and cloud lakes by redesigning Hadoop Distcp, moving intensive tasks to the Application Master, parallelising copy‑listing and commit phases, and leveraging Uber‑mapper jobs to dramatically cut latency and improve resource efficiency.

Big DataData ReplicationDistcp

0 likes · 17 min read

How Uber Scaled Data Replication to Petabytes Daily with Distcp Optimizations

Data Party THU

Jan 29, 2026 · Big Data

How a Tsinghua Big Data Program Turned a Chemistry PhD into an AI‑Powered Process Engineer

This article recounts a Tsinghua University PhD student's journey through a multidisciplinary big‑data training program, detailing the acquisition of AI and data‑science skills, the creation of novel algorithms like MicroFlowSAM and ImageRAG, and their successful application to chemical engineering research and industry projects.

Big DataChemical EngineeringIndustrial Application

0 likes · 8 min read

How a Tsinghua Big Data Program Turned a Chemistry PhD into an AI‑Powered Process Engineer

Big Data Tech Team

Jan 22, 2026 · Industry Insights

Top 10 Open‑Source Data Visualization Platforms You Should Know

This article presents a concise overview of ten popular open‑source data visualization tools—including Echarts, D3.js, Grafana, Plotly, Redash, Metabase, Superset, Kibana, AntV, and Pyecharts—highlighting their main features, typical use cases, and visual examples to help readers choose the right solution for their needs.

Big DataD3.jsData Visualization

0 likes · 6 min read

Top 10 Open‑Source Data Visualization Platforms You Should Know

Ray's Galactic Tech

Jan 22, 2026 · Big Data

Export 1 Billion Elasticsearch Docs in 3 Hours Using PIT + Slice

This guide explains how to reliably export over a billion Elasticsearch documents within a few hours by using Point‑In‑Time (PIT) snapshots combined with parallel Slice processing, covering diagnostics, performance modeling, consistency levels, failure recovery, and resource isolation.

Big DataElasticsearchPIT

0 likes · 7 min read

Export 1 Billion Elasticsearch Docs in 3 Hours Using PIT + Slice

StarRocks

Jan 22, 2026 · Big Data

How Paimon + StarRocks Accelerates Double‑11 OLAP Queries by 80% Refresh Speed

This article explains how Taotian Group unified real‑time and offline data using Paimon as lake storage and StarRocks for high‑performance OLAP, eliminating costly sync pipelines, cutting refresh time by about 80%, saving nearly ten million yuan annually, and detailing the architecture, cluster safeguards, configuration tweaks, monitoring, and future roadmap for large‑scale promotional events.

Big DataData ArchitectureOLAP

0 likes · 24 min read

How Paimon + StarRocks Accelerates Double‑11 OLAP Queries by 80% Refresh Speed

DataFunSummit

Jan 18, 2026 · Big Data

How Ray Reinvents AI Data Pipelines for Massive Multimodal Inference

This article examines the shortcomings of traditional big‑data engines for AI workloads, presents a Ray‑based heterogeneous fusion architecture that unifies CPU/GPU scheduling, Python ecosystems, and streaming‑batch processing, and details fault‑tolerance, checkpointing, compute‑storage separation, resource‑utilization, scalability, and observability improvements that enable thousands of nodes and dramatically higher GPU efficiency.

Big DataCloud NativeDistributed Computing

0 likes · 31 min read

How Ray Reinvents AI Data Pipelines for Massive Multimodal Inference

Mike Chen's Internet Architecture

Jan 18, 2026 · Big Data

Mastering Kafka High Availability: Replication, Leader‑Follower, ISR, and Ack Strategies

This article explains Kafka's high‑availability architecture, covering multi‑replica replication, leader‑follower election and failover, the role of In‑Sync Replicas, and producer acknowledgment settings with min.insync.replicas for reliable, zero‑data‑loss streaming.

Ack StrategyBig DataHigh Availability

0 likes · 4 min read

Mastering Kafka High Availability: Replication, Leader‑Follower, ISR, and Ack Strategies

ByteDance Data Platform

Jan 15, 2026 · Artificial Intelligence

Why Model Evaluation Can Be Cool: Innovative Automated Testing for Data‑Driven LLM Agents

In the era of rapidly advancing large‑model technology, the article outlines the challenges of evaluating data‑centric LLM agents, proposes a three‑layer evaluation framework covering basic capabilities, component‑level checks, and end‑to‑end business impact, and shares practical innovations such as semantic‑equivalence SQL matching, agent‑as‑judge pipelines, and a unified assessment platform.

Agent as judgeBig DataData Agent

0 likes · 22 min read

Why Model Evaluation Can Be Cool: Innovative Automated Testing for Data‑Driven LLM Agents

StarRocks

Jan 15, 2026 · Artificial Intelligence

How AI‑First Lakehouse Redefines Data Platforms for Multimodal Analytics

The article outlines the evolution from traditional OLAP to an AI‑first Lakehouse, detailing unified multimodal storage, CPU/GPU heterogeneous scheduling, native vector search, in‑database AI inference, agent‑centric execution, and self‑evolving platform capabilities that together reshape modern data analytics.

AIBig DataIn‑Database Inference

0 likes · 11 min read

How AI‑First Lakehouse Redefines Data Platforms for Multimodal Analytics

AsiaInfo Technology: New Tech Exploration

Jan 6, 2026 · Industry Insights

Apache Paimon: Boosting Real-Time Data Lakes for Fraud Detection & Manufacturing

This article examines Apache Paimon’s innovative lakehouse architecture, detailing its LSM‑Tree storage, flexible merge engine, and multi‑engine integration, and showcases two real‑world deployments—an operator’s real‑time fraud‑prevention system and a manufacturing firm’s unified data platform—highlighting performance gains and cost reductions.

Apache PaimonBig DataCase Study

0 likes · 15 min read

Apache Paimon: Boosting Real-Time Data Lakes for Fraud Detection & Manufacturing

Alibaba Cloud Big Data AI Platform

Jan 5, 2026 · Big Data

How Xunlei Boosted Data Processing with Alibaba Cloud EMR Serverless Spark

This article details Xunlei's migration from a fixed Hadoop cluster to Alibaba Cloud EMR Serverless Spark, outlining the platform's background, pain points, technical upgrade goals, serverless capabilities, archive data access methods, Kyuubi integration, and the resulting business and technical benefits.

Big DataCloud ComputingEMR

0 likes · 11 min read

How Xunlei Boosted Data Processing with Alibaba Cloud EMR Serverless Spark

JD Retail Technology

Jan 5, 2026 · Big Data

How JD’s Data Lake Uses Hudi LSM‑Tree to Power Near‑Real‑Time Data Assets

The article details JD’s data lake architecture, its 500 PB scale, self‑developed Hudi extensions—including LSM‑Tree‑based MoR tables, custom indexing, IO optimizations, Flink stream scheduling, and NativeIO SDK—along with benchmarks, community contributions, and future roadmap for real‑time big‑data processing.

Big DataData LakeHudi

0 likes · 19 min read

How JD’s Data Lake Uses Hudi LSM‑Tree to Power Near‑Real‑Time Data Assets

Alibaba Cloud Big Data AI Platform

Dec 31, 2025 · Big Data

Build a Scalable AI Data Pipeline Using DataWorks, MaxCompute & MaxFrame

This guide walks you through setting up a secure, elastic, and high‑performance AI data processing platform on Alibaba Cloud by combining DataWorks, MaxCompute, and MaxFrame, covering the four essential steps, code examples, best‑practice tips, and common troubleshooting advice.

AIBig DataCloud Computing

0 likes · 10 min read

Build a Scalable AI Data Pipeline Using DataWorks, MaxCompute & MaxFrame

Past Memory Big Data

Dec 29, 2025 · Industry Insights

How Chinese Open‑Source Projects Dominated Half of 2025 Apache Top‑Level Projects

In 2025, five Apache Top‑Level Projects with Chinese origins—Uniffle, StreamPark, Gravitino, DevLake and HertzBeat—emerged, illustrating a shift toward central, platform‑oriented solutions driven by growing system scale, engineering complexity, and collaborative costs rather than a deliberate national agenda.

Big DataCloud NativeTop-Level Projects

0 likes · 7 min read

How Chinese Open‑Source Projects Dominated Half of 2025 Apache Top‑Level Projects

Big Data Tech Team

Dec 29, 2025 · Big Data

Master Big Data Development: A Complete Roadmap from Beginner to Expert

This guide presents a comprehensive big‑data development roadmap, detailing industry opportunities, a six‑module technology stack, four progressive learning stages, hands‑on project ideas, interview question strategies, common pitfalls, and curated resources, helping aspiring engineers become proficient and interview‑ready while avoiding common mistakes.

Big DataData EngineeringRoadmap

0 likes · 11 min read

Master Big Data Development: A Complete Roadmap from Beginner to Expert

Big Data Tech Team

Dec 26, 2025 · Interview Experience

How to Nail a 2‑Minute Data Engineer Self‑Introduction

This guide outlines a concise, 1.5‑2‑minute self‑introduction for data engineering interviews, highlighting essential personal details, technical stack, project achievements, business impact, and common pitfalls to avoid, with a concrete example and actionable tips.

Big DataCareer AdviceData Engineering

0 likes · 5 min read

How to Nail a 2‑Minute Data Engineer Self‑Introduction

Big Data Tech Team

Dec 25, 2025 · Big Data

How to Build an End‑to‑End E‑Commerce Data Warehouse for Interview Success

This guide walks you through designing and implementing a complete e‑commerce data‑warehouse project—from raw data ingestion and ODS/DWD/DWS/ADS layers to optional real‑time analytics—while highlighting interview‑ready resume tips, common pitfalls, and performance‑tuning tricks.

Big DataETLFlink

0 likes · 10 min read

How to Build an End‑to‑End E‑Commerce Data Warehouse for Interview Success

Alibaba Cloud Big Data AI Platform

Dec 24, 2025 · Big Data

How Paimon’s Column‑Separation Architecture Powers Real‑Time Multi‑Modal Lakehouse for AI

This article explains the challenges of frequent column changes in AI feature engineering, introduces Paimon’s column‑separation storage with a global continuous Row ID, details its Blob data type for efficient multi‑modal handling, and outlines production results and future roadmap for building an AI‑native data lakehouse.

Apache PaimonBLOBBig Data

0 likes · 11 min read

How Paimon’s Column‑Separation Architecture Powers Real‑Time Multi‑Modal Lakehouse for AI

dbaplus Community

Dec 20, 2025 · Big Data

From Data Lakes to DataOps: Unveiling the Hidden Challenges of Data Governance

The article walks through the evolution of data management—from idealistic visions and messy “shit mountains” to the realities of data lakes, metadata layers, governance challenges, trust breakdowns, and finally the promise of DataOps as a hopeful path forward.

Big DataData GovernanceData Lake

0 likes · 3 min read

From Data Lakes to DataOps: Unveiling the Hidden Challenges of Data Governance

DataFunTalk

Dec 17, 2025 · Artificial Intelligence

How Large Language Models Unlock Field‑Level Data Lineage at Scale

This talk explains how a data platform tackled massive, heterogeneous enterprise data by using large language models and prompt engineering to automatically extract field‑level lineage from SQL scripts, achieve over 80% coverage, and raise accuracy above 95%, dramatically cutting impact‑analysis time.

AI for data engineeringBig DataLarge Language Model

0 likes · 6 min read

How Large Language Models Unlock Field‑Level Data Lineage at Scale