Big Data

Showing 100 articles max

Jun 6, 2026 · Big Data

Why Has the Term “Big Data” Suddenly Disappeared?

Although data production continues to surge—reaching 52.26 ZB in 2025—the “big data” label is fading because its original narrative of scale as value has run out, exposing a credit‑and‑responsibility gap that forces organizations to demand concrete business impact rather than mere infrastructure.

AI impactBig DataData Governance

0 likes · 15 min read

Why Has the Term “Big Data” Suddenly Disappeared?

Alibaba Cloud Big Data AI Platform

Jun 4, 2026 · Big Data

Scalar‑Vector Hybrid Search in a Data Lake with One SQL on EMR Serverless Spark

EMR Serverless Spark now supports scalar‑vector hybrid search via DLF Global Index, allowing a single Spark SQL statement to perform vector similarity and scalar filtering together, eliminating data movement, reducing latency, and boosting performance for scenarios such as autonomous driving, e‑commerce, and knowledge‑base retrieval.

Big DataDLF Global IndexEMR Serverless Spark

0 likes · 17 min read

Scalar‑Vector Hybrid Search in a Data Lake with One SQL on EMR Serverless Spark

dbaplus Community

Jun 3, 2026 · Big Data

Boosting SQL Compliance to 95%: Harness Solves AI’s “Memory Loss” in Data Warehouse

The article analyzes the challenges of AI‑generated SQL in a data‑warehouse environment—context loss, unstable rule enforcement, and token overflow—and presents a five‑layer Harness architecture that persists constraints, injects hooks, uses subagents, and refactors SKILL files, raising SQL compliance from 70‑80% to over 95% while reducing context compacting.

AIAutomationData Warehouse

0 likes · 26 min read

Boosting SQL Compliance to 95%: Harness Solves AI’s “Memory Loss” in Data Warehouse

Spring Full-Stack Practical Cases

Jun 2, 2026 · Big Data

Millisecond‑Level Real‑Time Sync from MySQL to Elasticsearch with Flink CDC

This guide walks through setting up a Spring Boot 3.5 environment, configuring Flink 1.20 and Flink CDC 3.5, preparing MySQL tables, and using both the Flink CDC CLI and SQL client to achieve near‑millisecond synchronization of data from MySQL to Elasticsearch, including custom sink programming and real‑time monitoring via the Flink Web UI.

Apache FlinkFlink CDCMySQL

0 likes · 14 min read

Millisecond‑Level Real‑Time Sync from MySQL to Elasticsearch with Flink CDC

AI Large-Model Wave and Transformation Guide

May 29, 2026 · Big Data

How to Solve Data Governance + AI Agent Pitfalls: Agent Roles, NL2SQL Datasets, and Rule Templates Explained

The article analyzes why data‑governance projects still fail when combined with AI, presents a four‑layer NL2SQL architecture, details agent responsibilities, metadata‑governance methods, anomaly‑diagnosis and permission‑control flows, outlines dataset‑building stages, evaluation metrics, and provides a step‑by‑step rollout roadmap.

AI AgentAnomaly DetectionData Governance

0 likes · 21 min read

How to Solve Data Governance + AI Agent Pitfalls: Agent Roles, NL2SQL Datasets, and Rule Templates Explained

DataFunTalk

May 28, 2026 · Big Data

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

Xiaohongshu transformed its data platform from a simple ClickHouse‑based ad‑hoc analysis to a Lambda‑style architecture and finally to a lakehouse with generic incremental compute, cutting architecture complexity, resource and development costs by one‑third while delivering second‑level queries over trillions of rows.

Big DataClickHouseData Architecture

0 likes · 21 min read

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

DataFunTalk

May 26, 2026 · Big Data

How MaxCompute Evolves into an AI‑Ready Data Platform: Architecture, Core Capabilities, and Real‑World Cases

The article details MaxCompute's transformation into a cloud‑native, AI‑centric data warehouse, covering multi‑modal storage, model management, heterogeneous CPU/GPU scheduling, SQL AI functions, the MaxFrame Python framework, and several production case studies that demonstrate performance gains of up to 50% and elastic resource scaling to 160 000 cores.

Data+AIDistributed ComputingLarge‑model preprocessing

0 likes · 13 min read

How MaxCompute Evolves into an AI‑Ready Data Platform: Architecture, Core Capabilities, and Real‑World Cases

Big Data Technology & Architecture

May 26, 2026 · Big Data

Advanced Paimon Production Issues: 10 Rare Compaction‑Related Problems and Fixes

This article enumerates ten uncommon, compaction‑related problems encountered in large‑scale Paimon deployments, explains their root causes—such as RPC timeouts, snapshot expiration, file corruption, and write conflicts—and provides concrete configuration tweaks and operational steps to resolve each issue.

Big DataCompactionFlink

0 likes · 9 min read

Advanced Paimon Production Issues: 10 Rare Compaction‑Related Problems and Fixes

Big Data Tech Team

May 25, 2026 · Big Data

AI Large Models Meet Data Warehouses: 3 Core Use Cases, 5 Common Pitfalls, and Best Practices

The article analyzes how AI large models can transform data‑warehouse development through three practical scenarios—automated modeling, intelligent data cleaning, and ops optimization—while exposing five frequent implementation traps and offering concrete best‑practice recommendations to achieve cost reduction, efficiency gains, and quality improvement.

AI large modelsAutomated modelingData Warehouse

0 likes · 10 min read

AI Large Models Meet Data Warehouses: 3 Core Use Cases, 5 Common Pitfalls, and Best Practices

DataFunSummit

May 25, 2026 · Big Data

How Hisense Built an AI‑Ready Multimodal Data Platform: Storage, Governance, and Development

This article details Hisense's journey to create an AI‑ready multimodal data platform, covering the challenges of integrating diverse business systems, the shift from a Hadoop‑based architecture to a cloud‑native data lake, the JuData governance and development platform, and six practical scenarios that demonstrate unified ingestion, metadata management, rule‑based quality control, intelligent asset retrieval, and future AI‑driven DataOps capabilities.

AI platformData GovernanceData Lake

0 likes · 23 min read

How Hisense Built an AI‑Ready Multimodal Data Platform: Storage, Governance, and Development

DataFunTalk

May 25, 2026 · Big Data

MaxCompute’s AI‑Ready Evolution: Architecture, Features, and Real‑World Use Cases

This article examines how Alibaba Cloud’s MaxCompute platform has been transformed for AI workloads, detailing its multi‑layer architecture, multimodal data storage, SQL AI functions, the Python‑based MaxFrame framework, and real‑world deployments in large‑model preprocessing, autonomous driving, and multimodal image labeling.

AIBig DataDistributed Computing

0 likes · 12 min read

MaxCompute’s AI‑Ready Evolution: Architecture, Features, and Real‑World Use Cases

Data STUDIO

May 25, 2026 · Big Data

Polars vs Pandas: Is Switching Worth It for Ten‑Million‑Row Datasets?

The article shows that Polars, a query‑compiling DataFrame library, can accelerate ten‑million‑row GroupBy workloads by 6‑10× compared with Pandas, explains the underlying optimizer, Arrow columnar engine and Rust parallelism, provides a 20‑item syntax map, three real migration scenarios, streaming for out‑of‑memory data, and AI‑pipeline use cases, and offers a step‑by‑step migration guide.

Apache ArrowDataFramesPandas

0 likes · 21 min read

Polars vs Pandas: Is Switching Worth It for Ten‑Million‑Row Datasets?

Big Data Tech Team

May 24, 2026 · Big Data

Data Warehouse Interview Pitfall Guide 2.0: Avoid Common SQL, Modeling, and ETL Mistakes

This guide compiles the most frequent interview pitfalls for data warehouse roles, covering SQL join and aggregation errors, window function misuse, subquery versus CTE performance myths, dimensional modeling mistakes, SCD implementation traps, layered design issues, data quality handling, ETL traps, Hive and Spark performance questions, real‑time warehousing considerations, and effective interview strategies.

Big DataETLHive

0 likes · 3 min read

Data Warehouse Interview Pitfall Guide 2.0: Avoid Common SQL, Modeling, and ETL Mistakes

DataFunSummit

May 22, 2026 · Big Data

How OPPO Accelerates Multimodal Data & AI Fusion with Gravitino and Curvine

OPPO tackles explosive multimodal data growth by unifying metadata with Gravitino and boosting I/O performance using the open‑source Curvine cache, delivering a four‑layer data‑lake architecture that resolves data islands, metadata chaos, and bandwidth bottlenecks while achieving near‑commercial query speeds.

CurvineGravitinoLanceDB

0 likes · 11 min read

How OPPO Accelerates Multimodal Data & AI Fusion with Gravitino and Curvine

DataFunTalk

May 22, 2026 · Big Data

How Xiaohongshu Cut Data Architecture Complexity and Cost by One‑Third in the Big AI Data Era

The article details Xiaohongshu's evolution from a simple ClickHouse‑based analytics layer to a Lambda‑enabled 2.0 stack and finally a Lakehouse‑based 3.0 architecture, showing how each iteration reduced infrastructure complexity, resource consumption and development effort by roughly one‑third while supporting trillions of daily events and AI‑driven use cases.

Big DataClickHouseData Architecture

0 likes · 21 min read

How Xiaohongshu Cut Data Architecture Complexity and Cost by One‑Third in the Big AI Data Era

DataFunTalk

May 21, 2026 · Big Data

How Bitmap‑Based High‑Table Architecture Powers Mill‑Scale User Profiling and Real‑Time Crowd Selection

The article explains how a bitmap‑driven high‑table design (SelectDB) overcomes wide‑table storage bloat and latency to enable millisecond‑level crowd selection for tens of millions of users with hundreds of tag dimensions, while supporting dynamic tag expansion.

BitmapSelectDBcrowd selection

0 likes · 2 min read

How Bitmap‑Based High‑Table Architecture Powers Mill‑Scale User Profiling and Real‑Time Crowd Selection

DataFunSummit

May 21, 2026 · Big Data

Alibaba Cloud’s Agent-Ready Big Data AI Infrastructure: Boosting Data Development from Hours to Minutes

Facing a projected 85% of enterprises deploying internal agents within two years, Alibaba Cloud proposes an Agent-Ready big‑data AI infrastructure—comprising a unified data lake, real‑time processing, high‑dimensional vector retrieval, elastic model serving, and comprehensive security governance—that has already cut data‑development cycles from hours to 5‑10 minutes in internal model‑training and Taobao flash‑sale scenarios.

AIAgent-ReadyBig Data

0 likes · 15 min read

Alibaba Cloud’s Agent-Ready Big Data AI Infrastructure: Boosting Data Development from Hours to Minutes

DataFunSummit

May 20, 2026 · Big Data

How Kuaishou’s Real‑Time Data Lake Boosts AI and BI Architecture

The article explains how Kuaishou partnered with Apache Hudi to overhaul its ODS‑based data lake, addressing latency, storage cost, and complexity for AI and BI workloads, detailing the evolution from mysql‑to‑hive to mysql‑to‑hudi 1.0 and 2.0, the resulting performance gains, cost savings, and future roadmap.

AIBIBig Data

0 likes · 20 min read

How Kuaishou’s Real‑Time Data Lake Boosts AI and BI Architecture

StarRocks

May 20, 2026 · Big Data

How StarRocks, Paimon, and Fluss Enable Multimodal Fusion Search in a Lakehouse

The Streaming Lakehouse Meetup (May 27) explores breaking data silos by unifying structured tables, images, video, audio, and high‑dimensional vectors through StarRocks‑Paimon‑Fluss integration, covering multimodal fusion retrieval, vector search internals, native reader/writer performance gains, and real‑world ANN indexing practices.

FlussLakehousePaimon

0 likes · 5 min read

How StarRocks, Paimon, and Fluss Enable Multimodal Fusion Search in a Lakehouse

AntTech

May 20, 2026 · Big Data

SIGMOD 2026: Shared Computation for Query Subgraph Matching & Fast MPC Shortest Paths

This article reviews two SIGMOD 2026 papers—MASC, which redefines multi‑query subgraph matching by maximizing shared computation to achieve up to two orders of magnitude speedup, and PrivHop, which combines 2‑hop labeling with secure multi‑party computation to enable privacy‑preserving shortest‑path queries on million‑node graphs with roughly a million‑fold reduction in runtime and communication.

MPCgraph algorithmsprivacy-preserving

0 likes · 5 min read

SIGMOD 2026: Shared Computation for Query Subgraph Matching & Fast MPC Shortest Paths