Big Data

Showing 100 articles max

May 20, 2026 · Big Data

Why 90% of Companies Get Data Governance Wrong and How to Reduce Friction

Most data‑governance initiatives fail not because of lacking technology but because they add friction; the article explains how companies mistakenly focus on rules, platforms, and processes, and offers a step‑by‑step approach—identifying high‑value tables, minimal metadata, targeted quality rules, and fast issue diagnosis—to make governance truly useful.

Big DataData GovernanceData Quality

0 likes · 29 min read

Why 90% of Companies Get Data Governance Wrong and How to Reduce Friction

Big Data Tech Team

May 19, 2026 · Big Data

Enterprise Data Warehouse Development Playbook: Standard Engineering Edition

This playbook provides enterprise‑level data warehouse engineers, ETL developers, data modelers, and data‑team managers with a complete, logical, and actionable set of standards, processes, and best‑practice guidelines covering architecture, development principles, role responsibilities, end‑to‑end workflow, metadata, security, performance metrics, and team collaboration.

Data ModelingData QualityETL

0 likes · 18 min read

Enterprise Data Warehouse Development Playbook: Standard Engineering Edition

dbaplus Community

May 14, 2026 · Big Data

Building a ‘One‑Sentence Bank’: Big Data and AI Fusion for Small Banks

The article outlines the evolution of big data in banking, compares management models for heterogeneous data, describes the shift from data engineering to knowledge engineering, introduces LLMOps for high‑quality knowledge bases, and details how integrating AI and data can enable a “one‑sentence bank” that answers queries and executes tasks.

Artificial IntelligenceBig DataData Governance

0 likes · 22 min read

Building a ‘One‑Sentence Bank’: Big Data and AI Fusion for Small Banks

Cloud Architecture

May 14, 2026 · Big Data

Real‑Time Member Level Calculation at Trillion‑Event Scale: A Production‑Ready Apache Flink Architecture

This article walks through the challenges of computing membership tiers in real time for trillion‑event traffic, explains why traditional batch pipelines fall short, and presents a complete production‑grade Apache Flink design—including event modeling, state layout, bucket aggregation, rule hot‑updates, exactly‑once guarantees, and operational monitoring.

0 likes · 26 min read

Real‑Time Member Level Calculation at Trillion‑Event Scale: A Production‑Ready Apache Flink Architecture

DataFunSummit

May 14, 2026 · Big Data

How Gravitino, Daft, and Lance Enable Secure, AI‑Driven Multimodal Lakehouse

The article examines the challenges of multimodal data in modern lakehouses and presents a three‑tool stack—Gravitino, Daft, and Lance—that provides unified metadata, distributed multimodal compute, and high‑performance storage, while detailing security governance, integration paths, and future directions.

DaftGravitinoLakehouse

0 likes · 11 min read

How Gravitino, Daft, and Lance Enable Secure, AI‑Driven Multimodal Lakehouse

vivo Internet Technology

May 13, 2026 · Big Data

How Vivo Upgraded a Million‑Node YARN Cluster: Architecture, Scheduler Switch, and Performance Optimizations

This article details Vivo's end‑to‑end upgrade of a YARN 2.6.0 cluster to a modern version for a million‑node, hundred‑thousand‑tasks‑per‑day platform, covering architectural evolution, scheduler migration, compatibility fixes, performance tuning, and service‑continuity strategies.

Big DataCapacity SchedulerHadoop

0 likes · 28 min read

How Vivo Upgraded a Million‑Node YARN Cluster: Architecture, Scheduler Switch, and Performance Optimizations

DeWu Technology

May 13, 2026 · Big Data

How BP Claw Solves AI Coding Input Challenges in FlinkSpec’s Real‑Time Data Warehouse

The article explains how BP Claw tackles unstable AI coding results by automatically converting low‑quality PRD documents into structured, high‑quality requirements, applying token‑saving strategies, strict hallucination guards, and multi‑skill orchestration, which together boost FlinkSpec’s real‑time data‑warehouse delivery efficiency by up to 30%.

AI codingBP ClawBig Data

0 likes · 17 min read

How BP Claw Solves AI Coding Input Challenges in FlinkSpec’s Real‑Time Data Warehouse

Architect's Guide

May 13, 2026 · Big Data

Next‑Gen Visual Drag‑Drop Data Flow Platform: Features, Architecture, and Performance

The article introduces a visual drag‑and‑drop data flow platform that unifies stream and batch processing, offers version control, automatic fault tolerance, configurable data permissions, comprehensive monitoring, data alignment, and query templates, and presents single‑instance performance benchmarks of over 30k and 60k ops/s.

Data AlignmentData FlowDrag-and-Drop

0 likes · 7 min read

Next‑Gen Visual Drag‑Drop Data Flow Platform: Features, Architecture, and Performance

DataFunTalk

May 11, 2026 · Big Data

How Xiaohongshu Re‑engineered Its Data Architecture for the Big AI Data Era

Xiaohongshu transformed its data platform from a simple ClickHouse‑based ad‑hoc analysis to a Lambda‑style architecture and finally to a lakehouse built on Iceberg, StarRocks, Flink and Spark, cutting architecture complexity, resource and development costs by two‑thirds while supporting trillions of daily events with sub‑second query latency.

Big DataClickHouseFlink

0 likes · 22 min read

How Xiaohongshu Re‑engineered Its Data Architecture for the Big AI Data Era

DataFunSummit

May 10, 2026 · Big Data

How Lance File Format v2.2 Accelerates, Cuts Costs, and Governs Multimodal Data

Lance File Format v2.2 tackles the AI data explosion by delivering hundred‑fold random‑read performance, advanced two‑layer compression, zero‑cost schema evolution, Git‑style versioning, external blob handling, and a roadmap toward native media support and intelligent encoding, positioning it as a core infrastructure for large‑scale multimodal workloads.

CompressionData GovernanceFile Format

0 likes · 14 min read

How Lance File Format v2.2 Accelerates, Cuts Costs, and Governs Multimodal Data

Zhihu Tech Column

May 9, 2026 · Big Data

How Zhihu Built a Unified OneID System to Consolidate Fragmented User Identities

Zhihu created a unified OneID framework that merges scattered account, device, and behavior data into a global unique identifier, using strong and weak IDs, graph‑based connectivity, device governance, and a device half‑life model to improve recommendation, push, and advertising effectiveness.

Big DataDevice GovernanceGraph Computation

0 likes · 11 min read

How Zhihu Built a Unified OneID System to Consolidate Fragmented User Identities

StarRocks

May 8, 2026 · Big Data

Scaling Real‑Time Analytics at KaptureCX: Best Practices with RisingWave and StarRocks

KaptureCX migrated its core analytics from ClickHouse to StarRocks, introduced RisingWave and Kafka for CDC, and achieved millisecond‑level query latency, a reporting cycle cut from weeks to one day, and a solid data foundation for AI‑driven services.

KafkaMVPRisingWave

0 likes · 11 min read

Scaling Real‑Time Analytics at KaptureCX: Best Practices with RisingWave and StarRocks

DataFunTalk

May 8, 2026 · Big Data

How MaxCompute Evolves into a Data+AI Platform: Architecture, Core Capabilities, and Real-World Cases

The article explains how Alibaba Cloud's MaxCompute has been transformed into a cloud‑native Data+AI platform, detailing its layered architecture, multimodal storage, model management, hybrid compute scheduling, SQL AI functions, the MaxFrame Python framework, and several enterprise case studies that demonstrate performance gains and flexible resource orchestration.

Big DataData+AIMaxCompute

0 likes · 11 min read

How MaxCompute Evolves into a Data+AI Platform: Architecture, Core Capabilities, and Real-World Cases

DataFunTalk

May 6, 2026 · Big Data

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

The article details Xiaohongshu's four‑stage data‑platform evolution—from a simple ClickHouse ad‑hoc setup to a Lambda‑based 2.0 design and finally a lakehouse‑driven 3.0 architecture—highlighting the adoption of general incremental compute, cost‑reduction to one‑third, performance gains of up to ten‑fold, and the SPOT standards that guide the new system.

Big DataClickHouseData Architecture

0 likes · 21 min read

How Xiaohongshu Evolved Its Data Architecture for the Big AI Data Era

DataFunSummit

May 5, 2026 · Big Data

A New Data Lake Paradigm: Volcano Engine’s Multi‑Modal Data Lake Built on Lance

The article presents Volcano Engine’s AI‑focused data lake built on the Lance format, detailing why traditional lakes fall short for multimodal data, the engineering enhancements such as Binary Copy Compaction, Lance Insight, distributed vector indexing, JSON‑based tagging, Row‑ID shuffle optimization, and real‑world case studies that demonstrate significant performance and cost gains.

AIBinary Copy CompactionData Lake

0 likes · 18 min read

A New Data Lake Paradigm: Volcano Engine’s Multi‑Modal Data Lake Built on Lance

Cloud Architecture

May 3, 2026 · Big Data

Cutting Log Storage Costs 70% for 100 Billion Daily Logs: A Full Guide to Hot‑Cold Separation Architecture

This article explains why massive log systems must adopt hot‑cold separation, walks through the problem analysis, SLO definition, component design, Kafka partition planning, Elasticsearch and ClickHouse tuning, Parquet archiving, a unified query gateway with async cold queries, governance practices, cost modeling, common pitfalls, and a roadmap for evolving the platform.

Kafkaelasticsearchhot cold separation

0 likes · 40 min read

Cutting Log Storage Costs 70% for 100 Billion Daily Logs: A Full Guide to Hot‑Cold Separation Architecture

DataFunTalk

May 2, 2026 · Big Data

Building a One-Person Data Team: Core Skills of a Full‑Stack Data Engineer

The article examines why a single data engineer can run an end‑to‑end data team, outlines the essential abilities—semantic ownership, building an agentic data stack, and leveraging historical context—while discussing ChatBI’s limits, validation loops, and the open‑source Datus 0.3 harness for practical implementation.

Agentic AIChatBIData Engineering

0 likes · 14 min read

Building a One-Person Data Team: Core Skills of a Full‑Stack Data Engineer

DataFunTalk

Apr 29, 2026 · Big Data

How Xiaohongshu Revamped Its Data Architecture for the Big AI Data Era

Xiaohongshu transformed its data platform from a simple ClickHouse‑based analytics stack to a unified lakehouse with generic incremental compute, cutting architecture complexity, resource cost, and development effort by roughly one‑third while supporting petabyte‑scale, sub‑second queries across its 350 million‑user app.

Big DataClickHouseData Architecture

0 likes · 22 min read

How Xiaohongshu Revamped Its Data Architecture for the Big AI Data Era

Lao Guo's Learning Space

Apr 29, 2026 · Big Data

Designing a Full-Stack Credit Data System: From Ingestion to Real-Time Decision

The article dissects a credit data system architecture, detailing six logical layers—from multi-source data collection and feature engineering (including graph features and feature stores) to model training, real‑time stream processing, decision engine integration, and privacy‑preserving computation—while explaining the trade‑offs, tools, and performance targets needed for accurate, low‑latency risk assessment.

Credit ScoringFeature StoreFlink

0 likes · 16 min read

Designing a Full-Stack Credit Data System: From Ingestion to Real-Time Decision

Model Perspective

Apr 28, 2026 · Big Data

How a Taiwan Ban Became Free Advertising for Amap’s Map App

A recent Taiwan government warning against Amap turned into a viral boost, exposing the app’s superior traffic‑light countdown, massive data‑driven network effects, and the underlying reverse‑propagation model that explains why the ban accelerated downloads rather than suppressing them.

AmapBig DataNetwork Effects

0 likes · 11 min read

How a Taiwan Ban Became Free Advertising for Amap’s Map App