Tagged articles
3672 articles
Page 3 of 37
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 16, 2025 · Big Data

Master Flink Optimizations: TTL, Mini‑Batch, Two‑Phase Aggregation, Lookup Join & More

This article reviews the most effective Flink optimization techniques since 2022, including operator‑level TTL, mini‑batch processing, two‑phase aggregation, multi‑dimensional DISTINCT with FILTER, lookup join caching strategies, and TopN implementations, each rated with recommendation stars for production use.

Big DataFlinkLookup Join
0 likes · 8 min read
Master Flink Optimizations: TTL, Mini‑Batch, Two‑Phase Aggregation, Lookup Join & More
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jul 15, 2025 · Big Data

How MaxCompute’s Append DeltaTable Transforms BigQuery Migration

This article details the complex migration of a leading Southeast Asian tech group's data warehouse from Google BigQuery to Alibaba Cloud MaxCompute, outlining challenges such as storage format differences, SQL compatibility, and performance tuning, and explains how the new Append DeltaTable format with dynamic bucketing and incremental reclustering resolves these issues.

Big DataData MigrationData Warehouse
0 likes · 19 min read
How MaxCompute’s Append DeltaTable Transforms BigQuery Migration
IT Architects Alliance
IT Architects Alliance
Jul 10, 2025 · Cloud Native

Inside Alibaba’s Tech Stack: Cloud‑Native Architecture Behind Billions of Transactions

This article examines Alibaba's extensive cloud‑native technology stack—including distributed computing, storage, middleware, real‑time data processing, AI platforms, performance engineering, and security—revealing how its architects design systems that handle massive transaction volumes during events like Double 11.

Big DataDistributed SystemsMicroservices
0 likes · 12 min read
Inside Alibaba’s Tech Stack: Cloud‑Native Architecture Behind Billions of Transactions
IT Architects Alliance
IT Architects Alliance
Jul 8, 2025 · Cloud Native

Why Do Big‑Tech Architects Earn Six Figures? The Skills That Set Them Apart

The article explores why architects at leading tech firms command six‑figure salaries while those in traditional companies earn far less, highlighting gaps in technical depth, massive data handling, performance optimization, business insight, continuous learning, and the scarcity of true senior architects.

Big DataCareer DevelopmentDistributed Systems
0 likes · 9 min read
Why Do Big‑Tech Architects Earn Six Figures? The Skills That Set Them Apart
Model Perspective
Model Perspective
Jul 8, 2025 · Big Data

Why Historical Data Can Mislead Your Forecasts—and What to Do Instead

The article explains how relying solely on historical data for prediction often leads to large errors because future structural changes and missing variables are ignored, and it proposes causal modeling, scenario simulation, and real‑time signals as more reliable forecasting approaches.

Big Datacausal modelingforecasting
0 likes · 9 min read
Why Historical Data Can Mislead Your Forecasts—and What to Do Instead
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 8, 2025 · Big Data

Flink’s AI Agents and Disaggregated State: Transforming Big Data

The article reviews key topics from the FFA2025 Singapore conference, highlighting Flink’s new AI‑focused Agents framework, the breakthrough Flink 2.0 disaggregated state architecture, emerging lake storage solutions like Paimon, and the Fluss streaming table store, illustrating how big‑data platforms are evolving for AI workloads.

AI agentsBig DataData Lake
0 likes · 6 min read
Flink’s AI Agents and Disaggregated State: Transforming Big Data
DataFunTalk
DataFunTalk
Jul 7, 2025 · Big Data

Unlock Real-Time Analytics with Cloud Lakehouse: A Complete Guide

This article presents a curated list of sessions covering cloud Lakehouse technology for real-time, multidimensional data analysis, including case studies from SalesEasy, Changan Auto, Tencent, and JD, as well as discussions on data lake adoption, streaming lake Paimon, and the relevance of metadata‑driven data governance in the digital economy.

Big DataCase StudyData Governance
0 likes · 2 min read
Unlock Real-Time Analytics with Cloud Lakehouse: A Complete Guide
DataFunTalk
DataFunTalk
Jul 6, 2025 · Big Data

How Cloud Lakehouse Is Redefining Real-Time Multi-Dimensional Data Analytics

This article presents a curated list of case studies and insights on cloud Lakehouse technology, covering real-time intelligent analytics, data architecture simplification, IoT big‑data platforms, integrated data platforms, and the evolving role of metadata‑driven data governance in the digital economy.

Big DataCase StudiesData Governance
0 likes · 2 min read
How Cloud Lakehouse Is Redefining Real-Time Multi-Dimensional Data Analytics
FunTester
FunTester
Jul 5, 2025 · Big Data

Master Kafka: Core Concepts and Performance Testing Strategies

This article explains Kafka’s high‑performance distributed streaming architecture, key components such as topics, partitions, producers, consumers, brokers, offsets, and ZooKeeper, and provides step‑by‑step workflows for producers and consumers along with performance‑testing tips and Maven setup.

Big DataJavaKafka
0 likes · 9 min read
Master Kafka: Core Concepts and Performance Testing Strategies
360 Tech Engineering
360 Tech Engineering
Jul 4, 2025 · Artificial Intelligence

How AI is Revolutionizing Security Operations: Insights from the 2025 Global Digital Economy Conference

The 2025 Global Digital Economy Conference highlighted the fusion of big data and AI in security, revealing both the transformative potential of large‑model technologies for operational efficiency and the critical challenges they pose, while showcasing 360's AI‑native platform and measurable performance gains.

AI securityBig DataDigital Transformation
0 likes · 5 min read
How AI is Revolutionizing Security Operations: Insights from the 2025 Global Digital Economy Conference
Baidu Geek Talk
Baidu Geek Talk
Jul 2, 2025 · Big Data

Baidu’s Secret to Faster Search Data: Wide‑Table Modeling & Fusion Engine

This article outlines Baidu’s innovative approach to building its search data platform, detailing the design of wide‑table models, the upgrade to a Spark‑based fusion computation engine, and the new Turing 3.0 service delivery framework, which together deliver higher efficiency, lower cost, and faster, more reliable analytics.

Big DataData WarehouseFusion Engine
0 likes · 21 min read
Baidu’s Secret to Faster Search Data: Wide‑Table Modeling & Fusion Engine
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Jul 1, 2025 · Big Data

Master ElasticSearch: Core Concepts, Architecture, and Search Workflow Explained

This article provides a comprehensive overview of ElasticSearch, covering its definition, core components such as indexes, shards and replicas, the analysis pipeline, inverted index mechanics, and the two‑stage search process that enables scalable, fault‑tolerant full‑text search in big‑data environments.

AnalyzersBig DataDistributed Search
0 likes · 7 min read
Master ElasticSearch: Core Concepts, Architecture, and Search Workflow Explained
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 1, 2025 · Big Data

What’s New in Apache Hive 4.0? Key Features and Industry Outlook

After a weekend dive into Apache Hive’s official Wiki and GitHub, this article highlights Hive’s declining visibility compared to Spark and Flink, examines its 4.0 release’s major features—including Iceberg integration, enhanced ACID, cost‑based optimizer upgrades, and Ozone support—while reflecting on its role in modern data ecosystems.

Apache HiveBig DataData Warehouse
0 likes · 4 min read
What’s New in Apache Hive 4.0? Key Features and Industry Outlook
DataFunSummit
DataFunSummit
Jun 22, 2025 · Databases

Unlocking Apache Doris: How Lakehouse Integration Supercharges Data Analytics

This article walks through Apache Doris’s lakehouse‑in‑one architecture, explains its core value and paradigm, details the system’s components and use cases, examines technical challenges such as file‑format diversity and I/O stability, and presents a suite of optimizations—from predicate push‑down and partition pruning to metadata caching and dynamic scheduling—that dramatically improve query performance and resource utilization, while also outlining future roadmap plans.

Apache DorisBig DataData Warehouse
0 likes · 22 min read
Unlocking Apache Doris: How Lakehouse Integration Supercharges Data Analytics
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Jun 13, 2025 · Artificial Intelligence

Designing AI-Ready Data Architecture: Key Features and Future Trends

AI-era data architecture must handle massive, multimodal datasets with real-time processing, prioritize data quality over quantity, support scalability, provenance, and native ML/AI integration, while addressing governance, security, and ethical challenges through emerging technologies like data fabric, mesh, and federated learning.

AIBig DataData Architecture
0 likes · 6 min read
Designing AI-Ready Data Architecture: Key Features and Future Trends
DataFunSummit
DataFunSummit
Jun 10, 2025 · Big Data

How OpenLake Redefines Data Lake Infrastructure for the AI Era

This article explores OpenLake's evolution as a data lake platform for AI, covering the transition from Hive to modern lake formats like Iceberg and Paimon, performance benchmarks, metadata management advances, intelligent storage optimization, and the integration of multimodal support with the Lance file format.

AIBig DataData Lake
0 likes · 22 min read
How OpenLake Redefines Data Lake Infrastructure for the AI Era
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jun 10, 2025 · Big Data

Boosting Automotive Data Processing with Alibaba Cloud EMR Serverless Spark

This article details how a leading automotive parts supply‑chain platform migrated from a traditional Hadoop stack to Alibaba Cloud EMR Serverless Spark and DataWorks, achieving faster, more elastic, and cost‑effective data processing, enhanced AI integration, and significant operational improvements across multiple business scenarios.

Big DataCloud NativeData Lake
0 likes · 12 min read
Boosting Automotive Data Processing with Alibaba Cloud EMR Serverless Spark
Lobster Programming
Lobster Programming
Jun 9, 2025 · Databases

How to Add a Column to Billion‑Row Tables Without Downtime

This article explains a metadata‑driven approach for extending massive tables—using a separate extension table, sharding, and Elasticsearch sync—to add new fields to billion‑row databases without locking the primary table or disrupting online services.

Big DataElasticsearchdatabase schema
0 likes · 6 min read
How to Add a Column to Billion‑Row Tables Without Downtime
DataFunSummit
DataFunSummit
Jun 6, 2025 · Big Data

How Unicom Digital’s Integrated Data Platform Revolutionizes Metadata Management

This article details Unicom Digital’s metadata management practice on its integrated data platform, covering the strategic background of data, key challenges, award-winning capabilities, three-pronged solutions—automation, linking+, and AI—along with practical implementations, full‑chain lineage, data responsibility, lifecycle management, and future AI‑driven enhancements.

AIAutomationBig Data
0 likes · 18 min read
How Unicom Digital’s Integrated Data Platform Revolutionizes Metadata Management
Instant Consumer Technology Team
Instant Consumer Technology Team
Jun 5, 2025 · Big Data

Mastering Kafka in Production: Boost Throughput, Ensure Reliability, and Avoid Data Loss

This article shares practical Kafka production insights, covering architecture overview, producer throughput tuning, message loss prevention, broker and consumer configurations, duplicate consumption avoidance, backlog mitigation, ordering guarantees, and the mechanics of consumer group rebalancing, helping engineers build stable, high‑performance streaming pipelines.

Big DataKafkaMessage Queue
0 likes · 15 min read
Mastering Kafka in Production: Boost Throughput, Ensure Reliability, and Avoid Data Loss
DataFunSummit
DataFunSummit
May 22, 2025 · Operations

Automated Fault Detection and Repair System for Grab's Data Pipelines (Hugo) – Architecture, Implementation, and Impact

This article presents Grab's Hugo platform, an automated fault‑detection and self‑healing system for over 4,000 data pipelines that combines multi‑source signal collection, intelligent diagnosis, layered auto‑repair, and a health API to dramatically improve data visibility, reduce manual intervention, and boost operational efficiency across the company.

AutomationBig DataDataOps
0 likes · 12 min read
Automated Fault Detection and Repair System for Grab's Data Pipelines (Hugo) – Architecture, Implementation, and Impact
Zhuanzhuan Tech
Zhuanzhuan Tech
May 21, 2025 · Big Data

How We Turned a Microservice Finance System into a Scalable Big‑Data Warehouse

This article details the evolution of a fast‑growing e‑commerce finance platform from a monolithic microservice architecture plagued by data inconsistency, low processing efficiency, and scalability limits to a robust, distributed big‑data warehouse using SparkSQL, layered data models, and optimized scheduling, achieving ten‑fold performance gains and near‑zero failure rates.

Big DataData WarehouseETL
0 likes · 21 min read
How We Turned a Microservice Finance System into a Scalable Big‑Data Warehouse
Java Backend Technology
Java Backend Technology
May 21, 2025 · Big Data

Master DataX: Fast Offline Data Sync for MySQL without mysqldump

This guide explains how to use Alibaba's open‑source DataX tool to perform high‑performance offline synchronization between heterogeneous MySQL databases, covering installation, framework design, job configuration, full‑ and incremental sync, and practical command‑line examples.

Big DataDataXETL
0 likes · 15 min read
Master DataX: Fast Offline Data Sync for MySQL without mysqldump
Big Data Technology & Architecture
Big Data Technology & Architecture
May 21, 2025 · Big Data

Interview Experience: Flink Task Resource Allocation, Issues, and Monitoring

This article shares an interviewee's experience discussing core Flink interview questions, including typical resource allocation for large online tasks, common problems such as data, performance, stability, and resource issues, and the monitoring practices for clusters and tasks, while also containing a brief self‑promotion.

Big DataFlinkPerformance Issues
0 likes · 7 min read
Interview Experience: Flink Task Resource Allocation, Issues, and Monitoring
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
May 19, 2025 · Industry Insights

How Xiaohongshu Built a Minute‑Level Near‑Real‑Time Data Warehouse with Incremental Computing

Facing billions of daily logs and the need for minute‑level experiment metrics, Xiaohongshu partnered with Yunqi Tech to design a generic incremental‑compute solution that delivers near‑real‑time data warehousing with lower cost, higher accuracy, simplified pipelines, and improved query performance.

Big DataData LakeFlink
0 likes · 24 min read
How Xiaohongshu Built a Minute‑Level Near‑Real‑Time Data Warehouse with Incremental Computing
Huolala Tech
Huolala Tech
May 14, 2025 · Big Data

How Lalamove Scaled Real‑Time Data Warehousing with Flink and Paimon

Lalamove’s international logistics platform transformed its real‑time data warehouse by leveraging Apache Flink and the Paimon lakehouse, addressing challenges of multi‑region data centers, time‑zone diversity, frequent upstream changes, and high costs, while improving scalability, latency, and operational efficiency across global markets.

Big DataFlinkPaimon
0 likes · 13 min read
How Lalamove Scaled Real‑Time Data Warehousing with Flink and Paimon
JD Tech
JD Tech
May 13, 2025 · Databases

Unlock ClickHouse’s Lightning‑Fast Queries: Architecture, Storage, and Index Secrets

This article examines ClickHouse’s high‑performance OLAP design, covering its MPP architecture, columnar storage, vectorized execution, pre‑sorting, table engines, extensive data‑type system, sharding and replication strategies, as well as its sparse and skip‑index mechanisms that together enable ultra‑fast analytics on massive datasets.

Big DataClickHouseColumnar Storage
0 likes · 16 min read
Unlock ClickHouse’s Lightning‑Fast Queries: Architecture, Storage, and Index Secrets
macrozheng
macrozheng
May 12, 2025 · Big Data

Master DataX: Efficient Data Synchronization for Massive MySQL Datasets

Learn how to overcome inaccurate reporting and cross-database challenges by using Alibaba’s open-source DataX tool to efficiently synchronize massive MySQL datasets, covering its architecture, job scheduling, installation, configuration, full- and incremental sync, and practical command-line examples.

Big DataDataXETL
0 likes · 15 min read
Master DataX: Efficient Data Synchronization for Massive MySQL Datasets
Top Architect
Top Architect
May 7, 2025 · Big Data

Using DataX for Efficient MySQL Data Synchronization

This article provides a comprehensive guide on using Alibaba's open‑source DataX tool for efficient offline synchronization between heterogeneous databases such as MySQL, covering its architecture, installation on Linux, job configuration, full‑ and incremental data transfer, and practical code examples.

Big DataDataXETL
0 likes · 18 min read
Using DataX for Efficient MySQL Data Synchronization
DataFunSummit
DataFunSummit
May 4, 2025 · Big Data

Iceberg Table Format Practice in Huawei Terminal Cloud

This article explains how Huawei's terminal cloud adopts the Apache Iceberg table format to efficiently manage large-scale datasets, detailing its architecture, feature engineering, merge operations, LSM-based storage, schema versioning, AB testing support, catalog enhancements, and future roadmap for full lifecycle data governance.

Big DataData LakeHuawei Cloud
0 likes · 13 min read
Iceberg Table Format Practice in Huawei Terminal Cloud
JD Tech
JD Tech
Apr 30, 2025 · Artificial Intelligence

TimeHF: A Billion‑Scale Time Series Forecasting Model Guided by Human Feedback

The JD Supply Chain algorithm team introduces TimeHF, a billion‑parameter time‑series large model that leverages RLHF to boost demand‑forecast accuracy by over 10%, detailing dataset construction, the PCTLM architecture, a custom RLHF framework (TPO), and extensive SOTA experimental results.

Big DataDeep LearningRLHF
0 likes · 10 min read
TimeHF: A Billion‑Scale Time Series Forecasting Model Guided by Human Feedback
Big Data Tech Team
Big Data Tech Team
Apr 28, 2025 · Big Data

Mastering Metadata, Master Data, and Data Governance: A Complete Guide

This article explains the core concepts of metadata, master data, data resources, data governance, and data management, outlines their roles, compares governance with management, and provides practical steps and best‑practice recommendations for building a robust enterprise data framework.

Big DataData GovernanceMaster Data
0 likes · 15 min read
Mastering Metadata, Master Data, and Data Governance: A Complete Guide
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Apr 27, 2025 · Big Data

Scaling Property Services: StarRocks‑Powered Storage‑Compute Separation for 8000+ Communities

Facing a flood of data from over 8,000 communities, the Bifeng service team migrated from a monolithic storage‑compute architecture to a StarRocks‑based storage‑compute separation solution, achieving lower costs, higher resource utilization, faster queries, and improved SLA across their property management platform.

Big DataData WarehouseInfrastructure Migration
0 likes · 11 min read
Scaling Property Services: StarRocks‑Powered Storage‑Compute Separation for 8000+ Communities
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Apr 24, 2025 · Big Data

Boosting Product Recommendations with Serverless Spark and Milvus: A Real‑World Case Study

蝉妈妈 migrated its recommendation platform to Alibaba Cloud Serverless Spark and Milvus, replacing traditional vector search and Spark clusters, achieving 40% faster offline tasks, 80% lower failure rates, significant cost savings, and scalable, low‑latency similar‑product retrieval for personalized marketing.

Big DataMilvusrecommendation system
0 likes · 8 min read
Boosting Product Recommendations with Serverless Spark and Milvus: A Real‑World Case Study
Big Data Tech Team
Big Data Tech Team
Apr 20, 2025 · Industry Insights

Essential Skills & Tech Stacks for Every Data Team Role

This guide breaks down the main positions in a data team— from data development and analysis engineers to product managers and operations specialists—detailing each role’s key responsibilities, essential skill sets, and the typical technology stack they rely on.

Big DataData Analyticsdata engineering
0 likes · 7 min read
Essential Skills & Tech Stacks for Every Data Team Role
dbaplus Community
dbaplus Community
Apr 20, 2025 · Databases

Why Wide Tables Fail and How to Design Them Efficiently

This article explains what wide tables are, why they are controversial, outlines three common design pitfalls with practical avoidance tips, and introduces three key technologies—ClickHouse, Cassandra, and Hudi/Iceberg—to help engineers build performant, maintainable wide‑table solutions in data warehouses.

Big DataClickHouseDatabase design
0 likes · 7 min read
Why Wide Tables Fail and How to Design Them Efficiently
macrozheng
macrozheng
Apr 18, 2025 · Big Data

How to Build Near Real-Time Elasticsearch Indexes for PB-Scale Data

This article explains why traditional databases like MySQL struggle with massive data, introduces Elasticsearch’s advantages, and details a practical architecture using Hive, Canal, and Otter to achieve near real‑time indexing of petabyte‑scale datasets with minimal latency.

Big DataCanalData Transfer Service
0 likes · 20 min read
How to Build Near Real-Time Elasticsearch Indexes for PB-Scale Data
AntTech
AntTech
Apr 17, 2025 · Artificial Intelligence

Data+AI Forum at the 18th China Electronics Information Conference (2025) – Speaker Bios and Session Summaries

The 18th China Electronics Information Conference will be held in Chengdu from April 17‑21, 2025, featuring the DATA+AI forum that gathers leading academicians and industry experts to discuss data‑AI integration, with detailed speaker biographies, presentation titles, and abstracts covering topics such as large‑model inference, cloud‑edge ultrasound diagnostics, and the future of databases in the AI era.

@DataAIBig Data
0 likes · 12 min read
Data+AI Forum at the 18th China Electronics Information Conference (2025) – Speaker Bios and Session Summaries
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 17, 2025 · Big Data

MaxCompute: Intelligent Data Warehouse Platform for the Data+AI Era

This article, based on a meetup presentation, details Alibaba Cloud's MaxCompute platform—its evolution, serverless architecture, AI integration, distributed Python framework, Object Table, near‑real‑time processing, and intelligent warehouse features—addressing the challenges of data warehouses in the Data+AI era.

Big DataData WarehouseMaxCompute
0 likes · 11 min read
MaxCompute: Intelligent Data Warehouse Platform for the Data+AI Era
vivo Internet Technology
vivo Internet Technology
Apr 16, 2025 · Big Data

Offline Mixed Deployment of Spark Tasks on Kubernetes: Containerization, Scheduling, and Elastic Resource Management

The article explains how the vivo Internet Big Data team containerized offline Spark jobs and deployed them with the Spark Operator on a mixed online‑offline Kubernetes cluster, using elastic scheduling and resource‑over‑subscription to boost CPU utilization by 30‑40% and handle over 100,000 daily tasks.

Big DataKubernetesResource Management
0 likes · 36 min read
Offline Mixed Deployment of Spark Tasks on Kubernetes: Containerization, Scheduling, and Elastic Resource Management
dbaplus Community
dbaplus Community
Apr 15, 2025 · Big Data

How Xiaohongshu Boosted Data Warehouse Performance with Logical Datasets and Materialized Views

Xiaohongshu introduced logical datasets and materialized views to overcome low reuse of APP tables, limited scalability of single‑table BI datasets, and poor dashboard query performance, achieving higher data processing efficiency and faster query responses through optimized data flow, query pruning, and accelerated ETL scheduling.

Big Datalogical datasetquery optimization
0 likes · 24 min read
How Xiaohongshu Boosted Data Warehouse Performance with Logical Datasets and Materialized Views
DataFunSummit
DataFunSummit
Apr 13, 2025 · Big Data

Data Governance at Didi: Interview with Liu Chao on Big Data Asset Management

In this interview, Didi data governance lead Liu Chao discusses his career journey, the unique technical architecture of Didi’s big‑data governance system, cost‑driven pricing models, metadata management, lineage extraction, automation practices, and offers practical advice for enterprises seeking effective data governance.

AutomationBig DataCost-based Pricing
0 likes · 12 min read
Data Governance at Didi: Interview with Liu Chao on Big Data Asset Management
JD Cloud Developers
JD Cloud Developers
Apr 11, 2025 · Artificial Intelligence

How a Billion-Parameter Time Series Model Beats GPT4TS: The PCTLM Breakthrough

This article introduces PCTLM, a pioneering billion‑parameter pure time‑series large model that outperforms existing solutions like GPT4TS across multiple benchmarks, detailing its massive high‑quality dataset, novel patch‑based architecture, and a tailored RLHF framework (TPO) that enhances zero‑shot forecasting accuracy.

Big DataPCTLMRLHF
0 likes · 11 min read
How a Billion-Parameter Time Series Model Beats GPT4TS: The PCTLM Breakthrough
DataFunTalk
DataFunTalk
Apr 9, 2025 · Big Data

Highlights of the Apache Hudi Asia Technical Salon Hosted by Kuaishou – Practices and Innovations from Leading Companies

The Kuaishou‑hosted Apache Hudi Asia technical salon gathered over 230 attendees and featured seven experts from Kuaishou, Meituan, TikTok, Huawei, JD and others, who shared best practices, architecture designs, and performance optimizations for large‑scale data lake applications across AI, BI, and real‑time workloads.

AIApache HudiBatch Processing
0 likes · 14 min read
Highlights of the Apache Hudi Asia Technical Salon Hosted by Kuaishou – Practices and Innovations from Leading Companies
JD Retail Technology
JD Retail Technology
Apr 8, 2025 · Databases

ClickHouse Architecture and Core Technologies Overview

ClickHouse is an open‑source, massively parallel, column‑oriented OLAP database that integrates its own columnar storage, vectorized batch processing, pre‑sorted data, diverse table engines, extensive data types, sharding with replication, sparse primary‑key and skip indexes, and a multithreaded query engine, delivering high‑throughput real‑time analytics on massive datasets.

Big DataClickHouseColumnar Storage
0 likes · 15 min read
ClickHouse Architecture and Core Technologies Overview
DataFunSummit
DataFunSummit
Apr 3, 2025 · Big Data

Apache Hudi Asia Technical Salon Highlights: Practices and Innovations from Kuaishou, Meituan, Douyin, Huawei, and JD

The Apache Hudi Asia technical salon held in Beijing on March 29 gathered over 230 on‑site participants and 16,000 online viewers, featuring expert talks from leading Chinese tech companies that showcased real‑world Hudi implementations, performance optimizations, and future roadmap for data‑lake technologies.

Apache HudiBig DataData Lake
0 likes · 13 min read
Apache Hudi Asia Technical Salon Highlights: Practices and Innovations from Kuaishou, Meituan, Douyin, Huawei, and JD
Kuaishou Tech
Kuaishou Tech
Apr 2, 2025 · Big Data

Apache Hudi Asia Summit Successfully Held

The first Apache Hudi Asia Summit in Beijing attracted over 230 attendees, featuring technical discussions on data lake optimization and case studies from companies like Fastly and Meituan.

Apache HudiBig DataData Lake
0 likes · 12 min read
Apache Hudi Asia Summit Successfully Held
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 2, 2025 · Databases

Replacing Elasticsearch with Apache Doris for Real‑Time Big Data Analytics: Architecture, Performance, and Enterprise Cases

This article analyzes why Elasticsearch struggles with large‑scale, complex real‑time analytics and demonstrates how Apache Doris’s MPP, columnar storage, and native SQL support provide a cost‑effective, high‑performance alternative, illustrated with detailed enterprise case studies.

Apache DorisBig DataElasticsearch
0 likes · 11 min read
Replacing Elasticsearch with Apache Doris for Real‑Time Big Data Analytics: Architecture, Performance, and Enterprise Cases
Mingyi World Elasticsearch
Mingyi World Elasticsearch
Apr 1, 2025 · Big Data

Elasticsearch Unveiled: Learn Search Engine Basics Through Comics

This visual guide walks readers through Elasticsearch fundamentals—from architecture and indexing to clustering, query DSL, aggregations, and performance tuning—using comic-style illustrations that simplify each concept for easy understanding, and security considerations, multilingual support, and real‑time search capabilities.

Big DataDistributed SystemsElasticsearch
0 likes · 2 min read
Elasticsearch Unveiled: Learn Search Engine Basics Through Comics
DataFunSummit
DataFunSummit
Apr 1, 2025 · Big Data

Understanding Flink CDC 3.3: Features, Improvements, and Future Plans

This article provides a comprehensive overview of Flink CDC 3.3, detailing its CDC fundamentals, new connectors, Transform module enhancements, asynchronous snapshot splitting, community adoption, and upcoming roadmap for broader ecosystem support and batch‑mode execution.

Big DataCDCChange Data Capture
0 likes · 15 min read
Understanding Flink CDC 3.3: Features, Improvements, and Future Plans
IT Architects Alliance
IT Architects Alliance
Mar 30, 2025 · Backend Development

Douyin’s Architectural Evolution: From Simple Beginnings to Scalable Cloud‑Native System

The article chronicles Douyin’s journey from a modest early‑stage architecture to a sophisticated, distributed, micro‑service and cloud‑native infrastructure that leverages load balancing, caching, big‑data frameworks, CDN, edge computing, and automated operations to support billions of users and massive traffic spikes.

Big DataDouyincloud-native
0 likes · 12 min read
Douyin’s Architectural Evolution: From Simple Beginnings to Scalable Cloud‑Native System
vivo Internet Technology
vivo Internet Technology
Mar 26, 2025 · Big Data

Reading Encrypted ORC Files in StarRocks: Architecture and Implementation Details

The article details how StarRocks extends the Apache ORC C++ library to decrypt column‑level encrypted ORC files, describing the file hierarchy, AES‑128‑CTR key handling, the query‑time master‑key retrieval, a decorator‑based decryption/decompression pipeline, and the block‑skip‑read mechanism that enables efficient predicate push‑down.

Big DataFile FormatORC
0 likes · 19 min read
Reading Encrypted ORC Files in StarRocks: Architecture and Implementation Details
Big Data Technology Architecture
Big Data Technology Architecture
Mar 25, 2025 · Big Data

Kafka 4.0 Release: KRaft Architecture, Consumer Group Optimizations, and New Queue Features

Kafka 4.0 marks a milestone release that replaces ZooKeeper with the KRaft consensus engine, improves scalability and performance, introduces a server‑side consumer‑group protocol, adds shared‑group queue capabilities, and updates Java requirements and documentation, delivering a more robust and flexible streaming platform.

Big DataDistributed StreamingJava11
0 likes · 6 min read
Kafka 4.0 Release: KRaft Architecture, Consumer Group Optimizations, and New Queue Features
Baidu Geek Talk
Baidu Geek Talk
Mar 24, 2025 · Big Data

How Turing Data Finder Transforms Growth Analysis with a Unified Data Platform

The article provides a detailed technical overview of the Turing Data Finder (TDF) platform, describing its background, core components, data schema, ingestion workflow, and a suite of growth‑analysis features such as event, retention, funnel, path, component, distribution, and attribution analysis, while also outlining performance‑optimisation techniques and future development directions.

Big DataData PlatformSQL Optimization
0 likes · 17 min read
How Turing Data Finder Transforms Growth Analysis with a Unified Data Platform
Didi Tech
Didi Tech
Mar 20, 2025 · Big Data

Key Questions and Value Assessment in Data Warehouse Modeling and Development

The article explores nine fundamental questions about data‑warehouse modeling—why and when to model, how to evaluate and compare models, the warehouse’s unique role versus business systems, modern architectural shifts, a quantitative value‑proof scoring framework, industry‑standard versus custom approaches, demonstrating business impact, and career insights—concluding that true value lies in enabling informed decisions rather than technology hype.

AIBig DataData Value
0 likes · 12 min read
Key Questions and Value Assessment in Data Warehouse Modeling and Development
Model Perspective
Model Perspective
Mar 20, 2025 · Big Data

How to Sample Effectively in the Big Data Era: Methods and Best Practices

This article explores essential sampling strategies for big‑data environments—including simple random, reservoir, stratified, oversampling, undersampling, and weighted sampling—detailing their principles, algorithmic steps, advantages, drawbacks, and suitable application scenarios to help analysts choose the right method.

Big DataSamplingoversampling
0 likes · 8 min read
How to Sample Effectively in the Big Data Era: Methods and Best Practices
AntData
AntData
Mar 20, 2025 · Big Data

Design and Optimization of Real‑time Data Lake Tables with Paimon and Flink for Advertising Diagnostics

This article presents a comprehensive exploration of using Apache Paimon and Flink to design lake tables that support minute‑level latency, low cost, and unified batch‑stream processing for advertising data, covering schema design, partitioning strategies, performance trade‑offs, cost analysis, and operational best practices.

Big DataData LakeFlink
0 likes · 34 min read
Design and Optimization of Real‑time Data Lake Tables with Paimon and Flink for Advertising Diagnostics
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 20, 2025 · Big Data

How to Read and Write StarRocks Data with EMR Serverless Spark

This step‑by‑step guide explains how to use EMR Serverless Spark together with the StarRocks Spark Connector to create a workspace, upload the connector JAR, configure network connections, create databases and tables in StarRocks, and perform read/write operations via SQL sessions, Notebook sessions, or batch Spark jobs, complete with code examples and UI screenshots.

Big DataData IntegrationSpark
0 likes · 14 min read
How to Read and Write StarRocks Data with EMR Serverless Spark
Data Thinking Notes
Data Thinking Notes
Mar 19, 2025 · Big Data

How to Maximize Data Asset Value: From DataOps to Monetization

This report outlines a comprehensive framework for turning raw data into valuable assets, introducing DataOps and panoramic data architecture, and detailing practical methods for data value assessment, asset circulation, and operational mechanisms to help enterprises build a solid value baseline and expand data asset applications.

Big DataData Asset ManagementData Governance
0 likes · 4 min read
How to Maximize Data Asset Value: From DataOps to Monetization
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 17, 2025 · Big Data

How MaxFrame Enables Scalable Python AI Workloads on MaxCompute

This article introduces MaxFrame, a cloud‑native distributed Python compute service built on MaxCompute, detailing its architecture, seamless integration with the Python ecosystem, and real‑world use cases ranging from large‑scale data analysis and machine learning to offline LLM inference and custom image deployments.

Big DataData WarehouseMaxFrame
0 likes · 18 min read
How MaxFrame Enables Scalable Python AI Workloads on MaxCompute
JD Tech
JD Tech
Mar 13, 2025 · Operations

Ensuring Stability of the Double 11 Supply‑Chain Dashboard: Full‑Link Process, Risk Points, and Technical Safeguards

This article details how JD Logistics guarantees the stability of its Double 11 supply‑chain dashboard by mapping the entire data‑flow, identifying risk points across ingestion, processing, storage, service, and monitoring layers, and applying targeted technical and organizational safeguards.

Big DataDashboardSupply Chain
0 likes · 10 min read
Ensuring Stability of the Double 11 Supply‑Chain Dashboard: Full‑Link Process, Risk Points, and Technical Safeguards
DataFunSummit
DataFunSummit
Mar 12, 2025 · Big Data

Principles and Common Optimization Techniques of the Spark SQL Optimizer

This article explains the underlying principles of the Spark SQL optimizer and presents three classic optimization paradigms—push‑down optimization, operator elimination/merging, and expression elimination/replacement—illustrating each with concrete rule implementations and code examples.

Big DataSpark SQLoptimizer
0 likes · 12 min read
Principles and Common Optimization Techniques of the Spark SQL Optimizer
JD Tech Talk
JD Tech Talk
Mar 12, 2025 · Big Data

Ensuring Stability of the Double‑11 Supply Chain Dashboard: Full‑Chain Process, Risk Points, and Technical Safeguard Strategies

This article details how the supply‑chain big‑screen dashboard for Double‑11 maintains high stability by mapping the full data‑flow, identifying risk points across ingestion, processing, storage and service layers, and applying comprehensive technical safeguards such as high‑availability design, fault‑tolerance, monitoring, and coordinated operational procedures.

Big DataDashboardSupply Chain
0 likes · 11 min read
Ensuring Stability of the Double‑11 Supply Chain Dashboard: Full‑Chain Process, Risk Points, and Technical Safeguard Strategies
Ma Wei Says
Ma Wei Says
Mar 11, 2025 · Big Data

Mastering DWS Layer Design: Principles, Steps, and Best Practices

This article explains the role of the DWS layer in data warehouses, outlines design principles, step‑by‑step modeling, naming conventions, field design, provides concrete DDL/ETL examples, common pitfalls, and how to build reusable, performant summary tables for analytics.

Big DataDWS LayerData Warehouse
0 likes · 15 min read
Mastering DWS Layer Design: Principles, Steps, and Best Practices
Ma Wei Says
Ma Wei Says
Mar 9, 2025 · Big Data

Mastering DWD Layer Design: Principles, Fact Tables, and Performance Tips

This article provides a comprehensive guide to designing the Data Warehouse Detail (DWD) layer, covering Kimball‑based design principles, step‑by‑step modeling, table and field naming conventions, concrete Hive DDL/DML examples, and optimization techniques such as partitioning, bucketing, and compression.

Big DataDWDData Warehouse
0 likes · 21 min read
Mastering DWD Layer Design: Principles, Fact Tables, and Performance Tips
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Mar 6, 2025 · Big Data

Leveraging Apache Iceberg and AutoMQ for Real-Time Data Lake Ingestion: Architecture, Best Practices, and Cost Optimization

This article examines how Apache Iceberg’s snapshot‑based ACID transactions, logical‑physical partition evolution, and COW/MOR update modes enable efficient real‑time data lake ingestion, and demonstrates AutoMQ’s Kafka‑to‑Iceberg Table Topic solution that simplifies schema management, reduces latency, and cuts operational costs.

Apache IcebergAutoMQBig Data
0 likes · 14 min read
Leveraging Apache Iceberg and AutoMQ for Real-Time Data Lake Ingestion: Architecture, Best Practices, and Cost Optimization
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 3, 2025 · Big Data

The Turning Point for Data Development: From Traditional Data Engineering to AI Data Engineering

The article analyzes how the rapid rise of open‑source large‑model AI in 2025 is reshaping the data development profession, urging developers to transition from specialized data‑engineer roles to full‑stack AI data engineering skills such as distributed computing, lake‑house architectures, and model tuning.

AIBig DataFlink
0 likes · 7 min read
The Turning Point for Data Development: From Traditional Data Engineering to AI Data Engineering
IT Architects Alliance
IT Architects Alliance
Feb 28, 2025 · Industry Insights

What 10 Core Technologies Every IT Architect Must Master in 2024?

Amid rapid advances in cloud, AI, big data, and DevOps, this 2024 guide outlines the ten essential technologies—ranging from multi-language programming and database mastery to distributed systems, microservices, and security—that IT architects need to master to stay competitive and drive digital transformation.

Big DataDevOpsIT Architecture
0 likes · 26 min read
What 10 Core Technologies Every IT Architect Must Master in 2024?
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Feb 28, 2025 · Databases

How MaxCompute’s Intelligent Data Warehouse Optimizes Queries with AutoMV

This article explains MaxCompute’s intelligent data warehouse architecture, its self‑learning optimization pipeline, the role of intelligent materialized views, the automated recommendation system for materialized views, and the AutoMV feature that automatically creates, updates, and cleans up materialized views to reduce compute costs and improve query performance.

AutoMVBig DataData Warehouse
0 likes · 17 min read
How MaxCompute’s Intelligent Data Warehouse Optimizes Queries with AutoMV
DataFunSummit
DataFunSummit
Feb 27, 2025 · Big Data

Case Study: Migrating Spark Thinking Education's Big Data Architecture from EMR to Serverless

This article details Spark Thinking Education's comprehensive migration from EMR to a serverless big‑data architecture, outlining the challenges of elasticity, cost accounting, and resource contention, the step‑by‑step implementation of serverless compute, storage, and integration services, and the resulting performance, cost, and stability gains.

Big DataCost OptimizationServerless
0 likes · 41 min read
Case Study: Migrating Spark Thinking Education's Big Data Architecture from EMR to Serverless
DataFunSummit
DataFunSummit
Feb 23, 2025 · Big Data

Douyin Group’s ByteLake Data Lake Table Optimization and Management Practices

This article presents Douyin Group’s ByteLake, a heavily customized Apache Hudi‑based data lake table framework, detailing its core concepts, metadata services, write and read optimizations, operational challenges, a fully managed table management service, and its integration with the Amoro open‑source platform.

AmoroApache HudiBig Data
0 likes · 11 min read
Douyin Group’s ByteLake Data Lake Table Optimization and Management Practices
Deepin Linux
Deepin Linux
Feb 23, 2025 · Cloud Computing

Understanding Ceph Distributed Storage Architecture and Its Core Components

Ceph is a unified, open‑source distributed storage system whose layered architecture—comprising RADOS, LIBRADOS, and upper‑level services like RADOSGW, RBD, and CephFS—provides high performance, reliability, scalability, and flexible data access for cloud, big‑data, and AI workloads.

Big DataCepharchitecture
0 likes · 25 min read
Understanding Ceph Distributed Storage Architecture and Its Core Components
DataFunSummit
DataFunSummit
Feb 22, 2025 · Big Data

Blaze Engine: A Rust‑Based Native Vectorized Execution Engine for Spark SQL

The article introduces Blaze, Kuaishou's Rust‑powered native execution engine that vectorizes Spark SQL workloads, explains its architecture and operation, presents benchmark results showing up to 50% latency reduction, and details internal deployments, industry case studies, community collaborations, and the 2025 roadmap.

Big DataNative ExecutionPerformance Optimization
0 likes · 12 min read
Blaze Engine: A Rust‑Based Native Vectorized Execution Engine for Spark SQL