Tagged articles

3672 articles

Page 3 of 37

Jul 16, 2025 · Big Data

Master Flink Optimizations: TTL, Mini‑Batch, Two‑Phase Aggregation, Lookup Join & More

This article reviews the most effective Flink optimization techniques since 2022, including operator‑level TTL, mini‑batch processing, two‑phase aggregation, multi‑dimensional DISTINCT with FILTER, lookup join caching strategies, and TopN implementations, each rated with recommendation stars for production use.

Big DataFlinkLookup Join

0 likes · 8 min read

Master Flink Optimizations: TTL, Mini‑Batch, Two‑Phase Aggregation, Lookup Join & More

Alibaba Cloud Big Data AI Platform

Jul 15, 2025 · Big Data

How MaxCompute’s Append DeltaTable Transforms BigQuery Migration

This article details the complex migration of a leading Southeast Asian tech group's data warehouse from Google BigQuery to Alibaba Cloud MaxCompute, outlining challenges such as storage format differences, SQL compatibility, and performance tuning, and explains how the new Append DeltaTable format with dynamic bucketing and incremental reclustering resolves these issues.

Big DataData MigrationData Warehouse

0 likes · 19 min read

How MaxCompute’s Append DeltaTable Transforms BigQuery Migration

Java Tech Enthusiast

Jul 12, 2025 · Databases

Is SQL Losing Its Edge? Exploring the Rise of NoSQL and Programming Language Trends

The article examines the June 2025 TIOBE ranking where SQL fell to its lowest position, recounts its historical highs and removal from the list, highlights everyday reliance on SQL, and analyzes the growing prominence of NoSQL and other programming languages in the era of AI and big data.

Big DataNoSQLSQL

0 likes · 6 min read

Is SQL Losing Its Edge? Exploring the Rise of NoSQL and Programming Language Trends

IT Architects Alliance

Jul 10, 2025 · Cloud Native

Inside Alibaba’s Tech Stack: Cloud‑Native Architecture Behind Billions of Transactions

This article examines Alibaba's extensive cloud‑native technology stack—including distributed computing, storage, middleware, real‑time data processing, AI platforms, performance engineering, and security—revealing how its architects design systems that handle massive transaction volumes during events like Double 11.

Big DataDistributed SystemsMicroservices

0 likes · 12 min read

Inside Alibaba’s Tech Stack: Cloud‑Native Architecture Behind Billions of Transactions

IT Architects Alliance

Jul 8, 2025 · Cloud Native

Why Do Big‑Tech Architects Earn Six Figures? The Skills That Set Them Apart

The article explores why architects at leading tech firms command six‑figure salaries while those in traditional companies earn far less, highlighting gaps in technical depth, massive data handling, performance optimization, business insight, continuous learning, and the scarcity of true senior architects.

Big DataCareer DevelopmentDistributed Systems

0 likes · 9 min read

Why Do Big‑Tech Architects Earn Six Figures? The Skills That Set Them Apart

Model Perspective

Jul 8, 2025 · Big Data

Why Historical Data Can Mislead Your Forecasts—and What to Do Instead

The article explains how relying solely on historical data for prediction often leads to large errors because future structural changes and missing variables are ignored, and it proposes causal modeling, scenario simulation, and real‑time signals as more reliable forecasting approaches.

Big Datacausal modelingforecasting

0 likes · 9 min read

Why Historical Data Can Mislead Your Forecasts—and What to Do Instead

Big Data Technology & Architecture

Jul 8, 2025 · Big Data

Flink’s AI Agents and Disaggregated State: Transforming Big Data

The article reviews key topics from the FFA2025 Singapore conference, highlighting Flink’s new AI‑focused Agents framework, the breakthrough Flink 2.0 disaggregated state architecture, emerging lake storage solutions like Paimon, and the Fluss streaming table store, illustrating how big‑data platforms are evolving for AI workloads.

AI agentsBig DataData Lake

0 likes · 6 min read

Flink’s AI Agents and Disaggregated State: Transforming Big Data

DataFunTalk

Jul 7, 2025 · Big Data

Unlock Real-Time Analytics with Cloud Lakehouse: A Complete Guide

This article presents a curated list of sessions covering cloud Lakehouse technology for real-time, multidimensional data analysis, including case studies from SalesEasy, Changan Auto, Tencent, and JD, as well as discussions on data lake adoption, streaming lake Paimon, and the relevance of metadata‑driven data governance in the digital economy.

Big DataCase StudyData Governance

0 likes · 2 min read

Unlock Real-Time Analytics with Cloud Lakehouse: A Complete Guide

DataFunTalk

Jul 6, 2025 · Big Data

How Cloud Lakehouse Is Redefining Real-Time Multi-Dimensional Data Analytics

This article presents a curated list of case studies and insights on cloud Lakehouse technology, covering real-time intelligent analytics, data architecture simplification, IoT big‑data platforms, integrated data platforms, and the evolving role of metadata‑driven data governance in the digital economy.

Big DataCase StudiesData Governance

0 likes · 2 min read

How Cloud Lakehouse Is Redefining Real-Time Multi-Dimensional Data Analytics

FunTester

Jul 5, 2025 · Big Data

Master Kafka: Core Concepts and Performance Testing Strategies

This article explains Kafka’s high‑performance distributed streaming architecture, key components such as topics, partitions, producers, consumers, brokers, offsets, and ZooKeeper, and provides step‑by‑step workflows for producers and consumers along with performance‑testing tips and Maven setup.

Big DataJavaKafka

0 likes · 9 min read

Master Kafka: Core Concepts and Performance Testing Strategies

360 Tech Engineering

Jul 4, 2025 · Artificial Intelligence

How AI is Revolutionizing Security Operations: Insights from the 2025 Global Digital Economy Conference

The 2025 Global Digital Economy Conference highlighted the fusion of big data and AI in security, revealing both the transformative potential of large‑model technologies for operational efficiency and the critical challenges they pose, while showcasing 360's AI‑native platform and measurable performance gains.

AI securityBig DataDigital Transformation

0 likes · 5 min read

How AI is Revolutionizing Security Operations: Insights from the 2025 Global Digital Economy Conference

Big Data Technology & Architecture

Jul 4, 2025 · Big Data

Spark 4.0: New Features, Performance Gains, and Why It Still Leads Big Data

Despite the hype around Flink and AI models, Spark 4.0’s release brings a lightweight Python client, Spark Connect GA, enhanced SQL optimization, vectorized execution, and AI integration, reaffirming its leading position in the big‑data ecosystem while hinting at future challenges and innovations.

Big DataPerformance OptimizationPython

0 likes · 6 min read

Spark 4.0: New Features, Performance Gains, and Why It Still Leads Big Data

Baidu Geek Talk

Jul 2, 2025 · Big Data

Baidu’s Secret to Faster Search Data: Wide‑Table Modeling & Fusion Engine

This article outlines Baidu’s innovative approach to building its search data platform, detailing the design of wide‑table models, the upgrade to a Spark‑based fusion computation engine, and the new Turing 3.0 service delivery framework, which together deliver higher efficiency, lower cost, and faster, more reliable analytics.

Big DataData WarehouseFusion Engine

0 likes · 21 min read

Baidu’s Secret to Faster Search Data: Wide‑Table Modeling & Fusion Engine

Mike Chen's Internet Architecture

Jul 1, 2025 · Big Data

Master ElasticSearch: Core Concepts, Architecture, and Search Workflow Explained

This article provides a comprehensive overview of ElasticSearch, covering its definition, core components such as indexes, shards and replicas, the analysis pipeline, inverted index mechanics, and the two‑stage search process that enables scalable, fault‑tolerant full‑text search in big‑data environments.

AnalyzersBig DataDistributed Search

0 likes · 7 min read

Master ElasticSearch: Core Concepts, Architecture, and Search Workflow Explained

Big Data Technology & Architecture

Jul 1, 2025 · Big Data

What’s New in Apache Hive 4.0? Key Features and Industry Outlook

After a weekend dive into Apache Hive’s official Wiki and GitHub, this article highlights Hive’s declining visibility compared to Spark and Flink, examines its 4.0 release’s major features—including Iceberg integration, enhanced ACID, cost‑based optimizer upgrades, and Ozone support—while reflecting on its role in modern data ecosystems.

Apache HiveBig DataData Warehouse

0 likes · 4 min read

What’s New in Apache Hive 4.0? Key Features and Industry Outlook

DataFunSummit

Jun 22, 2025 · Databases

Unlocking Apache Doris: How Lakehouse Integration Supercharges Data Analytics

This article walks through Apache Doris’s lakehouse‑in‑one architecture, explains its core value and paradigm, details the system’s components and use cases, examines technical challenges such as file‑format diversity and I/O stability, and presents a suite of optimizations—from predicate push‑down and partition pruning to metadata caching and dynamic scheduling—that dramatically improve query performance and resource utilization, while also outlining future roadmap plans.

Apache DorisBig DataData Warehouse

0 likes · 22 min read

Unlocking Apache Doris: How Lakehouse Integration Supercharges Data Analytics

ITFLY8 Architecture Home

Jun 13, 2025 · Artificial Intelligence

Designing AI-Ready Data Architecture: Key Features and Future Trends

AI-era data architecture must handle massive, multimodal datasets with real-time processing, prioritize data quality over quantity, support scalability, provenance, and native ML/AI integration, while addressing governance, security, and ethical challenges through emerging technologies like data fabric, mesh, and federated learning.

AIBig DataData Architecture

0 likes · 6 min read

Designing AI-Ready Data Architecture: Key Features and Future Trends

vivo Internet Technology

Jun 11, 2025 · Big Data

How Vivo Built a Scalable Pulsar Monitoring System for Trillion‑Message Workloads

This article details Vivo's end‑to‑end Pulsar observability solution, covering the challenges of Prometheus‑based monitoring, the architecture of the alerting pipeline, adaptor development, metric optimizations for subscription backlog and bundle load, and fixes for kop lag reporting issues.

Big DataMetricsObservability

0 likes · 12 min read

How Vivo Built a Scalable Pulsar Monitoring System for Trillion‑Message Workloads

Alibaba Cloud Big Data AI Platform

Jun 11, 2025 · Big Data

Sync MaxCompute Tables to Milvus with DataWorks: Step‑by‑Step Guide

This guide explains how to use Alibaba Cloud DataWorks to create the necessary resources, configure Milvus and MaxCompute data sources, set up an offline single‑table synchronization task, and verify the imported vectors, enabling efficient AI‑driven vector search on large structured datasets.

Big DataData IntegrationDataWorks

0 likes · 8 min read

Sync MaxCompute Tables to Milvus with DataWorks: Step‑by‑Step Guide

Big Data Technology & Architecture

Jun 11, 2025 · Big Data

How to Solve Common Paimon Performance Issues in Flink: Small Files, OOM, and More

This article compiles frequent problems encountered when using Paimon with Flink—such as small‑file generation, write‑performance bottlenecks, OOM/GC issues, file‑deletion conflicts, dimension‑table join slowness, and snapshot expiration—and provides practical configuration and optimization solutions.

Big DataFlinkPaimon

0 likes · 9 min read

How to Solve Common Paimon Performance Issues in Flink: Small Files, OOM, and More

DataFunSummit

Jun 10, 2025 · Big Data

How OpenLake Redefines Data Lake Infrastructure for the AI Era

This article explores OpenLake's evolution as a data lake platform for AI, covering the transition from Hive to modern lake formats like Iceberg and Paimon, performance benchmarks, metadata management advances, intelligent storage optimization, and the integration of multimodal support with the Lance file format.

AIBig DataData Lake

0 likes · 22 min read

How OpenLake Redefines Data Lake Infrastructure for the AI Era

Alibaba Cloud Big Data AI Platform

Jun 10, 2025 · Big Data

Boosting Automotive Data Processing with Alibaba Cloud EMR Serverless Spark

This article details how a leading automotive parts supply‑chain platform migrated from a traditional Hadoop stack to Alibaba Cloud EMR Serverless Spark and DataWorks, achieving faster, more elastic, and cost‑effective data processing, enhanced AI integration, and significant operational improvements across multiple business scenarios.

Big DataCloud NativeData Lake

0 likes · 12 min read

Boosting Automotive Data Processing with Alibaba Cloud EMR Serverless Spark

Lobster Programming

Jun 9, 2025 · Databases

How to Add a Column to Billion‑Row Tables Without Downtime

This article explains a metadata‑driven approach for extending massive tables—using a separate extension table, sharding, and Elasticsearch sync—to add new fields to billion‑row databases without locking the primary table or disrupting online services.

Big DataElasticsearchdatabase schema

0 likes · 6 min read

How to Add a Column to Billion‑Row Tables Without Downtime

DataFunSummit

Jun 6, 2025 · Big Data

How Unicom Digital’s Integrated Data Platform Revolutionizes Metadata Management

This article details Unicom Digital’s metadata management practice on its integrated data platform, covering the strategic background of data, key challenges, award-winning capabilities, three-pronged solutions—automation, linking+, and AI—along with practical implementations, full‑chain lineage, data responsibility, lifecycle management, and future AI‑driven enhancements.

AIAutomationBig Data

0 likes · 18 min read

How Unicom Digital’s Integrated Data Platform Revolutionizes Metadata Management

Alibaba Cloud Developer

Jun 6, 2025 · Big Data

Why Observability 2.0 and SLS Data Pipelines Are Revolutionizing Log Analytics

This article explains how Observability 2.0 reshapes log, metric and trace management by unifying health views, introduces the evolution of Alibaba Cloud's SLS data pipeline, compares its three service modes, and demonstrates performance, cost and integration benefits for large‑scale, real‑time log processing.

Big DataObservabilitySLS

0 likes · 11 min read

Why Observability 2.0 and SLS Data Pipelines Are Revolutionizing Log Analytics

Big Data Technology & Architecture

Jun 6, 2025 · Big Data

How to Ace Big Data Interviews: A Complete 3‑Month Prep Guide

This guide outlines a step‑by‑step three‑month preparation plan—including resume building, project showcase, interview mindset, mock sessions, and offer negotiation—to help candidates secure high‑paying big‑data positions at top companies.

Big DataInterview Preparationoffer negotiation

0 likes · 7 min read

How to Ace Big Data Interviews: A Complete 3‑Month Prep Guide

Instant Consumer Technology Team

Jun 5, 2025 · Big Data

Mastering Kafka in Production: Boost Throughput, Ensure Reliability, and Avoid Data Loss

This article shares practical Kafka production insights, covering architecture overview, producer throughput tuning, message loss prevention, broker and consumer configurations, duplicate consumption avoidance, backlog mitigation, ordering guarantees, and the mechanics of consumer group rebalancing, helping engineers build stable, high‑performance streaming pipelines.

Big DataKafkaMessage Queue

0 likes · 15 min read

Mastering Kafka in Production: Boost Throughput, Ensure Reliability, and Avoid Data Loss

Big Data Technology & Architecture

Jun 5, 2025 · Big Data

Flink Web UI Monitoring and End‑to‑End Latency Implementation Guide

This article explains the key monitoring items of the Flink Web UI, details task topology, operator and system metrics, checkpoint and log inspection, and provides two practical solutions—custom metrics and distributed tracing—to measure and visualize full‑chain latency in Flink jobs.

Big DataDistributed TracingFlink

0 likes · 10 min read

Flink Web UI Monitoring and End‑to‑End Latency Implementation Guide

Big Data Technology & Architecture

Jun 3, 2025 · Big Data

Query Optimization Techniques for Paimon Real-Time Data Lake

This article explains how to improve Paimon's query performance by optimizing table schemas, storage settings, query parameters, and index designs, covering table mode choices, partitioning, file formats, parallelism, batch reads, and various index types such as Bloom filters and clustering indexes.

Big DataLSM‑TreePaimon

0 likes · 8 min read

Query Optimization Techniques for Paimon Real-Time Data Lake

DataFunSummit

Jun 1, 2025 · Big Data

Scaling WeChat’s Big Data and AI Workloads on Kubernetes: Challenges and Optimizations

This article details WeChat's migration of large‑scale big data and AI workloads to a cloud‑native Kubernetes platform, discussing performance bottlenecks, API server and ETCD overload protection, scheduler enhancements, observability solutions, resource utilization gains, and future serverless directions.

AIBig DataCloud Native

0 likes · 11 min read

Scaling WeChat’s Big Data and AI Workloads on Kubernetes: Challenges and Optimizations

Alibaba Cloud Infrastructure

May 26, 2025 · Big Data

Comparative Guide to Apache Airflow and Argo Workflows for Distributed Task Scheduling

This article provides a comprehensive comparison of Apache Airflow and Argo Workflows, covering their core features, architectures, use cases, code examples, and recommendations for selecting the appropriate distributed workflow engine in data engineering, big‑data, and AI pipelines.

Apache AirflowArgo WorkflowsBig Data

0 likes · 23 min read

Comparative Guide to Apache Airflow and Argo Workflows for Distributed Task Scheduling

dbaplus Community

May 22, 2025 · Databases

When Is Data Modeling Really Necessary? Lessons from 9 Common Data‑Warehouse Questions

This article examines nine recurring data‑warehouse dilemmas, exploring when modeling is essential, how to evaluate model quality, the boundaries between data warehouses and business systems, the evolution of modern warehouses, career growth for data engineers, and the future role of data R&D in the AI era.

AIAnalyticsBig Data

0 likes · 12 min read

When Is Data Modeling Really Necessary? Lessons from 9 Common Data‑Warehouse Questions

DataFunSummit

May 22, 2025 · Operations

Automated Fault Detection and Repair System for Grab's Data Pipelines (Hugo) – Architecture, Implementation, and Impact

This article presents Grab's Hugo platform, an automated fault‑detection and self‑healing system for over 4,000 data pipelines that combines multi‑source signal collection, intelligent diagnosis, layered auto‑repair, and a health API to dramatically improve data visibility, reduce manual intervention, and boost operational efficiency across the company.

AutomationBig DataDataOps

0 likes · 12 min read

Automated Fault Detection and Repair System for Grab's Data Pipelines (Hugo) – Architecture, Implementation, and Impact

Python Programming Learning Circle

May 22, 2025 · Big Data

Introduction to PySpark: Features, Core Components, Sample Code, and Use Cases

This article introduces PySpark as the Python API for Apache Spark, explains Spark's core concepts and advantages, details PySpark's main components and a simple code example, compares it with Pandas, and outlines typical big‑data scenarios and further learning directions.

Apache SparkBig DataDataFrames

0 likes · 5 min read

Introduction to PySpark: Features, Core Components, Sample Code, and Use Cases

Zhuanzhuan Tech

May 21, 2025 · Big Data

How We Turned a Microservice Finance System into a Scalable Big‑Data Warehouse

This article details the evolution of a fast‑growing e‑commerce finance platform from a monolithic microservice architecture plagued by data inconsistency, low processing efficiency, and scalability limits to a robust, distributed big‑data warehouse using SparkSQL, layered data models, and optimized scheduling, achieving ten‑fold performance gains and near‑zero failure rates.

Big DataData WarehouseETL

0 likes · 21 min read

How We Turned a Microservice Finance System into a Scalable Big‑Data Warehouse

Java Backend Technology

May 21, 2025 · Big Data

Master DataX: Fast Offline Data Sync for MySQL without mysqldump

This guide explains how to use Alibaba's open‑source DataX tool to perform high‑performance offline synchronization between heterogeneous MySQL databases, covering installation, framework design, job configuration, full‑ and incremental sync, and practical command‑line examples.

Big DataDataXETL

0 likes · 15 min read

Master DataX: Fast Offline Data Sync for MySQL without mysqldump

Big Data Technology & Architecture

May 21, 2025 · Big Data

Interview Experience: Flink Task Resource Allocation, Issues, and Monitoring

This article shares an interviewee's experience discussing core Flink interview questions, including typical resource allocation for large online tasks, common problems such as data, performance, stability, and resource issues, and the monitoring practices for clusters and tasks, while also containing a brief self‑promotion.

Big DataFlinkPerformance Issues

0 likes · 7 min read

Interview Experience: Flink Task Resource Allocation, Issues, and Monitoring

Xiaohongshu Tech REDtech

May 19, 2025 · Industry Insights

How Xiaohongshu Built a Minute‑Level Near‑Real‑Time Data Warehouse with Incremental Computing

Facing billions of daily logs and the need for minute‑level experiment metrics, Xiaohongshu partnered with Yunqi Tech to design a generic incremental‑compute solution that delivers near‑real‑time data warehousing with lower cost, higher accuracy, simplified pipelines, and improved query performance.

Big DataData LakeFlink

0 likes · 24 min read

How Xiaohongshu Built a Minute‑Level Near‑Real‑Time Data Warehouse with Incremental Computing

Big Data Technology & Architecture

May 15, 2025 · Big Data

Interview Review: Spark Stage Logic, Data Warehouse Evaluation, and Flink Late‑Data Handling

This article reviews common interview questions for data development roles, covering Spark stage partitioning and optimization, criteria for evaluating data warehouses, Flink's handling of late data, and provides practical answers and resources to help candidates deliver standout responses.

Big DataData QualityData Warehouse

0 likes · 11 min read

Interview Review: Spark Stage Logic, Data Warehouse Evaluation, and Flink Late‑Data Handling

Huolala Tech

May 14, 2025 · Big Data

How Lalamove Scaled Real‑Time Data Warehousing with Flink and Paimon

Lalamove’s international logistics platform transformed its real‑time data warehouse by leveraging Apache Flink and the Paimon lakehouse, addressing challenges of multi‑region data centers, time‑zone diversity, frequent upstream changes, and high costs, while improving scalability, latency, and operational efficiency across global markets.

Big DataFlinkPaimon

0 likes · 13 min read

How Lalamove Scaled Real‑Time Data Warehousing with Flink and Paimon

JD Tech

May 13, 2025 · Databases

Unlock ClickHouse’s Lightning‑Fast Queries: Architecture, Storage, and Index Secrets

This article examines ClickHouse’s high‑performance OLAP design, covering its MPP architecture, columnar storage, vectorized execution, pre‑sorting, table engines, extensive data‑type system, sharding and replication strategies, as well as its sparse and skip‑index mechanisms that together enable ultra‑fast analytics on massive datasets.

Big DataClickHouseColumnar Storage

0 likes · 16 min read

Unlock ClickHouse’s Lightning‑Fast Queries: Architecture, Storage, and Index Secrets

macrozheng

May 12, 2025 · Big Data

Master DataX: Efficient Data Synchronization for Massive MySQL Datasets

Learn how to overcome inaccurate reporting and cross-database challenges by using Alibaba’s open-source DataX tool to efficiently synchronize massive MySQL datasets, covering its architecture, job scheduling, installation, configuration, full- and incremental sync, and practical command-line examples.

Big DataDataXETL

0 likes · 15 min read

Master DataX: Efficient Data Synchronization for Massive MySQL Datasets

Top Architect

May 7, 2025 · Big Data

Using DataX for Efficient MySQL Data Synchronization

This article provides a comprehensive guide on using Alibaba's open‑source DataX tool for efficient offline synchronization between heterogeneous databases such as MySQL, covering its architecture, installation on Linux, job configuration, full‑ and incremental data transfer, and practical code examples.

Big DataDataXETL

0 likes · 18 min read

Using DataX for Efficient MySQL Data Synchronization

Architecture Digest

May 6, 2025 · Big Data

Using DataX for Efficient Data Synchronization Between MySQL Databases

This article explains how to employ Alibaba's open‑source DataX tool to perform fast, reliable full‑ and incremental data synchronization between MySQL instances, covering installation, framework design, job execution, and practical shell commands for Linux environments.

Big DataDataXETL

0 likes · 16 min read

Using DataX for Efficient Data Synchronization Between MySQL Databases

DataFunSummit

May 4, 2025 · Big Data

Iceberg Table Format Practice in Huawei Terminal Cloud

This article explains how Huawei's terminal cloud adopts the Apache Iceberg table format to efficiently manage large-scale datasets, detailing its architecture, feature engineering, merge operations, LSM-based storage, schema versioning, AB testing support, catalog enhancements, and future roadmap for full lifecycle data governance.

Big DataData LakeHuawei Cloud

0 likes · 13 min read

Iceberg Table Format Practice in Huawei Terminal Cloud

JD Tech

Apr 30, 2025 · Artificial Intelligence

TimeHF: A Billion‑Scale Time Series Forecasting Model Guided by Human Feedback

The JD Supply Chain algorithm team introduces TimeHF, a billion‑parameter time‑series large model that leverages RLHF to boost demand‑forecast accuracy by over 10%, detailing dataset construction, the PCTLM architecture, a custom RLHF framework (TPO), and extensive SOTA experimental results.

Big DataDeep LearningRLHF

0 likes · 10 min read

TimeHF: A Billion‑Scale Time Series Forecasting Model Guided by Human Feedback

Big Data Tech Team

Apr 28, 2025 · Big Data

Mastering Metadata, Master Data, and Data Governance: A Complete Guide

This article explains the core concepts of metadata, master data, data resources, data governance, and data management, outlines their roles, compares governance with management, and provides practical steps and best‑practice recommendations for building a robust enterprise data framework.

Big DataData GovernanceMaster Data

0 likes · 15 min read

Mastering Metadata, Master Data, and Data Governance: A Complete Guide

Big Data Technology & Architecture

Apr 28, 2025 · Big Data

Interview Insights on Spark Optimization, Flink Exactly-Once Semantics, and Paimon Asynchronous Merging

This article shares three high‑quality interview questions from a JD big‑data interview, covering practical Spark tuning, Flink's exactly‑once guarantees in production, and Paimon's asynchronous merge mechanism, and explains how to answer them with real‑world scenarios.

Big DataFlinkPaimon

0 likes · 6 min read

Interview Insights on Spark Optimization, Flink Exactly-Once Semantics, and Paimon Asynchronous Merging

Alibaba Cloud Big Data AI Platform

Apr 27, 2025 · Big Data

Scaling Property Services: StarRocks‑Powered Storage‑Compute Separation for 8000+ Communities

Facing a flood of data from over 8,000 communities, the Bifeng service team migrated from a monolithic storage‑compute architecture to a StarRocks‑based storage‑compute separation solution, achieving lower costs, higher resource utilization, faster queries, and improved SLA across their property management platform.

Big DataData WarehouseInfrastructure Migration

0 likes · 11 min read

Scaling Property Services: StarRocks‑Powered Storage‑Compute Separation for 8000+ Communities

Big Data Tech Team

Apr 26, 2025 · Big Data

Mastering the Data Development Roadmap: From Infrastructure to AI Integration

This guide outlines a comprehensive data development roadmap, covering infrastructure setup, governance frameworks, automated pipelines, BI and analytics tools, AI/ML integration, cultural adoption, and continuous performance monitoring to enable intelligent business transformation.

AI integrationAnalyticsBig Data

0 likes · 5 min read

Mastering the Data Development Roadmap: From Infrastructure to AI Integration

Alibaba Cloud Big Data AI Platform

Apr 24, 2025 · Big Data

Boosting Product Recommendations with Serverless Spark and Milvus: A Real‑World Case Study

蝉妈妈 migrated its recommendation platform to Alibaba Cloud Serverless Spark and Milvus, replacing traditional vector search and Spark clusters, achieving 40% faster offline tasks, 80% lower failure rates, significant cost savings, and scalable, low‑latency similar‑product retrieval for personalized marketing.

Big DataMilvusrecommendation system

0 likes · 8 min read

Boosting Product Recommendations with Serverless Spark and Milvus: A Real‑World Case Study

Big Data Tech Team

Apr 20, 2025 · Industry Insights

Essential Skills & Tech Stacks for Every Data Team Role

This guide breaks down the main positions in a data team— from data development and analysis engineers to product managers and operations specialists—detailing each role’s key responsibilities, essential skill sets, and the typical technology stack they rely on.

Big DataData Analyticsdata engineering

0 likes · 7 min read

Essential Skills & Tech Stacks for Every Data Team Role

dbaplus Community

Apr 20, 2025 · Databases

Why Wide Tables Fail and How to Design Them Efficiently

This article explains what wide tables are, why they are controversial, outlines three common design pitfalls with practical avoidance tips, and introduces three key technologies—ClickHouse, Cassandra, and Hudi/Iceberg—to help engineers build performant, maintainable wide‑table solutions in data warehouses.

Big DataClickHouseDatabase design

0 likes · 7 min read

Why Wide Tables Fail and How to Design Them Efficiently

macrozheng

Apr 18, 2025 · Big Data

How to Build Near Real-Time Elasticsearch Indexes for PB-Scale Data

This article explains why traditional databases like MySQL struggle with massive data, introduces Elasticsearch’s advantages, and details a practical architecture using Hive, Canal, and Otter to achieve near real‑time indexing of petabyte‑scale datasets with minimal latency.

Big DataCanalData Transfer Service

0 likes · 20 min read

How to Build Near Real-Time Elasticsearch Indexes for PB-Scale Data

AntTech

Apr 17, 2025 · Artificial Intelligence

Data+AI Forum at the 18th China Electronics Information Conference (2025) – Speaker Bios and Session Summaries

The 18th China Electronics Information Conference will be held in Chengdu from April 17‑21, 2025, featuring the DATA+AI forum that gathers leading academicians and industry experts to discuss data‑AI integration, with detailed speaker biographies, presentation titles, and abstracts covering topics such as large‑model inference, cloud‑edge ultrasound diagnostics, and the future of databases in the AI era.

@DataAIBig Data

0 likes · 12 min read

Data+AI Forum at the 18th China Electronics Information Conference (2025) – Speaker Bios and Session Summaries

Big Data Technology & Architecture

Apr 17, 2025 · Big Data

MaxCompute: Intelligent Data Warehouse Platform for the Data+AI Era

This article, based on a meetup presentation, details Alibaba Cloud's MaxCompute platform—its evolution, serverless architecture, AI integration, distributed Python framework, Object Table, near‑real‑time processing, and intelligent warehouse features—addressing the challenges of data warehouses in the Data+AI era.

Big DataData WarehouseMaxCompute

0 likes · 11 min read

MaxCompute: Intelligent Data Warehouse Platform for the Data+AI Era

vivo Internet Technology

Apr 16, 2025 · Big Data

Offline Mixed Deployment of Spark Tasks on Kubernetes: Containerization, Scheduling, and Elastic Resource Management

The article explains how the vivo Internet Big Data team containerized offline Spark jobs and deployed them with the Spark Operator on a mixed online‑offline Kubernetes cluster, using elastic scheduling and resource‑over‑subscription to boost CPU utilization by 30‑40% and handle over 100,000 daily tasks.

Big DataKubernetesResource Management

0 likes · 36 min read

Offline Mixed Deployment of Spark Tasks on Kubernetes: Containerization, Scheduling, and Elastic Resource Management

Top Architect

Apr 16, 2025 · Big Data

Redis Performance Optimization for Spark Streaming: Connection Pools, Pipelines, and Cluster Strategies

The article explains how to reduce latency in SparkStreaming jobs that heavily interact with Redis by using connection pools, batch sizing, pipeline techniques, and custom JedisCluster pipelines, while also covering Redis deployment modes, Codis proxy, and practical Java/Scala code examples.

Big DataJavaJedis

0 likes · 18 min read

Redis Performance Optimization for Spark Streaming: Connection Pools, Pipelines, and Cluster Strategies

dbaplus Community

Apr 15, 2025 · Big Data

How Xiaohongshu Boosted Data Warehouse Performance with Logical Datasets and Materialized Views

Xiaohongshu introduced logical datasets and materialized views to overcome low reuse of APP tables, limited scalability of single‑table BI datasets, and poor dashboard query performance, achieving higher data processing efficiency and faster query responses through optimized data flow, query pruning, and accelerated ETL scheduling.

Big Datalogical datasetquery optimization

0 likes · 24 min read

How Xiaohongshu Boosted Data Warehouse Performance with Logical Datasets and Materialized Views

Big Data Technology & Architecture

Apr 15, 2025 · Big Data

Designing a Lakehouse with Doris and Paimon: Query Acceleration and Unified Modeling

This article summarizes how the Doris‑Paimon lakehouse architecture leverages Doris' high‑performance OLAP engine to accelerate lake queries, provides a unified data analysis gateway, supports unified data integration, and enables open, layered data modeling for modern big‑data workloads.

Big DataData IntegrationPaimon

0 likes · 9 min read

Designing a Lakehouse with Doris and Paimon: Query Acceleration and Unified Modeling

DataFunSummit

Apr 13, 2025 · Big Data

Data Governance at Didi: Interview with Liu Chao on Big Data Asset Management

In this interview, Didi data governance lead Liu Chao discusses his career journey, the unique technical architecture of Didi’s big‑data governance system, cost‑driven pricing models, metadata management, lineage extraction, automation practices, and offers practical advice for enterprises seeking effective data governance.

AutomationBig DataCost-based Pricing

0 likes · 12 min read

Data Governance at Didi: Interview with Liu Chao on Big Data Asset Management

JD Cloud Developers

Apr 11, 2025 · Artificial Intelligence

How a Billion-Parameter Time Series Model Beats GPT4TS: The PCTLM Breakthrough

This article introduces PCTLM, a pioneering billion‑parameter pure time‑series large model that outperforms existing solutions like GPT4TS across multiple benchmarks, detailing its massive high‑quality dataset, novel patch‑based architecture, and a tailored RLHF framework (TPO) that enhances zero‑shot forecasting accuracy.

Big DataPCTLMRLHF

0 likes · 11 min read

How a Billion-Parameter Time Series Model Beats GPT4TS: The PCTLM Breakthrough

DataFunTalk

Apr 9, 2025 · Big Data

Highlights of the Apache Hudi Asia Technical Salon Hosted by Kuaishou – Practices and Innovations from Leading Companies

The Kuaishou‑hosted Apache Hudi Asia technical salon gathered over 230 attendees and featured seven experts from Kuaishou, Meituan, TikTok, Huawei, JD and others, who shared best practices, architecture designs, and performance optimizations for large‑scale data lake applications across AI, BI, and real‑time workloads.

AIApache HudiBatch Processing

0 likes · 14 min read

Highlights of the Apache Hudi Asia Technical Salon Hosted by Kuaishou – Practices and Innovations from Leading Companies

Alibaba Cloud Big Data AI Platform

Apr 9, 2025 · Big Data

How We Built an Intelligent Data Warehouse on Alibaba Cloud MaxCompute

This article details the business background, technical challenges, and the step‑by‑step implementation of an intelligent data warehouse on Alibaba Cloud MaxCompute, covering offline data pipelines, metric calculation, data analysis, and future plans for data lake and AI‑driven analytics.

AnalyticsBig DataData Lake

0 likes · 10 min read

How We Built an Intelligent Data Warehouse on Alibaba Cloud MaxCompute

JD Retail Technology

Apr 8, 2025 · Databases

ClickHouse Architecture and Core Technologies Overview

ClickHouse is an open‑source, massively parallel, column‑oriented OLAP database that integrates its own columnar storage, vectorized batch processing, pre‑sorted data, diverse table engines, extensive data types, sharding with replication, sparse primary‑key and skip indexes, and a multithreaded query engine, delivering high‑throughput real‑time analytics on massive datasets.

Big DataClickHouseColumnar Storage

0 likes · 15 min read

ClickHouse Architecture and Core Technologies Overview

DataFunSummit

Apr 3, 2025 · Big Data

Apache Hudi Asia Technical Salon Highlights: Practices and Innovations from Kuaishou, Meituan, Douyin, Huawei, and JD

The Apache Hudi Asia technical salon held in Beijing on March 29 gathered over 230 on‑site participants and 16,000 online viewers, featuring expert talks from leading Chinese tech companies that showcased real‑world Hudi implementations, performance optimizations, and future roadmap for data‑lake technologies.

Apache HudiBig DataData Lake

0 likes · 13 min read

Apache Hudi Asia Technical Salon Highlights: Practices and Innovations from Kuaishou, Meituan, Douyin, Huawei, and JD

Kuaishou Tech

Apr 2, 2025 · Big Data

Apache Hudi Asia Summit Successfully Held

The first Apache Hudi Asia Summit in Beijing attracted over 230 attendees, featuring technical discussions on data lake optimization and case studies from companies like Fastly and Meituan.

Apache HudiBig DataData Lake

0 likes · 12 min read

Apache Hudi Asia Summit Successfully Held

Big Data Technology & Architecture

Apr 2, 2025 · Databases

Replacing Elasticsearch with Apache Doris for Real‑Time Big Data Analytics: Architecture, Performance, and Enterprise Cases

This article analyzes why Elasticsearch struggles with large‑scale, complex real‑time analytics and demonstrates how Apache Doris’s MPP, columnar storage, and native SQL support provide a cost‑effective, high‑performance alternative, illustrated with detailed enterprise case studies.

Apache DorisBig DataElasticsearch

0 likes · 11 min read

Replacing Elasticsearch with Apache Doris for Real‑Time Big Data Analytics: Architecture, Performance, and Enterprise Cases

Mingyi World Elasticsearch

Apr 1, 2025 · Big Data

Elasticsearch Unveiled: Learn Search Engine Basics Through Comics

This visual guide walks readers through Elasticsearch fundamentals—from architecture and indexing to clustering, query DSL, aggregations, and performance tuning—using comic-style illustrations that simplify each concept for easy understanding, and security considerations, multilingual support, and real‑time search capabilities.

Big DataDistributed SystemsElasticsearch

0 likes · 2 min read

Elasticsearch Unveiled: Learn Search Engine Basics Through Comics

DataFunSummit

Apr 1, 2025 · Big Data

Understanding Flink CDC 3.3: Features, Improvements, and Future Plans

This article provides a comprehensive overview of Flink CDC 3.3, detailing its CDC fundamentals, new connectors, Transform module enhancements, asynchronous snapshot splitting, community adoption, and upcoming roadmap for broader ecosystem support and batch‑mode execution.

Big DataCDCChange Data Capture

0 likes · 15 min read

Understanding Flink CDC 3.3: Features, Improvements, and Future Plans

IT Architects Alliance

Mar 30, 2025 · Backend Development

Douyin’s Architectural Evolution: From Simple Beginnings to Scalable Cloud‑Native System

The article chronicles Douyin’s journey from a modest early‑stage architecture to a sophisticated, distributed, micro‑service and cloud‑native infrastructure that leverages load balancing, caching, big‑data frameworks, CDN, edge computing, and automated operations to support billions of users and massive traffic spikes.

Big DataDouyincloud-native

0 likes · 12 min read

Douyin’s Architectural Evolution: From Simple Beginnings to Scalable Cloud‑Native System

Ma Wei Says

Mar 27, 2025 · Big Data

What’s New in Apache Kafka 4.0? A Deep Dive into KRaft, Java 17, and Next‑Gen Consumer Rebalance

Apache Kafka 4.0 eliminates Zookeeper with the KRaft consensus protocol, upgrades Java requirements to 17, introduces a next‑generation consumer rebalance protocol, adds Share Group queue semantics, and bundles numerous performance, security, and API improvements for modern streaming workloads.

Big DataConsumer RebalanceKRaft

0 likes · 7 min read

What’s New in Apache Kafka 4.0? A Deep Dive into KRaft, Java 17, and Next‑Gen Consumer Rebalance

vivo Internet Technology

Mar 26, 2025 · Big Data

Reading Encrypted ORC Files in StarRocks: Architecture and Implementation Details

The article details how StarRocks extends the Apache ORC C++ library to decrypt column‑level encrypted ORC files, describing the file hierarchy, AES‑128‑CTR key handling, the query‑time master‑key retrieval, a decorator‑based decryption/decompression pipeline, and the block‑skip‑read mechanism that enables efficient predicate push‑down.

Big DataFile FormatORC

0 likes · 19 min read

Reading Encrypted ORC Files in StarRocks: Architecture and Implementation Details

Big Data Technology Architecture

Mar 25, 2025 · Big Data

Kafka 4.0 Release: KRaft Architecture, Consumer Group Optimizations, and New Queue Features

Kafka 4.0 marks a milestone release that replaces ZooKeeper with the KRaft consensus engine, improves scalability and performance, introduces a server‑side consumer‑group protocol, adds shared‑group queue capabilities, and updates Java requirements and documentation, delivering a more robust and flexible streaming platform.

Big DataDistributed StreamingJava11

0 likes · 6 min read

Kafka 4.0 Release: KRaft Architecture, Consumer Group Optimizations, and New Queue Features

Alibaba Cloud Big Data AI Platform

Mar 25, 2025 · Big Data

How to Connect EMR Serverless Spark with Apache Doris for Seamless Data Processing

This guide explains how to integrate EMR Serverless Spark with the high‑performance Apache Doris analytical database, covering prerequisites, connector download, OSS upload, network configuration, table creation, and both SQL‑session and Notebook examples for reading and writing Doris tables.

Apache DorisBig DataData Integration

0 likes · 11 min read

How to Connect EMR Serverless Spark with Apache Doris for Seamless Data Processing

Baidu Geek Talk

Mar 24, 2025 · Big Data

How Turing Data Finder Transforms Growth Analysis with a Unified Data Platform

The article provides a detailed technical overview of the Turing Data Finder (TDF) platform, describing its background, core components, data schema, ingestion workflow, and a suite of growth‑analysis features such as event, retention, funnel, path, component, distribution, and attribution analysis, while also outlining performance‑optimisation techniques and future development directions.

Big DataData PlatformSQL Optimization

0 likes · 17 min read

How Turing Data Finder Transforms Growth Analysis with a Unified Data Platform

dbaplus Community

Mar 22, 2025 · Big Data

Why Data Lakes Are Crucial for Observability—and When They’re Not the Answer

The article explains how data lakes serve as a foundational component for observability by aggregating raw, diverse data for advanced analysis, while also outlining the technical, cost, and scalability challenges that make them unsuitable for every organization.

AnalyticsBig DataData Lake

0 likes · 10 min read

Why Data Lakes Are Crucial for Observability—and When They’re Not the Answer

Didi Tech

Mar 20, 2025 · Big Data

Key Questions and Value Assessment in Data Warehouse Modeling and Development

The article explores nine fundamental questions about data‑warehouse modeling—why and when to model, how to evaluate and compare models, the warehouse’s unique role versus business systems, modern architectural shifts, a quantitative value‑proof scoring framework, industry‑standard versus custom approaches, demonstrating business impact, and career insights—concluding that true value lies in enabling informed decisions rather than technology hype.

AIBig DataData Value

0 likes · 12 min read

Key Questions and Value Assessment in Data Warehouse Modeling and Development

Model Perspective

Mar 20, 2025 · Big Data

How to Sample Effectively in the Big Data Era: Methods and Best Practices

This article explores essential sampling strategies for big‑data environments—including simple random, reservoir, stratified, oversampling, undersampling, and weighted sampling—detailing their principles, algorithmic steps, advantages, drawbacks, and suitable application scenarios to help analysts choose the right method.

Big DataSamplingoversampling

0 likes · 8 min read

How to Sample Effectively in the Big Data Era: Methods and Best Practices

AntData

Mar 20, 2025 · Big Data

Design and Optimization of Real‑time Data Lake Tables with Paimon and Flink for Advertising Diagnostics

This article presents a comprehensive exploration of using Apache Paimon and Flink to design lake tables that support minute‑level latency, low cost, and unified batch‑stream processing for advertising data, covering schema design, partitioning strategies, performance trade‑offs, cost analysis, and operational best practices.

Big DataData LakeFlink

0 likes · 34 min read

Design and Optimization of Real‑time Data Lake Tables with Paimon and Flink for Advertising Diagnostics

Alibaba Cloud Big Data AI Platform

Mar 20, 2025 · Big Data

How to Read and Write StarRocks Data with EMR Serverless Spark

This step‑by‑step guide explains how to use EMR Serverless Spark together with the StarRocks Spark Connector to create a workspace, upload the connector JAR, configure network connections, create databases and tables in StarRocks, and perform read/write operations via SQL sessions, Notebook sessions, or batch Spark jobs, complete with code examples and UI screenshots.

Big DataData IntegrationSpark

0 likes · 14 min read

How to Read and Write StarRocks Data with EMR Serverless Spark

Data Thinking Notes

Mar 19, 2025 · Big Data

How to Maximize Data Asset Value: From DataOps to Monetization

This report outlines a comprehensive framework for turning raw data into valuable assets, introducing DataOps and panoramic data architecture, and detailing practical methods for data value assessment, asset circulation, and operational mechanisms to help enterprises build a solid value baseline and expand data asset applications.

Big DataData Asset ManagementData Governance

0 likes · 4 min read

How to Maximize Data Asset Value: From DataOps to Monetization

Alibaba Cloud Big Data AI Platform

Mar 17, 2025 · Big Data

How MaxFrame Enables Scalable Python AI Workloads on MaxCompute

This article introduces MaxFrame, a cloud‑native distributed Python compute service built on MaxCompute, detailing its architecture, seamless integration with the Python ecosystem, and real‑world use cases ranging from large‑scale data analysis and machine learning to offline LLM inference and custom image deployments.

Big DataData WarehouseMaxFrame

0 likes · 18 min read

How MaxFrame Enables Scalable Python AI Workloads on MaxCompute

JD Tech

Mar 13, 2025 · Operations

Ensuring Stability of the Double 11 Supply‑Chain Dashboard: Full‑Link Process, Risk Points, and Technical Safeguards

This article details how JD Logistics guarantees the stability of its Double 11 supply‑chain dashboard by mapping the entire data‑flow, identifying risk points across ingestion, processing, storage, service, and monitoring layers, and applying targeted technical and organizational safeguards.

Big DataDashboardSupply Chain

0 likes · 10 min read

Ensuring Stability of the Double 11 Supply‑Chain Dashboard: Full‑Link Process, Risk Points, and Technical Safeguards

DataFunSummit

Mar 12, 2025 · Big Data

Principles and Common Optimization Techniques of the Spark SQL Optimizer

This article explains the underlying principles of the Spark SQL optimizer and presents three classic optimization paradigms—push‑down optimization, operator elimination/merging, and expression elimination/replacement—illustrating each with concrete rule implementations and code examples.

Big DataSpark SQLoptimizer

0 likes · 12 min read

Principles and Common Optimization Techniques of the Spark SQL Optimizer

JD Tech Talk

Mar 12, 2025 · Big Data

Ensuring Stability of the Double‑11 Supply Chain Dashboard: Full‑Chain Process, Risk Points, and Technical Safeguard Strategies

This article details how the supply‑chain big‑screen dashboard for Double‑11 maintains high stability by mapping the full data‑flow, identifying risk points across ingestion, processing, storage and service layers, and applying comprehensive technical safeguards such as high‑availability design, fault‑tolerance, monitoring, and coordinated operational procedures.

Big DataDashboardSupply Chain

0 likes · 11 min read

Ensuring Stability of the Double‑11 Supply Chain Dashboard: Full‑Chain Process, Risk Points, and Technical Safeguard Strategies

JD Cloud Developers

Mar 12, 2025 · Operations

How to Ensure Double‑11 Supply‑Chain Dashboard Stability: End‑to‑End Strategies

This article details the end‑to‑end technical and operational measures—including full‑chain flow mapping, risk point analysis, layered mitigation tactics, monitoring, and team coordination—used to guarantee the stability and accuracy of the supply‑chain dashboard during the Double‑11 promotion.

Big DataDashboardOperations

0 likes · 15 min read

How to Ensure Double‑11 Supply‑Chain Dashboard Stability: End‑to‑End Strategies

Top Architecture Tech Stack

Mar 12, 2025 · Big Data

DeepSeek: Comprehensive Installation, Configuration, and Usage Guide

This article provides a detailed, step‑by‑step guide to installing, configuring, and using DeepSeek—a powerful command‑line data processing tool—covering basic operations, advanced features, scripting tips, and troubleshooting to help users efficiently import, clean, analyze, and visualize data.

Big DataCLIConfiguration

0 likes · 8 min read

DeepSeek: Comprehensive Installation, Configuration, and Usage Guide

Ma Wei Says

Mar 11, 2025 · Big Data

Mastering DWS Layer Design: Principles, Steps, and Best Practices

This article explains the role of the DWS layer in data warehouses, outlines design principles, step‑by‑step modeling, naming conventions, field design, provides concrete DDL/ETL examples, common pitfalls, and how to build reusable, performant summary tables for analytics.

Big DataDWS LayerData Warehouse

0 likes · 15 min read

Mastering DWS Layer Design: Principles, Steps, and Best Practices

Ma Wei Says

Mar 9, 2025 · Big Data

Mastering DWD Layer Design: Principles, Fact Tables, and Performance Tips

This article provides a comprehensive guide to designing the Data Warehouse Detail (DWD) layer, covering Kimball‑based design principles, step‑by‑step modeling, table and field naming conventions, concrete Hive DDL/DML examples, and optimization techniques such as partitioning, bucketing, and compression.

Big DataDWDData Warehouse

0 likes · 21 min read

Mastering DWD Layer Design: Principles, Fact Tables, and Performance Tips

Alibaba Cloud Infrastructure

Mar 6, 2025 · Big Data

Leveraging Apache Iceberg and AutoMQ for Real-Time Data Lake Ingestion: Architecture, Best Practices, and Cost Optimization

This article examines how Apache Iceberg’s snapshot‑based ACID transactions, logical‑physical partition evolution, and COW/MOR update modes enable efficient real‑time data lake ingestion, and demonstrates AutoMQ’s Kafka‑to‑Iceberg Table Topic solution that simplifies schema management, reduces latency, and cuts operational costs.

Apache IcebergAutoMQBig Data

0 likes · 14 min read

Leveraging Apache Iceberg and AutoMQ for Real-Time Data Lake Ingestion: Architecture, Best Practices, and Cost Optimization

DeWu Technology

Mar 5, 2025 · Big Data

Using ANTLR4 for SQL Parsing, Completion, and Validation in SparkSQL-based Data IDE

The article explains how a large‑scale data‑development IDE leverages ANTLR4 to build a custom SparkSQL parser that provides real‑time syntax checking, auto‑completion, and validation by generating ASTs, using listeners for context, optimizing performance, and exploring future integration with large language models.

ANTLRBig DataSQL

0 likes · 24 min read

Using ANTLR4 for SQL Parsing, Completion, and Validation in SparkSQL-based Data IDE

Big Data Technology & Architecture

Mar 3, 2025 · Big Data

The Turning Point for Data Development: From Traditional Data Engineering to AI Data Engineering

The article analyzes how the rapid rise of open‑source large‑model AI in 2025 is reshaping the data development profession, urging developers to transition from specialized data‑engineer roles to full‑stack AI data engineering skills such as distributed computing, lake‑house architectures, and model tuning.

AIBig DataFlink

0 likes · 7 min read

The Turning Point for Data Development: From Traditional Data Engineering to AI Data Engineering

IT Architects Alliance

Feb 28, 2025 · Industry Insights

What 10 Core Technologies Every IT Architect Must Master in 2024?

Amid rapid advances in cloud, AI, big data, and DevOps, this 2024 guide outlines the ten essential technologies—ranging from multi-language programming and database mastery to distributed systems, microservices, and security—that IT architects need to master to stay competitive and drive digital transformation.

Big DataDevOpsIT Architecture

0 likes · 26 min read

What 10 Core Technologies Every IT Architect Must Master in 2024?

DataFunSummit

Feb 28, 2025 · Big Data

Apache Gravitino: Open‑Source Data Asset Management for AI and Multi‑Cloud Environments

This article introduces Apache Gravitino, an open‑source metadata and data‑asset management platform designed to address AI‑driven data demands and multi‑cloud challenges, detailing its architecture, core components, typical use cases, real‑world success stories, and a Q&A session on its capabilities.

AIApache GravitinoBig Data

0 likes · 18 min read

Apache Gravitino: Open‑Source Data Asset Management for AI and Multi‑Cloud Environments

Alibaba Cloud Big Data AI Platform

Feb 28, 2025 · Databases

How MaxCompute’s Intelligent Data Warehouse Optimizes Queries with AutoMV

This article explains MaxCompute’s intelligent data warehouse architecture, its self‑learning optimization pipeline, the role of intelligent materialized views, the automated recommendation system for materialized views, and the AutoMV feature that automatically creates, updates, and cleans up materialized views to reduce compute costs and improve query performance.

AutoMVBig DataData Warehouse

0 likes · 17 min read

How MaxCompute’s Intelligent Data Warehouse Optimizes Queries with AutoMV

DataFunSummit

Feb 27, 2025 · Big Data

Case Study: Migrating Spark Thinking Education's Big Data Architecture from EMR to Serverless

This article details Spark Thinking Education's comprehensive migration from EMR to a serverless big‑data architecture, outlining the challenges of elasticity, cost accounting, and resource contention, the step‑by‑step implementation of serverless compute, storage, and integration services, and the resulting performance, cost, and stability gains.

Big DataCost OptimizationServerless

0 likes · 41 min read

Case Study: Migrating Spark Thinking Education's Big Data Architecture from EMR to Serverless

DataFunSummit

Feb 23, 2025 · Big Data

Douyin Group’s ByteLake Data Lake Table Optimization and Management Practices

This article presents Douyin Group’s ByteLake, a heavily customized Apache Hudi‑based data lake table framework, detailing its core concepts, metadata services, write and read optimizations, operational challenges, a fully managed table management service, and its integration with the Amoro open‑source platform.

AmoroApache HudiBig Data

0 likes · 11 min read

Douyin Group’s ByteLake Data Lake Table Optimization and Management Practices

Deepin Linux

Feb 23, 2025 · Cloud Computing

Understanding Ceph Distributed Storage Architecture and Its Core Components

Ceph is a unified, open‑source distributed storage system whose layered architecture—comprising RADOS, LIBRADOS, and upper‑level services like RADOSGW, RBD, and CephFS—provides high performance, reliability, scalability, and flexible data access for cloud, big‑data, and AI workloads.

Big DataCepharchitecture

0 likes · 25 min read

Understanding Ceph Distributed Storage Architecture and Its Core Components

DataFunSummit

Feb 22, 2025 · Big Data

Blaze Engine: A Rust‑Based Native Vectorized Execution Engine for Spark SQL

The article introduces Blaze, Kuaishou's Rust‑powered native execution engine that vectorizes Spark SQL workloads, explains its architecture and operation, presents benchmark results showing up to 50% latency reduction, and details internal deployments, industry case studies, community collaborations, and the 2025 roadmap.

Big DataNative ExecutionPerformance Optimization

0 likes · 12 min read

Blaze Engine: A Rust‑Based Native Vectorized Execution Engine for Spark SQL