Tagged articles
3675 articles
Page 6 of 37
DataFunSummit
DataFunSummit
Aug 31, 2024 · Big Data

Apache Hudi Clustering: Workflow and Layout Optimization Strategies (Part 6)

This article explains Apache Hudi's clustering service, detailing its workflow, three execution modes, and layout optimization strategies—including linear, Z‑order, and Hilbert space‑filling curves—to improve storage locality and query performance in large‑scale data lake environments.

Apache HudiBig DataSpace-filling Curves
0 likes · 8 min read
Apache Hudi Clustering: Workflow and Layout Optimization Strategies (Part 6)
Data Thinking Notes
Data Thinking Notes
Aug 29, 2024 · Big Data

How ICBC Evolved Its Data Intelligence Architecture for Real‑Time Insights

At the 2024 Data Intelligence Conference, ICBC's Big Data and AI Lab detailed the evolution of its data intelligence platform, covering architectural redesign, real‑time data warehouse technology, unified intelligent data tools, and future development directions to boost efficiency and innovation.

Big DataData Platformarchitecture evolution
0 likes · 3 min read
How ICBC Evolved Its Data Intelligence Architecture for Real‑Time Insights
Zhuanzhuan Tech
Zhuanzhuan Tech
Aug 28, 2024 · Big Data

Quality Inspection Data Collection: Design, Architecture, and Applications

This article outlines the design, architecture, and practical applications of a quality inspection data collection system, covering data point structures, reporting mechanisms, compliance analysis, intelligent strategy iteration, and BI dashboards, illustrating how big‑data techniques enable digital transformation of inspection processes.

BIBig Datacompliance
0 likes · 10 min read
Quality Inspection Data Collection: Design, Architecture, and Applications
Architecture Digest
Architecture Digest
Aug 27, 2024 · Big Data

Curated List of Free API Interfaces for Various Services

This article provides a comprehensive collection of free, unlimited-use API endpoints covering diverse services such as phone number lookup, historical events, stock data, weather forecasts, identity verification, jokes, maps, and many others, offering developers ready-to-use resources for building data-driven applications.

BackendBig Datadata services
0 likes · 5 min read
Curated List of Free API Interfaces for Various Services
DataFunTalk
DataFunTalk
Aug 27, 2024 · Big Data

Kuaishou's Year-Long White‑Box Cost Governance in Big Data: Engine, Data‑Warehouse, and Tool Optimizations

This article presents Kuaishou's comprehensive white‑box cost governance practice over the past year, detailing the data‑governance framework, engine and data‑warehouse white‑boxing techniques, compression algorithm replacement, HBO automatic tuning, operator analysis, and the resulting performance and cost benefits, as well as future plans.

Big DataCost Optimizationdata-warehouse
0 likes · 29 min read
Kuaishou's Year-Long White‑Box Cost Governance in Big Data: Engine, Data‑Warehouse, and Tool Optimizations
DataFunSummit
DataFunSummit
Aug 26, 2024 · Big Data

Building a Doris‑Based Lakehouse Integrated Analytics System at Kuaishou

This article presents Kuaishou's experience of designing and implementing a Doris‑driven lakehouse integrated analytics system, covering the current OLAP landscape, challenges of data duplication and governance, the new architecture with caching and auto‑materialization, implementation details, performance impact, and future work.

Auto MaterializationBig DataLakehouse
0 likes · 24 min read
Building a Doris‑Based Lakehouse Integrated Analytics System at Kuaishou
Bilibili Tech
Bilibili Tech
Aug 23, 2024 · Big Data

Accelerating Multi‑Dimensional OLAP Queries in ClickHouse with Grouping Sets, RBM, and Dense Dictionary Encoding

To achieve sub‑second, multi‑dimensional analytics on Bilibili’s hundred‑million‑row datasets, the team built a ClickHouse‑based acceleration layer that combines grouping‑set pre‑aggregation, bitmap (RBM) distinct handling, and a dense dictionary encoding service, dramatically cutting CPU, memory and query latency versus traditional OLAP pipelines.

Big DataBitmapClickHouse
0 likes · 28 min read
Accelerating Multi‑Dimensional OLAP Queries in ClickHouse with Grouping Sets, RBM, and Dense Dictionary Encoding
ByteDance Data Platform
ByteDance Data Platform
Aug 20, 2024 · Big Data

How FlinkSQL Optimizations Cut CPU Usage by Up to 60% in Streaming Jobs

This article details the FlinkSQL performance enhancements implemented by the streaming team, covering view reuse, redundant shuffle removal, multiple‑input operator redesign, long sliding‑window optimizations, and native JSON format improvements, which together deliver up to 60% CPU savings and massive core‑hour reductions.

Big DataCPU ReductionFlinkSQL
0 likes · 13 min read
How FlinkSQL Optimizations Cut CPU Usage by Up to 60% in Streaming Jobs
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 20, 2024 · Big Data

Practical Insights on Using Apache Paimon for Real-World Data Lake Scenarios

This article shares a personal, experience‑driven overview of Apache Paimon, highlighting its design simplicity, key capabilities such as schema evolution, stream‑batch unified processing, primary‑key support, and closed‑loop data handling, while discussing when its features are appropriate for production environments.

Apache PaimonBatch ProcessingBig Data
0 likes · 5 min read
Practical Insights on Using Apache Paimon for Real-World Data Lake Scenarios
Su San Talks Tech
Su San Talks Tech
Aug 18, 2024 · Big Data

How to Crush the One Billion Row Java Challenge: From 14 Minutes to Sub‑2‑Second Runtime

This article walks through the One Billion Row Challenge, explaining the problem, baseline solution, and a series of performance optimizations—from JVM selection and parallel I/O to custom hash tables, unsafe memory access, and SIMD techniques—that shrink execution time from minutes to under two seconds.

Big DataOne Billion Row ChallengePerformance
0 likes · 20 min read
How to Crush the One Billion Row Java Challenge: From 14 Minutes to Sub‑2‑Second Runtime
DataFunSummit
DataFunSummit
Aug 17, 2024 · Big Data

AnalyticDB Spark Architecture and Vectorized Engine Performance Overview

This article introduces the AnalyticDB Spark architecture, explains the need for Spark vectorization, surveys industry vectorized solutions, details ADB Spark's own vectorized implementation with Gluten and Velox, and presents performance test results showing a 6.98‑fold speedup over open‑source Spark.

AnalyticDBBig DataGluten
0 likes · 9 min read
AnalyticDB Spark Architecture and Vectorized Engine Performance Overview
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Aug 16, 2024 · Big Data

Understanding the Lambda Architecture for Big Data Processing

This article explains the Lambda architecture—a three‑layer model combining batch and real‑time processing for large‑scale data, outlines its components, advantages, disadvantages, common tools, and compares it with the Kappa alternative while providing practical insights for data engineers.

Batch ProcessingBig DataLambda architecture
0 likes · 5 min read
Understanding the Lambda Architecture for Big Data Processing
High Availability Architecture
High Availability Architecture
Aug 16, 2024 · Big Data

Introduction to Elasticsearch: Core Concepts, Query Types, Pagination, and Data Synchronization

This article provides a comprehensive overview of Elasticsearch, covering its distributed storage architecture, core data model concepts, analysis and query capabilities, practical next‑token pagination techniques, join strategies, and various data synchronization methods for integrating Elasticsearch with other systems.

Big DataElasticsearchQuery DSL
0 likes · 13 min read
Introduction to Elasticsearch: Core Concepts, Query Types, Pagination, and Data Synchronization
DataFunSummit
DataFunSummit
Aug 15, 2024 · Artificial Intelligence

Building an LLM‑Driven Metric Platform for Data Democratization

This article explains how large language models (LLMs) can launch data democratization by constructing a metric platform that combines LLM agents, semantic layers, NL2SQL/NL2API pipelines, warehouse‑internal and external semantics, and showcases SwiftAgent/SwiftMetrics innovations, real‑world case studies, and future directions.

Big DataData DemocratizationLLM
0 likes · 13 min read
Building an LLM‑Driven Metric Platform for Data Democratization
Mike Chen's Internet Architecture
Mike Chen's Internet Architecture
Aug 14, 2024 · Big Data

Understanding Data Middle Platform: Value, Architecture, and Real‑World Cases

This article explains the concept, value, three‑layer architecture, and practical implementations of a data middle platform, illustrating how it standardizes data, forms a middle‑office organization, and drives cost‑effective business empowerment through examples from Alibaba, NetEase, and other enterprises.

ArchitectureBig DataData Governance
0 likes · 9 min read
Understanding Data Middle Platform: Value, Architecture, and Real‑World Cases
DataFunSummit
DataFunSummit
Aug 14, 2024 · Big Data

Solving Typical Issues in Migrating to Spark 3.1: Multiple Catalog, Hive‑SQL to Spark‑SQL Migration, and Performance & Stability Optimizations at Xiaomi

This article shares Xiaomi's experience building a next‑generation one‑stop data development platform on Spark 3.1, covering typical challenges such as Multiple Catalog implementation, Hive‑SQL to Spark‑SQL migration, offline Spark performance and stability optimizations, and future roadmap plans.

Apache SparkBig DataData Platform
0 likes · 18 min read
Solving Typical Issues in Migrating to Spark 3.1: Multiple Catalog, Hive‑SQL to Spark‑SQL Migration, and Performance & Stability Optimizations at Xiaomi
DataFunSummit
DataFunSummit
Aug 13, 2024 · Big Data

Data Cost Reduction and Efficiency: Qichacha's Data Architecture and Multi‑Cloud Unified Design

This article presents Qichacha's comprehensive data‑cost‑reduction strategy, detailing its Hadoop‑based three‑pillar architecture, layered data warehouse, Hive upgrades, unified metadata across multi‑cloud clusters, middleware choices such as Alluxio and JuiceFS, version‑compatible hybrid clouds, and Kubernetes‑driven resource orchestration to achieve scalable, low‑cost data processing.

Big DataHadoopdata-warehouse
0 likes · 16 min read
Data Cost Reduction and Efficiency: Qichacha's Data Architecture and Multi‑Cloud Unified Design
Bilibili Tech
Bilibili Tech
Aug 13, 2024 · Big Data

How Bilibili Re‑engineered Its Search Indexing with Distributed Storage and Spark

This article details Bilibili's transformation of its search offline indexing pipeline, moving from manual MySQL‑based processes to a high‑capacity, distributed KV store and Spark‑driven builds, addressing performance, maintenance, and scalability challenges while improving resource efficiency and iteration speed.

Big DataBilibiliKV Store
0 likes · 24 min read
How Bilibili Re‑engineered Its Search Indexing with Distributed Storage and Spark

How Hudi MetaServer Transforms Metadata Management and Performance in Data Lakes

This article examines the challenges of Hudi metadata stored on HDFS, introduces the independently developed Hudi MetaServer for centralized metadata, visual management, unified permission control, TTL, expression payloads, and multi‑active scaling, and outlines future enhancements such as LLS, multi‑table fusion, and JDBC support.

Big DataData LakeHudi
0 likes · 11 min read
How Hudi MetaServer Transforms Metadata Management and Performance in Data Lakes
Top Architect
Top Architect
Aug 10, 2024 · Big Data

Design and Implementation of a Scalable Real-Time Log Monitoring Platform at Baidu

This article introduces Baidu's log platform that handles billions of daily events, explains UBC logging concepts and monitoring requirements, and details a low‑cost, high‑accuracy architecture using real‑time streaming, dimension mapping, watermarking, and time‑window aggregation to achieve reliable, scalable event monitoring.

Big DataLog MonitoringReal-time Streaming
0 likes · 14 min read
Design and Implementation of a Scalable Real-Time Log Monitoring Platform at Baidu
DataFunSummit
DataFunSummit
Aug 9, 2024 · Big Data

Design and Practice of Ant Group's Metric System

This article presents a comprehensive overview of Ant Group's metric system, covering its definition, three-layer architecture, common challenges, concept consensus methods, semantic layer options, mechanism design, productization capabilities, platform improvements, business outcomes, future directions, and a detailed Q&A session.

Big DataData Platformdata modeling
0 likes · 28 min read
Design and Practice of Ant Group's Metric System
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Aug 8, 2024 · Big Data

How to Migrate HBase and HDFS Clusters Safely Without Downtime

This guide details a step‑by‑step migration plan for HBase and HDFS clusters, covering background, high‑availability architecture, role assignments, expansion and shrinkage of ZooKeeper and JournalNode, NameNode and DataNode migration, rolling restarts, and common upgrade pitfalls.

Big DataCluster MigrationHBase
0 likes · 12 min read
How to Migrate HBase and HDFS Clusters Safely Without Downtime
DataFunSummit
DataFunSummit
Aug 6, 2024 · Big Data

Implementing a Multi‑Tenant Lakehouse Data Platform for Real‑Time Analytics at a SaaS CRM Company

This article details how a SaaS CRM provider built a cloud‑native Lakehouse platform to support multi‑tenant real‑time analytics, describing data challenges, metadata‑driven architecture, virtual database design, query optimization, BI integration, AI readiness, migration steps, and the resulting performance and scalability gains.

Big DataData PlatformLakehouse
0 likes · 19 min read
Implementing a Multi‑Tenant Lakehouse Data Platform for Real‑Time Analytics at a SaaS CRM Company
DataFunSummit
DataFunSummit
Aug 5, 2024 · Big Data

Velox Memory Management and Execution Engine Overview

This article presents a comprehensive overview of Meta's open‑source Velox query execution engine, detailing its architecture, vectorized execution model, memory‑pool hierarchy, arbitrator and allocator designs, spilling techniques, and future development plans for large‑scale data processing.

Big DataMemory ManagementQuery Execution
0 likes · 24 min read
Velox Memory Management and Execution Engine Overview
NewBeeNLP
NewBeeNLP
Aug 5, 2024 · Industry Insights

How Alibaba Cloud Scales Search Recommendations with Big Data, AI, and LLMs

This article details Alibaba Cloud's end‑to‑end architecture for search and advertising recommendation, covering the data platform, AI services, feature‑store design, training and inference optimizations, and the integration of large language models for new recommendation scenarios.

AI PlatformAlibaba CloudBig Data
0 likes · 17 min read
How Alibaba Cloud Scales Search Recommendations with Big Data, AI, and LLMs
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 5, 2024 · Big Data

Key Features of Apache Flink 1.20: Materialized Tables, DISTRIBUTED BY, and State/Checkpoint Optimizations

The article reviews Apache Flink 1.20, highlighting the new Materialized Table concept, the DISTRIBUTED BY support for load‑balanced storage and join performance, and state/checkpoint file merging improvements, while providing code examples and practical insights for users.

Apache FlinkBig DataCheckpoint Optimization
0 likes · 7 min read
Key Features of Apache Flink 1.20: Materialized Tables, DISTRIBUTED BY, and State/Checkpoint Optimizations
DataFunSummit
DataFunSummit
Aug 4, 2024 · Big Data

Apache Hudi from Zero to One: Comprehensive Guide to Write Indexing (Part 4)

This article explains Apache Hudi’s write‑side indexing, detailing the indexing API, various index types—including simple, Bloom, bucket, HBase, and record‑level indexes—and their mechanisms, helping readers understand how Hudi validates record existence and optimizes updates and deletions.

Apache HudiBig DataData Lake
0 likes · 9 min read
Apache Hudi from Zero to One: Comprehensive Guide to Write Indexing (Part 4)
DataFunTalk
DataFunTalk
Aug 2, 2024 · Artificial Intelligence

From Big Data to Large Models: Alibaba Cloud AI Platform Architecture and Practices for Search Recommendation

This presentation details Alibaba Cloud's AI platform, covering the end‑to‑end pipeline from big‑data processing and feature engineering to large‑model training, inference optimization, recommendation system architecture, and RAG applications, highlighting practical engineering solutions and performance gains.

AI PlatformBig DataFeature Store
0 likes · 18 min read
From Big Data to Large Models: Alibaba Cloud AI Platform Architecture and Practices for Search Recommendation
DataFunSummit
DataFunSummit
Aug 1, 2024 · Big Data

Deep Dive into Apache Spark SQL: Concepts, Core Components, and API

This article provides a comprehensive overview of Apache Spark SQL, covering its fundamental concepts such as TreeNode, AST, and QueryPlan, the distinction between logical and physical plans, the rule‑execution framework, core components like SparkSqlParser and Analyzer, as well as the Spark Session, Dataset/DataFrame, and various writer APIs, supplemented by a detailed Q&A session.

Apache SparkBig DataSQL optimization
0 likes · 19 min read
Deep Dive into Apache Spark SQL: Concepts, Core Components, and API
StarRocks
StarRocks
Aug 1, 2024 · Big Data

How Kingsoft Office Boosted Query Speed 2.3× with StarRocks 3.0

Kingsoft Office migrated its reporting platform from a multi‑engine stack to StarRocks 3.0, achieving a 48.84% performance gain, halving query latency, reducing operational costs, and improving resource utilization while supporting storage‑compute separation and seamless Trino SQL compatibility.

Big DataStarRocksStorage-Compute Separation
0 likes · 14 min read
How Kingsoft Office Boosted Query Speed 2.3× with StarRocks 3.0
Data Thinking Notes
Data Thinking Notes
Jul 29, 2024 · Big Data

What Is a Data Middle Platform and How Does It Transform Enterprise Data Management?

This article explains the concept, design principles, and core components of a data middle platform, detailing its overall, functional, layered, logical, and data architectures, as well as the specific platforms for data collection, processing, organization, governance, quality, sharing, and visualization, illustrated with diagrams.

Big DataData ArchitectureData Governance
0 likes · 27 min read
What Is a Data Middle Platform and How Does It Transform Enterprise Data Management?
58 Tech
58 Tech
Jul 29, 2024 · Databases

HBase Cloud Migration: Architecture, Challenges, and Solutions

This technical report details the background, architecture, construction, core issues, migration plans, and future roadmap of moving 58's HBase clusters to a cloud‑native environment, highlighting cost reduction, operational automation, and performance optimizations.

Big DataCloud NativeHBase
0 likes · 22 min read
HBase Cloud Migration: Architecture, Challenges, and Solutions
DataFunTalk
DataFunTalk
Jul 27, 2024 · Big Data

Design and Implementation of Kuaishou's Metric Middle Platform

This article presents Kuaishou's metric middle platform, detailing its background, design principles, metric management and service architecture, including headless BI concepts, unified analysis language OAX, query engine OCTO, data modeling layers, acceleration strategies, and future directions toward intelligence and high performance.

Big DataHeadless BIKuaishou
0 likes · 19 min read
Design and Implementation of Kuaishou's Metric Middle Platform
DataFunSummit
DataFunSummit
Jul 26, 2024 · Big Data

Understanding Power Law Distributions in Content Ecosystems: Data Science Insights and Applications

This article explores how power‑law and other heavy‑tailed distributions appear in content ecosystems, explains their statistical foundations, discusses why they are common, and presents data‑driven strategies—including integer programming, graph‑based creator analysis, and causal inference—to optimize content production, recommendation, and settlement policies.

Big DataData SciencePower Law
0 likes · 18 min read
Understanding Power Law Distributions in Content Ecosystems: Data Science Insights and Applications
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 26, 2024 · Databases

Apache Doris Architecture and Common Q&A: Read/Write Flow, Replication Consistency, Storage, and High Availability

This article provides a comprehensive overview of Apache Doris, explaining its frontend and backend nodes, storage structures such as tablets, rowsets, and segments, replication mechanisms, partitioning versus bucketing, indexing types, compaction processes, and high‑availability strategies through a detailed Q&A format.

Apache DorisBig DataDatabase Architecture
0 likes · 22 min read
Apache Doris Architecture and Common Q&A: Read/Write Flow, Replication Consistency, Storage, and High Availability
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 25, 2024 · Big Data

Fundamental Concepts and File Layout of Paimon: Snapshots, Partitions, Buckets, Consistency, and Compaction

This article explains Paimon's core concepts—including snapshots, partitions, buckets, consistency guarantees, file layout, LSM‑tree organization, and compaction strategies—while also covering table management tasks such as snapshot expiration, rollback, partition expiration, and small‑file mitigation techniques.

Big DataBucketsLSM‑Tree
0 likes · 12 min read
Fundamental Concepts and File Layout of Paimon: Snapshots, Partitions, Buckets, Consistency, and Compaction
StarRocks
StarRocks
Jul 24, 2024 · Big Data

Why Lakehouse Architecture Is Redefining Big Data Infrastructure in the AI Era

The article examines the rapid rise of lakehouse architecture, its market momentum, core components—including storage, metadata, table formats, and compute layers—compares Iceberg, Hudi, and Delta Lake, discusses the shift from HDFS to object storage, and outlines the strategic importance of lakehouses for AI-driven data management and future data infrastructure trends.

AIApache IcebergBig Data
0 likes · 28 min read
Why Lakehouse Architecture Is Redefining Big Data Infrastructure in the AI Era
DataFunSummit
DataFunSummit
Jul 23, 2024 · Big Data

Multi-Cloud Unified Data Acceleration Layer at Xiaohongshu: Challenges, Alluxio Solution, and Performance Gains

This article presents Xiaohongshu's multi‑cloud unified data acceleration layer built with Alluxio, detailing the challenges of multi‑cloud architectures, the design goals, Alluxio's architecture and features, real‑world case studies in AI training and recommendation indexing, performance improvements, and future plans.

AI trainingAlluxioBig Data
0 likes · 22 min read
Multi-Cloud Unified Data Acceleration Layer at Xiaohongshu: Challenges, Alluxio Solution, and Performance Gains
DataFunTalk
DataFunTalk
Jul 23, 2024 · Big Data

Practical Experience with Apache Kyuubi and Apache Celeborn in Big Data Platforms

This article shares detailed practical experiences from DingXiangYuan's big‑data platform on using Apache Kyuubi and Apache Celeborn, covering architecture, flexible configuration, AuthZ fine‑grained permissions, small‑file and Z‑Order optimizations, Arrow‑based large result transmission, and operational tips such as connection‑level issues and Netty cache handling.

Apache CelebornApache KyuubiArrow
0 likes · 17 min read
Practical Experience with Apache Kyuubi and Apache Celeborn in Big Data Platforms
JD Tech
JD Tech
Jul 23, 2024 · Big Data

Design and Architecture of JD's Buffalo Distributed Workflow Scheduling System

This article examines JD's self‑developed Buffalo distributed workflow scheduling system for big‑data ETL, detailing its two‑layer entity model, instance‑based scheduling, high‑availability three‑layer architecture, performance optimizations, cold‑hot data separation, and open APIs to support massive, complex data pipelines.

Big DataSchedulinghigh availability
0 likes · 11 min read
Design and Architecture of JD's Buffalo Distributed Workflow Scheduling System
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Jul 22, 2024 · Big Data

Comprehensive Guide to Kafka: Architecture, Core Concepts, and Configuration

This article provides an in‑depth overview of Apache Kafka, covering its use cases, comparison with other message queues, versioning, performance mechanisms, core concepts such as topics, partitions, offsets, consumer groups, rebalancing, replication, leader election, idempotence, transactions, compression, interceptors, request handling, and practical configuration tips for reliable streaming applications.

Big DataConsumerKafka
0 likes · 25 min read
Comprehensive Guide to Kafka: Architecture, Core Concepts, and Configuration
21CTO
21CTO
Jul 15, 2024 · Big Data

Twitter’s Kappa Architecture: Scaling Real-Time Processing of Billions of Events

Twitter migrated from a Lambda-based dual‑pipeline system to a Kappa architecture that relies on a single real‑time stream using Kafka, Google Pub/Sub, Dataflow, and BigTable, dramatically reducing latency, increasing throughput, and improving data accuracy for processing billions of daily events.

Big DataCloud ComputingDataflow
0 likes · 9 min read
Twitter’s Kappa Architecture: Scaling Real-Time Processing of Billions of Events
DataFunTalk
DataFunTalk
Jul 15, 2024 · Big Data

Douyin Group E‑commerce Data Tracking Evolution, Solutions, and Attribution Practices

This article examines Douyin Group's e‑commerce data‑tracking journey, detailing the progression from early log collection to Log 3.0, the challenges posed by rapidly evolving user flows, and the comprehensive solution framework—including BTM/BCM management, SDK capabilities, and an attribution platform—that improves data quality, development efficiency, and attribution accuracy.

Big DataData TrackingSDK
0 likes · 20 min read
Douyin Group E‑commerce Data Tracking Evolution, Solutions, and Attribution Practices
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Jul 12, 2024 · Big Data

How Flink + Hologres Power Real‑Time Streaming Warehouses

This article explains how combining Flink with Hologres creates a unified, real‑time streaming warehouse, detailing traditional layering approaches, the advantages of the Hologres‑based solution, core capabilities like Binlog and resource isolation, and a practical e‑commerce case study demonstrating performance gains.

Big DataFlinkHologres
0 likes · 21 min read
How Flink + Hologres Power Real‑Time Streaming Warehouses
Data Thinking Notes
Data Thinking Notes
Jul 11, 2024 · Big Data

How to Build a Robust Data Lineage Foundation for Scalable Business Insights

This article explains how to construct a full‑chain data lineage system, covering its overall architecture, quality measurement framework, and application layer, and demonstrates practical use cases such as handling data growth, monitoring warehouse changes, accelerating development, ensuring consistency, and automating metric decomposition in real‑world business scenarios.

Big DataData GovernanceData Lineage
0 likes · 14 min read
How to Build a Robust Data Lineage Foundation for Scalable Business Insights
Baidu Tech Salon
Baidu Tech Salon
Jul 11, 2024 · Industry Insights

How Baidu Feed Evolved Its Data Warehouse with Multi‑Version Wide Tables

This article outlines the step‑by‑step evolution of Baidu's Feed data warehouse—from traditional layered modeling to hour‑level core tables, then real‑time wide tables, and finally a flow‑batch integrated multi‑version wide‑table architecture—highlighting the motivations, design choices, challenges, and resulting benefits.

Big DataReal-time analyticsVersioning
0 likes · 10 min read
How Baidu Feed Evolved Its Data Warehouse with Multi‑Version Wide Tables
DataFunSummit
DataFunSummit
Jul 11, 2024 · Big Data

Design Principles of the Spark Core – DataFun Introduction to Apache Spark (Part 1)

This article provides a comprehensive overview of Apache Spark, covering its origins, key characteristics, core concepts such as RDD, DAG, partitioning and dependencies, the internal architecture including SparkConf, SparkContext, SparkEnv, storage and scheduling systems, as well as deployment models and the company behind the product.

Apache SparkBig DataRDD
0 likes · 16 min read
Design Principles of the Spark Core – DataFun Introduction to Apache Spark (Part 1)
Python Programming Learning Circle
Python Programming Learning Circle
Jul 10, 2024 · Big Data

Using the TransBigData Python Library for Mobile Signaling Data Processing, Analysis, and Visualization

This article introduces the open‑source Python package TransBigData, explains how to install it, and demonstrates step‑by‑step methods for reading mobile signaling data, preprocessing, identifying stays and moves, extracting home and work locations, and visualizing individual activity patterns using Jupyter notebooks.

Big DataGeospatialPython
0 likes · 8 min read
Using the TransBigData Python Library for Mobile Signaling Data Processing, Analysis, and Visualization
DataFunTalk
DataFunTalk
Jul 10, 2024 · Big Data

Apache SeaTunnel: A Next‑Generation Data Integration Platform for ETL/ELT and OLAP

This article introduces Apache SeaTunnel, a modern data integration platform designed for the EtLT era, detailing its architecture, core connector APIs, checkpoint mechanism, model inference, multi‑table synchronization, the high‑performance SeaTunnel Zeta engine, OLAP use cases, community roadmap, and the commercial WhaleTunnel product.

Apache SeaTunnelBig DataELT
0 likes · 22 min read
Apache SeaTunnel: A Next‑Generation Data Integration Platform for ETL/ELT and OLAP
Data Thinking Notes
Data Thinking Notes
Jul 9, 2024 · Big Data

How to Build a Robust Enterprise Data Asset Catalog for Better Governance

This article explains why a comprehensive data asset catalog is essential for modern enterprises, outlines its core components such as inventory, metadata, data lineage, standards and access control, details step‑by‑step construction methods, and highlights key applications in governance, quality, compliance, architecture and valuation.

Big DataData CatalogData Governance
0 likes · 13 min read
How to Build a Robust Enterprise Data Asset Catalog for Better Governance
DataFunSummit
DataFunSummit
Jul 9, 2024 · Big Data

Materialized Views in MaxCompute: Design, Implementation, and Best Practices

This article explains the concept, advantages, and drawbacks of materialized views, describes how MaxCompute implements them—including creation syntax, maintenance properties, automatic query rewrite, smart recommendation, and auto‑materialization—and shares performance results and future improvement plans.

Automatic RefreshBig DataMaxCompute
0 likes · 13 min read
Materialized Views in MaxCompute: Design, Implementation, and Best Practices
360 Smart Cloud
360 Smart Cloud
Jul 9, 2024 · Big Data

Understanding Shuffle in Spark: From Native Shuffle to External and Remote Shuffle Services (Uniffle)

This article examines the critical role of shuffle in big‑data processing, compares Spark's native shuffle with the External Shuffle Service (ESS) and Remote Shuffle Service (RSS) solutions, introduces Uniffle's architecture and configuration, and shares practical deployment experiences and performance results within a 360 internal environment.

Big DataExternal Shuffle ServiceRemote Shuffle Service
0 likes · 15 min read
Understanding Shuffle in Spark: From Native Shuffle to External and Remote Shuffle Services (Uniffle)
DataFunTalk
DataFunTalk
Jul 6, 2024 · Big Data

StarRocks and Paimon Data Lake Capabilities, Migration Solutions, and Future Roadmap

This article presents a practical overview of StarRocks and Apache Paimon data‑lake capabilities, explains their performance advantages, details migration strategies from Trino/Presto and other engines, describes cluster‑to‑cluster migration, and outlines future roadmap for integration and optimization.

Big DataCloud ComputingData Lake
0 likes · 13 min read
StarRocks and Paimon Data Lake Capabilities, Migration Solutions, and Future Roadmap
DataFunSummit
DataFunSummit
Jul 6, 2024 · Artificial Intelligence

Highlights of DataFunCon 2024 Beijing: Big Data, AI, and Large‑Model Trends

The two‑day DataFunCon 2024 Beijing conference gathered hundreds of big‑data and AI experts to discuss the evolution from data lakes to lake‑warehouses, large‑model development, practical applications, and future strategies for enterprises, while showcasing partner exhibitions and a vibrant community spirit.

Artificial IntelligenceBig DataChina
0 likes · 9 min read
Highlights of DataFunCon 2024 Beijing: Big Data, AI, and Large‑Model Trends
DataFunSummit
DataFunSummit
Jul 5, 2024 · Big Data

Highlights of DataFunCon 2024 Beijing: Big Data, Large Models, and AI Integration

The DataFunCon 2024 Beijing conference opened with keynote speeches on the evolution of Alibaba Cloud's big data platform, explored distributed data warehousing, large model research, and practical AI applications, and concluded with a round‑table discussing future trends and enterprise strategies for big data and AI integration.

Artificial IntelligenceBig Dataconference
0 likes · 8 min read
Highlights of DataFunCon 2024 Beijing: Big Data, Large Models, and AI Integration
iQIYI Technical Product Team
iQIYI Technical Product Team
Jul 5, 2024 · Big Data

RiskFactor: An Integrated Real‑Time and Offline Feature Platform for Risk Control

RiskFactor unifies iQIYI’s legacy real‑time and offline feature platforms onto Opal’s DAG‑plus‑SQL engine, accelerating feature production fifteen‑fold, cutting latency from hours to minutes, streamlining development, lowering costs, and delivering more reliable, versioned risk‑control capabilities against sophisticated online threats.

Big DataDAGReal-time Streaming
0 likes · 14 min read
RiskFactor: An Integrated Real‑Time and Offline Feature Platform for Risk Control
Data Thinking Notes
Data Thinking Notes
Jul 4, 2024 · Big Data

How Active Metadata Revolutionizes Data Governance and Cuts Costs

This article examines the growing challenges of data management—such as asset discoverability, architectural rigidity, development quality, and rising resource costs—and presents a comprehensive data‑governance framework that leverages standards, agile architecture, development isolation, and active‑metadata‑driven lifecycle evaluation to improve efficiency, reduce expenses, and enable intelligent, automated data back‑filling.

Big DataData GovernanceStorage Optimization
0 likes · 17 min read
How Active Metadata Revolutionizes Data Governance and Cuts Costs
JD Cloud Developers
JD Cloud Developers
Jul 3, 2024 · Big Data

How to Build a High‑Availability Real‑Time Logistics Dashboard with Flink and ClickHouse

This article details the design and implementation of a high‑availability, real‑time logistics supply‑chain dashboard, covering Flink‑based data pipelines, ClickHouse OLAP storage, metric consistency, stability measures, extensible configuration, and comprehensive monitoring to ensure accurate, scalable performance during major promotions.

Big DataClickHouseDashboard
0 likes · 9 min read
How to Build a High‑Availability Real‑Time Logistics Dashboard with Flink and ClickHouse
StarRocks
StarRocks
Jul 2, 2024 · Big Data

What’s New in StarRocks 3.3? Deep Dive into Lakehouse‑Optimized Performance and Features

StarRocks 3.3 introduces a comprehensive set of enhancements—including maturity levels, ARM‑optimized performance, advanced caching, materialized‑view rewrites, storage optimizations, and expanded lakehouse ecosystem support—that together boost stability, query speed, and usability for large‑scale analytics workloads.

Big DataCache OptimizationLakehouse
0 likes · 15 min read
What’s New in StarRocks 3.3? Deep Dive into Lakehouse‑Optimized Performance and Features
DataFunSummit
DataFunSummit
Jul 2, 2024 · Cloud Computing

Global Perspective on Multi-Cloud Data Architecture

The forum presents a series of technical talks on multi‑cloud data architecture, covering Xiaomi’s lake‑warehouse practice, cross‑border e‑commerce data platforms, Alluxio‑based machine‑learning acceleration, Qichacha’s cost‑effective data solutions, and Kuaishou’s Flink on Kubernetes migration, highlighting strategies, implementations, and audience benefits.

Big DataCloud ComputingData Architecture
0 likes · 8 min read
Global Perspective on Multi-Cloud Data Architecture
JD Tech
JD Tech
Jul 2, 2024 · Big Data

Real‑Time Monitoring Dashboard for Logistics Supply Chain: Architecture, Data Modeling, and Stability Design

This article presents the design and implementation of a high‑availability, real‑time logistics supply‑chain monitoring dashboard, covering its data processing pipeline with Flink, storage choices between Elasticsearch and ClickHouse, multi‑layer architecture, metric consistency, stability mechanisms, extensibility configurations, and monitoring practices.

Big DataClickHouseDashboard
0 likes · 11 min read
Real‑Time Monitoring Dashboard for Logistics Supply Chain: Architecture, Data Modeling, and Stability Design
DataFunTalk
DataFunTalk
Jun 28, 2024 · Big Data

Accelerating Spark with ClickHouse: Native Optimization Techniques and Performance Evaluation

This article presents a comprehensive technical overview of using ClickHouse as a native backend to accelerate Spark SQL execution, covering Spark performance bottlenecks, ClickHouse's CPU‑level optimizations, the design and implementation of the Spark‑Native integration, and detailed TPC‑DS benchmark results demonstrating up to 3.5× speedup.

Big DataClickHouseNative Execution
0 likes · 33 min read
Accelerating Spark with ClickHouse: Native Optimization Techniques and Performance Evaluation
Tencent Cloud Developer
Tencent Cloud Developer
Jun 28, 2024 · Big Data

Capacity-Constrained Influence Maximization: Algorithms and Applications

The paper introduces Capacity‑Constrained Influence Maximization (CIM), a framework that selects up to k neighbors per active user to maximize spread under node capacity limits, proposes MG‑Greedy and RR‑Greedy algorithms with ≥½ approximation, and demonstrates the near‑linear RR‑OPIM+ method’s superior accuracy and speed on large social networks and a Tencent game recommendation system.

Big DataCapacity ConstraintKDD 2023
0 likes · 8 min read
Capacity-Constrained Influence Maximization: Algorithms and Applications
DevOps
DevOps
Jun 27, 2024 · Big Data

Agile Data Engineering: Code‑as‑Infrastructure, Reuse Strategies, and ETL‑Level Continuous Integration

This article explores agile data engineering, advocating code‑as‑infrastructure practices such as code‑everything, data and code reuse, and ETL‑level continuous integration, while discussing the trade‑offs between data‑centric and code‑centric reuse, cost considerations, and practical implementation tips for modern data projects.

Agile DevelopmentBig DataCode as Infrastructure
0 likes · 22 min read
Agile Data Engineering: Code‑as‑Infrastructure, Reuse Strategies, and ETL‑Level Continuous Integration
DataFunTalk
DataFunTalk
Jun 27, 2024 · Big Data

Data Warehouse Construction and Data Governance Practices at Wing Payment

This presentation by senior data warehouse engineer Huang Luo details Wing Payment’s end‑to‑end data warehouse build, covering background challenges, governance framework, platform architecture, layered modeling, naming standards, asset management, monitoring, and future plans, illustrating how systematic data governance drives cost reduction, efficiency, and security.

AnalyticsBig DataData Governance
0 likes · 14 min read
Data Warehouse Construction and Data Governance Practices at Wing Payment
DataFunTalk
DataFunTalk
Jun 26, 2024 · Big Data

Evolution of the Big Data + AI Development Paradigm and Alibaba Cloud’s Integrated Architecture

This article examines how the big‑data AI development paradigm has shifted from model‑centric to data‑centric workflows, outlines the challenges of integrating data and AI teams, and details Alibaba Cloud’s end‑to‑end, serverless big‑data platform—including MaxCompute, Hologres, MaxFrame, Object Table, and vector search—designed to accelerate large‑scale AI applications.

AI integrationBig DataData Platform
0 likes · 20 min read
Evolution of the Big Data + AI Development Paradigm and Alibaba Cloud’s Integrated Architecture
Baidu Geek Talk
Baidu Geek Talk
Jun 24, 2024 · Big Data

Accelerating Spark with ClickHouse Native Techniques: Design, Implementation, and Performance Evaluation

The paper presents a Spark acceleration framework that replaces Java‑based task operators with a ClickHouse native library, converting plans via Protobuf and JNI, leveraging columnar storage, SIMD and JIT to achieve up to 3× speed‑up on TPC‑DS workloads while providing fallback mechanisms to ensure no performance loss.

Big DataClickHouseNative Acceleration
0 likes · 31 min read
Accelerating Spark with ClickHouse Native Techniques: Design, Implementation, and Performance Evaluation
DataFunTalk
DataFunTalk
Jun 22, 2024 · Big Data

Migrating Spark Shuffle Service from ESS to RSS (Celeborn) at Zhihu: Design, Implementation, and Benefits

This article details Zhihu's migration of massive Spark and MapReduce shuffle workloads from the External Shuffle Service (ESS) to a push‑based Remote Shuffle Service (RSS) powered by Celeborn, covering background problems, evaluation of open‑source implementations, deployment architecture, encountered issues, solutions, performance gains, and future plans.

Big DataPerformanceRSS
0 likes · 19 min read
Migrating Spark Shuffle Service from ESS to RSS (Celeborn) at Zhihu: Design, Implementation, and Benefits
DataFunSummit
DataFunSummit
Jun 21, 2024 · Big Data

Building a Complete Data System with Apache Arrow: Architecture, Dynamic Schema Modeling, and Practical Tips

This article explains why new data systems are needed, introduces Apache Arrow and its columnar in‑memory format, describes dynamic read‑time modeling, outlines the system’s execution flow, storage and indexing strategies, and shares practical tips and extensions for building scalable big‑data solutions.

AceroApache ArrowBig Data
0 likes · 20 min read
Building a Complete Data System with Apache Arrow: Architecture, Dynamic Schema Modeling, and Practical Tips
Meituan Technology Team
Meituan Technology Team
Jun 20, 2024 · Big Data

Vectorized Execution in Apache Spark: Meituan’s Practice with Gluten and Velox

Meituan enhances Apache Spark by integrating the Gluten‑Velox vectorized execution engine, converting row‑wise operations to columnar SIMD processing, which yields over 40 % memory savings and up to 13 % faster runtimes across thousands of ETL jobs, while addressing stability, ORC support, shuffle redesign, and off‑heap memory optimization.

Apache SparkBig DataGluten
0 likes · 30 min read
Vectorized Execution in Apache Spark: Meituan’s Practice with Gluten and Velox
DataFunSummit
DataFunSummit
Jun 20, 2024 · Big Data

Data+AI Data Lake Technologies: Apache Iceberg, PyIceberg, and Vector Table Solutions

This article presents a comprehensive overview of modern Data+AI data lake challenges and solutions, covering the evolution of data lakes, an introduction to Apache Iceberg, practical use of PyIceberg for AI training and inference pipelines, and advanced vector table and indexing techniques for efficient similarity search.

AI trainingApache IcebergBig Data
0 likes · 22 min read
Data+AI Data Lake Technologies: Apache Iceberg, PyIceberg, and Vector Table Solutions
AI Architecture Hub
AI Architecture Hub
Jun 20, 2024 · Big Data

How GeoHash Powers Efficient Large-Scale Location Queries Without Pagination

This article explains the GeoHash algorithm, shows how it converts latitude‑longitude pairs into compact binary strings, demonstrates the encoding process with a concrete example, and discusses how the resulting prefixes can be used to quickly locate nearby users in massive datasets while highlighting remaining edge‑case challenges.

Big DataGeoHashLocation Query
0 likes · 7 min read
How GeoHash Powers Efficient Large-Scale Location Queries Without Pagination
vivo Internet Technology
vivo Internet Technology
Jun 19, 2024 · Big Data

Understanding BitMap and Roaring BitMap: Principles, Containers, and Java API Usage

The article explains BitMap fundamentals and introduces Roaring BitMap’s compressed container architecture—Array, BitMap, and Run containers—detailing their conversion logic, Java implementation snippets, performance advantages over traditional BitSets, and practical API usage for high‑performance, memory‑efficient big‑data applications.

Big DataContainersRoaring Bitmap
0 likes · 18 min read
Understanding BitMap and Roaring BitMap: Principles, Containers, and Java API Usage