Tagged articles
607 articles
Page 2 of 7
DataFunTalk
DataFunTalk
Feb 25, 2024 · Big Data

Implementation Practice of Bilibili's Tag System: Evolution, Architecture, and Future Plans

This article details Bilibili's tag system from its 2021 inception through successive redesigns, describing the three‑layer architecture, data flow pipelines using Hive, Iceberg, Spark and ClickHouse, crowd selection DSL, online services with Redis, performance optimizations, and upcoming governance and quality initiatives.

Big DataClickHouseReal-time Processing
0 likes · 12 min read
Implementation Practice of Bilibili's Tag System: Evolution, Architecture, and Future Plans
Meituan Technology Team
Meituan Technology Team
Jan 25, 2024 · Artificial Intelligence

Design and Implementation of a Distributed Causal Forest Framework on Meituan's Fulfillment Platform

Meituan’s Fulfillment Platform team built a high‑performance distributed causal‑forest framework—named Causal On Spark—that trains hundreds of trees on hundreds of millions of samples within minutes using MapReduce‑based histogram splitting, extensive memory optimizations, Parquet model serving, and novel distributed evaluation metrics, enabling scalable causal inference for pricing, subsidies, and marketing.

Model ServingSparkcausal forest
0 likes · 23 min read
Design and Implementation of a Distributed Causal Forest Framework on Meituan's Fulfillment Platform
DataFunTalk
DataFunTalk
Jan 12, 2024 · Big Data

Building a Unified Data Empowerment Layer with Apache Kyuubi at GF Securities

The article describes how GF Securities designed and implemented a unified big‑data empowerment layer based on Apache Kyuubi to address data‑centric challenges, improve efficiency, ensure controllable governance, and support agile data scenarios across ingestion, processing, storage, and security.

Apache KyuubiBig DataData Empowerment
0 likes · 33 min read
Building a Unified Data Empowerment Layer with Apache Kyuubi at GF Securities
DataFunSummit
DataFunSummit
Dec 17, 2023 · Big Data

Apache Kyuubi 1.8: New Features and Enhancements Overview

Apache Kyuubi 1.8 introduces a range of enhancements including multi‑tenant serverless SQL support on Spark and Flink, expanded batch and streaming capabilities, improved resource scheduling with database‑backed queues, stronger Kerberos/LDAP security, Flink YARN integration, and a new web UI for management.

Apache KyuubiBig DataFlink
0 likes · 13 min read
Apache Kyuubi 1.8: New Features and Enhancements Overview
Zhongtong Tech
Zhongtong Tech
Dec 14, 2023 · Big Data

How Celeborn Transformed Spark Shuffle Performance at ZTO Express

Facing massive daily Spark shuffle volumes and unstable ETL performance, ZTO Express migrated from the community External Shuffle Service to Celeborn's Remote Shuffle Service, achieving higher disk I/O efficiency, better reliability, reduced network connections, and significant reductions in task failures and job latency.

Big DataRemote Shuffle ServiceShuffle
0 likes · 15 min read
How Celeborn Transformed Spark Shuffle Performance at ZTO Express
DataFunTalk
DataFunTalk
Dec 2, 2023 · Big Data

Apache Celeborn: Overview, Architecture, Community, and Future Roadmap

This article introduces Apache Celeborn, explains the challenges of intermediate data in large‑scale compute engines, details its core architecture and design—including master, worker, lifecycle manager and shuffle client—covers its community history, version releases, performance comparisons with Spark ESS, real‑world deployment scenarios, and outlines future development plans.

Apache CelebornBig DataFlink
0 likes · 14 min read
Apache Celeborn: Overview, Architecture, Community, and Future Roadmap
DataFunTalk
DataFunTalk
Nov 30, 2023 · Big Data

Big Data Cloud‑Native Trends and Challenges Highlighted at the 2023 Yunqi Conference

The 2023 Yunqi Conference in Hangzhou showcased the latest advances in cloud computing and big‑data technologies, examined the evolution from big‑data 1.0 to 3.0, discussed the key difficulties of making big data cloud‑native, and presented a practical case study of MiHoYo’s cloud‑native transformation.

Alibaba CloudBig DataCloud Native
0 likes · 12 min read
Big Data Cloud‑Native Trends and Challenges Highlighted at the 2023 Yunqi Conference
DataFunSummit
DataFunSummit
Nov 25, 2023 · Big Data

Practical Experience with Apache Kyuubi and Celeborn on the DXY Big Data Platform

This article presents a comprehensive technical overview of how DXY's big data platform leverages Apache Kyuubi and Celeborn to unify Spark entry points, configure flexible task isolation, implement fine‑grained AuthZ, optimize small files and Z‑Order sorting, and accelerate large result set transmission with Arrow, while also discussing operational challenges and upcoming features.

Apache KyuubiArrowBig Data
0 likes · 17 min read
Practical Experience with Apache Kyuubi and Celeborn on the DXY Big Data Platform
Zhuanzhuan Tech
Zhuanzhuan Tech
Nov 22, 2023 · Backend Development

Improving Stability and High Availability of an Advertising Billing System: Architecture Upgrade and Optimizations

This article describes the background, problems, and a series of architectural upgrades—including MQ replacement, thread‑pool isolation, Redis/TiKV redundancy, and Spark‑based compensation—to enhance the stability, scalability, and high‑availability of an advertising billing system.

AdvertisingBackendMessage Queue
0 likes · 12 min read
Improving Stability and High Availability of an Advertising Billing System: Architecture Upgrade and Optimizations
DataFunTalk
DataFunTalk
Nov 18, 2023 · Big Data

Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance

This article details ByteDance's extensive migration of Spark Shuffle to a cloud‑native architecture, describing the massive data volumes, the underlying ESS and CSS services, the challenges of resource isolation, monitoring, throttling, spill‑splitting, and the performance gains achieved across stable and mixed‑resource clusters.

Big DataByteDanceCloud Native
0 likes · 20 min read
Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance
Alibaba Cloud Native
Alibaba Cloud Native
Nov 10, 2023 · Big Data

Scaling Spark on Kubernetes: Elastic Compute, Cost Savings, and Storage Decoupling

MiHoYo’s data platform team details their migration of Spark workloads to Alibaba Cloud’s ACK Kubernetes service, describing how the Spark‑on‑K8s + OSS‑HDFS architecture delivers elastic compute, up to 50% cost reduction, and true compute‑storage separation, while addressing operational challenges through custom operators, Celeborn, and robust monitoring.

Big DataCost OptimizationKubernetes
0 likes · 24 min read
Scaling Spark on Kubernetes: Elastic Compute, Cost Savings, and Storage Decoupling
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Nov 10, 2023 · Big Data

How We Transformed Big Data Workloads with Spark on Kubernetes and OSS‑HDFS

Facing rapid growth in offline data and compute demands, we migrated our big‑data platform to a cloud‑native architecture using Spark 3.2.3 on Kubernetes with OSS‑HDFS storage, achieving elastic scaling, cost reduction, and compute‑storage separation while detailing implementation, challenges, and operational insights.

Sparkcloud-nativeelastic computing
0 likes · 25 min read
How We Transformed Big Data Workloads with Spark on Kubernetes and OSS‑HDFS
dbaplus Community
dbaplus Community
Oct 18, 2023 · Databases

Doris vs ClickHouse: Which Database Delivers Faster Writes and Queries?

This article presents a systematic performance comparison between Doris and ClickHouse, covering data ingestion speed, SQL syntax differences, hardware impact, and detailed query benchmarks across multiple scenarios, ultimately revealing that each system excels in different use cases.

Big DataClickHouseSQL
0 likes · 15 min read
Doris vs ClickHouse: Which Database Delivers Faster Writes and Queries?
DataFunTalk
DataFunTalk
Oct 13, 2023 · Big Data

Design Principles, Architecture, and Applications of the Open‑Source LakeSoul Lakehouse Framework

This article provides a comprehensive technical overview of LakeSoul, an open‑source, cloud‑native lakehouse framework, covering its design philosophy, core features, architecture, performance benchmarks, real‑time ingestion, incremental computation, multi‑stream joining, security, community progress, and future roadmap.

Big DataData LakehouseFlink
0 likes · 16 min read
Design Principles, Architecture, and Applications of the Open‑Source LakeSoul Lakehouse Framework
DataFunSummit
DataFunSummit
Oct 1, 2023 · Big Data

Iceberg Data Lake: Core Features, Xiaomi Use Cases, and Future Plans

This presentation introduces Iceberg's core capabilities, details Xiaomi's practical applications—including log ingestion, near‑real‑time warehousing, offline challenges, column‑level encryption, and Hive migration—and outlines future development directions such as materialized views and cloud migration, providing a comprehensive view of modern data‑lake engineering.

Big DataData LakeFlink
0 likes · 22 min read
Iceberg Data Lake: Core Features, Xiaomi Use Cases, and Future Plans
Baidu Geek Talk
Baidu Geek Talk
Sep 27, 2023 · Big Data

Design and Implementation of a Content Revenue Settlement System

The article details the design and implementation of a content revenue settlement platform that aggregates traffic and ad data, uses a Spark‑plus‑PALO architecture for processing tens of millions of daily records, and employs a master‑worker model with idempotent tasks, temporary tables, and verification steps to ensure reliable monthly profit‑share calculations for authors, media, mini‑program owners, and users.

Distributed ProcessingPaloSpark
0 likes · 14 min read
Design and Implementation of a Content Revenue Settlement System
dbaplus Community
dbaplus Community
Sep 3, 2023 · Big Data

How NetEase Yanxuan Migrated from Lambda to Iceberg for Seamless Batch‑Stream Integration

This article explains how NetEase Yanxuan upgraded its legacy Lambda architecture to an Iceberg‑based batch‑stream unified platform, detailing the original data pipeline, the challenges faced, the evaluation of Iceberg versus Hudi and DeltaLake, and the concrete engineering optimizations and governance measures implemented to achieve lower latency and higher query performance.

Batch-Stream IntegrationBig DataFlink
0 likes · 14 min read
How NetEase Yanxuan Migrated from Lambda to Iceberg for Seamless Batch‑Stream Integration
Bilibili Tech
Bilibili Tech
Sep 1, 2023 · Big Data

Design and Implementation of Session‑Based User Engagement Tracking for Cloud TV Application

The Cloud Vision TV app implements a session‑id and placement‑id driven tracking pipeline that generates, collects, and processes lifecycle data across server and client layers, enabling fine‑grained engagement strategies, scene reconstruction via AC automata, and actionable BI dashboards to improve user retention and personalization.

BI visualizationOLAPSpark
0 likes · 14 min read
Design and Implementation of Session‑Based User Engagement Tracking for Cloud TV Application
Tencent Cloud Developer
Tencent Cloud Developer
Aug 23, 2023 · Big Data

WeChat Experiment Platform: Architecture Design and Iceberg Lakehouse Optimization

The WeChat Experiment Platform migrated its 60,000 metric, 200,000 core, 30 PB plus data pipeline to an Iceberg based lakehouse, leveraging three layer metadata, fine grained partitioning, MERGE into writes, time travel snapshots and skew handling UDFs, which cut core time by 69%, saved ~100 PB storage, and reduced latency by up to 70%.

Big DataData WarehouseIceberg
0 likes · 18 min read
WeChat Experiment Platform: Architecture Design and Iceberg Lakehouse Optimization
ITPUB
ITPUB
Aug 23, 2023 · Cloud Native

Build a Cloud‑Native Lakehouse on AWS with Apache Iceberg and Amoro

This guide explains the cloud‑native lakehouse concept, outlines its advantages and challenges, compares lake‑table projects such as Iceberg, and provides a step‑by‑step AWS deployment of Apache Iceberg and Amoro—including environment setup, AMS installation, catalog configuration, optimizer launch, data ingestion with Flink, and query verification with Spark.

AWSAmoroApache Iceberg
0 likes · 33 min read
Build a Cloud‑Native Lakehouse on AWS with Apache Iceberg and Amoro
政采云技术
政采云技术
Aug 23, 2023 · Big Data

Step-by-Step Guide to Building a Hadoop Big Data Cluster on ARM Architecture

This comprehensive tutorial details the process of deploying a complete Hadoop-based big data ecosystem on ARM architecture, covering the installation and configuration of essential components including Java, Zookeeper, Hadoop, MySQL, Hive, and Spark with practical code examples.

ARM architectureCluster DeploymentDistributed Systems
0 likes · 19 min read
Step-by-Step Guide to Building a Hadoop Big Data Cluster on ARM Architecture
DataFunTalk
DataFunTalk
Aug 20, 2023 · Databases

Best Practices for Building Low‑Cost Data Lake Analytics with AnalyticDB MySQL and Serverless Spark

This article presents a comprehensive technical overview of Alibaba Cloud AnalyticDB MySQL and its Serverless Spark integration, detailing architecture, core optimizations, security enhancements, and real‑world case studies that demonstrate how to achieve cost‑effective, high‑performance data lake analytics.

AnalyticDBBig DataData Lake
0 likes · 19 min read
Best Practices for Building Low‑Cost Data Lake Analytics with AnalyticDB MySQL and Serverless Spark
Youzan Coder
Youzan Coder
Aug 8, 2023 · Big Data

Kylin4 Deployment and Performance Optimizations at Youzan

Since 2018 Youzan has migrated all online services to Kylin4, addressing long cube rebuilds, single‑point cache, CPU spikes, and throttling gaps by adding batch segment builds, low‑priority concurrency controls, Redis‑based query caching, parquet skew mitigation, range‑query acceleration, and class‑loader optimizations, which together doubled query‑per‑second capacity to 150, cut latency by up to 50 % and reduced CPU usage.

Big DataCubeKylin
0 likes · 17 min read
Kylin4 Deployment and Performance Optimizations at Youzan
政采云技术
政采云技术
Jul 6, 2023 · Big Data

Optimizing Large‑Scale Table Joins in Spark Using Bloom Filters

To address the resource‑intensive challenges of joining billion‑row tables in data warehouses, this article examines common optimization approaches, analyzes Spark’s SortMergeJoin algorithm, and proposes a Bloom‑filter‑based solution that filters unchanged data early, dramatically improving performance and reducing cluster resource consumption.

JOIN optimizationJavaSQL
0 likes · 17 min read
Optimizing Large‑Scale Table Joins in Spark Using Bloom Filters
DataFunTalk
DataFunTalk
Jun 29, 2023 · Big Data

Practical Deployment of Delta Lake in BI and AI Products

This article summarizes a technical presentation on how Delta Lake is integrated into a BI+AI platform, covering the product background, data‑lake architecture, Delta Lake features such as ACID transactions, schema management, multi‑engine support, performance optimizations, and future development directions.

AIBIBig Data
0 likes · 12 min read
Practical Deployment of Delta Lake in BI and AI Products
DataFunTalk
DataFunTalk
Jun 26, 2023 · Big Data

Iceberg Data Lake: Core Features, Xiaomi Use Cases, and Future Plans

This presentation details Iceberg's core capabilities—transactional writes, schema evolution, implicit partitioning, and row‑level updates—while showcasing Xiaomi's real‑world applications such as log ingestion redesign, near‑real‑time warehousing, offline optimizations, column‑level encryption, Hive migration strategies, and outlining upcoming enhancements like materialized views and cloud migration.

Big DataColumn EncryptionData Lake
0 likes · 20 min read
Iceberg Data Lake: Core Features, Xiaomi Use Cases, and Future Plans
Code Ape Tech Column
Code Ape Tech Column
Jun 21, 2023 · Big Data

From Java Streams to Spark: Basic Big Data Operations Explained

This article demonstrates how developers familiar with Java Stream APIs can quickly grasp fundamental Spark operations—including map, flatMap, groupBy, and reduce—by translating stream examples into Spark code, providing complete code snippets, explanations of transformations versus actions, and practical tips for handling exceptions in distributed processing.

Big DataJava StreamMAP
0 likes · 24 min read
From Java Streams to Spark: Basic Big Data Operations Explained
JD Tech
JD Tech
Jun 14, 2023 · Big Data

Understanding and Solving Data Skew in Offline Big Data Development (Hive & Spark)

This article explains the concept of data skew in offline big‑data jobs, describes its symptoms and root causes, and provides practical optimization techniques for Hive and Spark—including partitioning strategies, map‑join usage, adaptive query settings, and monitoring approaches—to prevent performance degradation and runtime failures.

Data SkewHiveSQL
0 likes · 17 min read
Understanding and Solving Data Skew in Offline Big Data Development (Hive & Spark)
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 13, 2023 · Big Data

Iceberg Data Lake Implementation and Optimization at iQIYI

This article details iQIYI's adoption of Iceberg for its data lake, covering the OLAP architecture, reasons for a data lake, Iceberg's table format advantages over Hive, platform construction, streaming ingestion, query and performance optimizations, real‑world business deployments, and future plans.

Big DataData LakeFlink
0 likes · 21 min read
Iceberg Data Lake Implementation and Optimization at iQIYI
DataFunSummit
DataFunSummit
Jun 11, 2023 · Artificial Intelligence

Applying Uplift Modeling, PSM Matching, and Spark CausalML for Growth at Tencent Video

This article explains how Tencent Video leverages causal inference techniques—including uplift gain models, propensity‑score‑matching (PSM), and a distributed Spark‑based CausalML library—to identify incremental user effects, evaluate marketing interventions, and improve growth across advertising, internal flow, push notifications, and coupon strategies.

Propensity Score MatchingSparkgrowth analytics
0 likes · 12 min read
Applying Uplift Modeling, PSM Matching, and Spark CausalML for Growth at Tencent Video
DataFunTalk
DataFunTalk
Jun 9, 2023 · Big Data

Cloud Music Data Governance Practice

This article presents a comprehensive case study of NetEase Cloud Music's data governance practice, covering data background, governance philosophy, detailed solutions across metadata, storage, compute, and model design, practical implementations, measurable cost savings, and future planning for sustainable data management.

Cost OptimizationHadoopSpark
0 likes · 15 min read
Cloud Music Data Governance Practice
DevOps
DevOps
Jun 7, 2023 · Big Data

Deploying Apache Spark on YARN vs Kubernetes: Architecture, Benefits, and Comparison

This article explains how Apache Spark can be deployed using the traditional Hadoop YARN resource manager and the newer Kubernetes approach, detailing configuration steps, submission methods, and a comprehensive comparison of isolation, scalability, learning curve, logging, performance, and cost considerations.

Big DataKubernetesSpark
0 likes · 10 min read
Deploying Apache Spark on YARN vs Kubernetes: Architecture, Benefits, and Comparison
360 Tech Engineering
360 Tech Engineering
Jun 2, 2023 · Big Data

Overcoming Challenges in User Profiling: A Big Data‑Driven Framework for Precise Marketing

The article outlines how a unified, big‑data‑based user profiling platform addresses traditional data silos, high costs, and limited functionality by standardizing tags, integrating Spark and RHadoop processing, and enabling a closed‑loop marketing workflow that improves accuracy and operational efficiency.

Big DataData IntegrationMarketing Automation
0 likes · 7 min read
Overcoming Challenges in User Profiling: A Big Data‑Driven Framework for Precise Marketing
DataFunSummit
DataFunSummit
May 28, 2023 · Big Data

Apache Hudi: Capabilities, Architecture, Use Cases, and Future Outlook

This article introduces Apache Hudi as a next‑generation streaming data‑lake platform, explains its core concepts, architecture, and table types, and showcases real‑world use cases at Tencent such as CDC ingestion, minute‑level real‑time warehousing, streaming analytics, multi‑stream joins, ad attribution, and stream‑to‑batch processing, while also outlining future directions.

Apache HudiCDCData Lake
0 likes · 16 min read
Apache Hudi: Capabilities, Architecture, Use Cases, and Future Outlook
DataFunSummit
DataFunSummit
May 21, 2023 · Big Data

Blaze: Design and Practice of SparkSQL Native Operator Optimization at Kuaishou

This article presents Blaze, a Kuaishou‑built native execution middleware for SparkSQL that leverages Apache DataFusion to achieve vectorized operator execution, detailing its architecture, implementation, performance gains, current coverage, benchmark results, production rollout, and future development plans.

DataFusionNative ExecutionPerformance Optimization
0 likes · 17 min read
Blaze: Design and Practice of SparkSQL Native Operator Optimization at Kuaishou
Data Thinking Notes
Data Thinking Notes
May 10, 2023 · Big Data

Mastering Hive Small File Management: Strategies to Boost Performance

This article explains why tiny Hive files degrade storage and query efficiency, outlines how they are created, and presents practical Spark and Hive configuration techniques—including dynamic partitioning, AQE, Reduce tuning, and automated daily merge jobs—to effectively consolidate small files and improve overall data‑warehouse performance.

HiveSmall FilesSpark
0 likes · 10 min read
Mastering Hive Small File Management: Strategies to Boost Performance
Big Data Technology & Architecture
Big Data Technology & Architecture
May 5, 2023 · Big Data

Strategies for Handling Small Files in Hive and Spark

This article examines the causes and impacts of small file proliferation in Hive and Spark environments, and presents multiple mitigation techniques—including Spark 3 adaptive query execution, reducing reduce tasks, using DISTRIBUTE BY RAND(), post‑processing clean‑up, Hive and Spark configuration tweaks, and automated tooling—to improve performance and storage efficiency.

Big DataHiveSmall Files
0 likes · 9 min read
Strategies for Handling Small Files in Hive and Spark
DataFunTalk
DataFunTalk
May 3, 2023 · Big Data

Shuttle2.0: Enhancing Spark and Flink Shuffle with Distributed Sorting and Adaptive Broadcast

Shuttle2.0 extends OPPO's open‑source high‑availability Spark Remote Shuffle Service to support Flink, introduces a unified stream‑batch data model, pipelines shuffle with distributed sorting, and provides an Adaptive BroadcastJoin solution that dramatically improves performance and stability for large‑scale big‑data workloads.

Adaptive BroadcastBig DataDistributed Sorting
0 likes · 11 min read
Shuttle2.0: Enhancing Spark and Flink Shuffle with Distributed Sorting and Adaptive Broadcast
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Apr 28, 2023 · Artificial Intelligence

Exploring Alibaba’s Tongyi Qianwen AI Model, SWOT, Recipe Demo, and Code Samples for Spark Same‑Period Analysis and Java Bubble Sort

The article reviews Alibaba’s Tongyi Qianwen large‑language model, shares a cooking recipe generated by the AI, presents a SWOT analysis, and provides code examples—including a Spark Scala script for same‑period month‑over‑month calculations and a Java bubble‑sort implementation.

AIJavaSWOT
0 likes · 12 min read
Exploring Alibaba’s Tongyi Qianwen AI Model, SWOT, Recipe Demo, and Code Samples for Spark Same‑Period Analysis and Java Bubble Sort
58 Tech
58 Tech
Apr 20, 2023 · Big Data

Design and Implementation of a Data Application Platform for Business Opportunity Selection, Tagging, and Scheduling

The article describes a data application platform that enables business users to configure custom data selection rules for opportunities, create scheduled tasks, perform large‑scale data comparison, handle task dispatch with Redis queues, and implement rate‑limiting using sliding windows to ensure reliable processing.

Sparkrate limitingredis
0 likes · 9 min read
Design and Implementation of a Data Application Platform for Business Opportunity Selection, Tagging, and Scheduling
政采云技术
政采云技术
Apr 18, 2023 · Big Data

Implementing Data Cost Governance: Quantifying Storage and Compute Expenses with Hive, Spark, and HDFS FsImage

This article explains how to perform task‑level data cost governance by collecting storage and compute metrics from Hive tables, Spark jobs, and HDFS FsImage files, then estimating monthly expenses using replication factors and resource‑usage rates, while providing practical SQL and shell examples.

Data Cost GovernanceHDFSHive
0 likes · 18 min read
Implementing Data Cost Governance: Quantifying Storage and Compute Expenses with Hive, Spark, and HDFS FsImage
JD Retail Technology
JD Retail Technology
Apr 14, 2023 · Big Data

Understanding Data Skew and Its Mitigation in Hive and Spark

This article explains the concept of data skew, its symptoms such as slow tasks and OOM errors, and provides comprehensive mitigation techniques and configuration examples for Hive and Spark, including custom partitioning, map joins, adaptive execution, and key detection methods.

Adaptive ExecutionBig DataData Skew
0 likes · 15 min read
Understanding Data Skew and Its Mitigation in Hive and Spark
ITPUB
ITPUB
Apr 8, 2023 · Big Data

How Bilibili Cut Data Pipeline Costs by 20% with Flink Real‑Time Incremental Computing

Facing daily terabyte‑scale data ingestion and costly duplicate reads in its ODS‑to‑DWD pipeline, Bilibili introduced a Flink‑based real‑time incremental computation and multi‑level partition shuffling, dramatically reducing read amplification, cutting resource usage by ~20%, improving latency to minutes, and enhancing scalability.

Big DataFlinkReal-time Processing
0 likes · 19 min read
How Bilibili Cut Data Pipeline Costs by 20% with Flink Real‑Time Incremental Computing
DataFunTalk
DataFunTalk
Apr 4, 2023 · Big Data

Compass: An Open‑Source Big Data Task Diagnosis Platform for DolphinScheduler, Airflow and Spark

Compass is an open‑source big‑data diagnostic platform developed by OPPO that provides non‑intrusive, real‑time monitoring and root‑cause analysis for offline and streaming tasks on schedulers such as DolphinScheduler and Airflow, covering workflow‑level failures, Spark engine anomalies, resource usage, and offering one‑click reports and extensible rule‑based diagnostics.

Big DataDolphinSchedulerSpark
0 likes · 13 min read
Compass: An Open‑Source Big Data Task Diagnosis Platform for DolphinScheduler, Airflow and Spark
DataFunSummit
DataFunSummit
Mar 30, 2023 · Artificial Intelligence

MindAlpha: A High‑Performance Distributed Machine Learning Platform for Advertising

The article introduces MindAlpha, a high‑performance distributed machine‑learning platform built for large‑scale, sparse ad‑tech workloads, detailing its architecture, MLOps pipeline, Spark integration, sync/async training strategies, CPU/GPU choices, model‑splitting techniques, and future directions such as model pruning and AutoML.

AIAd TechMLOps
0 likes · 10 min read
MindAlpha: A High‑Performance Distributed Machine Learning Platform for Advertising
DataFunSummit
DataFunSummit
Mar 29, 2023 · Big Data

Gluten Vectorized Engine: Boosting Spark Performance with Native Execution

The article introduces the Gluten vectorized engine, explains why Spark’s CPU bottleneck motivates integrating native vectorized back‑ends via Substrait, details its architecture, component design, current performance gains of up to three‑fold, and outlines ongoing development and future work.

GlutenNative EngineSpark
0 likes · 18 min read
Gluten Vectorized Engine: Boosting Spark Performance with Native Execution
ITPUB
ITPUB
Mar 13, 2023 · Big Data

What’s New in Apache Kyuubi 1.6.0? Server, Client, and Engine Enhancements

Apache Kyuubi 1.6.0 introduces major server‑side upgrades such as batch JAR task submission with RESTful APIs and a metadata store for HA, client‑side improvements including a unified JDBC driver and enhanced Beeline, plus mature Spark, Flink, Trino, and Hive engine plugins, while outlining the community’s roadmap.

Big DataEngine PluginsFlink
0 likes · 13 min read
What’s New in Apache Kyuubi 1.6.0? Server, Client, and Engine Enhancements
DataFunTalk
DataFunTalk
Mar 12, 2023 · Big Data

Apache Kyuubi 1.6.0 Feature Overview and Enhancements

The article provides a comprehensive walkthrough of Apache Kyuubi 1.6.0, detailing server‑side enhancements such as batch (JAR) task submission, metadata store and unified API/authentication, client‑side improvements to the built‑in JDBC driver and Beeline, as well as engine plugins for Spark, Flink, Trino and Hive, and concludes with the community’s roadmap and statistics.

Apache KyuubiBatch ProcessingBig Data
0 likes · 12 min read
Apache Kyuubi 1.6.0 Feature Overview and Enhancements
DataFunTalk
DataFunTalk
Mar 8, 2023 · Artificial Intelligence

Applying AI Algorithms to Big Data Governance: Use Cases and Future Directions

This article presents Datacake's experience of integrating AI algorithms into big data governance, covering the bidirectional relationship between AI and big data, health‑score assessment of data tasks, intelligent Spark parameter tuning, SQL engine selection, and future application scenarios across the data lifecycle.

AIBig DataData Governance
0 likes · 18 min read
Applying AI Algorithms to Big Data Governance: Use Cases and Future Directions
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Mar 3, 2023 · Big Data

How Alibaba Cloud EMR Evolved from Open‑Source Compatibility to Enterprise‑Grade Performance

This article outlines Alibaba Cloud EMR's three‑stage evolution—compatibility, contribution, and beyond open source—detailing its early Hadoop adoption, Flink and Spark innovations, cloud‑native optimizations, and enterprise‑grade features such as Remote Shuffle Service, performance benchmarks, and integrated diagnostics.

Alibaba CloudBig DataCloud Native
0 likes · 13 min read
How Alibaba Cloud EMR Evolved from Open‑Source Compatibility to Enterprise‑Grade Performance
Tencent Cloud Developer
Tencent Cloud Developer
Mar 1, 2023 · Big Data

We Analysis User Profiling System: Architecture and Technical Implementation

We Analysis, the official data‑analysis platform for WeChat mini‑program providers, delivers a zero‑learning‑curve user‑profiling system that combines basic tag analysis and flexible, rule‑based segmentation, using an ETL pipeline to store pre‑computed data in TDSQL and online bitmap‑optimized queries in ClickHouse with RoaringBitmap, ensuring low‑latency, stable, and comprehensive analytics.

ClickHouseDataPipelineSpark
0 likes · 20 min read
We Analysis User Profiling System: Architecture and Technical Implementation
DataFunSummit
DataFunSummit
Feb 28, 2023 · Big Data

Iceberg Technology Overview and Its Application at Xiaomi: Practices, Stream‑Batch Integration, and Future Plans

This article introduces the Iceberg table format, explains its core architecture and advantages such as transactionality, implicit partitioning and row‑level updates, details Xiaomi's practical deployments—including CDC pipelines, partition strategies, compaction services, and stream‑batch integration—and outlines future development directions.

Data LakeFlinkIceberg
0 likes · 20 min read
Iceberg Technology Overview and Its Application at Xiaomi: Practices, Stream‑Batch Integration, and Future Plans
Programmer DD
Programmer DD
Feb 27, 2023 · Big Data

Why Hadoop/Spark Feel Heavy and How SPL Offers a Lightweight Big Data Solution

With data volumes soaring, traditional Hadoop and Spark clusters become costly and cumbersome for small to medium workloads, prompting many to seek lighter alternatives; this article examines the technical, operational, and financial burdens of Hadoop/Spark and introduces the open‑source SPL engine as a fast, low‑cost, easy‑to‑use big‑data solution.

Big DataHadoopSpark
0 likes · 16 min read
Why Hadoop/Spark Feel Heavy and How SPL Offers a Lightweight Big Data Solution
StarRing Big Data Open Lab
StarRing Big Data Open Lab
Feb 8, 2023 · Big Data

Why MapReduce and Spark Still Matter: A Deep Dive into Distributed Computing

Distributed computing splits massive tasks across multiple servers, and this article explains the classic MapReduce batch engine and the modern Spark framework, covering their architectures, strengths, limitations, and evolution, while highlighting key features like fault tolerance, in‑memory processing, and real‑time streaming capabilities.

Big DataMapReduceSpark
0 likes · 12 min read
Why MapReduce and Spark Still Matter: A Deep Dive into Distributed Computing
ITPUB
ITPUB
Feb 7, 2023 · Big Data

How Kuaigou Built a Scalable Real‑Time Data Warehouse with Spark, Flink, and Cloud

Facing massive, multi‑source traffic and the need for instant analytics, Kuaigou’s real‑time data warehouse evolved from Spark on‑premise to a cloud‑native stack using Alibaba Blink, Flink, and layered OLAP models, streamlining development, cutting costs, and enabling diverse real‑time applications.

FlinkOLAPSpark
0 likes · 11 min read
How Kuaigou Built a Scalable Real‑Time Data Warehouse with Spark, Flink, and Cloud
Alibaba Cloud Native
Alibaba Cloud Native
Jan 9, 2023 · Big Data

How Kubernetes Powers Cloud‑Native Big Data with EMR on ACK

This article explains the shift of big data and machine‑learning workloads toward storage‑compute separation and cloud‑native architectures, outlines the technical challenges of running Spark on Kubernetes, and details the EMR on ACK solution with its architecture, performance gains, and real‑world adoption.

ACKEMRSpark
0 likes · 6 min read
How Kubernetes Powers Cloud‑Native Big Data with EMR on ACK
21CTO
21CTO
Jan 7, 2023 · Big Data

How WeChat’s WeAnalysis Powers Scalable User Segmentation with Big Data Architecture

This article explains the design and implementation of WeChat's WeAnalysis image system, covering its basic tag and user‑group modules, multi‑source data ingestion, ETL processing, storage choices such as TDSQL and ClickHouse, bitmap handling, query performance, and service APIs for flexible, high‑performance user segmentation.

ClickHouseData AnalyticsSpark
0 likes · 20 min read
How WeChat’s WeAnalysis Powers Scalable User Segmentation with Big Data Architecture
DataFunTalk
DataFunTalk
Dec 29, 2022 · Big Data

Design and Implementation of OPPO's Big Data Diagnostic Platform (Compass)

This article presents the background, requirements, architecture, key modules, and practical impact of OPPO's non‑intrusive big‑data diagnostic platform—named Compass—designed to quickly locate issues, provide optimization suggestions, and achieve cost‑saving and efficiency gains for large‑scale Spark and Hadoop workloads.

Big DataCost reductionHadoop
0 likes · 17 min read
Design and Implementation of OPPO's Big Data Diagnostic Platform (Compass)
Tencent Advertising Technology
Tencent Advertising Technology
Dec 27, 2022 · Big Data

Design and Optimization of Tencent Advertising Log Data Lake Using Iceberg, Spark, and Flink

The article details how Tencent Advertising re‑architected its massive log pipeline by consolidating heterogeneous real‑time and offline logs into an Iceberg‑based data lake, introducing multi‑level partitioning, Spark and Flink ingestion, and numerous performance and cost optimizations for scalable big‑data analytics.

Big DataData LakeFlink
0 likes · 20 min read
Design and Optimization of Tencent Advertising Log Data Lake Using Iceberg, Spark, and Flink
Data Thinking Notes
Data Thinking Notes
Dec 21, 2022 · Big Data

Why Your Spark Batch Job Fails: Memory Limits, Data Skew, and Practical Fixes

This article examines a recurring Spark batch task failure caused by OutOfMemory errors and data skew, details the investigation steps—including increasing executor memory, raising parallelism, and analyzing shuffle metrics—and proposes solutions such as data validation, filtering oversized keys, and memory adjustments.

Batch ProcessingData SkewOutOfMemory
0 likes · 4 min read
Why Your Spark Batch Job Fails: Memory Limits, Data Skew, and Practical Fixes
Data Thinking Notes
Data Thinking Notes
Dec 14, 2022 · Big Data

Why Spark Jobs Keep Running After You Kill Them: Daemon Threads and Driver Behavior

This article investigates why Spark tasks that appear killed in the Web UI continue running on the driver, analyzes the role of daemon versus non‑daemon threads and SparkContext shutdown mechanisms, reproduces the issue with sample code, and provides practical solutions such as using daemon threads or checking SparkContext status.

DaemonThreadSparkbigdata
0 likes · 8 min read
Why Spark Jobs Keep Running After You Kill Them: Daemon Threads and Driver Behavior
AntTech
AntTech
Dec 11, 2022 · Information Security

Occlum v1.0: Open‑Source Trusted Execution Environment OS with Major Performance Gains and Spark Big Data Integration

Occlum v1.0, the open‑source trusted execution environment operating system released by Ant Group, delivers up to five‑fold performance improvements, supports over 150 Linux syscalls, introduces async I/O, dynamic memory management, and a Spark‑BigDL big‑data analysis solution, while outlining future GPU and TDX extensions.

Big DataConfidential ComputingOcclum
0 likes · 11 min read
Occlum v1.0: Open‑Source Trusted Execution Environment OS with Major Performance Gains and Spark Big Data Integration
政采云技术
政采云技术
Dec 6, 2022 · Fundamentals

How to Use Antlr4 for Custom SQL Parsing in Spark Projects

This guide explains common business scenarios that require custom SQL parsing, walks through setting up Antlr4 in IntelliJ IDEA, configuring Maven dependencies, generating parser code, and provides Java examples for extracting table names from Spark SQL statements, including handling of prediction modes and execution results.

Antlr4BackendJava
0 likes · 11 min read
How to Use Antlr4 for Custom SQL Parsing in Spark Projects
Open Source Linux
Open Source Linux
Dec 1, 2022 · Fundamentals

How NVIDIA Boosted Software Safety by Switching from C to SPARK

NVIDIA’s security team adopted the formally verified SPARK language, replacing C in safety‑critical components, and after a successful proof‑of‑concept demonstrated improved security, verification efficiency, and unchanged performance, leading to widespread internal adoption across many products.

AdaCoreC to SPARK migrationNvidia
0 likes · 4 min read
How NVIDIA Boosted Software Safety by Switching from C to SPARK
Volcano Engine Developer Services
Volcano Engine Developer Services
Nov 28, 2022 · Cloud Native

How ByteDance Built a Massive Cloud‑Native Big Data Platform to Power TikTok

ByteDance’s cloud‑native computing team, led by Li Yakun, details how they transformed a Hadoop‑centric big‑data stack into a Kubernetes‑driven platform—customizing storage, middleware, and scheduling—to support petabyte‑scale workloads, achieve over 40% resource utilization, and sustain rapid product growth.

Big DataCloud NativeSpark
0 likes · 17 min read
How ByteDance Built a Massive Cloud‑Native Big Data Platform to Power TikTok
ITPUB
ITPUB
Nov 18, 2022 · Big Data

How Xiaomi Uses Iceberg for Real‑Time Streaming and Batch Data Lakes

This article introduces Iceberg’s table‑format fundamentals, details Xiaomi’s large‑scale deployment of Iceberg for CDC and log ingestion, explores their streaming‑batch integration experiments, outlines future roadmap items, and provides a comprehensive Q&A covering practical challenges and solutions.

Batch ProcessingBig DataData Lake
0 likes · 23 min read
How Xiaomi Uses Iceberg for Real‑Time Streaming and Batch Data Lakes
Meituan Technology Team
Meituan Technology Team
Nov 10, 2022 · Big Data

Optimizing Spark mapPartitions: Memory Management and Best Practices

The article details how Meituan’s Turing machine‑learning platform cut offline resource use by 80% and task time by 63% through memory‑level techniques such as column pruning, adaptive caching, and a deep dive into Spark’s mapPartitions operator, including source‑code analysis, GC behavior, and a low‑memory batch‑iterator best practice.

Big DataMemory OptimizationSpark
0 likes · 19 min read
Optimizing Spark mapPartitions: Memory Management and Best Practices
Data Thinking Notes
Data Thinking Notes
Nov 8, 2022 · Big Data

Effective Spark GC Tuning: Experiments, Results, and Best Practices

This article walks through a Spark job’s garbage‑collection tuning workflow, presents step‑by‑step experiments with different JVM options and collectors, compares performance under tight and normal memory conditions, and offers practical recommendations for choosing the optimal GC strategy in big‑data workloads.

MemorySparkTuning
0 likes · 12 min read
Effective Spark GC Tuning: Experiments, Results, and Best Practices
dbaplus Community
dbaplus Community
Oct 30, 2022 · Big Data

Why Layered Data Warehouse Modeling Boosts Performance and Cuts Costs

This article explains the importance of layering in data warehouse modeling, outlines the four ETL steps, describes common pitfalls, presents a typical technical stack, and details each warehouse layer (ODS, DWD, DWS, ADS) along with best‑practice naming conventions and implementation tips for big‑data environments.

ETLHiveModeling
0 likes · 38 min read
Why Layered Data Warehouse Modeling Boosts Performance and Cuts Costs
DataFunSummit
DataFunSummit
Oct 30, 2022 · Big Data

Integrating Apache Spark with Cloud‑Native Technologies: Principles, Kubernetes Deployments, EMR on ACK, and Serverless Spark on DLF

This article examines the challenges of traditional Spark clusters and explains how integrating Spark with cloud‑native platforms—through Kubernetes deployment modes, EMR on ACK practices, Remote Shuffle Service, and serverless Spark on DLF—provides elastic scaling, lower operational costs, and advanced features such as executor rolling and custom scheduler support.

Big DataDLFKubernetes
0 likes · 18 min read
Integrating Apache Spark with Cloud‑Native Technologies: Principles, Kubernetes Deployments, EMR on ACK, and Serverless Spark on DLF
Bilibili Tech
Bilibili Tech
Oct 21, 2022 · Big Data

Kyuubi at Bilibili: Architecture, Enhancements, and Production Practices for Large‑Scale Data Processing

Bilibili adopted the open‑source Kyuubi proxy to replace its unstable STS layer, enabling multi‑tenant, multi‑engine (Spark, Presto, Flink) SQL/Scala processing with Hive Thrift compatibility, fine‑grained queue isolation, UI monitoring, stability safeguards, and Kubernetes/YARN deployment, while planning further cloud‑native extensions.

Big DataKubernetesKyuubi
0 likes · 20 min read
Kyuubi at Bilibili: Architecture, Enhancements, and Production Practices for Large‑Scale Data Processing
Hulu Beijing
Hulu Beijing
Oct 21, 2022 · Big Data

How Hulu Scales Spark on Kubernetes: Cloud‑Native Big Data at Disney‑Scale

Hulu’s data platform team describes how they migrated large‑scale Spark workloads from Yarn to native Spark on Kubernetes, leveraging AWS services such as EKS, S3, and custom operators to achieve dynamic scaling, unified monitoring, cost‑effective resource management, and improved stability for search, recommendation, and advertising pipelines.

AWSBig DataCloud Native
0 likes · 18 min read
How Hulu Scales Spark on Kubernetes: Cloud‑Native Big Data at Disney‑Scale
DataFunSummit
DataFunSummit
Oct 18, 2022 · Big Data

Feature Overview of Apache Kyuubi (Incubating) v1.5.0

The article presents a detailed technical walkthrough of Apache Kyuubi 1.5.0, covering its service‑oriented architecture, high‑availability design, multi‑engine extensions for Spark, Flink, Trino and Hive, enhanced engine‑sharing policies, POOL mode configuration, and the project’s future roadmap.

Apache KyuubiBig DataEngine Architecture
0 likes · 13 min read
Feature Overview of Apache Kyuubi (Incubating) v1.5.0

Understanding Data Skew and Its Mitigation Strategies in Distributed Computing

The article explains what data skew is in distributed computing, analyzes its logical and data‑level causes, and presents preventive and remedial techniques such as data partitioning, logical replacement, two‑stage aggregation, increasing parallelism, and data cleaning to improve processing efficiency.

Data SkewPerformance OptimizationSpark
0 likes · 8 min read
Understanding Data Skew and Its Mitigation Strategies in Distributed Computing

How a Leading E‑commerce Platform Built a Scalable Data Warehouse with Lambda & Hudi

This article explains how an e‑commerce company designed and implemented a modern data warehouse—combining batch Spark jobs, real‑time Flink streams, and Hudi data‑lake storage—to handle terabytes of daily logs, ensure data quality, and provide fast, reliable analytics for business decision‑making.

Data LakeData WarehouseETL
0 likes · 16 min read
How a Leading E‑commerce Platform Built a Scalable Data Warehouse with Lambda & Hudi