Tagged articles

Spark

623 articles · Page 2 of 7

May 29, 2024 · Artificial Intelligence

Distributed Network Embedding Algorithm for Billion‑Scale Graph Data in Tencent Games

Tencent’s Game Social Algorithm Team presents a Spark‑based distributed network embedding framework that recursively partitions hundred‑billion‑edge game graphs into manageable subgraphs, runs node2vec locally, and fuses results, enabling efficient link prediction and node classification across multiple games within hours.

Distributed ComputingGame AnalyticsSpark

0 likes · 7 min read

Distributed Network Embedding Algorithm for Billion‑Scale Graph Data in Tencent Games

DataFunSummit

May 27, 2024 · Big Data

Design and Optimization of Zhihu's Bridge Platform for DMP/CDP: Architecture, Challenges, and Solutions

This article presents a comprehensive case study of Zhihu's Bridge platform, detailing its background, five core modules, unified architecture built on Spark and Flink, bitmap‑based tagging, and performance optimizations that address query speed, write latency, and high‑QPS online checks while outlining future directions with Doris 2.0 and large language models.

CDPDMPData Platform

0 likes · 27 min read

Design and Optimization of Zhihu's Bridge Platform for DMP/CDP: Architecture, Challenges, and Solutions

Big Data Technology & Architecture

May 27, 2024 · Big Data

Athena Data Factory: A One‑Stop Data Development and Governance Platform – Architecture, Features, and Impact

The Athena Data Factory, built by Spark Thinking, is a comprehensive one‑stop data development and governance platform that integrates data integration, development, analysis, and services, offering offline, real‑time, and AI pipelines, modular architecture, extensive monitoring, and cost‑optimisation to empower thousands of users across the company.

AirflowBig DataCloud Computing

0 likes · 26 min read

Athena Data Factory: A One‑Stop Data Development and Governance Platform – Architecture, Features, and Impact

DataFunTalk

May 26, 2024 · Big Data

Athena Data Factory: A One‑Stop Data Development and Governance Platform for Sparkle Thinking

The article details how Sparkle Thinking built the Athena Data Factory—a comprehensive, self‑service data development and governance platform that integrates data integration, ETL, real‑time processing, monitoring, and analytics, describing its architecture, key technologies, implementation timeline, operational practices, performance gains, and future directions.

AirflowETLFlink

0 likes · 26 min read

Athena Data Factory: A One‑Stop Data Development and Governance Platform for Sparkle Thinking

DataFunSummit

May 15, 2024 · Big Data

Xiaomi Sales Data Warehouse: Architecture, Construction Theory, and Capability Evolution

This article details Xiaomi's sales data warehouse development, covering its history, architecture, dimensional modeling, layer design, streaming‑batch integration, governance, security, and future directions, while also addressing practical Q&A on implementation challenges and best practices.

Big DataData WarehouseFlink

0 likes · 15 min read

Xiaomi Sales Data Warehouse: Architecture, Construction Theory, and Capability Evolution

Big Data Technology & Architecture

May 13, 2024 · Big Data

Apache Paimon 0.8 Release: Deletion Vectors, File Index, Performance Boosts, and Flink/Spark Integration Enhancements

The article introduces Apache Paimon 0.8, highlighting new Deletion Vectors, a universal file index, memory and I/O optimizations, record‑level TTL, and integration improvements with Flink and Spark, while also discussing broader lake‑house performance trends and future directions.

Apache PaimonBig DataDeletion Vectors

0 likes · 8 min read

Apache Paimon 0.8 Release: Deletion Vectors, File Index, Performance Boosts, and Flink/Spark Integration Enhancements

DataFunSummit

Apr 25, 2024 · Big Data

Paimon Project Overview: Recent Developments, Core Capabilities, and Future Roadmap

This article presents a comprehensive overview of the Apache‑incubated Paimon project, covering its evolution from Flink Table Store, the current features of primary‑key and log tables, management tools such as snapshots, tags and branches, performance optimizations for Flink and Spark, and a detailed roadmap of upcoming functionalities.

Big DataData ManagementFlink

0 likes · 23 min read

Paimon Project Overview: Recent Developments, Core Capabilities, and Future Roadmap

DataFunSummit

Apr 22, 2024 · Big Data

Intelligent Optimization of Bilibili’s Iceberg‑Based Lakehouse for Query Acceleration

This article describes Bilibili’s intelligent optimization project that automatically analyzes historical query workloads to configure multi‑dimensional sorting, various indexes, and pre‑aggregation on Iceberg tables, thereby reducing scan volume by 28% across dozens of tables and improving OLAP query latency.

Big DataData WarehouseIceberg

0 likes · 15 min read

Intelligent Optimization of Bilibili’s Iceberg‑Based Lakehouse for Query Acceleration

StarRocks

Mar 26, 2024 · Big Data

How Replacing Spark with StarRocks Cut Data Refresh Time by 90% and Saved 99% Cost

The article details how the Xiaohongshu data warehouse team integrated StarRocks into their offline processing pipeline, replacing Spark for heavy Cube calculations, which reduced job execution from hours to minutes, cut resource consumption by over 90%, advanced daily data output by 1.5 hours, and lowered refresh cost by more than 99%.

Big DataOLAPPerformance Optimization

0 likes · 18 min read

How Replacing Spark with StarRocks Cut Data Refresh Time by 90% and Saved 99% Cost

DataFunSummit

Mar 24, 2024 · Big Data

Design and Implementation of a User Data Warehouse and Profiling System at 58.com

This article details the design and implementation of a user data warehouse at 58.com, covering data warehouse fundamentals, user profiling concepts, multi‑layer architecture, modeling methods, ETL migration from Hive to Spark, data quality assurance, and the resulting achievements.

Big DataData WarehouseETL

0 likes · 20 min read

Design and Implementation of a User Data Warehouse and Profiling System at 58.com

DataFunSummit

Mar 20, 2024 · Big Data

Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance

This article details ByteDance's large‑scale evolution of Spark Shuffle to a cloud‑native architecture, describing background, stability and mixed‑resource scenarios, challenges such as CPU and I/O limits, custom ESS enhancements, shuffle throttling, spill‑split mechanisms, and the Cloud Shuffle Service with its push‑based design and performance gains.

Big DataPerformance OptimizationShuffle

0 likes · 21 min read

Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance

Xiaohongshu Tech REDtech

Mar 18, 2024 · Big Data

Optimizing Offline Data Warehouse with StarRocks: Replacing Spark for Faster, Cost‑Effective Data Processing

By replacing part of its Spark‑based offline pipeline with StarRocks, Xiaohongshu’s data‑warehouse team cut job execution from hours to minutes, reduced resource usage over 90 %, lowered back‑fill cost by 99 %, and accelerated daily data production by 1.5 hours.

Data WarehouseOLAPPerformance Optimization

0 likes · 16 min read

Optimizing Offline Data Warehouse with StarRocks: Replacing Spark for Faster, Cost‑Effective Data Processing

DataFunSummit

Feb 29, 2024 · Big Data

Trino at Xiaomi: Architecture, Practices, and Future Plans

This article details Xiaomi’s practical deployment of Trino, covering its architectural role, core and extended capabilities, performance comparisons, integration with Iceberg and Spark, operational enhancements, multi‑cluster and ad‑hoc query scenarios, future cloud‑storage plans, and a Q&A session.

Big DataIcebergOLAP

0 likes · 20 min read

Trino at Xiaomi: Architecture, Practices, and Future Plans

Baidu Tech Salon

Feb 28, 2024 · Big Data

Design, Optimization, and Practice of Baidu's Fusion Compute Engine for Data Warehouse

Baidu’s Fusion Compute Engine, built on Spark with a one‑layer wide‑table model, combines data‑skipping, push‑down, code‑generation, vectorization and extensive tuning to cut ad‑hoc query latency to seconds, shrink storage by ~30 %, and accelerate ETL workloads while maintaining stability for massive data‑warehouse workloads.

BaiduBig DataFusion Compute Engine

0 likes · 10 min read

Design, Optimization, and Practice of Baidu's Fusion Compute Engine for Data Warehouse

Baidu Geek Talk

Feb 28, 2024 · Big Data

How Baidu’s Fusion Compute Engine Cuts Query Time to Seconds on Petabyte‑Scale Data

This article analyzes Baidu's fusion compute engine for its data warehouse, detailing its architecture, optimization techniques such as data skipping, Parquet column indexing, ProjectLimit and CodeGen, and demonstrates how these innovations reduce query latency to seconds while cutting storage costs by about 30% on multi‑petabyte workloads.

BaiduBig DataData Warehouse

0 likes · 12 min read

How Baidu’s Fusion Compute Engine Cuts Query Time to Seconds on Petabyte‑Scale Data

DataFunTalk

Feb 25, 2024 · Big Data

Implementation Practice of Bilibili's Tag System: Evolution, Architecture, and Future Plans

This article details Bilibili's tag system from its 2021 inception through successive redesigns, describing the three‑layer architecture, data flow pipelines using Hive, Iceberg, Spark and ClickHouse, crowd selection DSL, online services with Redis, performance optimizations, and upcoming governance and quality initiatives.

Big DataClickHouseData Engineering

0 likes · 12 min read

Implementation Practice of Bilibili's Tag System: Evolution, Architecture, and Future Plans

Meituan Technology Team

Jan 25, 2024 · Artificial Intelligence

Design and Implementation of a Distributed Causal Forest Framework on Meituan's Fulfillment Platform

Meituan’s Fulfillment Platform team built a high‑performance distributed causal‑forest framework—named Causal On Spark—that trains hundreds of trees on hundreds of millions of samples within minutes using MapReduce‑based histogram splitting, extensive memory optimizations, Parquet model serving, and novel distributed evaluation metrics, enabling scalable causal inference for pricing, subsidies, and marketing.

Sparkcausal forestcausal inference

0 likes · 23 min read

Design and Implementation of a Distributed Causal Forest Framework on Meituan's Fulfillment Platform

DataFunSummit

Jan 21, 2024 · Big Data

Xiaomi Sales Data Warehouse: Architecture, Construction Theory, and Capability Layers

This article presents Xiaomi's sales data warehouse practice, detailing its evolution, positioning, dimensional modeling, layered architecture, Lambda design, Iceberg integration, capability building, security governance, and future directions toward data value and real‑time metrics.

Big DataData WarehouseFlink

0 likes · 15 min read

Xiaomi Sales Data Warehouse: Architecture, Construction Theory, and Capability Layers

DataFunTalk

Jan 12, 2024 · Big Data

Building a Unified Data Empowerment Layer with Apache Kyuubi at GF Securities

The article describes how GF Securities designed and implemented a unified big‑data empowerment layer based on Apache Kyuubi to address data‑centric challenges, improve efficiency, ensure controllable governance, and support agile data scenarios across ingestion, processing, storage, and security.

Apache KyuubiBig DataData Empowerment

0 likes · 33 min read

Building a Unified Data Empowerment Layer with Apache Kyuubi at GF Securities

dbaplus Community

Dec 25, 2023 · Big Data

Why Spark and Flink Can't Stream MySQL via JDBC (And What Works Instead)

This article explains the limitations of using JDBC for true streaming reads in Spark and Flink, demonstrates failed attempts with MySQL, shows workarounds that revert to batch processing, and recommends Flink CDC as the practical solution for incremental MySQL ingestion.

Big DataCDCFlink

0 likes · 8 min read

Why Spark and Flink Can't Stream MySQL via JDBC (And What Works Instead)

DataFunSummit

Dec 17, 2023 · Big Data

Apache Kyuubi 1.8: New Features and Enhancements Overview

Apache Kyuubi 1.8 introduces a range of enhancements including multi‑tenant serverless SQL support on Spark and Flink, expanded batch and streaming capabilities, improved resource scheduling with database‑backed queues, stronger Kerberos/LDAP security, Flink YARN integration, and a new web UI for management.

Apache KyuubiBig DataFlink

0 likes · 13 min read

Apache Kyuubi 1.8: New Features and Enhancements Overview

Zhongtong Tech

Dec 14, 2023 · Big Data

How Celeborn Transformed Spark Shuffle Performance at ZTO Express

Facing massive daily Spark shuffle volumes and unstable ETL performance, ZTO Express migrated from the community External Shuffle Service to Celeborn's Remote Shuffle Service, achieving higher disk I/O efficiency, better reliability, reduced network connections, and significant reductions in task failures and job latency.

Big DataRemote Shuffle ServiceShuffle

0 likes · 15 min read

How Celeborn Transformed Spark Shuffle Performance at ZTO Express

DataFunTalk

Dec 2, 2023 · Big Data

Apache Celeborn: Overview, Architecture, Community, and Future Roadmap

This article introduces Apache Celeborn, explains the challenges of intermediate data in large‑scale compute engines, details its core architecture and design—including master, worker, lifecycle manager and shuffle client—covers its community history, version releases, performance comparisons with Spark ESS, real‑world deployment scenarios, and outlines future development plans.

Apache CelebornBig DataFlink

0 likes · 14 min read

Apache Celeborn: Overview, Architecture, Community, and Future Roadmap

DataFunTalk

Nov 30, 2023 · Big Data

Big Data Cloud‑Native Trends and Challenges Highlighted at the 2023 Yunqi Conference

The 2023 Yunqi Conference in Hangzhou showcased the latest advances in cloud computing and big‑data technologies, examined the evolution from big‑data 1.0 to 3.0, discussed the key difficulties of making big data cloud‑native, and presented a practical case study of MiHoYo’s cloud‑native transformation.

Alibaba CloudBig DataCloud Native

0 likes · 12 min read

Big Data Cloud‑Native Trends and Challenges Highlighted at the 2023 Yunqi Conference

DataFunSummit

Nov 25, 2023 · Big Data

Practical Experience with Apache Kyuubi and Celeborn on the DXY Big Data Platform

This article presents a comprehensive technical overview of how DXY's big data platform leverages Apache Kyuubi and Celeborn to unify Spark entry points, configure flexible task isolation, implement fine‑grained AuthZ, optimize small files and Z‑Order sorting, and accelerate large result set transmission with Arrow, while also discussing operational challenges and upcoming features.

Apache KyuubiArrowBig Data

0 likes · 17 min read

Practical Experience with Apache Kyuubi and Celeborn on the DXY Big Data Platform

Zhuanzhuan Tech

Nov 22, 2023 · Backend Development

Improving Stability and High Availability of an Advertising Billing System: Architecture Upgrade and Optimizations

This article describes the background, problems, and a series of architectural upgrades—including MQ replacement, thread‑pool isolation, Redis/TiKV redundancy, and Spark‑based compensation—to enhance the stability, scalability, and high‑availability of an advertising billing system.

AdvertisingHigh AvailabilityMessage Queue

0 likes · 12 min read

Improving Stability and High Availability of an Advertising Billing System: Architecture Upgrade and Optimizations

DataFunTalk

Nov 18, 2023 · Big Data

Large‑Scale Evolution of Spark Shuffle Cloud‑Native Architecture at ByteDance

This article details ByteDance's extensive migration of Spark Shuffle to a cloud‑native architecture, describing the massive data volumes, the underlying ESS and CSS services, the challenges of resource isolation, monitoring, throttling, spill‑splitting, and the performance gains achieved across stable and mixed‑resource clusters.

Big DataByteDanceCloud Native

0 likes · 20 min read

Alibaba Cloud Native

Nov 10, 2023 · Big Data

Scaling Spark on Kubernetes: Elastic Compute, Cost Savings, and Storage Decoupling

MiHoYo’s data platform team details their migration of Spark workloads to Alibaba Cloud’s ACK Kubernetes service, describing how the Spark‑on‑K8s + OSS‑HDFS architecture delivers elastic compute, up to 50% cost reduction, and true compute‑storage separation, while addressing operational challenges through custom operators, Celeborn, and robust monitoring.

Big DataSparkStorage Decoupling

0 likes · 24 min read

Scaling Spark on Kubernetes: Elastic Compute, Cost Savings, and Storage Decoupling

Big Data Technology & Architecture

Nov 10, 2023 · Big Data

MVP Learning Roadmap for Securing a Big Data Internship

This article offers a concise MVP learning plan for recent graduates aiming to secure a big‑data internship, covering essential computer fundamentals, core big‑data frameworks, project ideas, and algorithm/SQL practice, along with practical study tips and resource recommendations.

FlinkHadoopSQL

0 likes · 8 min read

MVP Learning Roadmap for Securing a Big Data Internship

Alibaba Cloud Big Data AI Platform

Nov 10, 2023 · Big Data

How We Transformed Big Data Workloads with Spark on Kubernetes and OSS‑HDFS

Facing rapid growth in offline data and compute demands, we migrated our big‑data platform to a cloud‑native architecture using Spark 3.2.3 on Kubernetes with OSS‑HDFS storage, achieving elastic scaling, cost reduction, and compute‑storage separation while detailing implementation, challenges, and operational insights.

Sparkcloud-nativeelastic computing

0 likes · 25 min read

How We Transformed Big Data Workloads with Spark on Kubernetes and OSS‑HDFS

dbaplus Community

Oct 18, 2023 · Databases

Doris vs ClickHouse: Which Database Delivers Faster Writes and Queries?

This article presents a systematic performance comparison between Doris and ClickHouse, covering data ingestion speed, SQL syntax differences, hardware impact, and detailed query benchmarks across multiple scenarios, ultimately revealing that each system excels in different use cases.

Big DataClickHouseDatabase Comparison

0 likes · 15 min read

Doris vs ClickHouse: Which Database Delivers Faster Writes and Queries?

DataFunTalk

Oct 13, 2023 · Big Data

Design Principles, Architecture, and Applications of the Open‑Source LakeSoul Lakehouse Framework

This article provides a comprehensive technical overview of LakeSoul, an open‑source, cloud‑native lakehouse framework, covering its design philosophy, core features, architecture, performance benchmarks, real‑time ingestion, incremental computation, multi‑stream joining, security, community progress, and future roadmap.

Big DataData LakehouseFlink

0 likes · 16 min read

Design Principles, Architecture, and Applications of the Open‑Source LakeSoul Lakehouse Framework

DataFunSummit

Oct 1, 2023 · Big Data

Iceberg Data Lake: Core Features, Xiaomi Use Cases, and Future Plans

This presentation introduces Iceberg's core capabilities, details Xiaomi's practical applications—including log ingestion, near‑real‑time warehousing, offline challenges, column‑level encryption, and Hive migration—and outlines future development directions such as materialized views and cloud migration, providing a comprehensive view of modern data‑lake engineering.

Big DataData LakeFlink

0 likes · 22 min read

Iceberg Data Lake: Core Features, Xiaomi Use Cases, and Future Plans

Baidu Geek Talk

Sep 27, 2023 · Big Data

Design and Implementation of a Content Revenue Settlement System

The article details the design and implementation of a content revenue settlement platform that aggregates traffic and ad data, uses a Spark‑plus‑PALO architecture for processing tens of millions of daily records, and employs a master‑worker model with idempotent tasks, temporary tables, and verification steps to ensure reliable monthly profit‑share calculations for authors, media, mini‑program owners, and users.

Distributed ProcessingPaloSpark

0 likes · 14 min read

Design and Implementation of a Content Revenue Settlement System

dbaplus Community

Sep 3, 2023 · Big Data

How NetEase Yanxuan Migrated from Lambda to Iceberg for Seamless Batch‑Stream Integration

This article explains how NetEase Yanxuan upgraded its legacy Lambda architecture to an Iceberg‑based batch‑stream unified platform, detailing the original data pipeline, the challenges faced, the evaluation of Iceberg versus Hudi and DeltaLake, and the concrete engineering optimizations and governance measures implemented to achieve lower latency and higher query performance.

Batch-Stream IntegrationBig DataFlink

0 likes · 14 min read

How NetEase Yanxuan Migrated from Lambda to Iceberg for Seamless Batch‑Stream Integration

Bilibili Tech

Sep 1, 2023 · Big Data

Design and Implementation of Session‑Based User Engagement Tracking for Cloud TV Application

The Cloud Vision TV app implements a session‑id and placement‑id driven tracking pipeline that generates, collects, and processes lifecycle data across server and client layers, enabling fine‑grained engagement strategies, scene reconstruction via AC automata, and actionable BI dashboards to improve user retention and personalization.

BI visualizationMobile AppOLAP

0 likes · 14 min read

Design and Implementation of Session‑Based User Engagement Tracking for Cloud TV Application

Tencent Cloud Developer

Aug 23, 2023 · Big Data

WeChat Experiment Platform: Architecture Design and Iceberg Lakehouse Optimization

The WeChat Experiment Platform migrated its 60,000 metric, 200,000 core, 30 PB plus data pipeline to an Iceberg based lakehouse, leveraging three layer metadata, fine grained partitioning, MERGE into writes, time travel snapshots and skew handling UDFs, which cut core time by 69%, saved ~100 PB storage, and reduced latency by up to 70%.

Big DataData WarehouseIceberg

0 likes · 18 min read

WeChat Experiment Platform: Architecture Design and Iceberg Lakehouse Optimization

ITPUB

Aug 23, 2023 · Cloud Native

Build a Cloud‑Native Lakehouse on AWS with Apache Iceberg and Amoro

This guide explains the cloud‑native lakehouse concept, outlines its advantages and challenges, compares lake‑table projects such as Iceberg, and provides a step‑by‑step AWS deployment of Apache Iceberg and Amoro—including environment setup, AMS installation, catalog configuration, optimizer launch, data ingestion with Flink, and query verification with Spark.

AWSAmoroApache Iceberg

0 likes · 33 min read

Build a Cloud‑Native Lakehouse on AWS with Apache Iceberg and Amoro

政采云技术

Aug 23, 2023 · Big Data

Step-by-Step Guide to Building a Hadoop Big Data Cluster on ARM Architecture

This comprehensive tutorial details the process of deploying a complete Hadoop-based big data ecosystem on ARM architecture, covering the installation and configuration of essential components including Java, Zookeeper, Hadoop, MySQL, Hive, and Spark with practical code examples.

ARM architectureCluster DeploymentHadoop

0 likes · 19 min read

Step-by-Step Guide to Building a Hadoop Big Data Cluster on ARM Architecture

DataFunTalk

Aug 20, 2023 · Databases

Best Practices for Building Low‑Cost Data Lake Analytics with AnalyticDB MySQL and Serverless Spark

This article presents a comprehensive technical overview of Alibaba Cloud AnalyticDB MySQL and its Serverless Spark integration, detailing architecture, core optimizations, security enhancements, and real‑world case studies that demonstrate how to achieve cost‑effective, high‑performance data lake analytics.

AnalyticDBBig DataData Lake

0 likes · 19 min read

Best Practices for Building Low‑Cost Data Lake Analytics with AnalyticDB MySQL and Serverless Spark

Youzan Coder

Aug 8, 2023 · Big Data

Kylin4 Deployment and Performance Optimizations at Youzan

Since 2018 Youzan has migrated all online services to Kylin4, addressing long cube rebuilds, single‑point cache, CPU spikes, and throttling gaps by adding batch segment builds, low‑priority concurrency controls, Redis‑based query caching, parquet skew mitigation, range‑query acceleration, and class‑loader optimizations, which together doubled query‑per‑second capacity to 150, cut latency by up to 50 % and reduced CPU usage.

Big DataCubeKylin

0 likes · 17 min read

Kylin4 Deployment and Performance Optimizations at Youzan

GuanYuan Data Tech Team

Jul 27, 2023 · Big Data

How Delta Lake Powers Scalable BI & AI: Real-World Practices and Optimizations

Guandata’s R&D leader outlines how their analytics platform leverages Delta Lake and Spark to deliver fast, ACID‑compliant BI and AI workloads, detailing architecture, key features like schema evolution and time travel, and practical performance tricks such as compaction, vacuuming, and multi‑engine integration.

AIBIBig Data

0 likes · 14 min read

How Delta Lake Powers Scalable BI & AI: Real-World Practices and Optimizations

政采云技术

Jul 6, 2023 · Big Data

Optimizing Large‑Scale Table Joins in Spark Using Bloom Filters

To address the resource‑intensive challenges of joining billion‑row tables in data warehouses, this article examines common optimization approaches, analyzes Spark’s SortMergeJoin algorithm, and proposes a Bloom‑filter‑based solution that filters unchanged data early, dramatically improving performance and reducing cluster resource consumption.

JavaSQLSpark

0 likes · 17 min read

Optimizing Large‑Scale Table Joins in Spark Using Bloom Filters

DataFunTalk

Jun 29, 2023 · Big Data

Practical Deployment of Delta Lake in BI and AI Products

This article summarizes a technical presentation on how Delta Lake is integrated into a BI+AI platform, covering the product background, data‑lake architecture, Delta Lake features such as ACID transactions, schema management, multi‑engine support, performance optimizations, and future development directions.

AIBIBig Data

0 likes · 12 min read

Practical Deployment of Delta Lake in BI and AI Products

Big Data Technology & Architecture

Jun 27, 2023 · Big Data

Comprehensive Big Data Interview Experience and Questions Overview

The article presents a detailed three‑month interview journey that led to a position at a top new‑energy automotive firm, outlining the questions and topics covered in five interview rounds—including Hive, Spark, Flink, Kafka, data modeling, and data governance—to help candidates prepare for big‑data roles.

Big DataFlinkHive

0 likes · 7 min read

Comprehensive Big Data Interview Experience and Questions Overview

DataFunTalk

Jun 26, 2023 · Big Data

Iceberg Data Lake: Core Features, Xiaomi Use Cases, and Future Plans

This presentation details Iceberg's core capabilities—transactional writes, schema evolution, implicit partitioning, and row‑level updates—while showcasing Xiaomi's real‑world applications such as log ingestion redesign, near‑real‑time warehousing, offline optimizations, column‑level encryption, Hive migration strategies, and outlining upcoming enhancements like materialized views and cloud migration.

Big DataColumn EncryptionData Lake

0 likes · 20 min read

Code Ape Tech Column

Jun 21, 2023 · Big Data

From Java Streams to Spark: Basic Big Data Operations Explained

This article demonstrates how developers familiar with Java Stream APIs can quickly grasp fundamental Spark operations—including map, flatMap, groupBy, and reduce—by translating stream examples into Spark code, providing complete code snippets, explanations of transformations versus actions, and practical tips for handling exceptions in distributed processing.

Big DataJava StreamSpark

0 likes · 24 min read

From Java Streams to Spark: Basic Big Data Operations Explained

JD Tech

Jun 14, 2023 · Big Data

Understanding and Solving Data Skew in Offline Big Data Development (Hive & Spark)

This article explains the concept of data skew in offline big‑data jobs, describes its symptoms and root causes, and provides practical optimization techniques for Hive and Spark—including partitioning strategies, map‑join usage, adaptive query settings, and monitoring approaches—to prevent performance degradation and runtime failures.

Data SkewHiveOptimization

0 likes · 17 min read

Understanding and Solving Data Skew in Offline Big Data Development (Hive & Spark)

Big Data Technology & Architecture

Jun 13, 2023 · Big Data

Iceberg Data Lake Implementation and Optimization at iQIYI

This article details iQIYI's adoption of Iceberg for its data lake, covering the OLAP architecture, reasons for a data lake, Iceberg's table format advantages over Hive, platform construction, streaming ingestion, query and performance optimizations, real‑world business deployments, and future plans.

Big DataData LakeFlink

0 likes · 21 min read

Iceberg Data Lake Implementation and Optimization at iQIYI

DataFunSummit

Jun 11, 2023 · Artificial Intelligence

Applying Uplift Modeling, PSM Matching, and Spark CausalML for Growth at Tencent Video

This article explains how Tencent Video leverages causal inference techniques—including uplift gain models, propensity‑score‑matching (PSM), and a distributed Spark‑based CausalML library—to identify incremental user effects, evaluate marketing interventions, and improve growth across advertising, internal flow, push notifications, and coupon strategies.

Propensity Score MatchingSparkgrowth analytics

0 likes · 12 min read

Applying Uplift Modeling, PSM Matching, and Spark CausalML for Growth at Tencent Video

Big Data Technology & Architecture

Jun 11, 2023 · Big Data

Typical Interview Questions for Offline Data Warehouse Positions (Spark, Hadoop, etc.)

The article shares a fresh graduate's experience interviewing for offline data‑warehouse roles at companies like Ctrip, Meituan and Alibaba, outlines the common interview pattern, and lists detailed Spark, Hadoop, and data‑warehouse questions used by these firms.

AlibabaBig DataCtrip

0 likes · 5 min read

Typical Interview Questions for Offline Data Warehouse Positions (Spark, Hadoop, etc.)

DataFunTalk

Jun 9, 2023 · Big Data

Cloud Music Data Governance Practice

This article presents a comprehensive case study of NetEase Cloud Music's data governance practice, covering data background, governance philosophy, detailed solutions across metadata, storage, compute, and model design, practical implementations, measurable cost savings, and future planning for sustainable data management.

HadoopMetadataSpark

0 likes · 15 min read

DevOps

Jun 7, 2023 · Big Data

Deploying Apache Spark on YARN vs Kubernetes: Architecture, Benefits, and Comparison

This article explains how Apache Spark can be deployed using the traditional Hadoop YARN resource manager and the newer Kubernetes approach, detailing configuration steps, submission methods, and a comprehensive comparison of isolation, scalability, learning curve, logging, performance, and cost considerations.

Big DataSparkYARN

0 likes · 10 min read

Deploying Apache Spark on YARN vs Kubernetes: Architecture, Benefits, and Comparison

360 Tech Engineering

Jun 2, 2023 · Big Data

Overcoming Challenges in User Profiling: A Big Data‑Driven Framework for Precise Marketing

The article outlines how a unified, big‑data‑based user profiling platform addresses traditional data silos, high costs, and limited functionality by standardizing tags, integrating Spark and RHadoop processing, and enabling a closed‑loop marketing workflow that improves accuracy and operational efficiency.

Big DataData IntegrationMarketing Automation

0 likes · 7 min read

Overcoming Challenges in User Profiling: A Big Data‑Driven Framework for Precise Marketing

DataFunSummit

May 28, 2023 · Big Data

Apache Hudi: Capabilities, Architecture, Use Cases, and Future Outlook

This article introduces Apache Hudi as a next‑generation streaming data‑lake platform, explains its core concepts, architecture, and table types, and showcases real‑world use cases at Tencent such as CDC ingestion, minute‑level real‑time warehousing, streaming analytics, multi‑stream joins, ad attribution, and stream‑to‑batch processing, while also outlining future directions.

Apache HudiCDCData Lake

0 likes · 16 min read

Apache Hudi: Capabilities, Architecture, Use Cases, and Future Outlook

DataFunSummit

May 21, 2023 · Big Data

Blaze: Design and Practice of SparkSQL Native Operator Optimization at Kuaishou

This article presents Blaze, a Kuaishou‑built native execution middleware for SparkSQL that leverages Apache DataFusion to achieve vectorized operator execution, detailing its architecture, implementation, performance gains, current coverage, benchmark results, production rollout, and future development plans.

DataFusionPerformance OptimizationSpark

0 likes · 17 min read

Blaze: Design and Practice of SparkSQL Native Operator Optimization at Kuaishou

Big Data Technology & Architecture

May 19, 2023 · Big Data

Comprehensive Big Data Interview Q&A and Personal Project Summary

This article shares a recent graduate's successful job offer story, emphasizes preparing a detailed personal project summary, and provides extensive big‑data interview questions covering Hadoop, Spark, Flink, Kafka, Hive, ClickHouse, and related technologies to help candidates excel in interviews.

Big DataFlinkHadoop

0 likes · 15 min read

Comprehensive Big Data Interview Q&A and Personal Project Summary

Data Thinking Notes

May 10, 2023 · Big Data

Mastering Hive Small File Management: Strategies to Boost Performance

This article explains why tiny Hive files degrade storage and query efficiency, outlines how they are created, and presents practical Spark and Hive configuration techniques—including dynamic partitioning, AQE, Reduce tuning, and automated daily merge jobs—to effectively consolidate small files and improve overall data‑warehouse performance.

HiveOptimizationSmall Files

0 likes · 10 min read

Mastering Hive Small File Management: Strategies to Boost Performance

Rare Earth Juejin Tech Community

May 8, 2023 · Artificial Intelligence

Review of Alibaba's Tongyi Qianwen AI Model with Sample Code, Recipe, and SWOT Analysis

This article reviews Alibaba's Tongyi Qianwen large language model, shares personal impressions, provides a fish‑flavored pork recipe, conducts a SWOT analysis, and includes Scala Spark and Java code examples illustrating its capabilities and usage scenarios.

JavaLarge Language ModelSWOT analysis

0 likes · 12 min read

Review of Alibaba's Tongyi Qianwen AI Model with Sample Code, Recipe, and SWOT Analysis

Big Data Technology & Architecture

May 5, 2023 · Big Data

Strategies for Handling Small Files in Hive and Spark

This article examines the causes and impacts of small file proliferation in Hive and Spark environments, and presents multiple mitigation techniques—including Spark 3 adaptive query execution, reducing reduce tasks, using DISTRIBUTE BY RAND(), post‑processing clean‑up, Hive and Spark configuration tweaks, and automated tooling—to improve performance and storage efficiency.

Big DataHiveSmall Files

0 likes · 9 min read

Strategies for Handling Small Files in Hive and Spark

DataFunTalk

May 3, 2023 · Big Data

Shuttle2.0: Enhancing Spark and Flink Shuffle with Distributed Sorting and Adaptive Broadcast

Shuttle2.0 extends OPPO's open‑source high‑availability Spark Remote Shuffle Service to support Flink, introduces a unified stream‑batch data model, pipelines shuffle with distributed sorting, and provides an Adaptive BroadcastJoin solution that dramatically improves performance and stability for large‑scale big‑data workloads.

Adaptive BroadcastBig DataDistributed Sorting

0 likes · 11 min read

Shuttle2.0: Enhancing Spark and Flink Shuffle with Distributed Sorting and Adaptive Broadcast

Rare Earth Juejin Tech Community

Apr 28, 2023 · Artificial Intelligence

Exploring Alibaba’s Tongyi Qianwen AI Model, SWOT, Recipe Demo, and Code Samples for Spark Same‑Period Analysis and Java Bubble Sort

The article reviews Alibaba’s Tongyi Qianwen large‑language model, shares a cooking recipe generated by the AI, presents a SWOT analysis, and provides code examples—including a Spark Scala script for same‑period month‑over‑month calculations and a Java bubble‑sort implementation.

AIJavaLarge Language Model

0 likes · 12 min read

Exploring Alibaba’s Tongyi Qianwen AI Model, SWOT, Recipe Demo, and Code Samples for Spark Same‑Period Analysis and Java Bubble Sort

DataFunSummit

Apr 25, 2023 · Big Data

Building a Real-Time Data Lake with Hudi: Architecture, Challenges, and Practices

This article presents Huawei's end‑to‑end solution for constructing a real‑time data lake on Hudi, covering requirement analysis, technology selection, architectural design, ingestion and processing challenges, practical optimizations, and future improvement directions.

Data LakeETL/ELTFlink

0 likes · 14 min read

Building a Real-Time Data Lake with Hudi: Architecture, Challenges, and Practices

Big Data Technology & Architecture

Apr 23, 2023 · Big Data

Spark and Flink Optimization Guide: Parallelism, GC Tuning, Memory Settings, and Production Configurations

This article provides a comprehensive guide on optimizing Spark and Flink workloads, covering parallelism settings, garbage‑collection tuning, out‑of‑memory mitigation, and full production‑grade configuration examples for both frameworks.

Big DataFlinkGC optimization

0 likes · 7 min read

Spark and Flink Optimization Guide: Parallelism, GC Tuning, Memory Settings, and Production Configurations

58 Tech

Apr 20, 2023 · Big Data

Design and Implementation of a Data Application Platform for Business Opportunity Selection, Tagging, and Scheduling

The article describes a data application platform that enables business users to configure custom data selection rules for opportunities, create scheduled tasks, perform large‑scale data comparison, handle task dispatch with Redis queues, and implement rate‑limiting using sliding windows to ensure reliable processing.

RedisSparkTask scheduling

0 likes · 9 min read

Design and Implementation of a Data Application Platform for Business Opportunity Selection, Tagging, and Scheduling

政采云技术

Apr 18, 2023 · Big Data

Implementing Data Cost Governance: Quantifying Storage and Compute Expenses with Hive, Spark, and HDFS FsImage

This article explains how to perform task‑level data cost governance by collecting storage and compute metrics from Hive tables, Spark jobs, and HDFS FsImage files, then estimating monthly expenses using replication factors and resource‑usage rates, while providing practical SQL and shell examples.

Data Cost GovernanceHDFSHive

0 likes · 18 min read

Implementing Data Cost Governance: Quantifying Storage and Compute Expenses with Hive, Spark, and HDFS FsImage

Architects Research Society

Apr 14, 2023 · Databases

Querying, Analyzing, and Presenting Time Series Data in MongoDB

This article explains how to query, analyze, and visualize time‑series data stored in MongoDB using the aggregation framework, MongoDB Compass, read‑only views, the BI connector with SQL tools, as well as integrations with Spark and R for advanced analytics.

AggregationBI ConnectorMongoDB

0 likes · 14 min read

Querying, Analyzing, and Presenting Time Series Data in MongoDB

JD Retail Technology

Apr 14, 2023 · Big Data

Understanding Data Skew and Its Mitigation in Hive and Spark

This article explains the concept of data skew, its symptoms such as slow tasks and OOM errors, and provides comprehensive mitigation techniques and configuration examples for Hive and Spark, including custom partitioning, map joins, adaptive execution, and key detection methods.

Adaptive ExecutionBig DataData Skew

0 likes · 15 min read

Understanding Data Skew and Its Mitigation in Hive and Spark

ITPUB

Apr 8, 2023 · Big Data

How Bilibili Cut Data Pipeline Costs by 20% with Flink Real‑Time Incremental Computing

Facing daily terabyte‑scale data ingestion and costly duplicate reads in its ODS‑to‑DWD pipeline, Bilibili introduced a Flink‑based real‑time incremental computation and multi‑level partition shuffling, dramatically reducing read amplification, cutting resource usage by ~20%, improving latency to minutes, and enhancing scalability.

Big DataFlinkReal-time Processing

0 likes · 19 min read

How Bilibili Cut Data Pipeline Costs by 20% with Flink Real‑Time Incremental Computing

DataFunTalk

Apr 4, 2023 · Big Data

Compass: An Open‑Source Big Data Task Diagnosis Platform for DolphinScheduler, Airflow and Spark

Compass is an open‑source big‑data diagnostic platform developed by OPPO that provides non‑intrusive, real‑time monitoring and root‑cause analysis for offline and streaming tasks on schedulers such as DolphinScheduler and Airflow, covering workflow‑level failures, Spark engine anomalies, resource usage, and offering one‑click reports and extensible rule‑based diagnostics.

Big DataDolphinSchedulerMonitoring

0 likes · 13 min read

Compass: An Open‑Source Big Data Task Diagnosis Platform for DolphinScheduler, Airflow and Spark

DataFunSummit

Mar 30, 2023 · Artificial Intelligence

MindAlpha: A High‑Performance Distributed Machine Learning Platform for Advertising

The article introduces MindAlpha, a high‑performance distributed machine‑learning platform built for large‑scale, sparse ad‑tech workloads, detailing its architecture, MLOps pipeline, Spark integration, sync/async training strategies, CPU/GPU choices, model‑splitting techniques, and future directions such as model pruning and AutoML.

AIAd TechMLOps

0 likes · 10 min read

MindAlpha: A High‑Performance Distributed Machine Learning Platform for Advertising

DataFunSummit

Mar 29, 2023 · Big Data

Gluten Vectorized Engine: Boosting Spark Performance with Native Execution

The article introduces the Gluten vectorized engine, explains why Spark’s CPU bottleneck motivates integrating native vectorized back‑ends via Substrait, details its architecture, component design, current performance gains of up to three‑fold, and outlines ongoing development and future work.

GlutenNative EngineSpark

0 likes · 18 min read

Gluten Vectorized Engine: Boosting Spark Performance with Native Execution

Data Thinking Notes

Mar 22, 2023 · Big Data

How to Optimize Compute Resource Governance in Data Warehouses with Spark & Hive

This article walks through practical steps for governing compute resources in a data warehouse, covering problem identification, strategic thinking, Spark and Hive tuning, small‑file handling, DQC improvement, high‑consumption task optimization, scheduling adjustments, and measurable performance gains.

Compute GovernanceHiveSpark

0 likes · 13 min read

How to Optimize Compute Resource Governance in Data Warehouses with Spark & Hive

ITPUB

Mar 13, 2023 · Big Data

What’s New in Apache Kyuubi 1.6.0? Server, Client, and Engine Enhancements

Apache Kyuubi 1.6.0 introduces major server‑side upgrades such as batch JAR task submission with RESTful APIs and a metadata store for HA, client‑side improvements including a unified JDBC driver and enhanced Beeline, plus mature Spark, Flink, Trino, and Hive engine plugins, while outlining the community’s roadmap.

Big DataEngine PluginsFlink

0 likes · 13 min read

What’s New in Apache Kyuubi 1.6.0? Server, Client, and Engine Enhancements

DataFunTalk

Mar 12, 2023 · Big Data

Apache Kyuubi 1.6.0 Feature Overview and Enhancements

The article provides a comprehensive walkthrough of Apache Kyuubi 1.6.0, detailing server‑side enhancements such as batch (JAR) task submission, metadata store and unified API/authentication, client‑side improvements to the built‑in JDBC driver and Beeline, as well as engine plugins for Spark, Flink, Trino and Hive, and concludes with the community’s roadmap and statistics.

Apache KyuubiBatch ProcessingBig Data

0 likes · 12 min read

Apache Kyuubi 1.6.0 Feature Overview and Enhancements

DataFunTalk

Mar 8, 2023 · Artificial Intelligence

Applying AI Algorithms to Big Data Governance: Use Cases and Future Directions

This article presents Datacake's experience of integrating AI algorithms into big data governance, covering the bidirectional relationship between AI and big data, health‑score assessment of data tasks, intelligent Spark parameter tuning, SQL engine selection, and future application scenarios across the data lifecycle.

AIBig DataData Governance

0 likes · 18 min read

Applying AI Algorithms to Big Data Governance: Use Cases and Future Directions

Alibaba Cloud Big Data AI Platform

Mar 3, 2023 · Big Data

How Alibaba Cloud EMR Evolved from Open‑Source Compatibility to Enterprise‑Grade Performance

This article outlines Alibaba Cloud EMR's three‑stage evolution—compatibility, contribution, and beyond open source—detailing its early Hadoop adoption, Flink and Spark innovations, cloud‑native optimizations, and enterprise‑grade features such as Remote Shuffle Service, performance benchmarks, and integrated diagnostics.

Alibaba CloudBig DataCloud Native

0 likes · 13 min read

How Alibaba Cloud EMR Evolved from Open‑Source Compatibility to Enterprise‑Grade Performance

Tencent Cloud Developer

Mar 1, 2023 · Big Data

We Analysis User Profiling System: Architecture and Technical Implementation

We Analysis, the official data‑analysis platform for WeChat mini‑program providers, delivers a zero‑learning‑curve user‑profiling system that combines basic tag analysis and flexible, rule‑based segmentation, using an ETL pipeline to store pre‑computed data in TDSQL and online bitmap‑optimized queries in ClickHouse with RoaringBitmap, ensuring low‑latency, stable, and comprehensive analytics.

ClickHouseDataPipelineSpark

0 likes · 20 min read

We Analysis User Profiling System: Architecture and Technical Implementation

DataFunSummit

Feb 28, 2023 · Big Data

Iceberg Technology Overview and Its Application at Xiaomi: Practices, Stream‑Batch Integration, and Future Plans

This article introduces the Iceberg table format, explains its core architecture and advantages such as transactionality, implicit partitioning and row‑level updates, details Xiaomi's practical deployments—including CDC pipelines, partition strategies, compaction services, and stream‑batch integration—and outlines future development directions.

CompactionData LakeFlink

0 likes · 20 min read

Iceberg Technology Overview and Its Application at Xiaomi: Practices, Stream‑Batch Integration, and Future Plans

Programmer DD

Feb 27, 2023 · Big Data

Why Hadoop/Spark Feel Heavy and How SPL Offers a Lightweight Big Data Solution

With data volumes soaring, traditional Hadoop and Spark clusters become costly and cumbersome for small to medium workloads, prompting many to seek lighter alternatives; this article examines the technical, operational, and financial burdens of Hadoop/Spark and introduces the open‑source SPL engine as a fast, low‑cost, easy‑to‑use big‑data solution.

Big DataHadoopSpark

0 likes · 16 min read

Why Hadoop/Spark Feel Heavy and How SPL Offers a Lightweight Big Data Solution

JD Cloud Developers

Feb 23, 2023 · Big Data

How to Build a Local Hadoop & Spark Cluster from Scratch (Step‑by‑Step Guide)

This comprehensive tutorial walks you through setting up a three‑node Hadoop 3.3.4 and Spark 3.3.1 environment on CentOS 7 virtual machines, covering system preparation, JDK and Scala installation, Zookeeper configuration, Hadoop and Spark deployment, and verification with practical command‑line examples.

Big DataCluster SetupHadoop

0 likes · 10 min read

How to Build a Local Hadoop & Spark Cluster from Scratch (Step‑by‑Step Guide)

Architects Research Society

Feb 21, 2023 · Big Data

Comparing Apache Spark and Apache Flink: Origins, Architecture, and Processing Models

This article examines the evolution, architectural differences, data and processing models, stateful handling, and programming APIs of Apache Spark and Apache Flink, highlighting their strengths, limitations, and the challenges of big‑data development and operations in the modern data‑driven era.

Batch ProcessingBig DataData Engine

0 likes · 18 min read

Comparing Apache Spark and Apache Flink: Origins, Architecture, and Processing Models

Big Data Technology & Architecture

Feb 10, 2023 · Big Data

The Most Comprehensive Big Data Interview Preparation Handbook

This article presents a curated collection of big‑data learning resources, including interview guides, in‑depth articles on Flink, Spark, Hive, ClickHouse, data governance, and personal growth, offering readers a one‑stop reference to boost their big‑data expertise and interview readiness.

Big DataData GovernanceFlink

0 likes · 5 min read

The Most Comprehensive Big Data Interview Preparation Handbook

Big Data Technology & Architecture

Feb 9, 2023 · Big Data

The Most Comprehensive Big Data Interview Preparation Handbook and Resource Collection

This article presents a curated collection of the most comprehensive big‑data interview preparation resources, including expert guides, tutorials, and deep‑dive articles on Flink, Spark, Hive, ClickHouse, data governance, and related topics, accompanied by a call to engage with the content.

Big DataClickHouseData Governance

0 likes · 4 min read

The Most Comprehensive Big Data Interview Preparation Handbook and Resource Collection

StarRing Big Data Open Lab

Feb 8, 2023 · Big Data

Why MapReduce and Spark Still Matter: A Deep Dive into Distributed Computing

Distributed computing splits massive tasks across multiple servers, and this article explains the classic MapReduce batch engine and the modern Spark framework, covering their architectures, strengths, limitations, and evolution, while highlighting key features like fault tolerance, in‑memory processing, and real‑time streaming capabilities.

Big DataDistributed ComputingMapReduce

0 likes · 12 min read

Why MapReduce and Spark Still Matter: A Deep Dive into Distributed Computing

ITPUB

Feb 7, 2023 · Big Data

How Kuaigou Built a Scalable Real‑Time Data Warehouse with Spark, Flink, and Cloud

Facing massive, multi‑source traffic and the need for instant analytics, Kuaigou’s real‑time data warehouse evolved from Spark on‑premise to a cloud‑native stack using Alibaba Blink, Flink, and layered OLAP models, streamlining development, cutting costs, and enabling diverse real‑time applications.

Cloud MigrationFlinkOLAP

0 likes · 11 min read

How Kuaigou Built a Scalable Real‑Time Data Warehouse with Spark, Flink, and Cloud

dbaplus Community

Jan 31, 2023 · Big Data

Building ByteDance’s Real‑Time Data Warehouse with Hudi: Architecture & Solutions

This article explains how ByteDance designed and deployed a real‑time data warehouse on a data lake using Hudi, detailing three business scenarios, the challenges of latency, consistency and resource usage, and the engineering solutions—including upserts, compaction services, indexing, and future unified storage plans.

Data LakeFlinkHudi

0 likes · 14 min read

Building ByteDance’s Real‑Time Data Warehouse with Hudi: Architecture & Solutions

Alibaba Cloud Native

Jan 9, 2023 · Big Data

How Kubernetes Powers Cloud‑Native Big Data with EMR on ACK

This article explains the shift of big data and machine‑learning workloads toward storage‑compute separation and cloud‑native architectures, outlines the technical challenges of running Spark on Kubernetes, and details the EMR on ACK solution with its architecture, performance gains, and real‑world adoption.

ACKEMRSpark

0 likes · 6 min read

How Kubernetes Powers Cloud‑Native Big Data with EMR on ACK

21CTO

Jan 7, 2023 · Big Data

How WeChat’s WeAnalysis Powers Scalable User Segmentation with Big Data Architecture

This article explains the design and implementation of WeChat's WeAnalysis image system, covering its basic tag and user‑group modules, multi‑source data ingestion, ETL processing, storage choices such as TDSQL and ClickHouse, bitmap handling, query performance, and service APIs for flexible, high‑performance user segmentation.

ClickHouseSparkUser Segmentation

0 likes · 20 min read

How WeChat’s WeAnalysis Powers Scalable User Segmentation with Big Data Architecture

DataFunSummit

Dec 31, 2022 · Big Data

The Evolution of Data Platforms: From Early Computing to the Modern Big Data Stack

This article reviews the history of data platforms—from the first general‑purpose computers and early relational databases through traditional BI, agile BI, and big‑data technologies like Hadoop, Spark, and Flink, up to today’s cloud‑native modern data stack and its future outlook.

Big DataData PlatformFlink

0 likes · 26 min read

The Evolution of Data Platforms: From Early Computing to the Modern Big Data Stack

DataFunTalk

Dec 29, 2022 · Big Data

Design and Implementation of OPPO's Big Data Diagnostic Platform (Compass)

This article presents the background, requirements, architecture, key modules, and practical impact of OPPO's non‑intrusive big‑data diagnostic platform—named Compass—designed to quickly locate issues, provide optimization suggestions, and achieve cost‑saving and efficiency gains for large‑scale Spark and Hadoop workloads.

Big DataHadoopPerformance Optimization

0 likes · 17 min read

Design and Implementation of OPPO's Big Data Diagnostic Platform (Compass)

Tencent Advertising Technology

Dec 27, 2022 · Big Data

Design and Optimization of Tencent Advertising Log Data Lake Using Iceberg, Spark, and Flink

The article details how Tencent Advertising re‑architected its massive log pipeline by consolidating heterogeneous real‑time and offline logs into an Iceberg‑based data lake, introducing multi‑level partitioning, Spark and Flink ingestion, and numerous performance and cost optimizations for scalable big‑data analytics.

Big DataData LakeFlink

0 likes · 20 min read

Design and Optimization of Tencent Advertising Log Data Lake Using Iceberg, Spark, and Flink

Big Data Technology & Architecture

Dec 23, 2022 · Big Data

Understanding Spark SQL CacheManager: Caching Mechanism, Triggering, Uncaching, and Canonicalization

This article explains Spark SQL's CacheManager, how it stores cached query results using InMemoryRelation, the ways to trigger and release caches, the internal data structures like IndexedSeq and CachedData, and the role of canonicalization in determining cache reuse.

Big DataCacheManagerCaching

0 likes · 8 min read

Understanding Spark SQL CacheManager: Caching Mechanism, Triggering, Uncaching, and Canonicalization

Data Thinking Notes

Dec 21, 2022 · Big Data

Why Your Spark Batch Job Fails: Memory Limits, Data Skew, and Practical Fixes

This article examines a recurring Spark batch task failure caused by OutOfMemory errors and data skew, details the investigation steps—including increasing executor memory, raising parallelism, and analyzing shuffle metrics—and proposes solutions such as data validation, filtering oversized keys, and memory adjustments.

Batch ProcessingData SkewOutOfMemory

0 likes · 4 min read

Why Your Spark Batch Job Fails: Memory Limits, Data Skew, and Practical Fixes

Big Data Technology & Architecture

Dec 19, 2022 · Big Data

Near Real-Time Data Lake Practices in TikTok E-commerce: Architecture, Techniques, and Case Studies

This article presents a comprehensive overview of TikTok e-commerce's near‑real‑time data lake implementation, detailing data lake characteristics, architecture choices, practical use cases across analysis and operations, and for future challenges and plans.

Apache HudiBig DataData Lake

0 likes · 16 min read

Near Real-Time Data Lake Practices in TikTok E-commerce: Architecture, Techniques, and Case Studies

Data Thinking Notes

Dec 14, 2022 · Big Data

Why Spark Jobs Keep Running After You Kill Them: Daemon Threads and Driver Behavior

This article investigates why Spark tasks that appear killed in the Web UI continue running on the driver, analyzes the role of daemon versus non‑daemon threads and SparkContext shutdown mechanisms, reproduces the issue with sample code, and provides practical solutions such as using daemon threads or checking SparkContext status.

DaemonThreadSparkbigdata

0 likes · 8 min read

Why Spark Jobs Keep Running After You Kill Them: Daemon Threads and Driver Behavior

AntTech

Dec 11, 2022 · Information Security

Occlum v1.0: Open‑Source Trusted Execution Environment OS with Major Performance Gains and Spark Big Data Integration

Occlum v1.0, the open‑source trusted execution environment operating system released by Ant Group, delivers up to five‑fold performance improvements, supports over 150 Linux syscalls, introduces async I/O, dynamic memory management, and a Spark‑BigDL big‑data analysis solution, while outlining future GPU and TDX extensions.

Big DataConfidential ComputingOcclum

0 likes · 11 min read

Occlum v1.0: Open‑Source Trusted Execution Environment OS with Major Performance Gains and Spark Big Data Integration

Data Thinking Notes

Dec 6, 2022 · Big Data

Why Did Multiple HDFS DataNodes Crash? Memory, GC, and Block Overload Explained

This article analyzes a midnight HDFS DataNode failure caused by excessive GC and OOM due to Spark batch jobs, examines how an unexpected surge in block count overloaded default memory settings, and presents concrete remediation steps and optimization recommendations to stabilize the cluster.

Block OverloadDataNodeGarbage Collection

0 likes · 6 min read

Why Did Multiple HDFS DataNodes Crash? Memory, GC, and Block Overload Explained

政采云技术

Dec 6, 2022 · Fundamentals

How to Use Antlr4 for Custom SQL Parsing in Spark Projects

This guide explains common business scenarios that require custom SQL parsing, walks through setting up Antlr4 in IntelliJ IDEA, configuring Maven dependencies, generating parser code, and provides Java examples for extracting table names from Spark SQL statements, including handling of prediction modes and execution results.

Antlr4JavaParser

0 likes · 11 min read

How to Use Antlr4 for Custom SQL Parsing in Spark Projects

Open Source Linux

Dec 1, 2022 · Fundamentals

How NVIDIA Boosted Software Safety by Switching from C to SPARK

NVIDIA’s security team adopted the formally verified SPARK language, replacing C in safety‑critical components, and after a successful proof‑of‑concept demonstrated improved security, verification efficiency, and unchanged performance, leading to widespread internal adoption across many products.

AdaCoreC to SPARK migrationNVIDIA

0 likes · 4 min read

How NVIDIA Boosted Software Safety by Switching from C to SPARK