Tagged articles
607 articles
Page 4 of 7
TAL Education Technology
TAL Education Technology
Jul 22, 2021 · Big Data

Real-Time Monitoring Dashboard Solution in Future Cloud – Architecture, Technical Challenges, and Product Insights

This article presents the Future Cloud Business Monitoring real-time dashboard solution, detailing its technical architecture, key challenges in massive log processing, storage choices, product considerations, experience sharing, future plans, and concrete case studies such as live classroom monitoring.

ClickHouseDashboardSpark
0 likes · 15 min read
Real-Time Monitoring Dashboard Solution in Future Cloud – Architecture, Technical Challenges, and Product Insights
Big Data Technology Architecture
Big Data Technology Architecture
Jul 15, 2021 · Big Data

Resolving Spark Task Not Serializable Errors: Causes, Code Examples, and Best Practices

This article analyzes why Spark tasks fail with a "Task not serializable" exception when closures reference class members, demonstrates the issue with Scala code examples, and provides practical solutions such as using @transient annotations, moving functions to objects, and ensuring proper class serialization.

ScalaSparkTask Not Serializable
0 likes · 12 min read
Resolving Spark Task Not Serializable Errors: Causes, Code Examples, and Best Practices
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 10, 2021 · Big Data

Comprehensive Big Data Learning Path and Interview Knowledge Map

This extensive guide outlines a modern big‑data learning roadmap, covering essential programming languages, Linux, databases, distributed system theory, networking, offline and real‑time computation, message queues, data warehouses, algorithms, backend skills, interview preparation, and practical advice for building a personal knowledge system.

FlinkHadoopLearning Path
0 likes · 24 min read
Comprehensive Big Data Learning Path and Interview Knowledge Map
TAL Education Technology
TAL Education Technology
Jul 1, 2021 · Big Data

Optimization of A/B Test Metric Computation Using Spark and ClickHouse

This article details the design and multi‑stage optimization of an A/B testing metric system, describing its product architecture, Spark‑based computation engine, ClickHouse OLAP layer, cumulative calculation improvements, and batch processing techniques that reduced processing time from hours to a few minutes for hundreds of experiments and metrics.

A/B testingBig DataClickHouse
0 likes · 8 min read
Optimization of A/B Test Metric Computation Using Spark and ClickHouse
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 21, 2021 · Big Data

Comprehensive Guide to Apache Kylin: Background, Architecture, Installation, Optimization, and Real‑World Use Cases

This article provides an in‑depth overview of Apache Kylin, covering its history, mission, core MOLAP principles, technical architecture, step‑by‑step installation (Docker and Hadoop), performance tuning, advanced cube settings, and detailed case studies from major companies such as Baidu, Lianjia, and Didi.

Apache KylinCubeDocker
0 likes · 53 min read
Comprehensive Guide to Apache Kylin: Background, Architecture, Installation, Optimization, and Real‑World Use Cases
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 16, 2021 · Big Data

Practical Experience and Optimizations of Apache Iceberg in Tencent’s Big Data Ecosystem

This article reviews the advantages of Apache Iceberg for data lake storage, details Tencent’s custom optimizations and integration with Flink and Spark, and shares multiple real‑world implementations that demonstrate how Iceberg improves data consistency, reduces small‑file overhead, and enables near‑real‑time analytics in large‑scale big‑data environments.

Apache IcebergData LakeFlink
0 likes · 18 min read
Practical Experience and Optimizations of Apache Iceberg in Tencent’s Big Data Ecosystem
Big Data Technology Architecture
Big Data Technology Architecture
Jun 10, 2021 · Big Data

Understanding Apache Iceberg: Design, Architecture, and Its Application at NetEase Cloud Music

This article explains Apache Iceberg’s table‑format design, compares it with Hive’s limitations, details its snapshot‑based architecture and metadata handling, and describes how NetEase Cloud Music leveraged Iceberg to dramatically improve large‑scale log processing performance and stability.

Apache IcebergSparkTable Format
0 likes · 12 min read
Understanding Apache Iceberg: Design, Architecture, and Its Application at NetEase Cloud Music
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 4, 2021 · Big Data

Comprehensive Spark Interview Questions and Answers

This article provides a detailed collection of Spark interview questions covering deployment modes, performance advantages over MapReduce, shuffle mechanisms, RDD characteristics, optimization techniques, resource management, and various practical aspects of Spark on YARN, Mesos, and Kubernetes.

RDDShuffleSpark
0 likes · 21 min read
Comprehensive Spark Interview Questions and Answers
dbaplus Community
dbaplus Community
Jun 1, 2021 · Big Data

How Didi Boosted SQL Performance by 40%: Migrating 10k Hive Jobs to Spark

Didi migrated over 10,000 Hive SQL tasks to Spark SQL, achieving 85% Spark task share, cutting execution time by 40%, and reducing CPU and memory usage by 21% and 49% respectively, through a systematic migration process that addressed syntax, UDF, performance, and functional differences between the two engines.

Big DataHivePerformance Optimization
0 likes · 20 min read
How Didi Boosted SQL Performance by 40%: Migrating 10k Hive Jobs to Spark
Qunar Tech Salon
Qunar Tech Salon
Jun 1, 2021 · Big Data

Integrating TensorFlow for Java with Spark‑Scala for Distributed Machine Learning Prediction

This article shares practical experience of building a high‑performance distributed prediction service by combining TensorFlow for Java with Spark‑Scala, covering framework selection, performance comparison, model training, loading, inference, deployment, and optimization techniques for large‑scale data processing.

Big DataJavaPerformance Optimization
0 likes · 16 min read
Integrating TensorFlow for Java with Spark‑Scala for Distributed Machine Learning Prediction
NetEase Game Operations Platform
NetEase Game Operations Platform
May 22, 2021 · Big Data

Comprehensive Overview and Source Code Analysis of NetEase Spark Kyuubi

This article systematically introduces NetEase Kyuubi, an open‑source high‑performance JDBC and SQL execution engine built on Apache Spark, covering its background, core architecture, service discovery, session and operation management, startup processes, and key source‑code implementations with detailed code examples.

Apache ThriftBig DataKyuubi
0 likes · 47 min read
Comprehensive Overview and Source Code Analysis of NetEase Spark Kyuubi
JD Retail Technology
JD Retail Technology
May 13, 2021 · Big Data

Evolution and Architecture of JD.com Self‑Operated Rebate Platform

The article details the development, challenges, and redesign of JD.com’s self‑operated rebate system, describing its early monolithic architecture, data‑intensive processing pipeline, migration to a modular, high‑availability platform built on Spark, Hive, and Elasticsearch, and the resulting performance and operational improvements.

Big DataETLSpark
0 likes · 16 min read
Evolution and Architecture of JD.com Self‑Operated Rebate Platform
DataFunTalk
DataFunTalk
Apr 27, 2021 · Big Data

Implementing CDC‑to‑Hudi for Real‑Time Mutable Data in a Big Data System

This article describes how Linkflow migrated mutable customer data from MySQL to an Apache Hudi data lake using Debezium‑in‑Flink CDC, addressing challenges such as snapshot resumability, partial updates, row‑key merging, schema evolution, indexing, and concurrent writes to achieve minute‑level data freshness and improved offline processing performance.

Apache HudiBig DataCDC
0 likes · 21 min read
Implementing CDC‑to‑Hudi for Real‑Time Mutable Data in a Big Data System
dbaplus Community
dbaplus Community
Apr 14, 2021 · Big Data

Master Spark Performance: Key Tuning, Shuffle & Join Optimization

This guide compiles practical Spark tuning techniques, covering essential configuration parameters, programming best‑practices, detailed shuffle mechanics, and join optimization strategies, while also addressing common errors and mitigation steps, enabling developers to improve performance and resource utilization in large‑scale data processing jobs.

Big DataError HandlingJOIN optimization
0 likes · 25 min read
Master Spark Performance: Key Tuning, Shuffle & Join Optimization
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 10, 2021 · Big Data

Understanding Spark Cache and Checkpoint Mechanisms

This article explains Spark's cache and checkpoint mechanisms, detailing when to use each, how they are implemented internally, how cached and checkpointed RDDs are stored and retrieved, and the differences between caching, persisting, and checkpointing for reliable big‑data processing.

CacheCheckpointRDD
0 likes · 13 min read
Understanding Spark Cache and Checkpoint Mechanisms
iQIYI Technical Product Team
iQIYI Technical Product Team
Apr 9, 2021 · Big Data

Real-Time Data Warehouse at iQIYI Video Production Using Spark and ClickHouse

To meet iQIYI video production’s thousands‑QPS, petabyte‑scale, frequently‑updated data and large‑table join requirements, the team built a Spark‑plus‑ClickHouse real‑time warehouse that streams Kafka changes, joins HBase dimensions, and writes to ClickHouse, reducing reporting development time from days to hours while supporting both offline and real‑time analytics.

ClickHouseHBaseKafka
0 likes · 12 min read
Real-Time Data Warehouse at iQIYI Video Production Using Spark and ClickHouse
Architect
Architect
Apr 3, 2021 · Big Data

Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning

This article explains advanced Spark performance tuning techniques, focusing on diagnosing and resolving data skew and shuffle bottlenecks through stage analysis, key distribution inspection, and a variety of practical solutions such as Hive pre‑processing, key filtering, parallelism increase, two‑stage aggregation, map‑join, and combined strategies, while also covering ShuffleManager internals and related configuration parameters.

Big DataData SkewScala
0 likes · 47 min read
Advanced Spark Performance Optimization: Data Skew and Shuffle Tuning
Architect
Architect
Apr 2, 2021 · Big Data

Spark Performance Optimization Guide: Development and Resource Tuning

This article provides a comprehensive guide to Spark performance optimization, covering development‑level tuning principles, resource configuration parameters, practical code examples, and best‑practice recommendations to achieve high‑throughput big‑data processing.

Big DataRDDResource Tuning
0 likes · 33 min read
Spark Performance Optimization Guide: Development and Resource Tuning
Big Data Technology Architecture
Big Data Technology Architecture
Apr 1, 2021 · Big Data

Spark Adaptive Execution: Dynamic Shuffle Partition, Broadcast Join, and Skew Handling

The article explains the limitations of static shuffle partitions, execution‑plan estimation, and data skew in Spark SQL, and describes how Spark Adaptive Execution can automatically adjust shuffle partition numbers, switch join strategies, and mitigate skew through configurable parameters and code examples.

Adaptive ExecutionBroadcast JoinData Skew
0 likes · 11 min read
Spark Adaptive Execution: Dynamic Shuffle Partition, Broadcast Join, and Skew Handling
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 30, 2021 · Big Data

Implementing Real-Time Data Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions

This article describes how Soul's data engineering team replaced nightly batch ETL with real-time Delta Lake ingestion on EMR, detailing the motivations, comparative analysis of Delta, Hudi, Iceberg, the implementation architecture, encountered issues such as data skew and schema evolution, and the solutions adopted to improve performance and reliability.

Data LakeData SkewDelta Lake
0 likes · 13 min read
Implementing Real-Time Data Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions
Big Data Technology Architecture
Big Data Technology Architecture
Mar 10, 2021 · Big Data

Comprehensive Spark Performance Optimization: Development Tuning, Resource Configuration, Data Skew Solutions, and Shuffle Tuning

This guide presents a complete Spark performance optimization handbook covering development‑time best practices, resource‑parameter tuning, detailed data‑skew detection and mitigation techniques, advanced shuffle‑engine configurations, and practical code examples to help engineers build faster, more reliable Spark jobs.

Data SkewResource TuningShuffle
0 likes · 69 min read
Comprehensive Spark Performance Optimization: Development Tuning, Resource Configuration, Data Skew Solutions, and Shuffle Tuning
Big Data Technology Architecture
Big Data Technology Architecture
Mar 2, 2021 · Big Data

Implementing Real-Time Log Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions

This article describes how a data engineering team replaced nightly batch ETL with a Delta Lake‑based real‑time log ingestion pipeline on EMR, detailing the motivations, architecture, implementation steps, encountered issues such as data skew and schema evolution, and the practical solutions they applied to achieve low‑latency, reliable data delivery.

Delta LakeHiveSpark
0 likes · 14 min read
Implementing Real-Time Log Ingestion with Delta Lake on EMR: Architecture, Challenges, and Solutions
DataFunTalk
DataFunTalk
Feb 28, 2021 · Big Data

Migrating Youzan Offline Spark Platform to Kubernetes: Architecture, Optimizations, and Lessons Learned

This article details how Youzan's offline Spark computing platform was transformed for the cloud‑native era by migrating from YARN to Kubernetes, introducing containerization, storage‑compute separation, dynamic allocation, deployment optimizations, and a collection of practical lessons to reduce cost and improve resource utilization.

Big DataKubernetesPerformance Optimization
0 likes · 27 min read
Migrating Youzan Offline Spark Platform to Kubernetes: Architecture, Optimizations, and Lessons Learned
Youzan Coder
Youzan Coder
Feb 26, 2021 · Big Data

Migrating Spark Offline Computing to Kubernetes: Architecture, Optimizations, and Lessons Learned

Youzan migrated its large‑scale offline Spark workloads from YARN to a cloud‑native Kubernetes architecture, separating storage and compute with Ceph FS, adding dynamic executor allocation and remote shuffle services, and applying numerous Spark and deployment tweaks that yielded elastic scaling, higher resource utilization, reduced costs, and valuable operational lessons.

Cloud NativeDevOpsKubernetes
0 likes · 24 min read
Migrating Spark Offline Computing to Kubernetes: Architecture, Optimizations, and Lessons Learned
JD Tech
JD Tech
Feb 8, 2021 · Big Data

JD Remote Shuffle Service: Design, Implementation, and Performance Evaluation

This article presents JD's self‑developed Remote Shuffle Service for Spark, detailing its architecture, goals, implementation details, performance benchmarks, and real‑world production case studies that demonstrate its impact on shuffle efficiency and system stability in large‑scale data processing.

Distributed SystemsRemote Shuffle ServiceShuffle Optimization
0 likes · 17 min read
JD Remote Shuffle Service: Design, Implementation, and Performance Evaluation
JD Retail Technology
JD Retail Technology
Jan 19, 2021 · Big Data

Design, Implementation, and Performance Evaluation of JD's Remote Shuffle Service for Spark

This article describes JD's research and production deployment of a self‑developed Remote Shuffle Service for Spark, covering its motivations, architectural design, cloud‑native features, monitoring, performance benchmarks against external shuffle solutions, and a real‑world promotion‑period case study that demonstrates improved stability and resource efficiency.

Cloud NativeRemote Shuffle ServiceShuffle Optimization
0 likes · 17 min read
Design, Implementation, and Performance Evaluation of JD's Remote Shuffle Service for Spark
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 15, 2021 · Big Data

Evolution and Architecture of Major Chinese Big Data Platforms: Taobao, Didi, Meituan, 360, Kuaishou, and JD

This article reviews the evolution, architecture, and key components of major Chinese big‑data platforms—including those of Taobao, Didi, Meituan, 360, Kuaishou, and JD—highlighting data ingestion, storage, processing engines, scheduling systems, and service‑oriented designs that underpin their large‑scale data operations.

Big DataData PlatformHadoop
0 likes · 14 min read
Evolution and Architecture of Major Chinese Big Data Platforms: Taobao, Didi, Meituan, 360, Kuaishou, and JD
DataFunTalk
DataFunTalk
Jan 15, 2021 · Big Data

Optimizing Apache Kylin for Meituan's Sales OLAP: From MapReduce to Spark and Resource Tuning

This article presents a detailed case study of how Meituan's in‑store dining sales team identified severe efficiency issues in their Apache Kylin‑based OLAP system, dissected the construction process, and applied a step‑by‑step optimization roadmap—including engine migration, dimension pruning, resource configuration, and Spark‑based layered building—to boost query performance and achieve near‑perfect SLA.

Apache KylinBig DataMeituan
0 likes · 16 min read
Optimizing Apache Kylin for Meituan's Sales OLAP: From MapReduce to Spark and Resource Tuning
Big Data Technology & Architecture
Big Data Technology & Architecture
Jan 5, 2021 · Big Data

Improving Spark Job Parallelism on YARN: Diagnosis, Configuration, and Performance Gains

This article details a real‑world investigation of Spark SQL job latency on a YARN cluster, explains how switching the scheduler to FAIR mode, creating resource pools, and consolidating small Parquet files dramatically reduced scheduler delay and cut execution time from over 100 seconds to under 20 seconds.

ParquetPerformance OptimizationScheduler
0 likes · 13 min read
Improving Spark Job Parallelism on YARN: Diagnosis, Configuration, and Performance Gains
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 27, 2020 · Big Data

Understanding and Solving the Small File Problem in Big Data Systems

This article examines the pervasive small‑file issue in big‑data environments, explains its impact on storage and processing performance, and presents a comprehensive set of solutions—including file merging, Hadoop archives, SequenceFiles, HBase, CombineFileInputFormat, and Spark/Flink strategies—to mitigate metadata overhead and improve I/O efficiency.

FlinkHadoopNameNode
0 likes · 41 min read
Understanding and Solving the Small File Problem in Big Data Systems
JD Retail Technology
JD Retail Technology
Dec 24, 2020 · Databases

Applying ClickHouse for Offline and Real‑Time Data Analysis in JD's Golden Eye Business

This article details JD's Golden Eye business's adoption of ClickHouse for offline and real‑time traffic data analysis, covering system architecture, data ingestion pipelines, high‑availability design, monitoring, performance optimizations, and practical trade‑offs, offering insights for large‑scale analytical database deployments.

ClickHouseData WarehouseOLAP
0 likes · 17 min read
Applying ClickHouse for Offline and Real‑Time Data Analysis in JD's Golden Eye Business
Big Data Technology & Architecture
Big Data Technology & Architecture
Dec 20, 2020 · Big Data

Getting Started with Apache Zeppelin: Installation, Core Features, and Integration with JDBC, Spark, and Flink

This tutorial introduces Apache Zeppelin, explains REPL and Jupyter concepts, outlines its core features and project structure, and provides step‑by‑step instructions for installing Zeppelin, creating notebooks, and connecting to databases, Spark, and Flink with practical code examples.

Apache ZeppelinFlinkInstallation
0 likes · 11 min read
Getting Started with Apache Zeppelin: Installation, Core Features, and Integration with JDBC, Spark, and Flink
Architect
Architect
Dec 13, 2020 · Big Data

Understanding and Solving Data Skew in Hadoop and Spark

This article explains what data skew is, why it occurs in large‑scale Hadoop and Spark jobs, illustrates typical scenarios that cause it, and provides practical strategies and platform‑specific optimizations to detect, mitigate, and prevent skew in big‑data processing pipelines.

HadoopSparkdistributed computing
0 likes · 13 min read
Understanding and Solving Data Skew in Hadoop and Spark
Big Data Technology Architecture
Big Data Technology Architecture
Nov 24, 2020 · Big Data

Using DeltaLake for Industrial Data Platforms: Distributed Stream Processing, Batch‑Stream Fusion, and Transactional Support

This article shares practical experiences of building an industrial data middle‑platform with DeltaLake, covering heterogeneous distributed stream handling, batch‑stream unified analytics, and transactional/algorithm support to improve data timeliness, reliability, and operational efficiency in manufacturing environments.

Batch-Stream FusionBig DataDeltaLake
0 likes · 11 min read
Using DeltaLake for Industrial Data Platforms: Distributed Stream Processing, Batch‑Stream Fusion, and Transactional Support
Meituan Technology Team
Meituan Technology Team
Nov 19, 2020 · Big Data

Optimizing Apache Kylin for High‑Performance OLAP in Meituan's Sales System

Meituan’s sales system “Qingtian” boosted OLAP performance by migrating Apache Kylin’s build engine from MapReduce to Spark, consolidating Hive files, refining dictionary creation, applying a By‑layer algorithm, and bulk‑loading cuboid files to HBase, cutting resource consumption and halving build time, ultimately reaching a 100 % SLA.

Apache KylinBig DataMeituan
0 likes · 15 min read
Optimizing Apache Kylin for High‑Performance OLAP in Meituan's Sales System
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 16, 2020 · Big Data

Understanding Data Skew in Big Data: Causes, Symptoms, and Solutions for Hadoop and Spark

This article explains what data skew is, why it occurs in large‑scale Hadoop and Spark jobs, how to recognize its symptoms such as stuck reducers or OOM executors, and presents practical strategies—including business‑level adjustments, code refactoring, and platform‑specific tuning—to mitigate the problem.

Big DataHadoopSpark
0 likes · 13 min read
Understanding Data Skew in Big Data: Causes, Symptoms, and Solutions for Hadoop and Spark
Big Data Technology & Architecture
Big Data Technology & Architecture
Nov 16, 2020 · Big Data

Understanding Spark Streaming Backpressure Mechanism and Source Code Analysis

This article explains why Spark Streaming introduced backpressure, how the dynamic rate‑control mechanism works, and provides a detailed walkthrough of the relevant source code, including the RateController class, its registration, and the execution flow that adjusts ingestion rates to match processing capacity.

RateControllerRateLimiterSpark
0 likes · 14 min read
Understanding Spark Streaming Backpressure Mechanism and Source Code Analysis
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 29, 2020 · Fundamentals

Zero-Copy Data Transfer Mechanism: Principles, Implementations, and Applications in Java, Kafka, and Spark

This article explains the zero‑copy data transfer technique, compares it with traditional read/write approaches, shows Java NIO code examples, and discusses its use in high‑performance systems such as Kafka and Spark, highlighting the reductions in context switches and memory copies.

Data TransferJava NIOKafka
0 likes · 16 min read
Zero-Copy Data Transfer Mechanism: Principles, Implementations, and Applications in Java, Kafka, and Spark
Big Data Technology & Architecture
Big Data Technology & Architecture
Oct 23, 2020 · Big Data

Overview of Real-Time Big Data Processing: Spark Structured Streaming, CarbonData, Flink, and Cloud Stream

This article provides a comprehensive overview of modern real‑time big‑data solutions, detailing Spark Structured Streaming capabilities, CarbonData’s storage architecture, Meituan’s Flink deployments, and Huawei Cloud Stream’s unified streaming service, highlighting their features, challenges, and future directions.

CarbonDataFlinkReal-time analytics
0 likes · 17 min read
Overview of Real-Time Big Data Processing: Spark Structured Streaming, CarbonData, Flink, and Cloud Stream
Tencent Cloud Developer
Tencent Cloud Developer
Oct 19, 2020 · Big Data

Improving Spark Write Performance for Massive Files on Object Storage with Tencent Cloud EMR

By parallelizing Spark’s driver‑side commit, trash, and move phases—previously single‑threaded operations that caused costly copy‑on‑rename when writing massive files to object storage—the Tencent Cloud EMR case achieved over a tenfold (1,100 %) speedup, making object storage a viable alternative to HDFS.

Big DataEMRPerformance Optimization
0 likes · 8 min read
Improving Spark Write Performance for Massive Files on Object Storage with Tencent Cloud EMR
Alibaba Cloud Developer
Alibaba Cloud Developer
Sep 27, 2020 · Big Data

Why Spark on Kubernetes Needs a Remote Shuffle Service—and How It Boosts Performance

This article examines the challenges of running Spark on Kubernetes, introduces the Remote Shuffle Service architecture to overcome shuffle bottlenecks, details EMR on ACK integration, showcases performance gains with Terasort benchmarks, and outlines future cloud‑native big‑data strategies such as mixed‑cluster and serverless deployments.

EMRRemote Shuffle ServiceSpark
0 likes · 13 min read
Why Spark on Kubernetes Needs a Remote Shuffle Service—and How It Boosts Performance
DataFunTalk
DataFunTalk
Sep 25, 2020 · Big Data

Meituan Waimai Data Warehouse: Architecture Evolution, Governance, and Future Roadmap

The article details Meituan Waimai's offline data warehouse evolution from its initial V1.0 design through V2.0 improvements to the V3.0 modeling‑tool driven architecture, covering the four‑layer framework, Spark‑based ETL, data governance processes, resource optimization, security measures, and future development plans.

Big DataData GovernanceETL
0 likes · 22 min read
Meituan Waimai Data Warehouse: Architecture Evolution, Governance, and Future Roadmap
Big Data Technology & Architecture
Big Data Technology & Architecture
Sep 2, 2020 · Big Data

An Overview of Apache Hudi: Architecture, Features, and Query Types

Apache Hudi is an open‑source data‑lake framework that leverages Spark to ingest, manage, and incrementally query large analytical datasets on HDFS‑compatible storage, offering features such as timeline management, copy‑on‑write and merge‑on‑read tables, and support for snapshot, incremental, and read‑optimized queries across engines like Hive, Spark SQL and Presto.

Apache HudiBig DataData Lake
0 likes · 12 min read
An Overview of Apache Hudi: Architecture, Features, and Query Types
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 23, 2020 · Big Data

Apache Hudi Overview, Core Concepts, and Quick‑Start Guide

This article introduces Apache Hudi, explaining its storage types, query views, timeline feature, typical use cases such as near‑real‑time ingestion and incremental pipelines, and provides a step‑by‑step Scala/Spark quick‑start guide with code examples for compiling, inserting, updating, querying, and syncing data to Hive.

Apache HudiBig DataData Lake
0 likes · 18 min read
Apache Hudi Overview, Core Concepts, and Quick‑Start Guide
Beike Product & Technology
Beike Product & Technology
Aug 17, 2020 · Big Data

Bitmap-Based User Segmentation in a DMP Platform Using ClickHouse

This article describes how a data management platform (DMP) at Beike leverages ClickHouse bitmap structures and Spark pipelines to generate global numeric user IDs, design tag-specific bitmap rules for enum, continuous, and date attributes, handle boundary cases, and produce high‑performance bitmap SQL for real‑time user group estimation and complex segment logic.

Big DataBitmapClickHouse
0 likes · 17 min read
Bitmap-Based User Segmentation in a DMP Platform Using ClickHouse
Big Data Technology & Architecture
Big Data Technology & Architecture
Aug 4, 2020 · Big Data

Manual Kafka Offset Management in Spark Streaming using createDirectStream (Java & Scala)

This article explains how to use Spark Streaming's Direct Approach with Kafka, manually manage offsets, and provides complete Java and Scala implementations—including a JavaKafkaManager class, a demo application, and a Scala KafkaManager—illustrating the creation of DirectKafkaInputDStream, offset handling, and integration with Spark.

JavaKafkaOffset Management
0 likes · 14 min read
Manual Kafka Offset Management in Spark Streaming using createDirectStream (Java & Scala)
21CTO
21CTO
Aug 1, 2020 · Big Data

Mastering User Profiling: A Comprehensive Big Data Blueprint

This article explains how enterprises can leverage massive raw and business data to build detailed user profiles, covering tag types, data architecture, development modules, project phases, key deliverables, and a real-world e‑commerce case study.

Big DataData WarehouseETL
0 likes · 22 min read
Mastering User Profiling: A Comprehensive Big Data Blueprint
Didi Tech
Didi Tech
Jul 24, 2020 · Artificial Intelligence

DLFlow: An End-to-End Deep Learning Solution for Big Data Offline Tasks

DLFlow, an end‑to‑end framework from Didi’s user‑profile team, merges Spark and TensorFlow to automate feature preprocessing, large‑scale distributed training, and massive prediction for big‑data offline tasks, offering configuration‑driven pipelines, task scheduling, and easy deployment that dramatically speeds model development.

Deep LearningModel DevelopmentSpark
0 likes · 9 min read
DLFlow: An End-to-End Deep Learning Solution for Big Data Offline Tasks
Tencent Cloud Developer
Tencent Cloud Developer
Jul 13, 2020 · Big Data

Building MVP: A Lightweight Big Data Analysis System for Product Growth

The article describes how a lightweight big‑data analysis platform called MVP was built from scratch—using a User‑Event‑Config model, HDFS + ClickHouse + Spark, and four modules for metric monitoring, root‑cause alerts, deep growth analysis, and A/B testing—enabling real‑time insights in seconds instead of days and dramatically accelerating product‑growth operations.

AARRR ModelClickHouseHDFS
0 likes · 9 min read
Building MVP: A Lightweight Big Data Analysis System for Product Growth
Programmer DD
Programmer DD
Jul 7, 2020 · Big Data

How to Choose a Worthwhile Technology: Depth, Ecosystem, and Evolution

The article outlines a three‑dimensional framework—technical depth, ecosystem breadth, and evolution capability—to help engineers decide which big‑data or stream‑processing technology (such as Hadoop, Spark, or Flink) is worth investing time in, and provides practical tips like using Google Trends and GitHub awesome lists.

Big DataFlinkHadoop
0 likes · 12 min read
How to Choose a Worthwhile Technology: Depth, Ecosystem, and Evolution
Big Data Technology & Architecture
Big Data Technology & Architecture
Jul 5, 2020 · Big Data

Understanding Spark Memory Management: On‑heap, Off‑heap, and Unified Memory

This article provides a comprehensive overview of Spark's memory management, covering executor memory architecture, the differences between on‑heap and off‑heap memory, static versus unified memory managers, storage and execution memory handling, and practical guidelines for optimizing Spark applications.

Big DataExecutorMemory Management
0 likes · 21 min read
Understanding Spark Memory Management: On‑heap, Off‑heap, and Unified Memory
DataFunTalk
DataFunTalk
Jul 1, 2020 · Artificial Intelligence

Architecture and Implementation of Autohome's Machine Learning Platform

The article presents a comprehensive overview of Autohome's one‑stop machine learning platform, detailing its background, architecture, resource scheduling, data processing, model training (including distributed deep learning), deployment, real‑world applications such as purchase‑intent and recommendation models, and future development directions.

AutoMLDeep LearningDistributed Training
0 likes · 19 min read
Architecture and Implementation of Autohome's Machine Learning Platform
DataFunTalk
DataFunTalk
Jun 14, 2020 · Big Data

Designing an Offline Big Data Processing Architecture Based on Object Storage

This article presents a comprehensive offline big‑data processing framework that leverages scalable object storage for PB‑level data, details storage and compute engine requirements, compares cost options, describes data pipeline design, and showcases an e‑commerce case study with Spark‑driven analytics.

Big DataCost OptimizationSpark
0 likes · 19 min read
Designing an Offline Big Data Processing Architecture Based on Object Storage
iQIYI Technical Product Team
iQIYI Technical Product Team
Jun 12, 2020 · Artificial Intelligence

Deepthought: An End‑to‑End Machine Learning Platform at iQIYI

Deepthought is iQIYI’s end‑to‑end machine‑learning platform that unifies distributed frameworks, decouples pipeline stages, integrates with Tongtian Tower, and offers visual drag‑and‑drop configuration, evolving from a fraud‑detection prototype to a generic system with real‑time inference, automated hyper‑parameter optimization, and support for large‑scale data across anti‑fraud, recommendation, and analytics workloads.

AI PlatformAutoMLParameter Server
0 likes · 13 min read
Deepthought: An End‑to‑End Machine Learning Platform at iQIYI
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 9, 2020 · Big Data

Comprehensive Overview and Best Practices for Apache Spark Streaming

This article provides a detailed introduction to Spark Streaming, covering its architecture, DStream concepts, initialization, data sources, transformations, windowed aggregations, output operations, checkpointing, fault‑tolerance semantics, deployment, performance tuning, and monitoring for building reliable high‑throughput streaming applications.

Big DataDstreamScala
0 likes · 17 min read
Comprehensive Overview and Best Practices for Apache Spark Streaming
Big Data Technology Architecture
Big Data Technology Architecture
May 31, 2020 · Big Data

Applying Apache Hudi in Medical Big Data: Architecture, Synchronization, Storage Choices, and Future Directions

This article examines the use of Apache Hudi for building a hospital‑wide medical big‑data platform, covering construction background, reasons for selecting Hudi, data synchronization methods, storage mode choices, query optimizations, and future development considerations.

Apache HudiCopy-on-WriteMedical Big Data
0 likes · 7 min read
Applying Apache Hudi in Medical Big Data: Architecture, Synchronization, Storage Choices, and Future Directions