Tagged articles
25 articles
Page 1 of 1
DeepHub IMBA
DeepHub IMBA
Apr 29, 2026 · Artificial Intelligence

From Stateless to Stateful: 5 Architecture Patterns for Long‑Running Agents

The article outlines five concrete design patterns—Checkpoint‑and‑Resume, Delegated Approval, Memory‑Layered Context, Ambient Processing, and Fleet Orchestration—that enable production‑grade, multi‑day AI agents to persist state, handle failures, and scale safely.

AI agentsHuman-in-the-LoopMemory Management
0 likes · 12 min read
From Stateless to Stateful: 5 Architecture Patterns for Long‑Running Agents
Java Companion
Java Companion
Nov 27, 2025 · Backend Development

Interview Question: How to Handle a Crashed Scheduled‑Task Server? Most Miss It

When a scheduled‑task server crashes, simply restarting it is insufficient; a robust solution must combine clustering, distributed locks, idempotent designs, checkpointing, and monitoring to ensure tasks resume correctly across non‑runtime and runtime failures, as detailed with SpringTask‑Redis and XXL‑JOB implementations.

BackendIdempotencyScheduled Tasks
0 likes · 28 min read
Interview Question: How to Handle a Crashed Scheduled‑Task Server? Most Miss It
DataFunTalk
DataFunTalk
Sep 3, 2025 · Artificial Intelligence

How Alluxio’s Distributed Cache Boosts AI Training to 99.57% GPU Utilization

Alluxio’s distributed caching dramatically accelerates AI training and checkpointing workloads, achieving up to 99.57% GPU utilization and linear scaling across clusters in the MLPerf Storage v2.0 benchmark, while using cost‑effective commodity hardware to eliminate I/O bottlenecks.

AI trainingAlluxioGPU utilization
0 likes · 11 min read
How Alluxio’s Distributed Cache Boosts AI Training to 99.57% GPU Utilization
DataFunSummit
DataFunSummit
Mar 20, 2025 · Artificial Intelligence

Evolution of AI Training Stability and Baidu Baige’s Full-Stack Solutions for Large-Scale Model Training

The article traces the evolution of AI training stability from early manual operations on small GPU clusters to sophisticated, fault‑tolerant infrastructures for thousand‑card and ten‑thousand‑card models, detailing Baidu Baige’s metrics, monitoring, eBPF‑based diagnostics, and checkpoint strategies that reduce invalid training time and accelerate fault recovery.

Distributed SystemsLarge-Scale Trainingcheckpointing
0 likes · 22 min read
Evolution of AI Training Stability and Baidu Baige’s Full-Stack Solutions for Large-Scale Model Training
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Mar 10, 2025 · Artificial Intelligence

How Baidu Baige Achieves Near‑Zero Downtime in Massive AI Model Training

The article examines how Baidu Baige evolved AI training stability from manual operations to precise engineering, detailing metrics, fault‑perception techniques, eBPF‑based diagnostics, multi‑level restart strategies, and trigger‑based checkpointing that together achieve sub‑minute recovery and 99.5% effective training time on massive GPU clusters.

AI trainingLarge-Scale Clusterscheckpointing
0 likes · 25 min read
How Baidu Baige Achieves Near‑Zero Downtime in Massive AI Model Training
AntData
AntData
Mar 4, 2025 · Big Data

Design and Analysis of 3FS: An AI‑Optimized Distributed File System

The article provides a comprehensive English overview of 3FS, an AI‑focused distributed file system that leverages FoundationDB for metadata, CRAQ for chunk replication, and a hybrid Fuse/native client architecture, detailing its design, components, fault handling, and performance considerations for large‑scale training workloads.

AI storageCRAQ replicationDistributed File System
0 likes · 25 min read
Design and Analysis of 3FS: An AI‑Optimized Distributed File System
Kuaishou Large Model
Kuaishou Large Model
Jul 11, 2024 · Artificial Intelligence

Pipeline-Aware Offloading & Balanced Checkpointing Accelerate LLM Training

Researchers from Kwai’s large-model team present a novel training system that combines pipeline-parallel-aware activation offloading with a compute-memory balanced checkpointing strategy, enabling lossless acceleration of large language models, achieving up to 42.7% MFU on 256 NVIDIA H800 GPUs while reducing memory usage.

GPU trainingKwaiPerformance Modeling
0 likes · 13 min read
Pipeline-Aware Offloading & Balanced Checkpointing Accelerate LLM Training
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
May 10, 2024 · Artificial Intelligence

GPU Memory Analysis and Distributed Training Strategies

This article explains how GPU memory is allocated during model fine‑tuning, describes collective communication primitives, and compares data parallel, model parallel, ZeRO, pipeline parallel, mixed‑precision, and checkpointing techniques for reducing memory consumption in large‑scale AI training.

Distributed TrainingGPU MemoryPipeline Parallel
0 likes · 9 min read
GPU Memory Analysis and Distributed Training Strategies
dbaplus Community
dbaplus Community
Apr 20, 2021 · Big Data

10 Common Pitfalls When Migrating Spark Jobs to Flink (And How to Avoid Them)

This article shares ten practical pitfalls encountered when moving hourly Spark session jobs to Flink, covering parallelism load imbalance, state TTL, checkpointing strategies, logging, JMX debugging, state migration risks, reduce vs process choices, input data validation, event‑time handling, and external storage considerations, along with concrete configuration snippets and performance tips.

FlinkSpark migrationState Management
0 likes · 20 min read
10 Common Pitfalls When Migrating Spark Jobs to Flink (And How to Avoid Them)
DataFunTalk
DataFunTalk
Mar 15, 2021 · Big Data

Ten Gotchas When Migrating Spark Jobs to Flink

This article shares ten practical pitfalls encountered while moving hour‑level Spark session processing jobs to Apache Flink, covering parallelism skew, state TTL, checkpoint handling, logging, debugging, state migration, Reduce vs Process, input validation, event‑time handling, and the trade‑offs of storing data inside Flink.

Big DataFlinkState Management
0 likes · 19 min read
Ten Gotchas When Migrating Spark Jobs to Flink
Big Data Technology Architecture
Big Data Technology Architecture
Jul 8, 2020 · Big Data

Apache Flink 1.11.0 Release: New Features and Optimizations

Apache Flink 1.11.0 introduces a suite of major enhancements—including unaligned checkpoints, a unified source interface, CDC support in Table API/SQL, performance‑boosted PyFlink, a new application deployment mode, and numerous UI, Docker, and catalog improvements—aimed at increasing usability, scalability, and integration across streaming and batch workloads.

FlinkSQLSource Interface
0 likes · 18 min read
Apache Flink 1.11.0 Release: New Features and Optimizations
Architect
Architect
Jun 11, 2020 · Big Data

Understanding Apache Flink Architecture, Data Transfer, Event‑Time Processing, State Management, and Checkpointing

This article explains Apache Flink's distributed system architecture—including JobManager, ResourceManager, TaskManager, and Dispatcher—covers session and job deployment modes, data transfer mechanisms, event‑time handling with watermarks, various state types and backends, scaling strategies, and the checkpoint/savepoint recovery process.

Apache FlinkBig DataEvent Time
0 likes · 15 min read
Understanding Apache Flink Architecture, Data Transfer, Event‑Time Processing, State Management, and Checkpointing
Big Data Technology & Architecture
Big Data Technology & Architecture
Jun 9, 2020 · Big Data

Comprehensive Overview and Best Practices for Apache Spark Streaming

This article provides a detailed introduction to Spark Streaming, covering its architecture, DStream concepts, initialization, data sources, transformations, windowed aggregations, output operations, checkpointing, fault‑tolerance semantics, deployment, performance tuning, and monitoring for building reliable high‑throughput streaming applications.

Big DataDstreamScala
0 likes · 17 min read
Comprehensive Overview and Best Practices for Apache Spark Streaming
Big Data Technology & Architecture
Big Data Technology & Architecture
Feb 22, 2020 · Big Data

Understanding Flink's Asynchronous Barrier Snapshot (ABS) Algorithm for Checkpointing

This article explains how Apache Flink implements fault‑tolerant checkpointing using the Asynchronous Barrier Snapshot (ABS) algorithm, a localized version of the Chandy‑Lamport distributed snapshot, covering barriers, snapshot alignment, exactly‑once versus at‑least‑once semantics, and handling of cyclic dataflow graphs.

Asynchronous Barrier SnapshotDistributed SystemsFlink
0 likes · 9 min read
Understanding Flink's Asynchronous Barrier Snapshot (ABS) Algorithm for Checkpointing
dbaplus Community
dbaplus Community
Sep 10, 2019 · Big Data

Why Exactly‑Once Processing Is So Hard in Distributed Systems (And How to Tackle It)

This article explores the two toughest problems in distributed stream processing—exactly‑once message handling and ordering—by dissecting the underlying impossibility of perfect failure detectors, the liveness‑vs‑safety trade‑off, zombie processes, and the practical solutions employed by systems such as Flink, Kafka Streams, MillWheel, and Spark.

ConsensusDistributed SystemsExactly-Once
0 likes · 81 min read
Why Exactly‑Once Processing Is So Hard in Distributed Systems (And How to Tackle It)
Big Data Technology & Architecture
Big Data Technology & Architecture
May 12, 2019 · Big Data

Understanding Spark Streaming Integration with Kafka: Receiver-based and Direct Approaches

This article explains Spark Streaming’s architecture, core concepts such as DStream, windowing, and the two Kafka integration methods—Receiver-based and Direct approaches—detailing their configurations, memory implications, checkpointing, and best‑practice recommendations for reliable, high‑throughput real‑time data processing.

Big DataDirect ApproachReceiver Approach
0 likes · 18 min read
Understanding Spark Streaming Integration with Kafka: Receiver-based and Direct Approaches
Big Data Technology & Architecture
Big Data Technology & Architecture
Mar 13, 2019 · Big Data

Understanding Fault Tolerance and Exactly-Once Semantics in Apache Flink

This article explains Apache Flink's fault‑tolerance mechanisms, including checkpointing, barrier alignment, the differences between At‑Least‑Once and Exactly‑Once semantics, configuration options, incremental checkpointing, and the requirements for external sources and sinks to achieve end‑to‑end exactly‑once processing.

Apache FlinkBig DataExactly-Once
0 likes · 15 min read
Understanding Fault Tolerance and Exactly-Once Semantics in Apache Flink