Tagged articles

25 articles

Page 1 of 1

Apr 29, 2026 · Artificial Intelligence

From Stateless to Stateful: 5 Architecture Patterns for Long‑Running Agents

The article outlines five concrete design patterns—Checkpoint‑and‑Resume, Delegated Approval, Memory‑Layered Context, Ambient Processing, and Fleet Orchestration—that enable production‑grade, multi‑day AI agents to persist state, handle failures, and scale safely.

AI agentsHuman-in-the-LoopMemory Management

0 likes · 12 min read

From Stateless to Stateful: 5 Architecture Patterns for Long‑Running Agents

Java Companion

Nov 27, 2025 · Backend Development

Interview Question: How to Handle a Crashed Scheduled‑Task Server? Most Miss It

When a scheduled‑task server crashes, simply restarting it is insufficient; a robust solution must combine clustering, distributed locks, idempotent designs, checkpointing, and monitoring to ensure tasks resume correctly across non‑runtime and runtime failures, as detailed with SpringTask‑Redis and XXL‑JOB implementations.

BackendIdempotencyScheduled Tasks

0 likes · 28 min read

Interview Question: How to Handle a Crashed Scheduled‑Task Server? Most Miss It

DataFunTalk

Sep 3, 2025 · Artificial Intelligence

How Alluxio’s Distributed Cache Boosts AI Training to 99.57% GPU Utilization

Alluxio’s distributed caching dramatically accelerates AI training and checkpointing workloads, achieving up to 99.57% GPU utilization and linear scaling across clusters in the MLPerf Storage v2.0 benchmark, while using cost‑effective commodity hardware to eliminate I/O bottlenecks.

AI trainingAlluxioGPU utilization

0 likes · 11 min read

How Alluxio’s Distributed Cache Boosts AI Training to 99.57% GPU Utilization

DataFunSummit

Mar 20, 2025 · Artificial Intelligence

Evolution of AI Training Stability and Baidu Baige’s Full-Stack Solutions for Large-Scale Model Training

The article traces the evolution of AI training stability from early manual operations on small GPU clusters to sophisticated, fault‑tolerant infrastructures for thousand‑card and ten‑thousand‑card models, detailing Baidu Baige’s metrics, monitoring, eBPF‑based diagnostics, and checkpoint strategies that reduce invalid training time and accelerate fault recovery.

Distributed SystemsLarge-Scale Trainingcheckpointing

0 likes · 22 min read

Evolution of AI Training Stability and Baidu Baige’s Full-Stack Solutions for Large-Scale Model Training

Baidu Intelligent Cloud Tech Hub

Mar 10, 2025 · Artificial Intelligence

How Baidu Baige Achieves Near‑Zero Downtime in Massive AI Model Training

The article examines how Baidu Baige evolved AI training stability from manual operations to precise engineering, detailing metrics, fault‑perception techniques, eBPF‑based diagnostics, multi‑level restart strategies, and trigger‑based checkpointing that together achieve sub‑minute recovery and 99.5% effective training time on massive GPU clusters.

AI trainingLarge-Scale Clusterscheckpointing

0 likes · 25 min read

How Baidu Baige Achieves Near‑Zero Downtime in Massive AI Model Training

AntData

Mar 4, 2025 · Big Data

Design and Analysis of 3FS: An AI‑Optimized Distributed File System

The article provides a comprehensive English overview of 3FS, an AI‑focused distributed file system that leverages FoundationDB for metadata, CRAQ for chunk replication, and a hybrid Fuse/native client architecture, detailing its design, components, fault handling, and performance considerations for large‑scale training workloads.

AI storageCRAQ replicationDistributed File System

0 likes · 25 min read

Design and Analysis of 3FS: An AI‑Optimized Distributed File System

Big Data Technology & Architecture

Dec 9, 2024 · Big Data

Understanding Flink’s Exactly-Once Semantics and Its Relation to Deduplication

This article explains what Flink’s Exactly‑Once semantics actually guarantee, why it does not mean each event is processed only once, how checkpointing and two‑phase commit sinks enable end‑to‑end exactly‑once, and the three safeguards needed for true exactly‑once computation.

Big DataExactly-OnceFlink

0 likes · 5 min read

Understanding Flink’s Exactly-Once Semantics and Its Relation to Deduplication

Kuaishou Large Model

Jul 11, 2024 · Artificial Intelligence

Pipeline-Aware Offloading & Balanced Checkpointing Accelerate LLM Training

Researchers from Kwai’s large-model team present a novel training system that combines pipeline-parallel-aware activation offloading with a compute-memory balanced checkpointing strategy, enabling lossless acceleration of large language models, achieving up to 42.7% MFU on 256 NVIDIA H800 GPUs while reducing memory usage.

GPU trainingKwaiPerformance Modeling

0 likes · 13 min read

Pipeline-Aware Offloading & Balanced Checkpointing Accelerate LLM Training

Rare Earth Juejin Tech Community

May 10, 2024 · Artificial Intelligence

GPU Memory Analysis and Distributed Training Strategies

This article explains how GPU memory is allocated during model fine‑tuning, describes collective communication primitives, and compares data parallel, model parallel, ZeRO, pipeline parallel, mixed‑precision, and checkpointing techniques for reducing memory consumption in large‑scale AI training.

Distributed TrainingGPU MemoryPipeline Parallel

0 likes · 9 min read

GPU Memory Analysis and Distributed Training Strategies

Big Data Technology & Architecture

Mar 20, 2024 · Big Data

Flink 1.19 New Features: SQL Optimizations, Runtime Enhancements, and Checkpointing Improvements

The article reviews Flink 1.19’s new features, highlighting SQL capability enhancements such as custom source parallelism, TTL hints, and MiniBatch support for regular joins, as well as runtime dynamic parallelism for batch jobs and flexible checkpointing intervals for different data sources.

Big DataFlinkParallelism

0 likes · 6 min read

Flink 1.19 New Features: SQL Optimizations, Runtime Enhancements, and Checkpointing Improvements

dbaplus Community

Apr 20, 2021 · Big Data

10 Common Pitfalls When Migrating Spark Jobs to Flink (And How to Avoid Them)

This article shares ten practical pitfalls encountered when moving hourly Spark session jobs to Flink, covering parallelism load imbalance, state TTL, checkpointing strategies, logging, JMX debugging, state migration risks, reduce vs process choices, input data validation, event‑time handling, and external storage considerations, along with concrete configuration snippets and performance tips.

FlinkSpark migrationState Management

0 likes · 20 min read

10 Common Pitfalls When Migrating Spark Jobs to Flink (And How to Avoid Them)

DataFunTalk

Mar 15, 2021 · Big Data

Ten Gotchas When Migrating Spark Jobs to Flink

This article shares ten practical pitfalls encountered while moving hour‑level Spark session processing jobs to Apache Flink, covering parallelism skew, state TTL, checkpoint handling, logging, debugging, state migration, Reduce vs Process, input validation, event‑time handling, and the trade‑offs of storing data inside Flink.

Big DataFlinkState Management

0 likes · 19 min read

Ten Gotchas When Migrating Spark Jobs to Flink

Big Data Technology Architecture

Jul 8, 2020 · Big Data

Apache Flink 1.11.0 Release: New Features and Optimizations

Apache Flink 1.11.0 introduces a suite of major enhancements—including unaligned checkpoints, a unified source interface, CDC support in Table API/SQL, performance‑boosted PyFlink, a new application deployment mode, and numerous UI, Docker, and catalog improvements—aimed at increasing usability, scalability, and integration across streaming and batch workloads.

FlinkSQLSource Interface

0 likes · 18 min read

Apache Flink 1.11.0 Release: New Features and Optimizations

Architect

Jun 11, 2020 · Big Data

Understanding Apache Flink Architecture, Data Transfer, Event‑Time Processing, State Management, and Checkpointing

This article explains Apache Flink's distributed system architecture—including JobManager, ResourceManager, TaskManager, and Dispatcher—covers session and job deployment modes, data transfer mechanisms, event‑time handling with watermarks, various state types and backends, scaling strategies, and the checkpoint/savepoint recovery process.

Apache FlinkBig DataEvent Time

0 likes · 15 min read

Understanding Apache Flink Architecture, Data Transfer, Event‑Time Processing, State Management, and Checkpointing

Big Data Technology & Architecture

Jun 9, 2020 · Big Data

Comprehensive Overview and Best Practices for Apache Spark Streaming

This article provides a detailed introduction to Spark Streaming, covering its architecture, DStream concepts, initialization, data sources, transformations, windowed aggregations, output operations, checkpointing, fault‑tolerance semantics, deployment, performance tuning, and monitoring for building reliable high‑throughput streaming applications.

Big DataDstreamScala

0 likes · 17 min read

Comprehensive Overview and Best Practices for Apache Spark Streaming

Big Data Technology & Architecture

Jun 4, 2020 · Big Data

Understanding Flink StreamingFileSink: File States, Rolling Policies, and Example Code

This article explains Flink's StreamingFileSink in version 1.10.0, covering how files transition through In‑progress, Pending, and Finished states, the bucket assignment and rolling policies, and provides a complete Java example for writing string data to files.

Big DataFile RollingFlink

0 likes · 6 min read

Understanding Flink StreamingFileSink: File States, Rolling Policies, and Example Code

Architecture Digest

Mar 11, 2020 · Big Data

Apache Flink: Unified Stream and Batch Processing Architecture and Core Concepts

This article provides a comprehensive overview of Apache Flink, explaining how it unifies stream and batch processing on a single runtime, detailing its key features, APIs, libraries, architectural components, fault‑tolerance mechanisms, scheduling, iterative processing, and back‑pressure monitoring.

Apache FlinkBatch Processingbackpressure

0 likes · 20 min read

Apache Flink: Unified Stream and Batch Processing Architecture and Core Concepts

Big Data Technology & Architecture

Feb 22, 2020 · Big Data

Understanding Flink's Asynchronous Barrier Snapshot (ABS) Algorithm for Checkpointing

This article explains how Apache Flink implements fault‑tolerant checkpointing using the Asynchronous Barrier Snapshot (ABS) algorithm, a localized version of the Chandy‑Lamport distributed snapshot, covering barriers, snapshot alignment, exactly‑once versus at‑least‑once semantics, and handling of cyclic dataflow graphs.

Asynchronous Barrier SnapshotDistributed SystemsFlink

0 likes · 9 min read

Understanding Flink's Asynchronous Barrier Snapshot (ABS) Algorithm for Checkpointing

Big Data Technology & Architecture

Nov 11, 2019 · Big Data

Connecting Apache Kafka with Flink 1.9 – Overview, Compatibility, and Code Samples

This article explains how to use Flink 1.9's built‑in Kafka connector, covering supported versions, Maven dependencies, consumer and producer configuration in Java and Scala, checkpointing, offset handling, partition discovery, timestamps, watermarks, and provides a complete runnable example.

ConnectorFlinkJava

0 likes · 12 min read

Connecting Apache Kafka with Flink 1.9 – Overview, Compatibility, and Code Samples

Big Data Technology & Architecture

Sep 19, 2019 · Big Data

Building a Real‑Time ETL Pipeline with Apache Flink and Ensuring Exactly‑once Semantics

This article demonstrates how to develop a real‑time ETL job using Apache Flink, covering project setup, Kafka as a source, custom bucket assigners for HDFS, checkpointing, savepoints, and deployment on YARN to achieve exactly‑once processing guarantees.

Apache FlinkBig DataExactly-Once

0 likes · 11 min read

Building a Real‑Time ETL Pipeline with Apache Flink and Ensuring Exactly‑once Semantics

dbaplus Community

Sep 10, 2019 · Big Data

Why Exactly‑Once Processing Is So Hard in Distributed Systems (And How to Tackle It)

This article explores the two toughest problems in distributed stream processing—exactly‑once message handling and ordering—by dissecting the underlying impossibility of perfect failure detectors, the liveness‑vs‑safety trade‑off, zombie processes, and the practical solutions employed by systems such as Flink, Kafka Streams, MillWheel, and Spark.

ConsensusDistributed SystemsExactly-Once

0 likes · 81 min read

Why Exactly‑Once Processing Is So Hard in Distributed Systems (And How to Tackle It)

Big Data Technology & Architecture

May 12, 2019 · Big Data

Understanding Spark Streaming Integration with Kafka: Receiver-based and Direct Approaches

This article explains Spark Streaming’s architecture, core concepts such as DStream, windowing, and the two Kafka integration methods—Receiver-based and Direct approaches—detailing their configurations, memory implications, checkpointing, and best‑practice recommendations for reliable, high‑throughput real‑time data processing.

Big DataDirect ApproachReceiver Approach

0 likes · 18 min read

Understanding Spark Streaming Integration with Kafka: Receiver-based and Direct Approaches

Big Data Technology & Architecture

Mar 13, 2019 · Big Data

Understanding Fault Tolerance and Exactly-Once Semantics in Apache Flink

This article explains Apache Flink's fault‑tolerance mechanisms, including checkpointing, barrier alignment, the differences between At‑Least‑Once and Exactly‑Once semantics, configuration options, incremental checkpointing, and the requirements for external sources and sinks to achieve end‑to‑end exactly‑once processing.

Apache FlinkBig DataExactly-Once

0 likes · 15 min read

Understanding Fault Tolerance and Exactly-Once Semantics in Apache Flink

Big Data Technology & Architecture

Jan 3, 2019 · Big Data

Reading Kafka Topics with Flink: A Step‑by‑Step Guide

This tutorial demonstrates how to use Apache Flink's Kafka connector to read data from Kafka topics with exactly‑once semantics, covering Maven dependencies, consumer configuration, checkpointing for fault tolerance, and a complete Scala example that writes the streamed data to HDFS.

Big DataFlinkKafkaConnector

0 likes · 5 min read

Reading Kafka Topics with Flink: A Step‑by‑Step Guide

StarRing Big Data Open Lab

Sep 23, 2016 · Fundamentals

How We Built a Production‑Grade Paxos Library: Principles and Engineering Insights

This article explains the core concepts of Paxos, its role in asynchronous distributed environments, and the practical engineering techniques used to create a production‑ready Paxos library, covering roles, instance management, optimization, checkpointing, and correctness guarantees.

Paxosasynchronous networkcheckpointing

0 likes · 28 min read

How We Built a Production‑Grade Paxos Library: Principles and Engineering Insights