Tagged articles

Checkpoint

83 articles · Page 1 of 1

Jun 11, 2026 · Artificial Intelligence

How to Prevent AI Workflow Stalls with a Three‑Step Checkpoint and Rollback Protocol

The article explains why AI pipelines often freeze due to external rate limiting or data corruption, and presents a three‑step checkpoint and rollback protocol plus partial‑retry routing that cuts full rerun time from hours to minutes, reduces compute waste by 85% and dramatically improves reliability.

AIAutomationCheckpoint

0 likes · 7 min read

How to Prevent AI Workflow Stalls with a Three‑Step Checkpoint and Rollback Protocol

DataFunSummit

Jun 7, 2026 · Artificial Intelligence

Harness Engineering: Safety, Human‑Agent Collaboration, and Multi‑Agent Design

In a 90‑minute technical livestream, three experts dissect ten core challenges of bringing AI agents from demo to production, covering execution control, sandbox versus permission boundaries, checkpoint design, rollback strategies, tool‑call safety, human‑in‑the‑loop interaction, multi‑agent coordination, observability, and memory management.

Agent EngineeringCheckpointObservability

0 likes · 17 min read

Harness Engineering: Safety, Human‑Agent Collaboration, and Multi‑Agent Design

DataFunSummit

Jun 5, 2026 · Artificial Intelligence

Harness Engineering: Making Multi‑Agent Systems Safe and Trustworthy from Demo to Production

In a 90‑minute live technical session, three experts dissect ten core challenges of Agent engineering—sandbox vs permission boundaries, checkpoints, rollback, tool‑call safety, human‑in‑the‑loop, multi‑agent coordination, observability, and memory—showing that moving agents from "usable" to "trustworthy" requires fine‑grained execution controls rather than broader permissions.

Agent EngineeringCheckpointObservability

0 likes · 18 min read

Harness Engineering: Making Multi‑Agent Systems Safe and Trustworthy from Demo to Production

DataFunTalk

Jun 4, 2026 · Artificial Intelligence

Harness Engineering: Execution Control, Safety Boundaries, Multi‑Agent Design

The live discussion explores how to move agents from demo to production by establishing execution controls, safety boundaries, checkpoints, rollback mechanisms, tool‑call auditing, human‑in‑the‑loop handling, multi‑agent coordination, observability, and memory management, forming a comprehensive harness engineering framework.

Agent EngineeringCheckpointPermission Boundary

0 likes · 15 min read

Harness Engineering: Execution Control, Safety Boundaries, Multi‑Agent Design

Linyb Geek Road

May 6, 2026 · Artificial Intelligence

Ensuring High Availability and Robustness for LLM Agents: Key Strategies and Pitfalls

The article breaks down the unique hard and soft failure modes of LLM‑driven agents and proposes a four‑layer defense—LLM call handling, tool execution isolation, execution‑chain checkpointing, and semantic‑level safeguards—plus observability practices to keep production agents stable and reliable.

AgentCheckpointLLM

0 likes · 15 min read

Ensuring High Availability and Robustness for LLM Agents: Key Strategies and Pitfalls

James' Growth Diary

May 1, 2026 · Artificial Intelligence

10 Real-World LangGraph Production Pitfalls That Can Crash Your App

The article details ten production‑grade pitfalls encountered when using LangGraph—ranging from misusing thread IDs and unbounded state growth to uncaught tool errors, infinite loops, concurrency conflicts, subgraph field mismatches, HITL timeouts, and misconfigured LangSmith tracing—each illustrated with concrete code, root‑cause analysis, and concrete remediation steps.

AI agentsCheckpointLLM

0 likes · 14 min read

10 Real-World LangGraph Production Pitfalls That Can Crash Your App

James' Growth Diary

Apr 27, 2026 · Artificial Intelligence

LangGraph Persistence Deep Dive: Checkpoints for Conversation Memory and Resumable Runs

This article explains LangGraph's checkpoint persistence, detailing its data structure, the role of thread_id for multi‑session isolation, the three available checkpointer backends, and how to use checkpoints for conversation memory, resumable workflows, and manual state updates, while highlighting common pitfalls.

CheckpointLangGraphMemorySaver

0 likes · 9 min read

LangGraph Persistence Deep Dive: Checkpoints for Conversation Memory and Resumable Runs

Alibaba Cloud Observability

Mar 9, 2026 · Cloud Native

How LoongCollector’s One‑Time File Collection Simplifies Bulk Log Migration

LoongCollector introduces a One‑Time file collection mode that scans matching files once, records a snapshot, and exits, enabling efficient historic log migration, data back‑fill, and temporary batch processing while providing fine‑grained checkpoints, execution windows, and throttling controls to avoid quota issues and ensure reliable completion.

CheckpointData Migrationlog collection

0 likes · 12 min read

How LoongCollector’s One‑Time File Collection Simplifies Bulk Log Migration

Alibaba Cloud Native

Mar 8, 2026 · Cloud Native

How LoongCollector’s OneTime File Collection Transforms Static Log Migration

LoongCollector’s OneTime file collection feature enables fast, reliable migration of historical logs, data back‑filling, and batch processing by scanning files once, using checkpoints for fault tolerance, configurable execution windows, and rate‑limiting to avoid impacting live data streams.

CheckpointLoongCollectorOneTime

0 likes · 12 min read

How LoongCollector’s OneTime File Collection Transforms Static Log Migration

DeWu Technology

Feb 9, 2026 · Big Data

How to Build a Production‑Ready Flink ClickHouse Sink with Dynamic Sharding, Batch‑by‑Size, and Robust Retry

This article presents a production‑grade Flink ClickHouse sink that solves common pain points such as lack of size‑based batching, static table schemas, and distributed‑table latency by introducing data‑size batching, dynamic table routing, local‑table writes, load‑balanced node discovery, back‑pressure queues, dual‑trigger flush, and recursive retry with node exclusion, all integrated with Flink checkpoint semantics for at‑least‑once guarantees.

BatchingCheckpointClickHouse

0 likes · 25 min read

How to Build a Production‑Ready Flink ClickHouse Sink with Dynamic Sharding, Batch‑by‑Size, and Robust Retry

Java One

Jan 24, 2026 · Artificial Intelligence

Master Claude Code: Unlock AI‑Powered Terminal Coding

This guide explains Claude Code’s agent loop, model choices, built‑in tool categories, project access scope, session handling, checkpoint and permission controls, and practical tips for efficiently using the AI‑driven terminal assistant to write, test, and refactor code.

AI coding assistantAgent LoopCheckpoint

0 likes · 15 min read

Master Claude Code: Unlock AI‑Powered Terminal Coding

Fun with Large Models

Dec 21, 2025 · Artificial Intelligence

LangGraph 1.0 Quick Guide Part 2: Conditional Edges, Memory, and Human‑in‑the‑Loop

This article walks through three advanced LangGraph 1.0 features—using the Command object for conditional routing, checkpoint‑based memory for state persistence across invocations, and interrupt‑driven human‑in‑the‑loop control—providing concrete code examples, execution traces, and a comparison of design trade‑offs.

AI agentsCheckpointLangGraph

0 likes · 15 min read

LangGraph 1.0 Quick Guide Part 2: Conditional Edges, Memory, and Human‑in‑the‑Loop

Big Data Technology & Architecture

Sep 24, 2025 · Big Data

Avoid These 6 Common Paimon Data Loss Pitfalls in Flink and Spark

Learn the six typical scenarios that cause data loss when writing to Paimon—ranging from checkpoint failures and misconfigured partial‑update mode to incorrect sequence fields, snapshot retention issues, concurrent bucket writes, and outdated Spark versions—and how to prevent each problem.

Big DataCheckpointData loss

0 likes · 5 min read

Avoid These 6 Common Paimon Data Loss Pitfalls in Flink and Spark

Volcano Engine Developer Services

Sep 4, 2025 · Backend Development

How to Build a Multi‑Agent LLM Flow in Go with Eino – Deer‑Go Deep Dive

This article explains how to re‑implement ByteDance's DeerFlow deep‑research framework in Go (Deer‑Go), covering the multi‑agent architecture, control‑hand‑off, interrupt & checkpoint mechanisms, integration with the Hertz SSE server, and step‑by‑step deployment instructions.

CheckpointDeerFlowEino

0 likes · 16 min read

How to Build a Multi‑Agent LLM Flow in Go with Eino – Deer‑Go Deep Dive

Infra Learning Club

Feb 15, 2025 · Cloud Native

Advanced Guide: Real‑Time GPU Process Migration in Kubernetes with CRIU

This article explains how os‑criu provides transparent, OS‑level GPU checkpoint/restore, compares its performance with NVIDIA's cuda‑checkpoint, walks through building and installing the PhOS framework, demonstrates migration of a Llama2‑13b‑chat workload in Docker, and discusses current limitations and future Kubernetes integration plans.

CRIUCheckpointDocker

0 likes · 9 min read

Advanced Guide: Real‑Time GPU Process Migration in Kubernetes with CRIU

dbaplus Community

Jun 30, 2024 · Databases

How MySQL’s Write‑Ahead Log Safeguards Data During Power Failures

An in‑depth guide explains MySQL’s write‑ahead log mechanism, covering buffer pool, redo and undo logs, checkpoint types, and how the system recovers from power failures, with step‑by‑step examples and practical configuration tips for reliable data consistency.

CheckpointDatabase RecoveryRedo Log

0 likes · 12 min read

How MySQL’s Write‑Ahead Log Safeguards Data During Power Failures

ITPUB

May 6, 2024 · Databases

How MySQL’s Write‑Ahead Log Protects Data During Power Outages

This article explains MySQL InnoDB’s write‑ahead logging, detailing the roles of Buffer Pool, Redo and Undo logs, checkpoint mechanisms, and how they ensure data consistency and atomicity when a sudden power loss occurs.

CheckpointDatabase RecoveryInnoDB

0 likes · 12 min read

How MySQL’s Write‑Ahead Log Protects Data During Power Outages

DataFunTalk

Dec 27, 2023 · Big Data

Apache Flink 2023: Core Technical Achievements and Future Directions

The article reviews Apache Flink's rapid development over the past decade, highlighting its 2023 community growth, SIGMOD award, major releases, streaming SQL enhancements, incremental checkpointing, batch maturity, cloud‑native scaling, and integration with the emerging Lakehouse architecture.

Apache FlinkBig DataCheckpoint

0 likes · 11 min read

Apache Flink 2023: Core Technical Achievements and Future Directions

Architecture Digest

Dec 19, 2023 · Backend Development

Using CRaC with SpringBoot 3.2: A Practical Guide and Performance Evaluation

This article explains how to enable and use the OpenJDK CRaC project with SpringBoot 3.2, covering prerequisites, dependency setup, automatic and manual checkpoint creation, and demonstrates significant startup‑time reductions through detailed examples and performance results.

CRaCCheckpointJava

0 likes · 9 min read

Using CRaC with SpringBoot 3.2: A Practical Guide and Performance Evaluation

Java High-Performance Architecture

Dec 13, 2023 · Backend Development

How to Supercharge Spring Boot 3.2 Startup with CRaC Checkpointing

This guide explains how Spring Boot 3.2 leverages the OpenJDK CRaC project to create and restore JVM checkpoints, dramatically reducing application startup time through automatic and manual checkpoint techniques, complete with required dependencies, installation steps, and performance results.

CRaCCheckpointJava

0 likes · 8 min read

How to Supercharge Spring Boot 3.2 Startup with CRaC Checkpointing

Selected Java Interview Questions

Dec 8, 2023 · Backend Development

Using CRaC with Spring Boot 3.2 and Spring 6.1: A Practical Guide

This article demonstrates how to enable and use Coordinated Restore at Checkpoint (CRaC) in Spring Boot 3.2 with the Spring Petclinic example, covering required dependencies, JVM setup, automatic and manual checkpoint creation, performance measurements, and best‑practice considerations for backend developers.

CRaCCheckpointJava

0 likes · 8 min read

Using CRaC with Spring Boot 3.2 and Spring 6.1: A Practical Guide

Java Architecture Diary

Dec 1, 2023 · Backend Development

Accelerate Spring Boot 3.2 Startup with CRaC: Automatic & Manual Checkpoints

This article demonstrates how Spring Boot 3.2 leverages the OpenJDK CRaC project to dramatically cut application startup time, covering required JVM, dependencies, Zulu JDK setup, automatic and manual checkpoint creation, performance benchmarks, and restoration steps, all using the Petclinic example.

CRaCCheckpointSpring Boot

0 likes · 8 min read

Accelerate Spring Boot 3.2 Startup with CRaC: Automatic & Manual Checkpoints

Baidu Geek Talk

Apr 19, 2023 · Artificial Intelligence

Why Does Recompute Crash Distributed Training? A Deep Dive into Checkpoint Issues and Fixes

When training large‑batch deep learning models, developers often use recompute to trade computation for memory, but in dynamic graph frameworks this can trigger synchronization errors in distributed data parallel training; the article explains the underlying DDP mechanics, illustrates the error, and offers a practical no_sync workaround with code examples.

CheckpointPyTorchdistributed training

0 likes · 14 min read

Why Does Recompute Crash Distributed Training? A Deep Dive into Checkpoint Issues and Fixes

Big Data Technology & Architecture

Mar 27, 2023 · Big Data

Key Updates in Apache Flink 1.17: Batch and Streaming Enhancements

The article reviews Apache Flink 1.17's major batch and streaming improvements, including new Delete/Update APIs, performance boosts, SQL client gateway, checkpoint and watermark enhancements, StateBackend upgrades, and practical use‑case scenarios for data engineers.

Apache FlinkBatch ProcessingBig Data

0 likes · 7 min read

Key Updates in Apache Flink 1.17: Batch and Streaming Enhancements

ITPUB

Mar 24, 2023 · Big Data

What’s New in Apache Flink 1.17? Key Features, Performance Gains, and Streaming Warehouse Advances

Apache Flink 1.17 introduces a suite of batch and streaming enhancements—including a new Streaming Warehouse API, significant TPC‑DS performance boosts, adaptive batch scheduling, improved checkpointing, expanded SQL capabilities, Hive connector upgrades, and broader filesystem support—while also delivering upgrades to FRocksDB, Calcite, and the token framework to strengthen its position as a leading unified data‑processing engine.

Apache FlinkBatch ProcessingCheckpoint

0 likes · 23 min read

What’s New in Apache Flink 1.17? Key Features, Performance Gains, and Streaming Warehouse Advances

Big Data Technology & Architecture

Feb 24, 2023 · Big Data

Common Flink Task Submission Issues and Solutions on YARN

This article compiles frequent Flink job submission problems on YARN—including WordCount jar errors, HBase dependency conflicts, MySQL timeout, checkpoint restoration failures, parallelism limits, and unexpected container termination—provides root‑cause analysis and step‑by‑step remediation instructions.

Big DataCheckpointFlink

0 likes · 21 min read

Common Flink Task Submission Issues and Solutions on YARN

Sohu Tech Products

Jan 18, 2023 · Big Data

Root Cause Analysis of Flink TaskManager Failover Causing Data Reprocessing and Business Impact

An incident report details how a scheduled machine reboot on Alibaba Cloud triggered a Flink TaskManager failover, leading to excessive data replay, increased ES pressure, and significant business latency, and explains the root cause involving disabled checkpoints and timestamp‑based offset consumption.

CheckpointFlinkRootCause

0 likes · 10 min read

Root Cause Analysis of Flink TaskManager Failover Causing Data Reprocessing and Business Impact

Big Data Technology & Architecture

Dec 28, 2022 · Big Data

Flink 1.16 Highlights: Adaptive Batch Scheduling, Speculative Execution, Hybrid Shuffle, Dynamic Partition Pruning, Hive SQL Migration, Checkpoint Enhancements, CDC Integration, and Table Store

Flink 1.16 introduces adaptive batch scheduling, speculative execution, hybrid shuffle, dynamic partition pruning, improved Hive SQL compatibility, advanced checkpoint mechanisms including changelog backend, and integrates CDC with Kafka and Table Store, offering faster, more stable, and easier-to-use stream‑batch processing capabilities.

Big DataCDCCheckpoint

0 likes · 8 min read

Flink 1.16 Highlights: Adaptive Batch Scheduling, Speculative Execution, Hybrid Shuffle, Dynamic Partition Pruning, Hive SQL Migration, Checkpoint Enhancements, CDC Integration, and Table Store

ITPUB

Dec 21, 2022 · Big Data

How Bilibili Optimized Flink Runtime for Massive Real‑Time Jobs

This article details Bilibili's extensive enhancements to the Flink runtime—including checkpoint recoverability, max‑parallelism calculations, State Processor API extensions, Full and Regional Checkpoints, hybrid HA, task‑level recovery, load‑balanced partitioners, and large‑scale cluster maintenance—to improve reliability and performance of its billion‑scale streaming workloads.

Big DataCheckpointFlink

0 likes · 33 min read

How Bilibili Optimized Flink Runtime for Massive Real‑Time Jobs

政采云技术

Dec 8, 2022 · Big Data

Understanding Flink's Asynchronous Barrier Snapshotting (ABS) Checkpoint Algorithm

This article explains the Asynchronous Barrier Snapshotting algorithm used by Apache Flink for checkpointing, detailing its origins from the Chandy‑Lamport algorithm, its operation in both acyclic and cyclic dataflow graphs, barrier alignment, and the fault‑recovery process.

Asynchronous Barrier SnapshottingCheckpointFlink

0 likes · 10 min read

Understanding Flink's Asynchronous Barrier Snapshotting (ABS) Checkpoint Algorithm

Bilibili Tech

Nov 29, 2022 · Big Data

How Bilibili Supercharged Flink: Checkpoint, HA, and Runtime Optimizations

This article details Bilibili's extensive enhancements to Flink's runtime—including checkpoint recoverability, operator ID stability, state processor extensions, hybrid high‑availability, regional checkpointing, and load‑based channel selection—to improve scalability, reliability, and operational efficiency of large‑scale streaming jobs.

Big DataCheckpointFlink

0 likes · 32 min read

How Bilibili Supercharged Flink: Checkpoint, HA, and Runtime Optimizations

JD Tech

Sep 6, 2022 · Big Data

Flink Streaming Job Tuning Guide: Memory Model, Network Stack, RocksDB, and More

This article presents a detailed guide for optimizing large‑scale Apache Flink streaming jobs on the JD Real‑Time Computing platform, covering TaskManager memory model tuning, network stack configuration, RocksDB state management, checkpoint strategies, and additional performance tips with practical examples and calculations.

Apache FlinkCheckpointNetwork Stack

0 likes · 22 min read

Flink Streaming Job Tuning Guide: Memory Model, Network Stack, RocksDB, and More

Hulu Beijing

Aug 4, 2022 · Big Data

Unlock Seamless Object Serialization & Checkpoint Recovery in Spark with Neutrino

This article explains how Neutrino’s SerializableProvider API enables passing final classes, managing mutable object state, and supporting Spark checkpoint recovery through dependency injection, while also showing practical code patterns and injection of core Spark components.

Big DataCheckpointDependency Injection

0 likes · 8 min read

Unlock Seamless Object Serialization & Checkpoint Recovery in Spark with Neutrino

NetEase LeiHuo UX Big Data Technology

Aug 3, 2022 · Big Data

Understanding Spark Streaming Checkpoint Mechanism for Real‑Time Feature Computation

The article explains how Spark Streaming's checkpoint mechanism works, detailing the four-step process—from setting the checkpoint directory to writing RDD data and finalizing the checkpoint—highlighting its role in ensuring fault‑tolerant, fast recovery for real‑time recommendation feature pipelines.

Big DataCheckpointReal-time Processing

0 likes · 7 min read

Understanding Spark Streaming Checkpoint Mechanism for Real‑Time Feature Computation

DataFunTalk

Jun 6, 2022 · Big Data

Understanding Flink's Exactly-Once Guarantees: Checkpoint, Two‑Phase Commit, and Kafka Integration

This article explains how Apache Flink achieves end‑to‑end exactly‑once semantics by using source replay support, checkpoint‑based snapshots, asynchronous incremental checkpoints, and two‑phase commit sinks, and describes the interaction with external systems such as Kafka to ensure transactional writes.

Big DataCheckpointExactly-once

0 likes · 7 min read

Understanding Flink's Exactly-Once Guarantees: Checkpoint, Two‑Phase Commit, and Kafka Integration

DataFunTalk

Apr 25, 2022 · Big Data

Comprehensive Guide to Flink Deployment, State Programming, Checkpointing, and Performance Tuning

This article provides an extensive overview of Apache Flink, covering deployment modes, cluster sizing, job submission workflows, state programming concepts, checkpoint mechanisms, backpressure handling, comparison with Spark, and practical code snippets for configuration and optimization.

Big DataCheckpointFlink

0 likes · 48 min read

Comprehensive Guide to Flink Deployment, State Programming, Checkpointing, and Performance Tuning

Big Data Technology & Architecture

Apr 19, 2022 · Big Data

Understanding Flink Checkpoint and Unaligned Checkpoint Mechanisms

This article explains Flink's fundamental checkpoint mechanism, its coupling with backpressure, and how the introduction of Unaligned Checkpoint in Flink 1.11 decouples checkpointing from backpressure to improve latency and resource utilization in high‑backpressure streaming jobs.

Big DataCheckpointFlink

0 likes · 14 min read

Understanding Flink Checkpoint and Unaligned Checkpoint Mechanisms

Big Data Technology & Architecture

Jan 12, 2022 · Big Data

Common Production Issues and Troubleshooting Guide for Apache Flink

This article compiles a comprehensive list of common production problems encountered with Apache Flink, covering cluster sizing, checkpoint failures, backpressure analysis, resource allocation, deployment errors, UDF definitions, data skew, Kafka configurations, and provides detailed troubleshooting steps and best‑practice recommendations.

Apache FlinkCheckpointProduction troubleshooting

0 likes · 39 min read

Common Production Issues and Troubleshooting Guide for Apache Flink

dbaplus Community

Jan 5, 2022 · Big Data

How ByteDance Optimized Flink SQL for Real‑World Streaming at Scale

This article details ByteDance's practical experience with Apache Flink, covering SQL extensions, a visual SQL platform, performance tweaks such as window mini‑batching and custom windows, join and checkpoint recovery improvements, stream‑batch integration experiments, and future roadmap plans.

Batch IntegrationCheckpointFlink

0 likes · 16 min read

How ByteDance Optimized Flink SQL for Real‑World Streaming at Scale

政采云技术

Nov 30, 2021 · Databases

Overview of MySQL and InnoDB Storage Engine Architecture

This article provides a comprehensive overview of MySQL, detailing its configuration file search order, component architecture, various storage engines such as MyISAM, NDB, Memory, and an in‑depth examination of InnoDB’s internal structures, memory management, background threads, LRU handling, redo log buffering, and checkpoint mechanisms.

CheckpointDatabase ArchitectureInnoDB

0 likes · 26 min read

Overview of MySQL and InnoDB Storage Engine Architecture

Big Data Technology & Architecture

Nov 20, 2021 · Big Data

Comprehensive Overview of Apache Flink Concepts, Mechanisms, and Interview Questions

This article provides an extensive technical guide to Apache Flink, covering its exactly‑once consumption guarantees, checkpoint and two‑phase commit mechanisms, differences from Spark, state backends, watermark handling, time semantics, window joins, CEP, backpressure, architecture layers, deployment, resource management, and common operational issues.

Big DataCEPCheckpoint

0 likes · 77 min read

Comprehensive Overview of Apache Flink Concepts, Mechanisms, and Interview Questions

Big Data Technology & Architecture

Nov 16, 2021 · Big Data

Flink Checkpoint, Backpressure, and Memory Tuning Guide

This article provides a comprehensive guide on optimizing Flink checkpoints, diagnosing and alleviating backpressure, and fine‑tuning memory configurations—including process, heap, off‑heap, managed, and network memory—to improve job stability and performance in large‑scale streaming applications.

CheckpointFlinkStateBackend

0 likes · 25 min read

Flink Checkpoint, Backpressure, and Memory Tuning Guide

Tencent Cloud Developer

Nov 9, 2021 · Big Data

Comprehensive Overview of Apache Flink Streaming Computation and Architecture

The article systematically introduces Apache Flink’s streaming computation model, contrasting batch and real‑time processing, detailing its unified architecture, managed and raw state with key groups, checkpointing and savepoints for fault tolerance, data exchange mechanisms, time semantics, windowing, side‑outputs, and a complete Java Kafka‑based example.

Apache FlinkCheckpointFlink Architecture

0 likes · 46 min read

Comprehensive Overview of Apache Flink Streaming Computation and Architecture

Big Data Technology & Architecture

Oct 9, 2021 · Big Data

Apache Flink 1.7–1.14 Release Highlights and Feature Evolution

This article provides a comprehensive overview of Apache Flink's major releases from version 1.7 to 1.14, detailing new APIs, state management improvements, Kubernetes integration, SQL and Table API enhancements, checkpointing advances, and performance optimizations that together illustrate the platform's evolution for both streaming and batch processing workloads.

Apache FlinkBatch ProcessingCheckpoint

0 likes · 78 min read

Apache Flink 1.7–1.14 Release Highlights and Feature Evolution

DataFunTalk

Oct 6, 2021 · Big Data

Optimizing Flink Real‑Time Computing at Bilibili: Connector Stability, SQL, Runtime, and Future Outlook

This article details Bilibili's comprehensive optimization of Flink real‑time computing, covering connector stability improvements, SQL interval‑join enhancements, runtime state and checkpoint refinements, a diagnostic tool, and future directions for high‑throughput streaming workloads.

Big DataCheckpointFlink

0 likes · 18 min read

Optimizing Flink Real‑Time Computing at Bilibili: Connector Stability, SQL, Runtime, and Future Outlook

Aikesheng Open Source Community

Sep 17, 2021 · Databases

Understanding InnoDB "Pages flushed up to" and Its Relation to the Last Checkpoint

This article explains the meaning of the 'Pages flushed up to' value in MySQL's InnoDB engine, how it differs from the 'Last checkpoint' LSN, the underlying logic for its calculation, and provides test results demonstrating its behavior during writes and checkpoints.

CheckpointInnoDBLSN

0 likes · 6 min read

Understanding InnoDB "Pages flushed up to" and Its Relation to the Last Checkpoint

Java Architecture Diary

Aug 6, 2021 · Databases

How Relational Databases Ensure Durability: Inside Pages, Undo & Redo Logs

This article explains the internal mechanisms of relational databases, covering data pages, buffer pools, undo and redo logs, checkpointing, and how these components work together to provide atomicity, durability, and crash recovery while minimizing disk I/O.

CheckpointRedo LogTransaction Logs

0 likes · 7 min read

How Relational Databases Ensure Durability: Inside Pages, Undo & Redo Logs

DataFunTalk

Jul 28, 2021 · Big Data

Pravega Flink Connector: Past, Present, and Future – Architecture, Checkpoint Integration, and Upcoming Features

This article reviews the Pravega project and its Flink connector, covering Pravega's design for large‑scale streaming, the connector's evolution and exact‑once semantics, Flink 1.11 integration challenges, checkpoint mechanisms, and future plans such as schema‑registry and new Flink features.

Big DataCheckpointConnector

0 likes · 10 min read

Pravega Flink Connector: Past, Present, and Future – Architecture, Checkpoint Integration, and Upcoming Features

Big Data Technology & Architecture

Jul 20, 2021 · Big Data

Common Issues and Solutions for Flink CDC with MySQL

This article summarizes frequent problems encountered when using Flink CDC with MySQL—including Kafka version conflicts, checkpoint timeouts, permission errors, global lock issues, and DDL parsing failures—and provides practical configuration tweaks and code examples to resolve them.

CDCCheckpointDebezium

0 likes · 11 min read

Common Issues and Solutions for Flink CDC with MySQL

Big Data Technology & Architecture

Jul 12, 2021 · Big Data

Common Production Issues and Troubleshooting Guide for Apache Flink

This article compiles classic production problems encountered with Apache Flink, covering cluster sizing, checkpoint failures, backpressure diagnosis, client submission errors, resource allocation on YARN, and PyFlink UDF definitions, providing step‑by‑step troubleshooting methods and practical recommendations.

CheckpointFlinkProduction

0 likes · 18 min read

Big Data Technology & Architecture

Apr 10, 2021 · Big Data

Understanding Spark Cache and Checkpoint Mechanisms

This article explains Spark's cache and checkpoint mechanisms, detailing when to use each, how they are implemented internally, how cached and checkpointed RDDs are stored and retrieved, and the differences between caching, persisting, and checkpointing for reliable big‑data processing.

CacheCheckpointRDD

0 likes · 13 min read

Understanding Spark Cache and Checkpoint Mechanisms

Big Data Technology & Architecture

Apr 4, 2021 · Big Data

Flink Performance Tuning Guide: Memory Configuration, Parallelism, Checkpoint Optimization, and Common Issues

This guide details comprehensive Flink performance tuning techniques, covering memory configuration, GC settings, parallelism adjustments, process parameters, partitioning strategies, Netty network tuning, checkpoint optimization, and common issues such as data skew and resource bottlenecks.

CheckpointFlinkMemory Management

0 likes · 18 min read

Flink Performance Tuning Guide: Memory Configuration, Parallelism, Checkpoint Optimization, and Common Issues

DataFunTalk

Mar 21, 2021 · Big Data

Single‑Point Recovery and Regional Checkpoint in Flink: Design, Implementation, and Optimizations

This article presents ByteDance's recent Flink enhancements, detailing a single‑point recovery mechanism for the network layer and a regional checkpoint strategy that together improve failover latency, reduce output loss, and enable scalable, high‑throughput stream processing for large‑scale real‑time recommendation workloads.

Big DataCheckpointFlink

0 likes · 12 min read

Single‑Point Recovery and Regional Checkpoint in Flink: Design, Implementation, and Optimizations

Big Data Technology & Architecture

Mar 18, 2021 · Big Data

Flink Job Troubleshooting and Performance Optimization: Data Skew, Kafka Configuration, Resource Management, and Checkpoint Issues

This article details common Flink streaming problems such as data skew causing task back‑pressure, oversized Kafka messages, high‑throughput ack settings, slot removal errors, checkpoint timeouts, and resource constraints, and provides concrete configuration changes and architectural adjustments to resolve them.

CheckpointData SkewFlink

0 likes · 18 min read

Flink Job Troubleshooting and Performance Optimization: Data Skew, Kafka Configuration, Resource Management, and Checkpoint Issues

Big Data Technology & Architecture

Jan 9, 2021 · Big Data

Comprehensive 2021 Flink Interview Questions and Answers

This article presents a detailed collection of 2021 Flink interview questions covering checkpoint mechanisms, watermarks, state backends, join types, fault tolerance, resource configuration, and recent Flink 1.10 features, providing concise explanations and code examples for each topic.

CheckpointFlinkState Backend

0 likes · 23 min read

Comprehensive 2021 Flink Interview Questions and Answers

dbaplus Community

Dec 15, 2020 · Big Data

Building Real‑Time OLAP Reports with Flink SQL CDC and Elasticsearch

This article details a production‑grade pipeline that uses Apache Flink 1.11's SQL CDC to stream MySQL changes into Elasticsearch, enabling low‑latency OLAP reporting, and shares the architecture, DDL/DML scripts, operational settings, and dozens of pitfalls encountered along the way.

CheckpointYAMLbig-data

0 likes · 19 min read

Building Real‑Time OLAP Reports with Flink SQL CDC and Elasticsearch

StarRing Big Data Open Lab

Oct 16, 2020 · Databases

How Memory Databases Handle Concurrency, Persistence, and Query Processing

This article explores the concurrency control strategies, persistence mechanisms, and compiled query processing techniques used by modern memory databases, comparing systems like Hekaton, Hyper, HANA, and others while highlighting performance trade‑offs and architectural considerations.

CheckpointConcurrency ControlMVCC

0 likes · 20 min read

How Memory Databases Handle Concurrency, Persistence, and Query Processing

Architects' Tech Alliance

Aug 27, 2020 · Fundamentals

Understanding Burst Buffer Technology and Its Role in High‑Performance Computing (HPC)

Burst Buffer is a storage acceleration technology that enhances I/O bandwidth and OPS for high‑performance computing by providing fast checkpoint/restart, temporary storage, and balancing SSD and parallel file system resources, with implementations from DDN, Cray, EMC, and IBM detailed for HPC designers.

Burst BufferCheckpointHPC

0 likes · 5 min read

Understanding Burst Buffer Technology and Its Role in High‑Performance Computing (HPC)

Alibaba Cloud Developer

Jul 13, 2020 · Big Data

What’s New in Apache Flink 1.11? A Deep Dive into Features and Performance

Apache Flink 1.11.0, released after four months of development, brings major ecosystem, usability, and stability improvements—including CDC support, a new JDBC catalog, real‑time Hive integration, a redesigned source API, PyFlink enhancements, application mode for Kubernetes, and checkpoint optimizations—while highlighting the growing contribution of Chinese developers.

Apache FlinkCheckpointFeature Release

0 likes · 20 min read

What’s New in Apache Flink 1.11? A Deep Dive into Features and Performance

Big Data Technology & Architecture

Jul 13, 2020 · Big Data

Understanding and Optimizing Flink Checkpoint Mechanism for Large-Scale State

This article explains Flink's checkpoint mechanism, outlines key performance metrics, discusses interval configuration, external state storage choices, resource allocation, and task-local recovery strategies to improve checkpoint speed and reliability in large‑scale state scenarios.

Big DataCheckpointFlink

0 likes · 5 min read

Understanding and Optimizing Flink Checkpoint Mechanism for Large-Scale State

Architect

Jun 3, 2020 · Backend Development

Elasticsearch Distributed Consistency Analysis: Data Flow, PacificA Algorithm, Sequence Numbers and Checkpoints

This article provides a detailed examination of Elasticsearch's distributed consistency mechanisms, covering the shard write path, the PacificA replication algorithm, the role of SequenceNumber and Checkpoint, and a comparison of ES's implementation with the original algorithm, based on version 6.2.

CheckpointElasticsearchPacificA

0 likes · 23 min read

Elasticsearch Distributed Consistency Analysis: Data Flow, PacificA Algorithm, Sequence Numbers and Checkpoints

Big Data Technology & Architecture

Apr 15, 2020 · Big Data

Understanding HDFS SecondaryNameNode and the Checkpoint Process

This article explains the role of HDFS SecondaryNameNode, the structure of fsimage and edits files, how checkpointing works—including configuration parameters and steps—and how the process changes when NameNode high availability is enabled.

Big DataCheckpointFilesystem

0 likes · 6 min read

Understanding HDFS SecondaryNameNode and the Checkpoint Process

Big Data Technology & Architecture

Apr 8, 2020 · Big Data

Common Apache Flink Exceptions and How to Resolve Them

This article enumerates typical Apache Flink deployment, job, and checkpoint errors—such as JDK version issues, resource shortages, task manager timeouts, and state migration problems—and provides practical troubleshooting steps and configuration tips to help engineers quickly diagnose and fix these failures.

Big DataCheckpointException

0 likes · 8 min read

Common Apache Flink Exceptions and How to Resolve Them

Youzan Coder

Feb 28, 2020 · Big Data

Flink Checkpoint Principle Analysis and Failure Cause Investigation

The article thoroughly explains Apache Flink’s checkpoint mechanism—including state types, coordinator workflow, exactly‑once versus at‑least‑once semantics, common failure sources such as code exceptions, storage or network issues, and practical configuration tips like interval settings, local recovery and externalized checkpoints.

Apache FlinkCheckpointExactly-once

0 likes · 15 min read

Flink Checkpoint Principle Analysis and Failure Cause Investigation

dbaplus Community

Feb 25, 2020 · Backend Development

How to Merge Small Files in Flink Checkpoints to Reduce HDFS Load

This article explains a small‑file‑merging technique for Apache Flink checkpoints that reuses FSDataOutputStreams to combine multiple state files into a single HDFS file, detailing design considerations such as concurrent checkpoint support, reference‑counted deletion, space amplification reduction, fault handling, compatibility, and observed production performance gains.

Apache FlinkCheckpointHDFS

0 likes · 13 min read

How to Merge Small Files in Flink Checkpoints to Reduce HDFS Load

Aikesheng Open Source Community

Feb 20, 2020 · Databases

Understanding MySQL Multi‑Threaded Slave (MTS) Checkpoint Mechanism and Event Execution

This article explains how MySQL's Multi‑Threaded Slave (MTS) processes events, manages checkpoints, and persists state using GAQ queues, bitmaps, and system tables, providing detailed code references and configuration parameters for reliable parallel replication.

CheckpointGTIDMTS

0 likes · 14 min read

Understanding MySQL Multi‑Threaded Slave (MTS) Checkpoint Mechanism and Event Execution

Qunar Tech Salon

Feb 11, 2020 · Databases

Understanding InnoDB REDO Log Management and MTR Physical Transactions

This article explains the purpose and management of InnoDB REDO log files, the role of Log Sequence Numbers, the structure of log pages, the MTR physical transaction mechanism, and how checkpoints ensure data integrity in MySQL databases.

CheckpointDatabase InternalsInnoDB

0 likes · 22 min read

Understanding InnoDB REDO Log Management and MTR Physical Transactions

Big Data Technology & Architecture

Nov 4, 2019 · Big Data

Understanding Spark Checkpoint: Purpose, Mechanism, and Best Practices

This article explains why Spark checkpoints are needed for large or complex RDD pipelines, how they work by persisting data to reliable storage such as HDFS, and outlines practical steps and best‑practice recommendations for using checkpoints effectively in production environments.

Big DataCheckpointHDFS

0 likes · 6 min read

Understanding Spark Checkpoint: Purpose, Mechanism, and Best Practices

dbaplus Community

Oct 22, 2019 · Big Data

How Weibo Built a Billion‑Log Real‑Time Data Platform with Flink

This article details how Weibo’s advertising team designed and implemented a real‑time data platform capable of processing over a hundred billion daily logs, covering technology selection, Flink advantages, architecture evolution, data processing pipelines, component libraries, fault‑tolerance strategies, and the construction of a multi‑layer real‑time data warehouse.

Big DataCheckpointData Architecture

0 likes · 25 min read

How Weibo Built a Billion‑Log Real‑Time Data Platform with Flink

Big Data Technology & Architecture

Oct 14, 2019 · Big Data

Optimizing Spark PageRank: Cache, Checkpoint, Data Skew, and Resource Utilization

This article presents a comprehensive analysis of Spark PageRank performance, detailing the algorithm's basics, the original example code, and four key optimizations—caching with checkpointing, memory‑efficient data structures, handling data skew, and maximizing executor and driver resource usage—backed by experimental results and practical recommendations.

Big DataCacheCheckpoint

0 likes · 18 min read

Optimizing Spark PageRank: Cache, Checkpoint, Data Skew, and Resource Utilization

Big Data Technology & Architecture

Oct 9, 2019 · Big Data

Choosing and Using Flink State Backends: MemoryStateBackend, FsStateBackend, and RocksDBStateBackend

This article explains how Flink checkpoints persist state, compares the three built‑in state backends (MemoryStateBackend, FsStateBackend, RocksDBStateBackend), discusses their configurations, advantages, limitations, and provides guidance on selecting the appropriate backend for different big‑data streaming scenarios.

Big DataCheckpointFlink

0 likes · 10 min read

Choosing and Using Flink State Backends: MemoryStateBackend, FsStateBackend, and RocksDBStateBackend

Big Data Technology & Architecture

Sep 18, 2019 · Big Data

Understanding Flink Checkpoint Mechanism and Configuration

This article explains Flink's checkpoint mechanism, its execution flow, common configuration options, and the benefits and considerations of incremental checkpoints using the RocksDB state backend, providing practical code examples and YAML settings for reliable stream processing.

Big DataCheckpointFlink

0 likes · 12 min read

Understanding Flink Checkpoint Mechanism and Configuration

Big Data Technology & Architecture

Aug 9, 2019 · Big Data

Understanding Exactly-Once Semantics in Apache Flink: Challenges and Implementation

This article analyzes the difficulties of achieving exactly-once delivery in Apache Flink, explains the distinction between state and end‑to‑end exactly‑once, and details how Flink implements exactly‑once sinks using idempotent and transactional approaches, including a Bucketing File Sink example.

CheckpointFlinkState Management

0 likes · 12 min read

Understanding Exactly-Once Semantics in Apache Flink: Challenges and Implementation

Node Underground

Jul 20, 2019 · Cloud Native

How to Use Docker Checkpoint & CRIU for Live Container Migration

This guide walks you through enabling Docker's experimental mode, installing CRIU, building a simple Node container, creating checkpoints, and restoring containers both on the same host and on a different host, highlighting the prerequisites and limitations of live migration.

CRIUCheckpointContainer Migration

0 likes · 5 min read

How to Use Docker Checkpoint & CRIU for Live Container Migration

Big Data Technology & Architecture

Jul 9, 2019 · Big Data

Understanding Flink State Management and Checkpointing for Exactly-Once Kafka Integration

This article explains how Apache Flink manages state, uses checkpointing for fault-tolerant recovery, and achieves exactly-once semantics when consuming Kafka streams by persisting offsets, describing the checkpoint mechanism, recovery process, and practical considerations for production deployments.

Big DataCheckpointFlink

0 likes · 8 min read

Understanding Flink State Management and Checkpointing for Exactly-Once Kafka Integration

Node Underground

Jun 3, 2019 · Operations

How to Use CRIU for Linux Process Checkpoint and Restore with Node.js

Learn how to install CRIU on CentOS, create a simple Node.js program that prints its PID, capture its state with a checkpoint using CRIU, and then restore the process, demonstrating practical use cases like migration and fast startup.

CheckpointLinuxNode.js

0 likes · 3 min read

How to Use CRIU for Linux Process Checkpoint and Restore with Node.js

Qunar Tech Salon

Feb 20, 2019 · Big Data

Building Real-Time User Behavior Engineering with Apache Flink: Architecture, Features, and Implementation

This article introduces the design and implementation of a real‑time user behavior engineering platform at Qunar using Apache Flink, covering Flink's core characteristics, distributed runtime, DataStream programming model, fault‑tolerance, back‑pressure handling, event‑time processing, windowing, watermarks, and practical code examples for filtering, splitting, joining, and state management.

CheckpointDataStreamEventTime

0 likes · 18 min read

Building Real-Time User Behavior Engineering with Apache Flink: Architecture, Features, and Implementation

Qunar Tech Salon

Oct 25, 2018 · Big Data

Why Alibaba Chose Apache Flink: Architecture, Scale, and Future Directions

This article explains how Alibaba adopted Apache Flink as a unified, low‑latency, high‑throughput big‑data engine, detailing its stream‑first design, state management, checkpointing, massive production deployment, community contributions, and upcoming plans for a unified API, SQL layer, broader language support, and AI integration.

AlibabaApache FlinkBig Data

0 likes · 13 min read

Why Alibaba Chose Apache Flink: Architecture, Scale, and Future Directions

Efficient Ops

Jun 7, 2018 · Fundamentals

SSD Power‑Loss Recovery: Normal vs. Abnormal Scenarios and Mapping Table Rebuild

This article explains the two types of SSD power loss—normal and abnormal—detailing how SSDs preserve data, the role of capacitors and non‑volatile memory, and the checkpoint‑based strategies used to quickly reconstruct mapping tables after unexpected power interruptions.

CheckpointData RecoveryMapping Table

0 likes · 10 min read

SSD Power‑Loss Recovery: Normal vs. Abnormal Scenarios and Mapping Table Rebuild

dbaplus Community

Apr 12, 2017 · Databases

Why InnoDB Double Write Matters: MySQL vs Oracle Recovery Mechanisms

This article explains InnoDB’s double‑write buffer in MySQL, compares its design and recovery handling with Oracle’s redo and control‑file mechanisms, discusses partial‑write issues, checkpoint strategies, performance impacts on SSDs, and provides practical commands and configuration tips for DBAs.

CheckpointInnoDBOracle

0 likes · 21 min read

Why InnoDB Double Write Matters: MySQL vs Oracle Recovery Mechanisms

Architects' Tech Alliance

Dec 3, 2016 · Fundamentals

Effective Data Cleaning Practices and Tips

This article provides practical guidance on data cleaning, covering the importance of data wrangling, using assertions, handling incomplete records, checkpointing, testing on subsets, logging, optional raw data storage, and validating the cleaned dataset to ensure reliable downstream analysis.

CheckpointData preprocessingLogging

0 likes · 7 min read

Effective Data Cleaning Practices and Tips

ITPUB

Aug 11, 2016 · Databases

How Oracle’s Incremental Checkpoints Reduce I/O Overhead and Speed Recovery

This article explains Oracle’s checkpoint mechanism, contrasting full and incremental checkpoints, describing the checkpoint queue structure, the roles of DBWn and CKPT processes, recovery using redo logs, and how to tune checkpoint frequency with the fast_start_mttr_target parameter while monitoring relevant performance views.

CheckpointOracleRecovery

0 likes · 14 min read

How Oracle’s Incremental Checkpoints Reduce I/O Overhead and Speed Recovery

Architect

Feb 25, 2016 · Databases

Understanding MySQL 5.6 Parallel Replication (MTS) Architecture and Implementation

This article explains the design, configuration parameters, core data structures, initialization, coordinator distribution, worker execution, checkpointing, and shutdown procedures of MySQL 5.6's Multi‑Threaded Slave (MTS) parallel replication, providing a code‑level walkthrough for developers and DBAs.

BinlogCheckpointMTS

0 likes · 17 min read

Understanding MySQL 5.6 Parallel Replication (MTS) Architecture and Implementation