Accelerating Java Application Startup with CRaC and Flexible Compute on Alibaba Cloud Container Service
This article explains how Alibaba Cloud Container Service leverages flexible compute and the CRaC (Coordinated Restore at Checkpoint) mechanism to dramatically reduce Java application startup latency, details integration steps, presents experimental performance results, and discusses future applicability in cloud‑native environments.
After a Java application such as SpringBoot starts, it often requires additional data warm‑up (class loading and JIT compilation) to handle traffic; insufficient warm‑up can cause repeated CrashBackOff failures under high load. Alibaba Cloud Container Service (ACS) uses its flexible compute capability together with CRaC technology to dramatically speed up Java startup.
What is Flexible Compute? ACS can dynamically allocate compute resources to match actual demand, allowing applications to receive extra CPU during startup (via a configurable acceleration factor) and then scale back to baseline, reducing cost. It also provides recommended JVM parameters to fully exploit the extra resources.
What is CRaC? CRaC (Coordinated Restore at Checkpoint) enables an application to checkpoint its JVM state and later restore from that snapshot, bypassing the full cold‑start process. It builds on CRIU (Checkpoint/Restore in Userspace) and has been enhanced for Java by the Dragonwell project.
Alibaba Dragonwell, Alibaba's customized OpenJDK distribution, now includes CRaC support. By combining CRaC with ACS's flexible compute, users can allocate extra CPU for the initial startup and then use checkpoint‑based restoration for rapid recovery after crashes.
Integration Steps :
Replace the JDK with Dragonwell that supports CRaC.
Insert checkpoint commands into the application or startup scripts (an intrusive modification requiring developer knowledge).
Ensure that the checkpointed state can be reused or restored via callbacks if necessary.
The article provides a Dockerfile for reproducing the experiments, wrapped in a code block:
FROM alibaba-cloud-linux-3-registry.cn-hangzhou.cr.aliyuncs.com/alinux3/alinux3:latest
RUN dnf install -y glibc && dnf update --security -y && dnf upgrade --security -y && dnf clean all && rm -rf /var/cache/dnf/ && rm -rf /core.*
RUN yum install glibc net-tools tar iputils tcpdump wget iproute bind-utils openssh-clients xz procps-ng util-linux -y && yum update -y
ADD app/Alibaba_Dragonwell_Extended_11.0.25.22.9_x64_linux.tar.gz /home/app
ADD app/spring-petclinic-2.4.0.BUILD-SNAPSHOT.jar /home/app
ADD app/run.sh /home/app
ADD app/takepid.sh /home/app
ENV JAVA_HOME=/home/app/jdk
RUN setcap cap_checkpoint_restore+eip /home/app/dragonwell-11.0.25.22+9-GA/lib/criu
RUN chmod +x /home/app/run.sh && chmod +x /home/app/takepid.sh
WORKDIR /home/app
USER root
ENTRYPOINT ["/home/app/run.sh"]Experimental Results :
Two sets of experiments compare startup times with and without CRaC and with/without ACS acceleration. Tables show that using CRaC reduces crash‑restart time from ~29 s to ~0.176 s, while combining ACS acceleration reduces first‑start time from 29 s to as low as 7.33 s (with 8× CPU factor) and crash‑restart time to ~0.136 s.
The results demonstrate that CRaC effectively eliminates most of the downtime after a crash, and ACS acceleration significantly cuts the initial startup latency.
Outlook :
CRaC’s checkpoint‑restore approach is promising for serverless, distributed, and cloud‑native systems, edge computing, and IoT, where rapid recovery and resource efficiency are critical. Combining flexible compute with CRaC can improve overall service availability and reduce cost by scaling resources dynamically.
Additional resources and links to Dragonwell releases, CRaC documentation, and ACS flexible compute guides are provided for further exploration.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.