Automate Java OOM Heapdump Collection with a Kubernetes DaemonSet
This guide explains how to automatically capture Java OOM heapdump files using a DaemonSet that watches for heapdump.prof creation, compresses and uploads them to Alibaba Cloud OSS, and notifies developers via a WeChat bot, providing a scalable, non‑intrusive solution for memory‑leak diagnostics in Kubernetes environments.
1. Introduction
The
heapdumpfile is a diagnostic report generated when a Java application encounters an Out‑Of‑Memory (OOM) error, capturing a snapshot of the JVM heap at a specific moment. Analyzing this file reveals which objects occupy memory and their reference relationships, which is essential for locating memory leaks.
Why use a DaemonSet
Previous approaches added dump scripts directly to application containers using the
-XX:OnOutOfMemoryErrorJVM flag, which is invasive and can fail when the dump file is large. Deploying a DaemonSet running a
heapdump-watcherallows non‑intrusive monitoring of
heapdump.proffiles across nodes.
Tip: This method is suitable for persisting heapdump.prof to K8s nodes, but it serves mainly as a reference.
Prerequisites
heapdump.profmust be persisted on a K8s node.
Log directories should follow a consistent pattern, e.g.,
/mnt/logs/<APP_NAME>/logs/heapdump.prof(or include
<POD_NAME>to avoid conflicts).
Alibaba Cloud OSS access permissions are required.
A functional Enterprise WeChat bot is needed.
2. Overall Design
When an OOM event occurs, the JVM is started with
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/mnt/logs/heapdump.hprof, generating
heapdump.profunder
/mnt/logs. The DaemonSet uses
fsnotifyto watch for this file, compresses it, uploads it to OSS, deletes the local copy, and sends a WeChat alert with a download link.
3. Detailed Implementation
(1) Initialization
<code>func init() {
// Get environment
env = getEnv("ENV", "prod")
var err error
watcher, err = fsnotify.NewWatcher()
if err != nil {
log.Fatalf("Failed to create fsnotify watcher: %v", err)
}
// Load configuration
config, err = loadConfig(configPath)
if err != nil {
log.Fatalf("Failed to load config: %v", err)
}
// Initialize OSS client
ossClient, err := oss.New(config.OSS.Endpoint, config.OSS.AccessID, config.OSS.AccessKey)
if err != nil {
log.Fatalf("Failed to create OSS client: %v", err)
}
client, _ = ossClient.Bucket(config.OSS.Bucket)
if config.WatchPods {
// Initialize Kubernetes client
kubeClient, err = createKubeClient()
if err != nil {
log.Fatalf("Failed to create Kubernetes client: %v", err)
}
// Get node IP
nodeIP, err = getNodeIP()
if err != nil {
log.Fatalf("Failed to get node IP: %v", err)
}
}
// Initialize signal channels
signalChan = make(chan os.Signal, 1)
stopChan = make(chan struct{})
signal.Notify(signalChan, syscall.SIGINT, syscall.SIGTERM)
}
</code>This code obtains the environment variable
ENV, creates an
fsnotify.Watcher, loads OSS and WeChat configuration, optionally creates a Kubernetes client, retrieves the node IP, and sets up signal handling for graceful shutdown.
(2) File Watching
<code>func watchFiles() {
for {
select {
case event, ok := <-watcher.Events:
if !ok { return }
if event.Op&fsnotify.Create == fsnotify.Create {
if strings.HasSuffix(event.Name, "heapdump.prof") {
log.Printf("New heapdump file detected: %s", event.Name)
if err := waitForFileCompletion(event.Name); err != nil {
log.Printf("Failed to wait for file completion: %v", err)
continue
}
appName := filepath.Base(filepath.Dir(filepath.Dir(event.Name)))
err := uploadFileToOSS(event.Name, appName)
if err != nil {
log.Printf("Failed to upload file to OSS: %v", err)
} else {
log.Printf("File uploaded to OSS successfully: %s", event.Name)
if err = sendWechatAlert(appName); err != nil {
log.Printf("Failed to send WeChat alert: %v", err)
}
}
}
}
case err, ok := <-watcher.Errors:
if !ok { return }
log.Printf("Error: %v", err)
}
}
}
</code>The function continuously monitors filesystem events, detects creation of
heapdump.prof, waits for the file to be fully written, uploads it to OSS, and sends a WeChat notification.
(3) Pod Watching
<code>func watchPods() {
for _, appName := range config.Whitelist {
pods, err := kubeClient.CoreV1().Pods(metav1.NamespaceAll).List(context.TODO(), metav1.ListOptions{
LabelSelector: fmt.Sprintf("app=%s", appName),
FieldSelector: fmt.Sprintf("spec.nodeName=%s", nodeIP),
})
if err != nil {
log.Printf("Failed to list pods for app %s: %v", appName, err)
continue
}
for _, pod := range pods.Items {
addPodWatch(appName, pod.Name)
}
}
_, controller := cache.NewInformer(
&cache.ListWatch{
ListFunc: func(options metav1.ListOptions) (runtime.Object, error) {
options.FieldSelector = fmt.Sprintf("spec.nodeName=%s", nodeIP)
return kubeClient.CoreV1().Pods(metav1.NamespaceAll).List(context.TODO(), options)
},
WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
options.FieldSelector = fmt.Sprintf("spec.nodeName=%s", nodeIP)
return kubeClient.CoreV1().Pods(metav1.NamespaceAll).Watch(context.TODO(), options)
},
},
&corev1.Pod{},
0,
cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {
pod := obj.(*corev1.Pod)
appName := pod.Labels["app"]
if isWhitelisted(appName) {
log.Printf("Pod added: %s/%s", pod.Namespace, pod.Name)
addPodWatch(appName, pod.Name)
}
},
DeleteFunc: func(obj interface{}) {
pod := obj.(*corev1.Pod)
appName := pod.Labels["app"]
if isWhitelisted(appName) {
log.Printf("Pod deleted: %s/%s", pod.Namespace, pod.Name)
removePodWatch(appName, pod.Name)
}
},
},
)
controller.Run(stopChan)
}
</code>This code lists pods for whitelisted applications on the current node, adds file watches for each pod, and uses a Kubernetes informer to handle pod add/delete events, updating the watches accordingly.
(4) File Upload
<code>func uploadFileToOSS(filePath string, appName string) error {
file, err := os.Open(filePath)
if err != nil { return err }
defer file.Close()
tempFile, err := os.CreateTemp("", "heapdump-*.zip")
if err != nil { return err }
defer tempFile.Close()
defer os.Remove(tempFile.Name())
zipWriter := zip.NewWriter(tempFile)
defer zipWriter.Close()
zipFileWriter, err := zipWriter.Create(filepath.Base(filePath))
if err != nil { return err }
if _, err = io.Copy(zipFileWriter, file); err != nil { return err }
if err = zipWriter.Close(); err != nil { return err }
tempFile.Seek(0, 0)
tempFileReader := io.Reader(tempFile)
timestamp := time.Now().Format("20060102150405")
objectName := fmt.Sprintf("heapdump/%s/heapdump_%s.zip", appName, timestamp)
expires := time.Now().Add(24 * time.Hour)
options := []oss.Option{oss.Expires(expires)}
if err = client.PutObject(objectName, tempFileReader, options...); err != nil { return err }
ossURL, err = client.SignURL(objectName, oss.HTTPGet, expires.Unix()-time.Now().Unix())
if err != nil { log.Fatalf("Failed to generate presigned URL: %v", err) }
log.Printf("Deleting local file: %s", filePath)
if err := os.Remove(filePath); err != nil {
log.Printf("Failed to delete local file: %v", err)
}
return nil
}
</code>The function compresses the
heapdump.proffile into a zip archive, uploads it to OSS with a 24‑hour expiration, generates a presigned download URL, and removes the local file.
(5) Notification
<code>func sendWechatAlert(appName string) error {
markdownContent := fmt.Sprintf(`# JAVA OOM DUMP FILE GENERATED\n> Application: %s\n> Environment: %s\n> File: [Download Link](%s)\n> *Tips*: File is retained for 1 day, please download promptly`, appName, env, ossURL)
payload := map[string]interface{}{
"msgtype": "markdown",
"markdown": map[string]string{"content": markdownContent},
}
_, body, errs := gorequest.New().Post(config.Wechat.WebhookURL).Send(payload).End()
if errs != nil { return fmt.Errorf("failed to send WeChat alert: %v", errs) }
log.Printf("WeChat alert response: %s", body)
return nil
}
</code>This step posts a markdown message to the configured Enterprise WeChat webhook, informing developers of the new heapdump file and providing a download link.
4. Deployment Verification
(1) Build Docker Image
<code>FROM golang:1.21-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o /heapdump-watcher
FROM alpine:3.18
RUN apk add --no-cache ca-certificates
WORKDIR /app
COPY --from=builder /heapdump-watcher ./heapdump-watcher
CMD ["/heapdump-watcher"]
</code>(2) Deploy to Kubernetes
<code>apiVersion: v1
kind: ServiceAccount
metadata:
name: heapdump-watcher
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: default
name: heapdump-watcher-role
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get","list","watch"]
---
apiVersion: v1
kind: ConfigMap
metadata:
name: heapdump-config
namespace: default
data:
config.yaml: |
oss:
endpoint: your-oss-endpoint
bucket: your-oss-bucket
accessID: your-oss-access-id
accessKey: your-oss-access-key
wechat:
webhookURL: your-wechat-webhook-url
whitelist:
- app1
- app2
- app3
watchPods: false
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: heapdump-watcher
namespace: default
spec:
selector:
matchLabels:
app: heapdump-watcher
template:
metadata:
labels:
app: heapdump-watcher
spec:
serviceAccountName: heapdump-watcher
containers:
- name: heapdump-watcher
image: your-docker-image:latest
volumeMounts:
- name: logs
mountPath: /mnt/logs
readOnly: false
- name: config
mountPath: /app/config.yaml
subPath: config.yaml
readOnly: true
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: ENV
value: prod
volumes:
- name: logs
hostPath:
path: /mnt/logs
type: Directory
- name: config
configMap:
name: heapdump-config
items:
- key: config.yaml
path: config.yaml
</code>(3) Validation
When an OOM dump is generated, a notification appears in the configured Enterprise WeChat group, confirming successful upload.
5. Conclusion
The current implementation provides a functional baseline for automated heapdump collection, but it can be extended to support additional cloud storage providers (e.g., Tencent COS, AWS S3), richer notification content (memory usage peaks, OOM timestamps), and other communication tools such as DingTalk or Feishu. Adding automated analysis of uploaded heapdumps to extract suspected leaking objects would further accelerate debugging of Java memory issues.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.