Cloud Native 19 min read

Automate Java OOM Heapdump Collection with a Kubernetes DaemonSet

This guide explains how to automatically capture Java OOM heapdump files using a DaemonSet that watches for heapdump.prof creation, compresses and uploads them to Alibaba Cloud OSS, and notifies developers via a WeChat bot, providing a scalable, non‑intrusive solution for memory‑leak diagnostics in Kubernetes environments.

Ops Development Stories
Ops Development Stories
Ops Development Stories
Automate Java OOM Heapdump Collection with a Kubernetes DaemonSet

1. Introduction

The

heapdump

file is a diagnostic report generated when a Java application encounters an Out‑Of‑Memory (OOM) error, capturing a snapshot of the JVM heap at a specific moment. Analyzing this file reveals which objects occupy memory and their reference relationships, which is essential for locating memory leaks.

Why use a DaemonSet

Previous approaches added dump scripts directly to application containers using the

-XX:OnOutOfMemoryError

JVM flag, which is invasive and can fail when the dump file is large. Deploying a DaemonSet running a

heapdump-watcher

allows non‑intrusive monitoring of

heapdump.prof

files across nodes.

Tip: This method is suitable for persisting heapdump.prof to K8s nodes, but it serves mainly as a reference.

Prerequisites

heapdump.prof

must be persisted on a K8s node.

Log directories should follow a consistent pattern, e.g.,

/mnt/logs/<APP_NAME>/logs/heapdump.prof

(or include

<POD_NAME>

to avoid conflicts).

Alibaba Cloud OSS access permissions are required.

A functional Enterprise WeChat bot is needed.

2. Overall Design

When an OOM event occurs, the JVM is started with

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/mnt/logs/heapdump.hprof

, generating

heapdump.prof

under

/mnt/logs

. The DaemonSet uses

fsnotify

to watch for this file, compresses it, uploads it to OSS, deletes the local copy, and sends a WeChat alert with a download link.

3. Detailed Implementation

(1) Initialization

<code>func init() {
    // Get environment
    env = getEnv("ENV", "prod")
    var err error
    watcher, err = fsnotify.NewWatcher()
    if err != nil {
        log.Fatalf("Failed to create fsnotify watcher: %v", err)
    }
    // Load configuration
    config, err = loadConfig(configPath)
    if err != nil {
        log.Fatalf("Failed to load config: %v", err)
    }
    // Initialize OSS client
    ossClient, err := oss.New(config.OSS.Endpoint, config.OSS.AccessID, config.OSS.AccessKey)
    if err != nil {
        log.Fatalf("Failed to create OSS client: %v", err)
    }
    client, _ = ossClient.Bucket(config.OSS.Bucket)
    if config.WatchPods {
        // Initialize Kubernetes client
        kubeClient, err = createKubeClient()
        if err != nil {
            log.Fatalf("Failed to create Kubernetes client: %v", err)
        }
        // Get node IP
        nodeIP, err = getNodeIP()
        if err != nil {
            log.Fatalf("Failed to get node IP: %v", err)
        }
    }
    // Initialize signal channels
    signalChan = make(chan os.Signal, 1)
    stopChan = make(chan struct{})
    signal.Notify(signalChan, syscall.SIGINT, syscall.SIGTERM)
}
</code>

This code obtains the environment variable

ENV

, creates an

fsnotify.Watcher

, loads OSS and WeChat configuration, optionally creates a Kubernetes client, retrieves the node IP, and sets up signal handling for graceful shutdown.

(2) File Watching

<code>func watchFiles() {
    for {
        select {
        case event, ok := <-watcher.Events:
            if !ok { return }
            if event.Op&fsnotify.Create == fsnotify.Create {
                if strings.HasSuffix(event.Name, "heapdump.prof") {
                    log.Printf("New heapdump file detected: %s", event.Name)
                    if err := waitForFileCompletion(event.Name); err != nil {
                        log.Printf("Failed to wait for file completion: %v", err)
                        continue
                    }
                    appName := filepath.Base(filepath.Dir(filepath.Dir(event.Name)))
                    err := uploadFileToOSS(event.Name, appName)
                    if err != nil {
                        log.Printf("Failed to upload file to OSS: %v", err)
                    } else {
                        log.Printf("File uploaded to OSS successfully: %s", event.Name)
                        if err = sendWechatAlert(appName); err != nil {
                            log.Printf("Failed to send WeChat alert: %v", err)
                        }
                    }
                }
            }
        case err, ok := <-watcher.Errors:
            if !ok { return }
            log.Printf("Error: %v", err)
        }
    }
}
</code>

The function continuously monitors filesystem events, detects creation of

heapdump.prof

, waits for the file to be fully written, uploads it to OSS, and sends a WeChat notification.

(3) Pod Watching

<code>func watchPods() {
    for _, appName := range config.Whitelist {
        pods, err := kubeClient.CoreV1().Pods(metav1.NamespaceAll).List(context.TODO(), metav1.ListOptions{
            LabelSelector: fmt.Sprintf("app=%s", appName),
            FieldSelector: fmt.Sprintf("spec.nodeName=%s", nodeIP),
        })
        if err != nil {
            log.Printf("Failed to list pods for app %s: %v", appName, err)
            continue
        }
        for _, pod := range pods.Items {
            addPodWatch(appName, pod.Name)
        }
    }
    _, controller := cache.NewInformer(
        &cache.ListWatch{
            ListFunc: func(options metav1.ListOptions) (runtime.Object, error) {
                options.FieldSelector = fmt.Sprintf("spec.nodeName=%s", nodeIP)
                return kubeClient.CoreV1().Pods(metav1.NamespaceAll).List(context.TODO(), options)
            },
            WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
                options.FieldSelector = fmt.Sprintf("spec.nodeName=%s", nodeIP)
                return kubeClient.CoreV1().Pods(metav1.NamespaceAll).Watch(context.TODO(), options)
            },
        },
        &corev1.Pod{},
        0,
        cache.ResourceEventHandlerFuncs{
            AddFunc: func(obj interface{}) {
                pod := obj.(*corev1.Pod)
                appName := pod.Labels["app"]
                if isWhitelisted(appName) {
                    log.Printf("Pod added: %s/%s", pod.Namespace, pod.Name)
                    addPodWatch(appName, pod.Name)
                }
            },
            DeleteFunc: func(obj interface{}) {
                pod := obj.(*corev1.Pod)
                appName := pod.Labels["app"]
                if isWhitelisted(appName) {
                    log.Printf("Pod deleted: %s/%s", pod.Namespace, pod.Name)
                    removePodWatch(appName, pod.Name)
                }
            },
        },
    )
    controller.Run(stopChan)
}
</code>

This code lists pods for whitelisted applications on the current node, adds file watches for each pod, and uses a Kubernetes informer to handle pod add/delete events, updating the watches accordingly.

(4) File Upload

<code>func uploadFileToOSS(filePath string, appName string) error {
    file, err := os.Open(filePath)
    if err != nil { return err }
    defer file.Close()
    tempFile, err := os.CreateTemp("", "heapdump-*.zip")
    if err != nil { return err }
    defer tempFile.Close()
    defer os.Remove(tempFile.Name())
    zipWriter := zip.NewWriter(tempFile)
    defer zipWriter.Close()
    zipFileWriter, err := zipWriter.Create(filepath.Base(filePath))
    if err != nil { return err }
    if _, err = io.Copy(zipFileWriter, file); err != nil { return err }
    if err = zipWriter.Close(); err != nil { return err }
    tempFile.Seek(0, 0)
    tempFileReader := io.Reader(tempFile)
    timestamp := time.Now().Format("20060102150405")
    objectName := fmt.Sprintf("heapdump/%s/heapdump_%s.zip", appName, timestamp)
    expires := time.Now().Add(24 * time.Hour)
    options := []oss.Option{oss.Expires(expires)}
    if err = client.PutObject(objectName, tempFileReader, options...); err != nil { return err }
    ossURL, err = client.SignURL(objectName, oss.HTTPGet, expires.Unix()-time.Now().Unix())
    if err != nil { log.Fatalf("Failed to generate presigned URL: %v", err) }
    log.Printf("Deleting local file: %s", filePath)
    if err := os.Remove(filePath); err != nil {
        log.Printf("Failed to delete local file: %v", err)
    }
    return nil
}
</code>

The function compresses the

heapdump.prof

file into a zip archive, uploads it to OSS with a 24‑hour expiration, generates a presigned download URL, and removes the local file.

(5) Notification

<code>func sendWechatAlert(appName string) error {
    markdownContent := fmt.Sprintf(`# JAVA OOM DUMP FILE GENERATED\n> Application: %s\n> Environment: %s\n> File: [Download Link](%s)\n> *Tips*: File is retained for 1 day, please download promptly`, appName, env, ossURL)
    payload := map[string]interface{}{
        "msgtype": "markdown",
        "markdown": map[string]string{"content": markdownContent},
    }
    _, body, errs := gorequest.New().Post(config.Wechat.WebhookURL).Send(payload).End()
    if errs != nil { return fmt.Errorf("failed to send WeChat alert: %v", errs) }
    log.Printf("WeChat alert response: %s", body)
    return nil
}
</code>

This step posts a markdown message to the configured Enterprise WeChat webhook, informing developers of the new heapdump file and providing a download link.

4. Deployment Verification

(1) Build Docker Image

<code>FROM golang:1.21-alpine AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 GOOS=linux go build -o /heapdump-watcher

FROM alpine:3.18
RUN apk add --no-cache ca-certificates
WORKDIR /app
COPY --from=builder /heapdump-watcher ./heapdump-watcher
CMD ["/heapdump-watcher"]
</code>

(2) Deploy to Kubernetes

<code>apiVersion: v1
kind: ServiceAccount
metadata:
  name: heapdump-watcher
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: default
  name: heapdump-watcher-role
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get","list","watch"]
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: heapdump-config
  namespace: default
data:
  config.yaml: |
    oss:
      endpoint: your-oss-endpoint
      bucket: your-oss-bucket
      accessID: your-oss-access-id
      accessKey: your-oss-access-key
    wechat:
      webhookURL: your-wechat-webhook-url
    whitelist:
      - app1
      - app2
      - app3
    watchPods: false
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: heapdump-watcher
  namespace: default
spec:
  selector:
    matchLabels:
      app: heapdump-watcher
  template:
    metadata:
      labels:
        app: heapdump-watcher
    spec:
      serviceAccountName: heapdump-watcher
      containers:
      - name: heapdump-watcher
        image: your-docker-image:latest
        volumeMounts:
        - name: logs
          mountPath: /mnt/logs
          readOnly: false
        - name: config
          mountPath: /app/config.yaml
          subPath: config.yaml
          readOnly: true
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        - name: ENV
          value: prod
      volumes:
      - name: logs
        hostPath:
          path: /mnt/logs
          type: Directory
      - name: config
        configMap:
          name: heapdump-config
          items:
          - key: config.yaml
            path: config.yaml
</code>

(3) Validation

When an OOM dump is generated, a notification appears in the configured Enterprise WeChat group, confirming successful upload.

5. Conclusion

The current implementation provides a functional baseline for automated heapdump collection, but it can be extended to support additional cloud storage providers (e.g., Tencent COS, AWS S3), richer notification content (memory usage peaks, OOM timestamps), and other communication tools such as DingTalk or Feishu. Adding automated analysis of uploaded heapdumps to extract suspected leaking objects would further accelerate debugging of Java memory issues.

Cloud NativekubernetesGooomDaemonSetheapdump
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.