Cloud Native 15 min read

How to Build a Kubernetes Fault‑Diagnosis CLI with AI‑Powered Insights

This article walks through extending the K8s Chat command‑line tool by adding an ‘analyze event’ command that gathers warning‑level events and pod logs, stores them in a map, and sends the information to a large‑language model via OpenAI’s API to receive actionable troubleshooting recommendations, while also suggesting further enhancements such as self‑healing and visualization.

Ops Development Stories

Aug 13, 2025

How to Build a Kubernetes Fault‑Diagnosis CLI with AI‑Powered Insights

Hi everyone, I’m Joker, a cloud‑native enthusiast and operations engineer. Author: Joker Public account: Operations Development Stories Blog: https://jokerbai.com

✍ Safety first: improper operations cause tears for operators.

In the previous article we introduced the "Develop K8s Chat CLI tool" which interacts with Kubernetes for basic operations. In this chapter we extend that tool with a Kubernetes fault‑diagnosis feature whose main functions are:

Fetch cluster events, focusing on warning‑level events.

Retrieve logs from the corresponding pods.

Use a large model to analyze events and logs and suggest remediation.

These ideas are just a starting point; you can expand them as needed.

Development Process

(1) First, use cobra-cli to add an analyze command. cobra-cli add analyze (2) Then add an event sub‑command under analyze for event analysis. cobra-cli add event -p 'analyzeCmd' (3) Design a getPodEventsAndLogs function to obtain Kubernetes events and logs.

// int64Ptr helper creates an int64 pointer
func int64Ptr(i int64) *int64 { return &i }

func getPodEventsAndLogs() (map[string][]string, error) {
    // Create Kubernetes client
    clientGo, err := utils.NewClientGo(kubeconfig)
    if err != nil {
        return nil, fmt.Errorf("failed to create Kubernetes client: %v", err)
    }

    result := make(map[string][]string)
    processedPods := make(map[string]bool)

    // Get warning‑level events
    events, err := clientGo.Clientset.CoreV1().Events("").List(context.TODO(), metav1.ListOptions{FieldSelector: "type=Warning"})
    if err != nil {
        return nil, fmt.Errorf("failed to get events: %v", err)
    }

    for _, event := range events.Items {
        if event.InvolvedObject.Kind != "Pod" {
            continue
        }
        podName := event.InvolvedObject.Name
        namespace := event.InvolvedObject.Namespace
        message := event.Message
        podKey := fmt.Sprintf("%s/%s", namespace, podName)

        if processedPods[podKey] {
            result[podKey] = append(result[podKey], fmt.Sprintf("Additional Event: %s", message))
            continue
        }
        processedPods[podKey] = true

        // Get pod logs (limit to last 100 lines, 1 MiB)
        logOptions := &corev1.PodLogOptions{TailLines: int64Ptr(100), LimitBytes: int64Ptr(1024 * 1024)}
        req := clientGo.Clientset.CoreV1().Pods(namespace).GetLogs(podName, logOptions)
        podLogs, err := req.Stream(context.TODO())
        if err != nil {
            result[podKey] = append(result[podKey], fmt.Sprintf("Event Message: %s", message))
            result[podKey] = append(result[podKey], fmt.Sprintf("Namespace: %s", namespace))
            result[podKey] = append(result[podKey], fmt.Sprintf("Log fetch failed: %v", err))
            continue
        }
        func() {
            defer podLogs.Close()
            buf := new(bytes.Buffer)
            _, err = buf.ReadFrom(podLogs)
            if err != nil {
                result[podKey] = append(result[podKey], fmt.Sprintf("Event Message: %s", message))
                result[podKey] = append(result[podKey], fmt.Sprintf("Namespace: %s", namespace))
                result[podKey] = append(result[podKey], fmt.Sprintf("Log read failed: %v", err))
                return
            }
            result[podKey] = append(result[podKey], fmt.Sprintf("Event Message: %s", message))
            result[podKey] = append(result[podKey], fmt.Sprintf("Namespace: %s", namespace))
            result[podKey] = append(result[podKey], fmt.Sprintf("Logs:
%s", buf.String()))
        }()
    }
    return result, nil
}

(4) Implement sendToChatGPT to pass pod events and logs to an AI model for analysis.

// sendToChatGPT sends podInfo to OpenAI ChatGPT and returns diagnostic suggestions
func sendToChatGPT(podInfo map[string][]string) (string, error) {
    if len(podInfo) == 0 {
        return "No Pod warning events found", nil
    }
    client, err := utils.NewOpenAIClient()
    if err != nil {
        return "", fmt.Errorf("failed to create OpenAI client: %v", err)
    }
    combinedInfo := buildPodInfoString(podInfo)
    fmt.Printf("Analyzing %d Pods...
", len(podInfo))
    fmt.Println("Details:")
    fmt.Println(combinedInfo)
    messages := []openai.ChatCompletionMessage{{Role: openai.ChatMessageRoleSystem, Content: `You are a senior Kubernetes expert tasked with:
1. Analyzing Pod warning events and logs
2. Identifying root causes
3. Providing concrete, actionable solutions
4. Prioritizing CLI commands, with YAML examples if needed
5. Ranking suggestions by severity`}, {Role: openai.ChatMessageRoleUser, Content: fmt.Sprintf(`Please analyze the following Kubernetes Pod issues:

%s

Provide:
1. Root‑cause analysis
2. Step‑by‑step remediation (prefer kubectl commands)
3. Preventive measures
4. YAML configuration if required`, combinedInfo))}}
    resp, err := client.Client.CreateChatCompletion(context.TODO(), openai.ChatCompletionRequest{Model: openai.GPT4oMini, Messages: messages, MaxTokens: 2000, Temperature: 0.1})
    if err != nil {
        return "", fmt.Errorf("OpenAI API call failed: %v", err)
    }
    if len(resp.Choices) == 0 {
        return "", fmt.Errorf("OpenAI returned empty response")
    }
    responseText := resp.Choices[0].Message.Content
    if responseText == "" {
        return "AI analysis completed but returned no suggestions", nil
    }
    return responseText, nil
}

func buildPodInfoString(podInfo map[string][]string) string {
    var builder strings.Builder
    builder.WriteString("Found the following Pod warning events and logs:

")
    podCount := 0
    for podKey, info := range podInfo {
        podCount++
        builder.WriteString(fmt.Sprintf("=== Pod %d: %s ===
", podCount, podKey))
        for _, line := range info {
            switch {
            case strings.HasPrefix(line, "Event Message:"):
                builder.WriteString(fmt.Sprintf("🚨 %s
", line))
            case strings.HasPrefix(line, "Namespace:"):
                builder.WriteString(fmt.Sprintf("📍 %s
", line))
            case strings.HasPrefix(line, "Logs:"):
                builder.WriteString(fmt.Sprintf("📋 %s
", line))
            case strings.HasPrefix(line, "Additional Event:"):
                builder.WriteString(fmt.Sprintf("🔄 %s
", line))
            default:
                builder.WriteString(fmt.Sprintf("%s
", line))
            }
        }
        builder.WriteString("
")
    }
    return builder.String()
}

(5) Run the analysis with the compiled binary: k8scopilot.exe analyze event Sample AI analysis output:

正在请求 AI 分析...
根据您提供的 Kubernetes Pod 日志和事件信息，以下是对问题的根本原因分析、解决步骤、预防措施建议以及必要的 YAML 配置示例。

### 1. 问题根本原因分析
- 资源不足：多个 Pod 报告了 `ephemeral-storage` 不足，导致 Pod 无法正常运行或重启；节点出现 `DiskPressure` 状态。
- 网络连接问题：就绪探针和存活探针失败，显示连接超时或被拒绝。
- 依赖注入失败：Spring Boot 应用的 Dubbo 服务未正确配置。
- 镜像拉取失败：出现 `ImagePullBackOff`，可能是镜像不存在或权限问题。

### 2. 具体解决步骤
#### 2.1 资源不足
- 清理不必要的系统 Pod：`kubectl delete pod --all -n kube-system`
- 增加节点存储或添加新节点。
- 调整 Pod 的 `ephemeral-storage` 请求和限制，例如使用 `kubectl edit deployment <name> -n <ns>` 并设置 `resources.requests.ephemeral-storage: "100Mi"`、`resources.limits.ephemeral-storage: "200Mi"`。

#### 2.2 网络连接
- 检查服务是否在预期端口启动：`kubectl get pods -n <ns>`、`kubectl logs <pod> -n <ns>`。
- 增加探针超时时间，例如在 `readinessProbe` 中设置 `initialDelaySeconds: 30`、`timeoutSeconds: 5`。

#### 2.3 依赖注入
- 确认 Dubbo 注册中心配置正确，例如 `dubbo.registry.address: "zookeeper://localhost:2181"`。
- 重启相关 Deployment：`kubectl rollout restart deployment <name> -n <ns>`。

#### 2.4 镜像拉取
- 确认镜像在仓库中存在并有访问权限。
- 如有标签错误，使用 `kubectl set image deployment/<name> <container>=<new-image>:<tag> -n <ns>` 更新。

### 3. 预防措施建议
- 使用 Prometheus + Grafana 监控资源使用情况。
- 为 Pod 设置合理的资源请求和限制，避免资源争用。
- 定期清理不再使用的 Pod 与镜像，释放存储空间。

Further possible extensions include:

Self‑healing : integrate function calling to automatically remediate simple issues.

Enhanced analysis : ingest additional data sources such as metrics and node status, and build a historical knowledge base.

Visualization : generate HTML diagnostic reports and display severity‑graded results.

CLI AI Kubernetes Go diagnostics client-go

Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.