How Alibaba Cloud’s Flink Advisor Transforms Real‑Time Log Diagnosis
Alibaba Cloud's Flink Intelligent Diagnosis (Advisor) combines real‑time data‑warehouse, log‑clustering, and decision‑tree algorithms to automatically analyze error logs, diagnose job anomalies, and provide optimization suggestions, dramatically reducing manual support tickets and improving user experience across Flink managed services.
01 Introduction
Alibaba Cloud Real‑Time Compute Flink is a professional high‑performance real‑time big‑data processing system that supports scenarios such as real‑time data warehouse, risk control, and real‑time machine learning. As usage grows, users encounter difficulties such as complex error‑log analysis, task failure handling, and performance tuning.
Because error‑log analysis and end‑to‑end anomaly diagnosis are limited, many problems cannot be intercepted by self‑service bots, forcing users to submit tickets, which increases the workload of operation teams.
To address these issues, Alibaba Cloud designed the Flink Intelligent Diagnosis (Advisor) tool. It provides precise error diagnosis and optimization suggestions throughout the lifecycle of Flink managed services, improving user experience and reducing reliance on manual support.
02 Problem Decomposition
Based on analysis of many Flink user cases, common problems are divided into three categories: error‑log analysis, anomaly analysis (affects current job execution), and risk analysis (does not affect current execution). Each category has a defined analysis scope.
Error‑Log Analysis
Analyzes the stack trace of the current job and includes two phases:
Development phase: analysis of exception stacks during development, e.g., syntax errors, schema configuration errors.
Running phase: analysis of exception stacks during execution, e.g., upstream binlog expiration, null values in time fields.
Anomaly Analysis
Focuses on issues that affect the current job, covering three stages:
Startup stage: startup file analysis, dependent cloud resources, data source permissions, network, session cluster, etc.
Running stage: checkpoint checks, permission checks, state checks, etc.
Shutdown stage: shutdown speed analysis.
Risk Analysis
Addresses issues that do not affect job execution, covering two stages:
Configuration stage: JobGraph checks, version checks, HA checks.
Running stage: checkpoint checks, runtime environment checks.
03 Core Technology
Data Layer
Provides real‑time data‑warehouse capabilities for the service layer. It collects basic cluster (Kubernetes) and product engine (VVP & Flink) data, processes them through a big‑data & AI engine (ETL, clustering, analysis), and stores the full‑lifecycle observability data of user Flink jobs in a real‑time warehouse.
Service Layer
Offers two capabilities:
Error‑log analysis service: uses log‑clustering and recommendation algorithms to build a knowledge base of Flink error logs, enabling automatic matching of user‑submitted error messages to solutions.
Job diagnosis service: reads the full‑lifecycle data from the data layer, periodically runs a decision‑tree model that encodes expert experience (error causes, performance, configuration, environment risks) and returns diagnostic items to the interface layer.
Business Layer
Exposes diagnostic data through multiple entry points (VVP console, DingTalk robot, ABM diagnosis). Users obtain exception information and suggested solutions, helping resolve job anomalies and ensuring stable Flink operation.
04 Feature Practice
Development‑stage Error‑Log Analysis
In the Flink Managed console, select Application > Job Development , write SQL, and click Validate to view error‑log analysis.
Health Score
On the Job Operations page, view the health score of the job.
Running‑stage Log Analysis
Switch between running log, startup log, and exception information to view analysis.
Job Diagnosis
Click the Diagnose button on the job detail page to see risk reasons and optimization suggestions.
05 Summary
The core capabilities of Flink Intelligent Diagnosis are:
Product experience: real‑time error diagnosis from development to operations, reducing ticket volume.
Technical innovation: log clustering and recommendation algorithms replace regex, solving massive log deduplication and lowering expert‑knowledge integration barriers.
Root‑cause suggestions: 100 % accurate matching of exception causes with solutions, enabling rapid hot‑updates.
R&D collaboration: jointly built by SRE, development, service, and product teams, forming a sustainable operation mechanism.
Impact: average of 3.5 diagnoses per user per day; operation tickets for job errors decreased by 28 %.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
