Big Data 12 min read

How Alibaba Cloud’s Flink Advisor Transforms Real‑Time Log Diagnosis

Alibaba Cloud's Flink Intelligent Diagnosis (Advisor) combines real‑time data‑warehouse, log‑clustering, and decision‑tree algorithms to automatically analyze error logs, diagnose job anomalies, and provide optimization suggestions, dramatically reducing manual support tickets and improving user experience across Flink managed services.

Alibaba Cloud Big Data AI Platform

Jun 7, 2023

How Alibaba Cloud’s Flink Advisor Transforms Real‑Time Log Diagnosis

01 Introduction

Alibaba Cloud Real‑Time Compute Flink is a professional high‑performance real‑time big‑data processing system that supports scenarios such as real‑time data warehouse, risk control, and real‑time machine learning. As usage grows, users encounter difficulties such as complex error‑log analysis, task failure handling, and performance tuning.

Because error‑log analysis and end‑to‑end anomaly diagnosis are limited, many problems cannot be intercepted by self‑service bots, forcing users to submit tickets, which increases the workload of operation teams.

To address these issues, Alibaba Cloud designed the Flink Intelligent Diagnosis (Advisor) tool. It provides precise error diagnosis and optimization suggestions throughout the lifecycle of Flink managed services, improving user experience and reducing reliance on manual support.

02 Problem Decomposition

Based on analysis of many Flink user cases, common problems are divided into three categories: error‑log analysis, anomaly analysis (affects current job execution), and risk analysis (does not affect current execution). Each category has a defined analysis scope.

Error‑Log Analysis

Analyzes the stack trace of the current job and includes two phases:

Development phase: analysis of exception stacks during development, e.g., syntax errors, schema configuration errors.

Running phase: analysis of exception stacks during execution, e.g., upstream binlog expiration, null values in time fields.

Anomaly Analysis

Focuses on issues that affect the current job, covering three stages:

Startup stage: startup file analysis, dependent cloud resources, data source permissions, network, session cluster, etc.

Running stage: checkpoint checks, permission checks, state checks, etc.

Shutdown stage: shutdown speed analysis.

Risk Analysis

Addresses issues that do not affect job execution, covering two stages:

Configuration stage: JobGraph checks, version checks, HA checks.

Running stage: checkpoint checks, runtime environment checks.

03 Core Technology

Data Layer

Provides real‑time data‑warehouse capabilities for the service layer. It collects basic cluster (Kubernetes) and product engine (VVP & Flink) data, processes them through a big‑data & AI engine (ETL, clustering, analysis), and stores the full‑lifecycle observability data of user Flink jobs in a real‑time warehouse.

Service Layer

Offers two capabilities:

Error‑log analysis service: uses log‑clustering and recommendation algorithms to build a knowledge base of Flink error logs, enabling automatic matching of user‑submitted error messages to solutions.

Job diagnosis service: reads the full‑lifecycle data from the data layer, periodically runs a decision‑tree model that encodes expert experience (error causes, performance, configuration, environment risks) and returns diagnostic items to the interface layer.

Business Layer

Exposes diagnostic data through multiple entry points (VVP console, DingTalk robot, ABM diagnosis). Users obtain exception information and suggested solutions, helping resolve job anomalies and ensuring stable Flink operation.

04 Feature Practice

Development‑stage Error‑Log Analysis

In the Flink Managed console, select Application > Job Development , write SQL, and click Validate to view error‑log analysis.

Health Score

On the Job Operations page, view the health score of the job.

Running‑stage Log Analysis

Switch between running log, startup log, and exception information to view analysis.

Job Diagnosis

Click the Diagnose button on the job detail page to see risk reasons and optimization suggestions.

05 Summary

The core capabilities of Flink Intelligent Diagnosis are:

Product experience: real‑time error diagnosis from development to operations, reducing ticket volume.

Technical innovation: log clustering and recommendation algorithms replace regex, solving massive log deduplication and lowering expert‑knowledge integration barriers.

Root‑cause suggestions: 100 % accurate matching of exception causes with solutions, enabling rapid hot‑updates.

R&D collaboration: jointly built by SRE, development, service, and product teams, forming a sustainable operation mechanism.

Impact: average of 3.5 diagnoses per user per day; operation tickets for job errors decreased by 28 %.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink AI log diagnosis

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.