Operations 14 min read

How Alibaba’s Hawkeye and Torch Transform AIOps for Search Platforms

Alibaba’s AIOps case study details how the Hawkeye intelligent diagnosis system and the Torch capacity governance platform jointly improve search platform efficiency, stability, and cost by leveraging algorithmic analysis, automated cloning, stress testing, and optimization across resource, performance, and smart Q&A dimensions.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How Alibaba’s Hawkeye and Torch Transform AIOps for Search Platforms

Background

With the rapid growth of search business, platforms have become more centralized, evolving from manual operations to DevOps and now to AIOps. Traditional operations and solutions can no longer meet the demands of big data and AI.

AIOps Practice and Implementation

Hawkeye – Intelligent Diagnosis and Optimization

Hawkeye is an intelligent diagnosis and optimization system composed of three layers: analysis, web, and service.

Analysis Layer

It includes two components: hawkeye-blink , which performs low‑level data processing such as access‑log and full‑data analysis using Blink; and hawkeye-experience , which provides user‑oriented analyses like field‑type validation, monotonicity monitoring, invalid alarms, smoke‑case entry, engine downgrade configuration, memory settings, recommendation row‑column configuration, and more.

Hawkeye‑experience serves as a rule‑centered platform that codifies operational expertise, allowing each new application to benefit from expert‑level diagnostics without repeated trial‑and‑error.

Key Features

Resource optimization: engine lock memory, real‑time memory.

Performance optimization: Top‑N slow query, buildservice resource tuning.

Intelligent diagnosis: routine inspection, smart Q&A.

Engine Lock Memory Optimization

Locking memory for index, attribute, and summary improves access speed, but unused fields waste memory. Hawkeye analyzes field usage and trims indexes for head‑tier applications, saving millions of yuan.

Slow Query Analysis

Slow queries are extracted from access logs. Using Blink’s big‑data capabilities, a divide‑and‑hash plus min‑heap algorithm identifies Top‑N slow queries, then provides personalized optimization suggestions to improve engine query performance and capacity.

One‑Click Diagnosis

Health scores indicate engine status; diagnosis reports show configuration issues, benefits, and logic. Users can view detailed results and take immediate action.

Intelligent Q&A

Repeated questions such as incremental stop or common resource alerts are answered automatically via a chat‑Ops bot that injects diagnostic information into alert messages, enabling users to obtain answers by simply @‑mentioning the bot.

Torch – Capacity Governance

Torch focuses on capacity governance to reduce cost. It addresses issues like arbitrary container requests and unknown real‑world capacity, providing guidance on optimal CPU, memory, and disk allocation.

Solution Overview

Capacity assessment combines KMON data with a dedicated stress‑testing service that clones a single instance of the online service, runs automated pressure tests, and feeds results to an algorithm service for cost‑aware resource planning.

System Architecture

From bottom to top: entry layer (application information), task management (capacity‑evaluation tasks), data factory (processes KMON and stress‑test data), decision center (algorithmic evaluation, validation, cleanup), and application layer (capacity dashboards, APIs).

Clone Simulation

Cloning creates a shallow or deep copy of an online service. Shallow cloning uses shadow tables for HA3, while deep cloning performs an offline build. Benefits include service isolation, validated optimization, and automatic resource release.

Stress‑Testing Service

A distributed stress‑testing service automatically scales workers to apply pressure, overcoming the limitations of existing platforms.

Algorithm Service

Cost‑minimization is formulated as a constrained optimization problem using a price formula (CPU, memory, disk). The algorithm finds the lowest‑cost resource configuration that satisfies QPS, memory, and disk requirements.

AIOps Outlook

The successful deployment of Hawkeye and Torch on the Tisplus search platform demonstrates significant cost reduction, efficiency, and stability improvements, paving the way for a unified AIOps platform for other online services. Future work will focus on building four foundational libraries: operations metrics, knowledge base, component library (cloning, stress testing, algorithm models), and strategy library (visual canvas, UDP scripts).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

aiops
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.