How Alibaba’s Hawkeye and Torch Transform AIOps for Search Platforms
Alibaba’s AIOps case study details how the Hawkeye intelligent diagnosis system and the Torch capacity governance platform jointly improve search platform efficiency, stability, and cost by leveraging algorithmic analysis, automated cloning, stress testing, and optimization across resource, performance, and smart Q&A dimensions.
Background
With the rapid growth of search business, platforms have become more centralized, evolving from manual operations to DevOps and now to AIOps. Traditional operations and solutions can no longer meet the demands of big data and AI.
AIOps Practice and Implementation
Hawkeye – Intelligent Diagnosis and Optimization
Hawkeye is an intelligent diagnosis and optimization system composed of three layers: analysis, web, and service.
Analysis Layer
It includes two components: hawkeye-blink , which performs low‑level data processing such as access‑log and full‑data analysis using Blink; and hawkeye-experience , which provides user‑oriented analyses like field‑type validation, monotonicity monitoring, invalid alarms, smoke‑case entry, engine downgrade configuration, memory settings, recommendation row‑column configuration, and more.
Hawkeye‑experience serves as a rule‑centered platform that codifies operational expertise, allowing each new application to benefit from expert‑level diagnostics without repeated trial‑and‑error.
Key Features
Resource optimization: engine lock memory, real‑time memory.
Performance optimization: Top‑N slow query, buildservice resource tuning.
Intelligent diagnosis: routine inspection, smart Q&A.
Engine Lock Memory Optimization
Locking memory for index, attribute, and summary improves access speed, but unused fields waste memory. Hawkeye analyzes field usage and trims indexes for head‑tier applications, saving millions of yuan.
Slow Query Analysis
Slow queries are extracted from access logs. Using Blink’s big‑data capabilities, a divide‑and‑hash plus min‑heap algorithm identifies Top‑N slow queries, then provides personalized optimization suggestions to improve engine query performance and capacity.
One‑Click Diagnosis
Health scores indicate engine status; diagnosis reports show configuration issues, benefits, and logic. Users can view detailed results and take immediate action.
Intelligent Q&A
Repeated questions such as incremental stop or common resource alerts are answered automatically via a chat‑Ops bot that injects diagnostic information into alert messages, enabling users to obtain answers by simply @‑mentioning the bot.
Torch – Capacity Governance
Torch focuses on capacity governance to reduce cost. It addresses issues like arbitrary container requests and unknown real‑world capacity, providing guidance on optimal CPU, memory, and disk allocation.
Solution Overview
Capacity assessment combines KMON data with a dedicated stress‑testing service that clones a single instance of the online service, runs automated pressure tests, and feeds results to an algorithm service for cost‑aware resource planning.
System Architecture
From bottom to top: entry layer (application information), task management (capacity‑evaluation tasks), data factory (processes KMON and stress‑test data), decision center (algorithmic evaluation, validation, cleanup), and application layer (capacity dashboards, APIs).
Clone Simulation
Cloning creates a shallow or deep copy of an online service. Shallow cloning uses shadow tables for HA3, while deep cloning performs an offline build. Benefits include service isolation, validated optimization, and automatic resource release.
Stress‑Testing Service
A distributed stress‑testing service automatically scales workers to apply pressure, overcoming the limitations of existing platforms.
Algorithm Service
Cost‑minimization is formulated as a constrained optimization problem using a price formula (CPU, memory, disk). The algorithm finds the lowest‑cost resource configuration that satisfies QPS, memory, and disk requirements.
AIOps Outlook
The successful deployment of Hawkeye and Torch on the Tisplus search platform demonstrates significant cost reduction, efficiency, and stability improvements, paving the way for a unified AIOps platform for other online services. Future work will focus on building four foundational libraries: operations metrics, knowledge base, component library (cloning, stress testing, algorithm models), and strategy library (visual canvas, UDP scripts).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
