Databases 13 min read

LinkedIn's Network‑Level MySQL Query Analyzer: Architecture, Implementation, and Performance Evaluation

This article describes LinkedIn's network‑level MySQL query analyzer, explaining why it was needed, its three‑component architecture (agent, central server, UI), fingerprinting and hash‑map techniques, performance benchmarks, metric collection, security considerations, and the benefits it brings to database operations.

High Availability Architecture

Sep 22, 2017

LinkedIn's Network‑Level MySQL Query Analyzer: Architecture, Implementation, and Performance Evaluation

Introduction

LinkedIn heavily relies on MySQL, with over 500 services depending on it. To improve resource utilization, a multi‑tenant architecture is used, but queries from one application can affect others. Since schema and query control are limited, LinkedIn captures complete query information to analyze and optimize problematic queries.

Why a Query Analyzer Is Needed

Understanding the runtime behavior of hundreds of applications requires deep analysis of their SQL queries. Using the slow‑query log is insufficient because setting a low threshold generates massive I/O and reduces throughput. Performance Schema adds overhead (8‑25%) and requires restarts, while sys schema still needs data export. Therefore, LinkedIn built a network‑level query analyzer to minimize overhead and accurately measure all queries.

How the Query Analyzer Works

The analyzer consists of three components:

Agent running on each MySQL server.

Central server that stores query information.

UI on the central server for displaying analysis results.

Agent

The agent captures raw TCP packets on the MySQL node, decodes them using the MySQL protocol, and builds query objects. It records the time the query reaches the port and the time the first response packet is sent to compute response latency. Queries are fingerprinted using the Percona GO library, and the fingerprint hash becomes the query key.

For each unique hash the agent stores total response time, count, user, and database name in a hash map. When a duplicate hash appears, only the count and cumulative time are updated. Metadata (min/max times, fingerprint, etc.) is kept in a separate hash map and sent to the central server only when it changes. The agent uses only a few megabytes of memory and negligible network bandwidth.

Example fingerprint table:

Query

SQL

Fingerprint

Query A

SELECT * FROM table WHERE value1='abc'

SELECT * FROM table WHEREvalue1='?'

Query B

SELECT * FROM table WHEREvalue1='abc' AND value2=430

SELECT * FROM table WHEREvalue1='?' AND value2='?'

Query C

SELECT * FROM table WHEREvalue1='xyz' AND value2=123

SELECT * FROM table WHEREvalue1='?' AND value2='?'

Query D

SELECT * FROM table WHERE VALUES IN(1,2,3)

SELECT * FROM table WHERE VALUES IN(?+)

Hash‑map example (query hash, total time, count, user, DB) and metadata hash‑map example are also stored and periodically flushed to the central server.

UI

The UI runs on the central server and lets users select hostnames and time ranges to view per‑query statistics. It shows query load percentages, calculated as the query’s total time divided by the sum of all query times in the interval. Example calculations illustrate how a query with a small execution time but high frequency can dominate load.

Clicking a query displays its trend chart:

Performance

Benchmarks were run on a 12‑core Intel Xeon E5‑2620 using MySQL 5.6.29‑76.2‑log (Percona). Sysbench threads were increased gradually. The analyzer did not affect throughput up to 128 concurrent threads; at 256 threads throughput dropped ~5 % (still better than Performance Schema’s ~10 %). CPU usage stayed below 1 % until >128 threads, then peaked at ~5 %.

Metric Collection

Two MySQL tables store the raw data: query_history (hostname, checksum, timestamp, count, query_time, user, db) and query_info (hostname, checksum, fingerprint, sample, first_seen, mintime, maxtime, etc.). The former is partitioned by timestamp and sub‑partitioned by hostname. Indexes exist on checksum and other columns. Occasionally long‑range trend queries experience latency; future work plans to move data to an internal monitoring tool (inGraphs).

Security

The agent runs with sudo but can be granted the cap_net_raw capability to reduce privilege. It can also be executed as a specific low‑privilege user (chmod 100 or 500) to avoid full sudo usage. See Linux capabilities documentation for details.

Summary

The query analyzer enables database engineers to quickly identify problematic queries, compare weekly query loads, and efficiently troubleshoot performance regressions. Developers and analysts can visualize query trends, inspect load per table/database, and receive alerts for new or sensitive queries. It also helps balance query distribution across servers, aiding capacity planning and hardware optimization. An open‑source release is planned.

Acknowledgments

Thanks to the LinkedIn MySQL team: Basavaiah Thambara and Alex Lurthu (design review), Kishore Govindaluri (UI development), and Naresh Kumar Vudutha (code review).