Master TiDB 6.0 Troubleshooting with PingCAP Clinic and Diag
This article explains how to streamline TiDB 6.0 fault diagnosis by replacing manual screenshot and log collection with PingCAP's Clinic service and the TiUP Diag tool, covering data collection, upload procedures, security considerations, and additional diagnostic capabilities.
About a month ago PingCAP launched a TiDB 6.0 Book Rush activity; the author, who previously contributed to "TiDB in Action," shares experiences with the new 14 features and how the Book Rush helps test bugs.
The article focuses on TiDB 6.0's new Clinic feature.
Traditional Troubleshooting
When a bug occurs, official support usually asks for TiDB cluster information, Grafana monitoring data, component logs (TiDB/PD/TiKV/TiCDC/DM), and system configuration. The required checklist includes cluster node details, configuration, monitoring dashboards, and logs from the problematic component.
<code>【 TiDB 使用环境】
【概述】 场景 + 问题概述
【背景】 做过哪些操作
【现象】 业务和数据库现象
【问题】 当前遇到的问题
【业务影响】
【TiDB 版本】
【应用软件及版本】
【附件】 相关日志及配置信息
* TiUP Cluster Display 信息
* TiUP CLuster Edit config 信息
监控(https://metricstool.pingcap.com/)
* TiDB-Overview Grafana监控
* TiDB Grafana 监控
* TiKV Grafana 监控
* PD Grafana 监控
* 对应模块日志(包含问题前后 1 小时日志)</code>1. Provide Monitoring
The author describes three stages of manual screenshot collection, highlighting the inefficiency and time consumption of providing multiple Grafana screenshots.
Chaotic and cumbersome: support staff request many screenshots, leading to long back‑and‑forth.
Unclear time span: comparing data across days requires multiple screenshots.
Large dashboards (e.g., tikv detail) contain hundreds of metrics, making navigation slow.
Low efficiency and extra work such as masking sensitive data.
PingCAP later provided a script that exports Grafana dashboards as JSON, allowing support to load the data locally without manual screenshots.
Script:
<code>/* Copyright 2020 PingCAP, Inc. Licensed under MIT. */
"use strict"
var t={t:"Script failed to run. Please ensure it is running on the Grafana v6.x.x dashboard page.",
i:"Going to export snapshot of the current dashboard: ",s:"Note: Only metrics from visible panels can be exported",
o:"Number of panels",l:"Loading panels",h:"Expand all rows",p:"Export",u:"Export immediately",cancel:"Cancel"}
try{
var e=angular.element,i=document.styleSheets[0],a=[".__dexport_hint{z-index:2222;position:fixed;background:#fff;color:#000;padding:1em;font-size:18px;border-left:3px solid #faad14;right:0;top:5em;opacity:.8;min-width:30em}",".__dexport_hint:hover{opacity:1}",".__dexport_hint button{margin:0 1em}",".__dexport_hint button[disabled]{opacity:.5}",".__dexport_hint progress{margin-right:1em;vertical-align:middle;width:18em}"]
new(function(){function n(){var n=this,s=e(document).injector()
this.m=s.get("timeSrv")
var o=this.m.dashboard
a.forEach(t=>i.insertRule(t,0)),this.v=e("<span>") ,this.g=e('<progress value="0">'),
this._=e('<button style="font-weight:bold">').text(t.p).one("click",()=>n.k()),
this.P=e("<button>").text(t.h).on("click",()=>{o.expandRows(),n.S(()=>{n.T()})})
var r=/^(Mac|iP)/.test(navigator.platform)?"flex-direction:row-reverse":"justify-content:center"
this.j=e('<div class="__dexport_hint">').append(e("<p>").text(t.i).append(e("<strong>").text(o.title)),e("<p>").text(t.s).append(this.P),e('<p style="font-size:.8em">').append(this.g,this.v),e('<p style="display:flex;'+r+'">').append(this._,e("<button>").text(t.cancel).on("click",()=>n.I()))),
this.L=0,e(document.body).append(this.j),this.N=setInterval(()=>n.T(),500)}return n.prototype.T=function(){
var i=this,a=e(".panel-container"),n=a.length
if(0===this.L)this.v.text(t.o+": "+n),this._.prop("disabled",0===n)
else{var s=n-a.find(".panel-loading:visible").length,o=(100*s/n||0).toFixed(1)
this.v.text(t.l+": "+o+"% ("+s+"/"+n+")"),this.g.prop({value:s,max:n}),n> s?2!==this.L&& (this.L=2,
this._.prop("disabled",!1).text(t.u).one("click",()=>i.O())):this.O()}},
n.prototype.k=function(){
var t=this,i=this.m.dashboard
this.B=i.refresh,this.m.setAutoRefresh(),e(".panel-loading").removeClass("ng-hide"),this.S(()=>{t.L=1,
t._.prop("disabled",!0),i.snapshot={timestamp:new Date},i.startRefresh()})},
n.prototype.S=function(t){var i=e(".layout")
i.hide(),this.m.dashboard.removePanel(null),setTimeout(()=>{t(),i.show()},100)},
n.prototype.O=function(){
var t=this.m.dashboard,i=t.snapshot.timestamp.toISOString(),a=t.getSaveModelClone(),n=a.title
a.time=this.m.timeRange(),a.id=null,a.uid=null,a.title=n+" (exported at "+i+")"
var s={meta:{isSnapshot:!0,type:"snapshot",expires:"9999-12-31T23:59:59Z",created:i,
grafana:grafanaBootData.settings.buildInfo},dashboard:a,overwrite:!0}
this.I()
var o=new Blob([JSON.stringify(s)],{type:"application/json"}),r=URL.createObjectURL(o),l=e("<a>")
l.prop({href:r,download:n+"_"+i+".json",target:"_blank"}),e(document.body).append(l),l[0].click(),setTimeout(()=>{URL.revokeObjectURL(r),l.remove()},0)},
n.prototype.I=function(){var t=this.m.dashboard
delete t.snapshot,t.forEachPanel(t=>delete t.snapshotData),t.annotations.list.forEach(t=>delete t.snapshotData),
this.B&&this.m.setAutoRefresh(this.B),clearInterval(this.N),this.j.remove(),a.forEach(()=>i.deleteRule(0))},
n}())
}catch(e){console.error(e),alert(t.t)}</code>Note: The script may encounter compatibility issues with long‑term stored TiKV detail data, Grafana version changes, or Chrome updates, and often requires front‑end engineering skills.
2. Provide Logs
When monitoring alone cannot pinpoint the issue, logs from each component are needed. Example: a TiKV OOM caused by a slow query requires the following logs:
TiDB server logs – search for the "Welcome" keyword (indicates restart) and related [expensive_query] entries.
TiDB slow‑query log – examine query‑time, total‑keys, process‑keys to identify problematic SQL.
TiKV logs – check for read/write hotspots.
PD leader logs – needed if stale statistics cause slow queries.
Problems when providing logs:
Large TiKV clusters may lack a centralized log collection platform, forcing manual remote login and file hunting across many nodes.
Long‑duration logs become huge, making transfer to support cumbersome.
3. Security Concerns
Posting Grafana screenshots or logs on public platforms like asktug can expose sensitive cluster information, even after masking, leading to privacy worries.
PingCAP Clinic
PingCAP now offers the Clinic diagnostic service, which automatically collects cluster metrics and uploads them to a secure server. Users only need to share the generated download URL with support; the data is visible only to official staff and is deleted 90 days after case closure, mitigating security risks.
Clinic consists of two components:
Diag – the data‑collection client deployed on the cluster.
Clinic Server – receives uploaded diagnostic data.
1. What Diag Collects
(1) Cluster information – basic cluster details, hardware specs, kernel parameters, obtained via tiup cluster audit/display/edit-config and tiup exec --command for system insights.
(2) TiDB component configurations and logs – TiDB, PD, TiKV, TiFlash, TiCDC, DM logs and configs are fetched via SCP from each node.
(3) Monitoring data – alerts and metrics from Prometheus HTTP API, plus HTTP endpoints exposed by TiDB components for real‑time sampling.
2. Using Diag to Collect Cluster Data
Install the Diag client on the TiUP control machine:
<code>tiup install diag
download https://tiup-mirrors.pingcap.com/diag-v0.7.1-linux-amd64.tar.gz 17.57 MiB / 17.57 MiB 100.00% 10.74 MiB/s</code>Collect roughly two hours of diagnostic data:
<code>tiup diag collect ${cluster-name}
# The command shows what will be collected, size, and output path.
# After collection you will see directories such as:
cluster_audit
cluster.json
tiup_diag_audit.log
meta.yaml
monitor/ # Grafana dashboard JSON files
<hostname>/ # Component logs and system configs
datax/ # TiDB component configs and logs
dmesg.log # Kernel logs (OOM, hardware faults)
insight.json # System and hardware basics
limits.conf # System limits
ss.txt # Network status
sysctl.conf # Kernel parameters
</code>Use --from and --to to specify custom time ranges.
Upload the collected data to the Clinic Server:
<code>tiup diag upload ${filepath}</code>After upload, Diag prints a download URL that can be shared with support staff.
3. Additional Clinic Capabilities
Preserve cluster state for later offline analysis when immediate troubleshooting is not possible.
Routine health checks: the Technical Preview can inspect configuration items, flag unreasonable settings, and suggest fixes.
Collect diagnostic data for DM clusters, including dm‑master/worker logs, configs, and hardware info (use tiup diag collectdm ${cluster-name} ).
For more details, see the official documentation: https://docs.pingcap.com/zh/tidb/v6.0/clinic-data-instruction-for-tiup
Xiaolei Talks DB
Sharing daily database operations insights, from distributed databases to cloud migration. Author: Dai Xiaolei, with 10+ years of DB ops and development experience. Your support is appreciated.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.