Deployment, Optimization, and Management of TiDB Service in 360 Zhihui Cloud
This article details the product models, usage scenarios, and a series of performance and operational optimizations—including query plan health checks, space reclamation, resource isolation, cloud‑native deployment, cross‑region high availability, and unified monitoring—implemented for the TiDB service operated by 360 Zhihui Cloud since its launch in April 2023.
Since its official launch in April 2023, the 360 Zhihui Cloud middleware team has been continuously exploring and optimizing TiDB service, achieving significant results from open‑source projects to stable production services. This article introduces the product forms, application scenarios, and optimization measures of the Zhihui Cloud TiDB service.
1. Product Forms
The TiDB service provided by the Zhihui Cloud middleware team aims to offer a high‑availability, massive‑storage, strongly consistent, easy‑maintenance, and analytically powerful MySQL‑compatible database for the entire group. It solves sharding difficulties caused by large data volumes, avoids massive architectural changes, and meets the urgent need for analytical data warehouses, while leveraging TiDB’s built‑in HA to compensate for the lack of HA in local disks used by K8s cloud‑native deployments.
Three business shapes are offered:
Dedicated : for large‑scale exclusive workloads, supporting automatic scaling.
Shared : for small workloads, ensuring resource isolation.
Cloud‑Native : built on K8s and TiDB‑Operator for rapid delivery.
2. Dedicated Type
The dedicated TiDB cluster serves large businesses with exclusive resources and auto‑scaling capabilities.
Key components include:
LoadBalancer for even traffic distribution and node failure tolerance.
DM for MySQL‑to‑TiDB migration.
TiCDC for cluster‑level HA and TiDB‑to‑MySQL failover.
BR for full and incremental backups via S3.
V‑Metrics and Grafana for unified monitoring and alert convergence.
Deployment set to avoid placing replicas on the same physical host.
Physical machines hosting the TiDB clusters.
2.1 TiDB Query Optimization
To ensure stable query execution, a health‑check mechanism is built. Tables with health below 95% are analyzed to prevent large latency spikes, OOM, or task kills, while limiting concurrent analyze threads for service stability.
#### Collect table health for online tables and store in metadata table ####
show stats_healthy where db_name='$dbname' and table_name='$table_name';
#### Analyze tables with health < 95% ####
analyze table '$table_name';2.2 TiDB Space Reclamation Optimization
Frequent large‑scale deletes caused space not to be released because RocksDB only marks records as deleted and waits for compaction. By increasing max-merge-region-keys from 200 000 to 500 000 and max-merge-region-size from 20 MiB to 100 MiB, regions merge earlier, freeing space.
After the adjustment, the primary and standby clusters have comparable data sizes, and query/delete latency dropped dramatically.
3. Shared Type
The shared model uses resource control to let multiple small‑scale users share a TiDB cluster while preserving isolation. Versions are no lower than 7.5.
Workload testing in mixed read/write mode establishes a stable QPS‑to‑RU conversion (approximately 4:1), enabling reliable RU budgeting.
Test Group
Allocated RU
RU Threshold Met
Average QPS
rg1
10000
Yes
2484.51
rg2
7000
Yes
1745.59
rg3
4000
Yes
994.96
4. Cloud‑Native
The cloud‑native shape builds TiDB on Kubernetes using TiDB‑Operator, providing flexible, on‑demand database services and abstracting underlying infrastructure management.
4.1 TiDB Persistent Volumes
Four PV options were evaluated; OpenEBS was chosen for its dynamic local‑PV capability, simplicity, and cost‑effectiveness.
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: openebs-hostpath
annotations:
openebs.io/cas-type: local
cas.openebs.io/config: |
#hostpath type will create a PV by creating a sub‑directory under the BASEPATH
- name: StorageType
value: "hostpath"
- name: BasePath
value: "/data1/"
provisioner: openebs.io/local
volumeBindingMode: WaitForFirstConsumer
reclaimPolicy: DeleteExample TiDBCluster manifest (basic configuration):
kind: TidbCluster
metadata:
name: basic
spec:
version: v7.1.1
timezone: UTC
pvReclaimPolicy: Retain
enableDynamicConfiguration: true
configUpdateStrategy: RollingUpdate
pd:
baseImage: uhub.service.ucloud.cn/pingcap/pd
maxFailoverCount: 0
replicas: 2
storageClassName: openebs-hostpath
requests:
storage: "5Gi"
tikv:
baseImage: uhub.service.ucloud.cn/pingcap/tikv
maxFailoverCount: 0
replicas: 3
storageClassName: openebs-hostpath
requests:
storage: "5Gi"
config:
storage:
reserve-space: "0MB"
rocksdb:
max-open-files: 256
raftdb:
max-open-files: 256Running PVC list (example):
# kubectl get pvc | grep openebs
pd-basic-pd-0 Bound pvc-c11078de-... 5Gi RWO openebs-hostpath 91d
pd-basic-pd-1 Bound pvc-24aa77a6-... 5Gi RWO openebs-hostpath 91d
tikv-basic-tikv-0 Bound pvc-2e150942-... 5Gi RWO openebs-hostpath 91d
...5. Cross‑Region High Availability
Multiple TiDB clusters are deployed in northern and southern regions and synchronized via TiCDC.
5.1 Latency Optimization
To reduce inter‑region latency, the team applied measures such as controlling large upstream transactions, scaling TiCDC, increasing per-table-memory-quota , enabling cross‑node table sync, colocating TiCDC with downstream clusters, and configuring early‑warning alerts.
6. Multi‑Active Across Data Centers
Three‑center same‑city deployment ensures data‑center‑level HA. Location labels zone and host are set, and placement policies direct leaders to the preferred zone.
## PD service configuration ##
pd:
enable-tcp4-only: true
replication.location-labels:
- zone
- host
## TiKV service configuration ##
tikv_servers:
- host: xxxxxxx
ssh_port: 22
port: xxxxx
status_port: xxxx
deploy_dir: xxxxxx
data_dir: xxxxxx
log_dir: xxxxx
config:
server.labels:
host: xxxx.xx.xxx.xxxx
zone: pdcPlacement policy example:
Create placement policy pdc_leader_policy leader_constraints="[+zone=pdc]";These rules reduce average query latency from ~20 ms to ~1 ms.
7. Monitoring and Alerting
Instead of using TiDB’s built‑in monitoring, all metrics are integrated into the Zhihui Cloud DBA monitoring platform, providing a unified dashboard and flexible alert configuration, which reduces resource waste and improves operational efficiency.
8. Summary
The 360 Zhihui Cloud infrastructure team successfully deployed and optimized TiDB service across dedicated, shared, and cloud‑native forms, meeting diverse business needs. Custom optimizations such as query‑plan health checks, space‑reclamation tuning, RU‑QPS management, cross‑region HA, multi‑active data‑center deployment, and unified monitoring have markedly improved resource efficiency, query speed, and operational stability. The service now runs 20 clusters, manages over 135 TiB of data, and continues to evolve with TiDB’s ongoing enhancements.
360 Smart Cloud
Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.