tRPC Scaffolding Tooling and Observability Best Practices for Tencent Docs Backend
By introducing the unified tRPC scaffolding tool trpcx and embedding OpenTelemetry‑generated observability configurations, the Tencent Docs backend team streamlined service creation, standardized directory structures, migrated metrics and logs to ClickHouse for cost‑effective performance, and established best‑practice workflows that dramatically improve development speed and fault‑diagnosis efficiency.
This article describes how the Tencent Docs backend team improved development efficiency by adopting a standardized tRPC scaffolding tool (trpcx) and enhancing observability using OpenTelemetry and the internal Tianji Ge platform.
1. Background – The backend services suffered from heterogeneous frameworks, duplicated middleware, unclear service boundaries, inconsistent directory structures, and under‑utilized observability, leading to high maintenance cost and low fault‑diagnosis efficiency.
2. tRPC Scaffolding Tool Construction
2.1 trpcx.NewServer() – A unified server starter reads a local trpc.yaml configuration and generates all required code, directories, and startup scripts automatically via ./create_app.sh <module> <service> .
2.2 Plugin Integration – Common plugins such as dyeing (environment routing), trpctelemetry (trace/metric/log reporting), telemetryx (attribute extraction), apiheader (login state parsing), and checkLoginStatus are now unified in the scaffolding, reducing the need for manual imports.
2.3 OpenTelemetry YAML Generation – The scaffolding creates an opentelemetry.yml file following the “Git as Code” and “Observability as Code” principles, allowing custom PromQL alert rules.
version: v1
owners:
- name: rtx_name_1
- name: rtx_name_2
resource:
tenant: tenantName
app: appName
server: serverName
cloud:
provider: provider
platform: "platform"
alert:
items:
- alert: 主调异常率>5%
metric: client_request_exception_rate_percent
type: max
threshold: 5
- alert: 被调异常率>5%
metric: server_handled_exception_rate_percent
type: max
threshold: 5
- alert: 主调请求量5分钟波动%>70%
metric: client_request_count
type: wave
threshold: 70
- alert: 被调请求量5分钟波动%>70%
metric: server_handled_exception_rate_percent
type: wave
threshold: 70
- alert: 小卡不成对被过滤>5
metric: "小卡不成对被过滤"
type: max
threshold: 5
metric:
codes:
- code: 0
type: success
description: "成功"
- code: 200
type: success
description: "成功"
- code: 1001
type: exception
description: "鉴权异常"
- code: 9999
type: timeout
description: "请求超时"
service: # optional
method: # optional2.4 Directory Structure – Enforces a high‑cohesion, low‑coupling layout with separate Service (entry), Logic (business), and Repo (network/DB) layers.
3. ClickHouse Migration for Document Tenants
The team migrated observability data from Elasticsearch to ClickHouse, achieving higher write throughput, better compression (Traces 7:1, Logs 11:1), and roughly 50% cost reduction. The migration proceeded in three phases: gray‑scale verification (10% traffic), full copy (dual write), and final cut‑over.
Performance metrics and cost calculations are presented, showing that a 50‑node ClickHouse cluster can handle the projected load while halving the original ES node count.
4. Observability Best Practices
4.1 Definition – Observability is the ability to infer internal system state from external outputs (traces, metrics, logs).
4.2 Data Models – Traces (distributed request flow), Metrics (aggregated numeric data), Logs (structured event records).
4.3 OpenTelemetry – A unified collection framework for all three signals.
4.4 Integration with Tianji Ge – The platform implements CNCF‑compatible OpenTelemetry, providing alerting, dashboards, and root‑cause analysis.
4.5 Current Issues – Over‑reliance on Traces, redundant log embedding, under‑use of Metrics and Logs, and lack of clear distinction among the three signals.
4.6 Service Integration – By adding the generated opentelemetry.yml to the repository, services automatically gain full observability without extra configuration.
4.7 Root‑Cause Analysis – Demonstrates how to navigate from a high‑level metric alert to a specific trace, then to the associated log, pinpointing the failure (e.g., missing Redis data).
4.8 Troubleshooting Workflow – Provides step‑by‑step methods to locate issues via trace‑index → trace‑detail, trace‑detail → log‑detail, or UID‑based log queries.
The article concludes with practical guidance on using the Tianji Ge dashboards, alert rules, and the unified scaffolding to maintain a healthy, observable backend system.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.