Backend Development 18 min read

tRPC Scaffolding Tooling and Observability Best Practices for Tencent Docs Backend

By introducing the unified tRPC scaffolding tool trpcx and embedding OpenTelemetry‑generated observability configurations, the Tencent Docs backend team streamlined service creation, standardized directory structures, migrated metrics and logs to ClickHouse for cost‑effective performance, and established best‑practice workflows that dramatically improve development speed and fault‑diagnosis efficiency.

Tencent Cloud Developer

Apr 2, 2024

tRPC Scaffolding Tooling and Observability Best Practices for Tencent Docs Backend

This article describes how the Tencent Docs backend team improved development efficiency by adopting a standardized tRPC scaffolding tool (trpcx) and enhancing observability using OpenTelemetry and the internal Tianji Ge platform.

1. Background – The backend services suffered from heterogeneous frameworks, duplicated middleware, unclear service boundaries, inconsistent directory structures, and under‑utilized observability, leading to high maintenance cost and low fault‑diagnosis efficiency.

2. tRPC Scaffolding Tool Construction

2.1 trpcx.NewServer() – A unified server starter reads a local trpc.yaml configuration and generates all required code, directories, and startup scripts automatically via ./create_app.sh <module> <service>.

2.2 Plugin Integration – Common plugins such as dyeing (environment routing), trpctelemetry (trace/metric/log reporting), telemetryx (attribute extraction), apiheader (login state parsing), and checkLoginStatus are now unified in the scaffolding, reducing the need for manual imports.

2.3 OpenTelemetry YAML Generation – The scaffolding creates an opentelemetry.yml file following the “Git as Code” and “Observability as Code” principles, allowing custom PromQL alert rules.

version: v1
owners:
  - name: rtx_name_1
  - name: rtx_name_2
resource:
  tenant: tenantName
  app: appName
  server: serverName
  cloud:
    provider: provider
    platform: "platform"
alert:
  items:
    - alert: 主调异常率>5%
      metric: client_request_exception_rate_percent
      type: max
      threshold: 5
    - alert: 被调异常率>5%
      metric: server_handled_exception_rate_percent
      type: max
      threshold: 5
    - alert: 主调请求量5分钟波动%>70%
      metric: client_request_count
      type: wave
      threshold: 70
    - alert: 被调请求量5分钟波动%>70%
      metric: server_handled_exception_rate_percent
      type: wave
      threshold: 70
    - alert: 小卡不成对被过滤>5
      metric: "小卡不成对被过滤"
      type: max
      threshold: 5
metric:
  codes:
    - code: 0
      type: success
      description: "成功"
    - code: 200
      type: success
      description: "成功"
    - code: 1001
      type: exception
      description: "鉴权异常"
    - code: 9999
      type: timeout
      description: "请求超时"
      service: # optional
      method: # optional

2.4 Directory Structure – Enforces a high‑cohesion, low‑coupling layout with separate Service (entry), Logic (business), and Repo (network/DB) layers.

3. ClickHouse Migration for Document Tenants

The team migrated observability data from Elasticsearch to ClickHouse, achieving higher write throughput, better compression (Traces 7:1, Logs 11:1), and roughly 50% cost reduction. The migration proceeded in three phases: gray‑scale verification (10% traffic), full copy (dual write), and final cut‑over.

Performance metrics and cost calculations are presented, showing that a 50‑node ClickHouse cluster can handle the projected load while halving the original ES node count.

4. Observability Best Practices

4.1 Definition – Observability is the ability to infer internal system state from external outputs (traces, metrics, logs).

4.2 Data Models – Traces (distributed request flow), Metrics (aggregated numeric data), Logs (structured event records).

4.3 OpenTelemetry – A unified collection framework for all three signals.

4.4 Integration with Tianji Ge – The platform implements CNCF‑compatible OpenTelemetry, providing alerting, dashboards, and root‑cause analysis.

4.5 Current Issues – Over‑reliance on Traces, redundant log embedding, under‑use of Metrics and Logs, and lack of clear distinction among the three signals.

4.6 Service Integration – By adding the generated opentelemetry.yml to the repository, services automatically gain full observability without extra configuration.

4.7 Root‑Cause Analysis – Demonstrates how to navigate from a high‑level metric alert to a specific trace, then to the associated log, pinpointing the failure (e.g., missing Redis data).

4.8 Troubleshooting Workflow – Provides step‑by‑step methods to locate issues via trace‑index → trace‑detail, trace‑detail → log‑detail, or UID‑based log queries.

The article concludes with practical guidance on using the Tianji Ge dashboards, alert rules, and the unified scaffolding to maintain a healthy, observable backend system.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Backend Development metrics OpenTelemetry ClickHouse tRPC

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.