Backend Development 13 min read

Why Middleware Client APIs Crash: Real‑World Tcaplus and Pulsar Debugging Stories

This article explains the complexity of middleware client APIs, shares two real external‑network failures— a Tcaplus callback coredump and a Pulsar Go deadlock— analyzes their root causes, and outlines practical design guidelines to build clear, asynchronous, fault‑tolerant, and maintainable backend APIs.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Why Middleware Client APIs Crash: Real‑World Tcaplus and Pulsar Debugging Stories

Middleware client API Overview

Client APIs for middleware services such as message queues, Zookeeper, etc., are rich APIs that must handle network connection management, master‑slave selection, asynchronous processing, and local caching.

Establish and maintain network links, with reconnection on failures.

Implement master‑slave selection when the service does not provide reverse proxy.

Design asynchronous mechanisms (threads, coroutines, callbacks) to keep application throughput high.

Cache data locally (e.g., queue messages) and delete after remote ACK.

With this baseline, the author shares two real external‑network problems encountered.

Two external network issues

Issue 1: Tcaplus API callback coredump

Tcaplus (Tencent Cloud’s game‑oriented NoSQL store) invokes a user‑registered callback after receiving a response; the callback caused a coredump.

Stack trace:

Tcaplus stack trace
Tcaplus stack trace

The illegal map access originates from a pointer that became a wild pointer because the callback retained a pointer to a stack variable that was destroyed when the coroutine ended.

Fix: clean up callback objects or nullify pointers when the coroutine finishes, or add a validity flag before using the pointer.

Issue 2: Pulsar Go API deadlock

The sidecar writes messages to Pulsar via the Go client API. Under heavy timeout conditions, the API entered a deadlock.

Eventsloop goroutine stack:

Pulsar eventsloop stack
Pulsar eventsloop stack

Timeout‑cleaner goroutine stack:

Pulsar timeout‑cleaner stack
Pulsar timeout‑cleaner stack

Locking code (simplified):

<code>———— 某业务协程
- 业务逻辑
- 初始化一个 tcaplus callback 对象,对象里有个指针指向了一个栈变量
- 请求 tcaplus 查询数据
- 切协程栈等回包
- 因为超时等原因,唤醒协程栈
- 业务协程结束,释放相关协程栈,此时 callback 对象里的指针成了野指针
———— tcaplus 回包
- tcaplus 回包姗姗来迟,找回了尚未销毁的 callback 对象
- 执行到了上面截图的回调代码,访问到了野指针,coredump
</code>

The deadlock chain is:

The API detects many message‑queue write timeouts.

The timeout‑cleaner goroutine calls the user callback, acquires the lock, sends a close command, and waits on

doneCh

.

The eventsloop goroutine receives the close command, also needs the same lock to clean pending messages, and then waits on

doneCh

, which will never be signaled because the lock is held.

Fix: avoid calling API close inside the callback, or redesign lock usage to prevent circular waiting.

Design points for middleware client APIs

Key considerations include:

Clear, minimal interfaces (initialization, read/write, cleanup, optional logging/debug callbacks).

Driver mechanisms: business‑driven (explicit Tick) vs internal runtime (event loop).

Asynchronous callbacks: provide both polling and push‑based models, but keep callback code lightweight to avoid re‑entrancy issues.

Network concurrency: batch requests, sequence IDs for request‑response correlation, and ordered‑message handling.

Re‑entrancy: keep lock scope small, avoid blocking inside locks.

Fault tolerance: unified error codes, automatic reconnection, optional caching and retry policies.

Restart‑safe and crash‑safe behavior (graceful shutdown, WAL‑based persistence when needed).

Configuration refresh without data loss, logging hooks, and internal metrics exposure.

Conclusion

Beyond middleware client APIs, many libraries (e.g., tcmalloc, libcurl) face similar challenges. In practice, aim for three principles: easy‑to‑use interfaces, high‑concurrency performance with robust fault tolerance, and maintainability through configurability, clear logging, and observable metrics.

DebuggingMiddlewareNetworkasynchronousdesignclient API
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.