How Toutiao Scaled to Millions of QPS with Go‑Powered Microservices

This article chronicles Toutiao’s evolution from a monolithic Python/C++/PHP stack to a large‑scale Go‑based microservice architecture, detailing the reasons for adopting Go, the design of the kite framework, concurrency models, timeout control, performance tuning, monitoring, and a reusable DAO component for efficient RPC aggregation.

21CTO
21CTO
21CTO
How Toutiao Scaled to Millions of QPS with Go‑Powered Microservices

Toutiao grew from a modest daily‑traffic app to a platform handling billions of requests per day. Before mid‑2015 the service was built mainly with Python, C++ and PHP, forming a large monolith that became difficult to scale.

Microservice Evolution

To address coupling and complexity, Toutiao migrated to a Service‑Oriented Architecture (SOA) and eventually to a full microservice architecture. The microservice model brings process decoupling, easier management, self‑containment, deployment independence and automation.

The content publishing system originally used Django and PHP, which introduced bottlenecks in process management. Extracting functionality into independent services was necessary.

Toutiao microservice architecture overview
Toutiao microservice architecture overview
Why Go?

Toutiao switched many services to Go because its syntax is simple, compilation is fast, performance is high, it offers native concurrency with goroutine and channel primitives, and deployment packages are small with minimal dependencies.

In June 2015 the team began rewriting the Feed service in Go, completing most of the migration by June 2016. The Go‑based platform now runs hundreds of microservices, peaks at 7 million QPS and processes over 300 billion requests daily.

Microservice Framework – kite

The internal framework kite is fully compatible with the Thrift protocol. It provides service registration and discovery, distributed load balancing, timeout and circuit‑breaker management, method‑level metrics, and distributed tracing, enabling unlimited horizontal scaling.

Concurrency

Go’s concurrency model is based on CSP: goroutines and channels replace OS threads with lightweight user‑space tasks, allowing tens of thousands of concurrent executions. This model simplifies reasoning about parallel logic and improves maintainability.

Example CSP prime‑sieve implementation (illustrated in the article) shows how each line of the pipeline runs in its own goroutine, communicating via channels.

ContentTask = NewContentInfoTask(id=123)
CommentTask = NewCommentsListTask(ContentId=123)
ParallelExec(ContentTask, CommentTask) // parallel RPC calls
user_id = ContentTask.Response.User_id
UserResp = NewUserTask(user_id).Load()
Concurrency Control

Two patterns are used: Wait – the main goroutine waits for all parallel RPC calls to finish; Cancel – if a global timeout expires, remaining RPCs are cancelled to avoid resource leakage. Go provides sync.WaitGroup and context.Context to implement these patterns.

Timeout Control

Proper timeout settings prevent cascading failures in large call graphs. The article illustrates a request flow where a gateway aggregates results from five downstream services, each with its own timeout, and shows how Go’s SetWriteDeadline and SetReadDeadline are used for connection, write and read timeouts.

Timeout control diagram
Timeout control diagram

In the kite client library a “Concurrent Ctrl” module limits the number of simultaneous RPCs and enforces precise timeout boundaries.

Performance

Go outperforms many traditional web back‑ends, but developers must still profile and tune services. Built‑in tools such as pprof, CPU and memory profiling, goroutine stack inspection, and trace analysis help identify bottlenecks.

Key optimization tips include: lock only variables, prefer CAS, focus on hot paths, consider GC impact, reuse objects (e.g., via sync.Pool), avoid reflection, tune GOGC, and keep Go versions up‑to‑date.

A real‑world case study shows a storage service where reducing memory allocations, switching from Thrift to Msgpack, and reusing buffers cut 99th‑percentile latency from 100 ms to 15 ms.

Performance profiling tools
Performance profiling tools
Service Monitoring

The runtime package exposes metrics such as goroutine count, GC pause time, and heap usage. kite collects these in real time and sets alert thresholds for critical metrics.

Go Programming Thinking and Engineering

Go forces a different mindset: each service runs in a single process, panics crash the process, and there is no thread‑local storage, so context propagation is explicit. Concurrency is the norm, requiring careful handling of shared resources.

The language’s simplicity and built‑in AST tools make large codebases easier to manage compared with languages that allow many idioms.

Go engineering diagram
Go engineering diagram
Reusable DAO Component for Toutiao’s “NeiHan Duanzi” Service

The article proposes a DAO layer that aggregates RPC, DB and cache calls, builds a dependency tree, and executes basic and sub‑loads concurrently. Basic properties depend only on the primary key, while sub‑properties depend on basic ones. The component uses maps such as BASIC_LOADER_MAP and SUB_LOADER_MAP to associate loaders with RPC functions.

func DaoLoad(needParamsTree, daoList, paramLoaderMap, subLoaderMap) error {
    // build basic and sub task lists
    // execute them concurrently
}

Clients specify required fields via a simple string slice, e.g.,

[]string{"Content_Info", "Content.User_Info", "Content.Comment_Info"}

, and the loader automatically resolves dependencies, reduces redundant code, and speeds up data retrieval.

Conclusion

Toutiao’s migration to Go enabled a high‑performance, highly concurrent microservice platform that scales to millions of QPS and fits naturally into a cloud‑native environment. The reusable DAO component further simplifies cross‑service data aggregation while leveraging Go’s concurrency strengths.

Author: Xiang Chao, Senior R&D Engineer at Toutiao, joined in 2015, promoted Go adoption, and developed the internal kite microservice framework.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BackendperformanceCloud NativeMicroservicesconcurrency
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.