Migrating Meitu Push Service Storage to Titan: Architecture, Challenges, and Solutions
Meitu migrated its high‑traffic push‑service storage from fragmented Redis nodes to the TiKV‑based Titan system, using a dual‑write rollout, batch‑operation tuning, and TiKV configuration tweaks, achieving required QPS, cutting costs by about 60%, and delivering stable, maintainable service after six months.
Introduction
Meitu operates numerous products with billions of daily push messages, requiring a highly reliable push service. In 2017 the company built an in‑house push system using Redis for message storage. As traffic and data volume grew, maintainability became difficult, prompting the development of a new storage solution called Titan, which leverages PingCAP’s TiKV as the underlying engine while exposing a Redis‑compatible protocol.
Push Service Status
Since early 2017, Meitu’s self‑developed push service (Thor) supports targeted, batch, offline, and token‑managed pushes across all Meitu apps, achieving >99% online delivery and handling up to 1 TB of stored messages during peak periods.
Architecture Model
The push system is split into three components: a long‑connection service (Bifrost), the push service (Thor), and a routing server (route_server). Service discovery is handled by etcd. The long‑connection service maintains client‑server links, while the push service manages client tokens and message storage.
Storage Model
Message storage is critical. From the client perspective, precise delivery and receipt reporting are required; from the system perspective, robust message management must survive crashes and network partitions. Each client (cid) owns a unique message queue, each message a unique identifier (mid). Messages are written to storage before delivery, and offline messages are re‑delivered upon client reconnection.
Current Challenges
High memory fragmentation due to frequent message expiration and deletion in Redis.
Growing single‑node data size leading to longer persistence times and service jitter.
Storage scaling incurs service degradation.
Rising operational costs.
These issues can only be resolved by replacing the storage layer with a system that supports massive data, horizontal elasticity, and smooth migration. Titan was selected as the solution.
Titan Overview
Titan is an open‑source NoSQL system maintained by the DistributedIO organization. It is built on TiKV for durable key‑value storage and provides a Redis‑protocol front‑end that translates Redis commands into KV operations. TiKV, written in Rust, uses the Raft consensus algorithm to guarantee strong consistency and supports ACID transactions.
Titan is stateless, supports Redis 5.0 data structures (lists, strings, hashes, sets, sorted sets), integrates Prometheus monitoring, and has been adopted by several companies (e.g., ZhaiZhai) that migrated 800 GB of data.
Smooth Migration Strategy
Migration is performed gradually. During the transition, both Redis and Titan run in parallel (dual‑write). Reads initially stay on Redis; after a week, reads switch to Titan, and Redis is decommissioned once Titan proves stable.
Business Evaluation & Optimization
Performance tests were conducted on a 1‑node SAS‑disk, 40‑core, 96 GB machine and a 3‑node SSD, 40‑core, 96 GB cluster (1 Titan + 12 TiKV instances). Expected QPS for Redis commands (hset, hgetall, hdel) were benchmarked. Initial results showed Titan could not meet online requirements for hset and hgetall.
Optimizations included batching hset operations (100 commands per transaction) and adjusting batch sizes, which reduced latency to an acceptable range (≈20 k ops/s). hgetall was re‑tested under realistic offline‑message loads and met the required 25 k QPS.
After these tweaks, Titan satisfied the push service’s performance needs, and the migration proceeded in stages.
Issues Encountered and Solutions
Transaction Conflicts : High‑frequency conflicts caused memory spikes and OOM in TiKV. Solution: Remove meta‑level count fields to reduce key‑level contention.
TiKV OOM : Peak traffic overloaded TiKV memory. Solution: Decrease TiKV block‑cache size.
Raft Store CPU Saturation : Single‑threaded Raft store caused >90% CPU usage. Solution: Expand the cluster by adding an extra TiKV node.
TiKV Channel Full : Massive write bursts triggered Raft hotspots and region leader migrations, leading to Redis command timeouts. Solution: Increase TiKV scheduler‑notify‑capacity to limit per‑iteration message fetches.
Conclusion
After six months of migration, the push service’s storage shifted from 16 Redis nodes to 4 SSD‑based TiKV servers and 2 Titan servers, cutting costs by ~60% and greatly improving maintainability. The service has run stably for half a year without incidents, and further business integration with Titan is planned.
Thanks are given to contributors and to PingCAP for TiKV support. Links to the Redis benchmark tool, Titan repository, and TiKV repository are provided.
Author Bio
Wang Hongjia, System R&D Engineer at Meitu, focuses on long‑connection services and push system infrastructure, and is a core member of the DistributedIO team.
Meitu Technology
Curating Meitu's technical expertise, valuable case studies, and innovation insights. We deliver quality technical content to foster knowledge sharing between Meitu's tech team and outstanding developers worldwide.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.