Building a High‑Performance OAuth Token Service with Tarantool, Raft, and Sharding
This article explains how we designed and implemented a scalable OAuth token storage and refresh system using Tarantool’s in‑memory database, Raft leader election, sharding across multiple data centers, and a custom lightweight queue to handle high‑throughput token updates while maintaining consistency and fault tolerance.
Background
Tarantool DBMS is known for its high performance, on‑disk engine Vinyl, and JSON support, but many articles overlook its ability to run Lua code inside the storage engine, enabling efficient data processing.
Problem Statement
We needed a service to store and refresh OAuth tokens for Mail.Ru‑related projects. Tokens consist of an access_token (short‑lived, allows actions) and a refresh_token (long‑lived, can obtain new access tokens). The one‑hour lifespan of access tokens creates a high‑load scenario when millions of tokens must be refreshed.
Initial Architecture
The simple framework consists of front‑ends that read/write tokens, an updater that obtains new access tokens from the OAuth provider, and two Tarantool nodes (master and replica) placed in separate data centers.
Challenges
Token validity is only one hour, leading to up to 3000 rps for 10 million tokens.
Database or server outages can invalidate large fractions of tokens.
CPU saturation after two years of logical extensions.
We upgraded the CPU temporarily, then moved to Tarantool 1.6, which supports master‑master replication.
Leader Election with Raft
To avoid three‑fold increase in OAuth provider requests, we needed a single leader per data‑center. We replaced a complex Paxos‑like algorithm with Raft, which selects the node that can communicate with the others as leader.
Using Tarantool’s net.box we connect every node in a mesh and run Raft on top of those connections. The resulting state is either leader, follower, or neither.
Handling “Abandoned” Nodes
If a data‑center loses connectivity, its nodes become “abandoned”. Raft still elects a new leader among the remaining nodes, preserving cluster operation. Abandoned nodes can optionally update tokens via a dedicated updater, but this may cause redundant work.
Sharding Strategy
To overcome CPU limits we introduced sharding. Two shards each have a replica. A deterministic function (e.g., CRC32) maps a key to a shard. We considered client‑side sharding (simple but requires all clients to know the function) and database‑side sharding (more complex DB code but hides sharding from clients).
Proxy Layer
To reduce the quadratic connection growth (each node connecting to every other), we introduce a lightweight proxy that knows the shard map (stored in a Lua config). Clients talk to the proxy; the proxy forwards requests to the appropriate shard leader.
Token Update Queue
Standard queues do not meet our timing requirements because each token must be refreshed before its one‑hour expiry. We implement a custom queue inside Tarantool tuples, adding two fields: status and time. The queue supports two operations:
put : insert a new task with status and expiry time.
take : create an index‑based iterator that selects ready tasks, waits if none are ready, and uses Tarantool’s fiber channels to notify waiting workers.
When a token is published, its tuple is tracked; if a client disconnects, the associated tasks are released automatically.
Conclusion
By combining master‑master replication, Raft leader election, sharding with a proxy layer, and a purpose‑built token queue, we achieved a fault‑tolerant, scalable OAuth token service that handles millions of refreshes per hour while keeping connection overhead linear and CPU usage within limits.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
