Facebook Ordered Queue Service (FOQS): Design and Implementation of a Distributed Priority Queue
The article explains Facebook’s Ordered Queue Service (FOQS), a multi‑tenant distributed priority queue that supports asynchronous job processing for services like Async, video encoding and translation, detailing its architecture, Thrift API, enqueue/dequeue mechanisms, ack/nack handling, scaling practices, checkpointing and disaster‑recovery strategies.
Introduction
Facebook’s ecosystem relies on thousands of distributed services and micro‑services, many of which use asynchronous jobs to improve resource utilization, reliability and peak‑shaving. The Facebook Ordered Queue Service (FOQS) provides a durable, multi‑tenant queue that stores items for asynchronous processing across services such as Async, video encoding and language translation.
Distributed Priority Queue Design
FOQS stores items in topics within namespaces and exposes a Thrift API with operations Enqueue, Dequeue, Ack, Nack and GetActiveTopics. Items contain fields such as namespace, topic, priority, payload, metadata, delivery timestamp, lease duration, unique ID and TTL.
Enqueue
Enqueue buffers incoming requests, returns a promise, and workers persist each item as a row in MySQL shards. Successful enqueues return a unique ID composed of shard ID and a 64‑bit primary key. FOQS uses a circuit‑breaker pattern to isolate unhealthy shards.
Dequeue
Dequeue receives a set of (topic, count) pairs and returns up to count items per topic ordered by priority and deliver‑after time. A prefetch buffer merges the highest‑priority items from all shards, enabling efficient pull‑based consumption.
Ack / Nack
Ack marks an item as successfully processed and removes it; Nack re‑queues the item with optional metadata updates and a new delivery timestamp. Both operations are routed to the appropriate shard using the shard ID encoded in the item ID.
Push vs Pull Model
FOQS adopts a pull model because workloads exhibit diverse end‑to‑end latency requirements, varying consumption rates, priority levels and geographic placement.
Large‑Scale Practices
FOQS processes close to one trillion items per day and maintains billions of pending items. To cope with this scale it employs checkpointing to bound update queries and disaster‑recovery replication across data‑center regions.
Checkpointing
Background threads periodically update item states. Instead of scanning the whole table, FOQS adds a checkpoint lower bound to the WHERE clause:
WHERE <checkpoint> <= timestamp_column AND timestamp_column <= UNIX_TIMESTAMP()Disaster Recovery
Each MySQL shard is synchronously replicated to a local disaster‑recovery cluster and asynchronously to a remote cluster. Failover promotes a replica to primary, and routing logic directs enqueues to hosts with capacity and dequeues to hosts holding high‑priority items.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Full-Stack Internet Architecture
Introducing full-stack Internet architecture technologies centered on Java
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
