Backend Development 12 min read

Facebook Ordered Queue Service (FOQS): Design and Implementation of a Distributed Priority Queue

The article explains Facebook’s Ordered Queue Service (FOQS), a multi‑tenant distributed priority queue that supports asynchronous job processing for services like Async, video encoding and translation, detailing its architecture, Thrift API, enqueue/dequeue mechanisms, ack/nack handling, scaling practices, checkpointing and disaster‑recovery strategies.

Full-Stack Internet Architecture

Jul 22, 2021

Facebook Ordered Queue Service (FOQS): Design and Implementation of a Distributed Priority Queue

Introduction

Facebook’s ecosystem relies on thousands of distributed services and micro‑services, many of which use asynchronous jobs to improve resource utilization, reliability and peak‑shaving. The Facebook Ordered Queue Service (FOQS) provides a durable, multi‑tenant queue that stores items for asynchronous processing across services such as Async, video encoding and language translation.

Distributed Priority Queue Design

FOQS stores items in topics within namespaces and exposes a Thrift API with operations Enqueue, Dequeue, Ack, Nack and GetActiveTopics. Items contain fields such as namespace, topic, priority, payload, metadata, delivery timestamp, lease duration, unique ID and TTL.

Enqueue

Enqueue buffers incoming requests, returns a promise, and workers persist each item as a row in MySQL shards. Successful enqueues return a unique ID composed of shard ID and a 64‑bit primary key. FOQS uses a circuit‑breaker pattern to isolate unhealthy shards.

Dequeue

Dequeue receives a set of (topic, count) pairs and returns up to count items per topic ordered by priority and deliver‑after time. A prefetch buffer merges the highest‑priority items from all shards, enabling efficient pull‑based consumption.

Ack / Nack

Ack marks an item as successfully processed and removes it; Nack re‑queues the item with optional metadata updates and a new delivery timestamp. Both operations are routed to the appropriate shard using the shard ID encoded in the item ID.

Push vs Pull Model

FOQS adopts a pull model because workloads exhibit diverse end‑to‑end latency requirements, varying consumption rates, priority levels and geographic placement.

Large‑Scale Practices

FOQS processes close to one trillion items per day and maintains billions of pending items. To cope with this scale it employs checkpointing to bound update queries and disaster‑recovery replication across data‑center regions.

Checkpointing

Background threads periodically update item states. Instead of scanning the whole table, FOQS adds a checkpoint lower bound to the WHERE clause:

WHERE <checkpoint> <= timestamp_column AND timestamp_column <= UNIX_TIMESTAMP()

Disaster Recovery

Each MySQL shard is synchronously replicated to a local disaster‑recovery cluster and asynchronously to a remote cluster. Failover promotes a replica to primary, and routing logic directs enqueues to hosts with capacity and dequeues to hosts holding high‑priority items.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

asynchronous processing Queue

Written by

Full-Stack Internet Architecture

Introducing full-stack Internet architecture technologies centered on Java

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.