Backend Development 51 min read

Mastering High‑Performance Backend: Lock‑Free, Zero‑Copy, Serialization, and More

This comprehensive guide explores essential backend techniques—including lock‑free programming, zero‑copy I/O, efficient serialization, pooling, concurrency, async processing, caching strategies, sharding, storage optimizations, and queue mechanisms—to build high‑performance, scalable services while highlighting practical code examples and real‑world trade‑offs.

ITPUB

Feb 18, 2023

Mastering High‑Performance Backend: Lock‑Free, Zero‑Copy, Serialization, and More

1. Lock‑Free Design

Multithreaded processing can improve concurrency, but contention on shared resources may degrade performance. Two lock‑free approaches are common:

1.1 Serial lock‑free (single‑threaded model)

Frameworks such as Redis and Nginx use a single thread to handle I/O. In a typical half‑sync/half‑async reactor, the main thread accepts connections and pushes data into a queue protected by a lock, while worker threads pull from the queue. This can be transformed into a lock‑free reactor where each accepted connection is assigned to a specific SubReactor and all I/O for that connection is processed by the same thread, eliminating the need for queue locking.

1.2 Data‑structure lock‑free

Hardware‑supported atomic operations (CAS) enable lock‑free containers. The example below compares a lock‑protected singly linked list with a lock‑free version using std::atomic and compare_exchange_weak in a loop to handle spurious failures.

template<typename T>
struct Node {
    Node(const T &value) : data(value) {}
    T data;
    Node *next = nullptr;
};

Locked list:

template<typename T>
class WithLockList {
    std::mutex mtx;
    Node<T> *head;
public:
    void pushFront(const T &value) {
        auto *node = new Node<T>(value);
        std::lock_guard<std::mutex> lock(mtx); //①
        node->next = head;
        head = node;
    }
};

Lock‑free list:

template<typename T>
class LockFreeList {
    std::atomic<Node<T> *> head;
public:
    void pushFront(const T &value) {
        auto *node = new Node<T>(value);
        node->next = head.load();
        while (!head.compare_exchange_weak(node->next, node)) {} //②
    }
};

Benchmark (1,000,000 pushes) shows the lock‑free list consistently faster (≈ 480 ms) than the locked list (≈ 550 ms).

2. Zero‑Copy I/O

Traditional file‑to‑socket transfer copies data between kernel buffers and user buffers multiple times, incurring CPU overhead. Two techniques reduce copies:

2.1 Memory‑mapped I/O

Mapping a file directly into the process address space with mmap allows the kernel to serve data without an extra user‑space copy.

int filefd = open(...);
int sockfd = socket(...);
void *buffer = mmap(filefd, ...); // map file
write(sockfd, buffer, size); // kernel copies directly to socket buffer

2.2 Zero‑Copy System Calls

Linux provides sendfile, splice, and tee to transfer data directly from a file descriptor to a socket descriptor using DMA, eliminating the user‑space copy.

int filefd = open(...);
int sockfd = socket(...);
sendfile(sockfd, filefd, NULL, file_size);

Zero‑copy reduces the number of copies from four (two DMA, two CPU) to two DMA copies, yielding ~3× throughput improvement in typical benchmarks (e.g., Kafka, RocketMQ).

3. Serialization

Data exchanged between services must be serialized. Three broad categories exist:

Built‑in language serialization (e.g., Java Serializable) – easy to use but often slow and not cross‑language.

Text formats such as JSON or XML – human readable, cross‑platform, but verbose.

Binary formats (Protocol Buffers, Thrift, MessagePack, FlatBuffers) – compact and fast.

Performance metrics include serialized size, (de)serialization speed, and CPU/memory consumption. Benchmarks show Protocol Buffers typically outperform JSON in both size and speed, while FlatBuffers can be even faster for read‑only scenarios because they avoid parsing.

3.3 Selection Guidelines

Performance : Choose binary formats for high‑throughput RPC or storage.

Ease of use : Prefer libraries offering code generation and rich data‑type support.

Cross‑language compatibility : Binary formats with official specs (ProtoBuf, Thrift) are safest.

Compatibility : Ensure forward/backward compatibility for evolving schemas.

Extensibility : Ability to add custom types without breaking existing services.

4. Pooling

Pooling reuses expensive resources to avoid repeated allocation overhead.

4.1 Memory Pools

Standard allocators (ptmalloc, tcmalloc, jemalloc) replace frequent malloc/free calls with internal pools, reducing fragmentation and system‑call overhead. Applications may implement custom object pools for small, frequently used objects.

4.2 Thread Pools

Creating a thread per task is costly. A thread pool maintains a fixed number of worker threads, reusing them for multiple tasks. Pools can be split into core (always alive) and non‑core (idle‑time‑reclaimed) threads.

4.3 Connection Pools

Database, Redis, or TCP connection pools cache open connections. Key design points include initialization strategy (eager vs lazy), pool size, handling exhaustion (wait vs create temporary), and health‑checking (active or periodic).

4.4 Object Pools

Objects with stable lifecycles (e.g., small integers in Redis, game entities) can be pooled to avoid repeated construction/destruction.

5. Concurrency

When a request can be decomposed into independent sub‑tasks, execute them in parallel to reduce latency. For three independent queries taking T1, T2, T3, serial time is T1+T2+T3, while concurrent execution finishes in max(T1,T2,T3). Batch RPC calls can also reduce round‑trip overhead.

5.2 Redundant Requests

Sending the same request to multiple back‑ends and using the fastest response (race) reduces tail latency, useful for initial connection establishment on flaky networks.

6. Asynchrony

Long‑running operations should be off‑loaded to asynchronous mechanisms to keep the main thread responsive.

6.1 Asynchronous Invocation Patterns

Callback : Register a function to be invoked upon completion.

Future/Promise : Submit a task, obtain a Future, and later retrieve the result.

CPS (Continuation‑Passing Style) : Chain asynchronous steps as continuations, e.g., Java CompletableFuture or C++ PPL.

6.2 Process‑Level Asynchrony

Non‑critical downstream actions (e.g., notifying recommendation services after a live‑stream start) can be queued and processed asynchronously, allowing the front‑end response to return immediately.

7. Caching

Caching trades space for time to serve read‑heavy workloads.

7.1 Usage Scenarios

Immutable or rarely‑changed data (e.g., game list).

Read‑intensive hot data (e.g., homepage feeds).

Expensive‑to‑compute results (e.g., daily leaderboards).

Data shared by many users (e.g., popular video lists).

7.2 Cache Levels

Process‑local cache : Fastest, limited by process memory.

Centralized cache : Single machine (Redis, Memcached) – larger capacity, survives process restart.

Distributed cache : Clustered (Redis Cluster, Codis) – horizontal scalability.

Multi‑level cache : Combine local + distributed layers; typical hit‑rate for first level can exceed 90%.

7.3 Cache Patterns

Cache‑Aside (read‑through/write‑through): Application reads from cache, falls back to source on miss, and writes to source then updates cache.

// Read
Data d = Cache.get(key);
if (d == NULL) {
    d = Store.load(key);
    Cache.set(key, d);
}
// Write
if (Store.save(key, d)) {
    Cache.set(key, d);
}

Potential inconsistency arises when writes to source and cache are not atomic.

Cache‑As‑SoR treats the cache as the primary data store, delegating reads/writes to the underlying store. Implementations include:

Read‑Through (cache miss triggers load from store).

Write‑Through (cache writes propagate synchronously to store).

Write‑Behind (cache writes are queued and flushed asynchronously, improving write throughput).

7.4 Eviction Strategies

Time‑based: TTL (absolute expiration) or TTI (idle timeout).

Space‑based: Max memory or entry count.

Reference‑based: LRU, LFU, FIFO.

7.5 Common Cache Problems

Cache penetration : Queries for non‑existent keys repeatedly hit the backend. Mitigate with placeholder values or a Bloom filter.

Cache avalanche : Simultaneous expiration of many keys overwhelms the backend. Stagger TTLs and use redundant instances.

Cache hot‑spot : A single key receives massive traffic. Replicate the hot key across multiple nodes or shard by adding a random suffix.

8. Sharding (Partitioning)

Sharding splits large datasets or workloads across multiple nodes to improve scalability and balance load.

8.1 Sharding Strategies

Range sharding : Assign contiguous key ranges to shards; good for ordered scans but can cause hotspot skew.

Hash sharding : Distribute keys via hash modulo; uniform distribution, but range queries require fan‑out.

Composite sharding : Combine hash (for load balance) and range (for ordered access) using multi‑field keys.

8.2 Indexing

Secondary indexes accelerate lookups on non‑primary attributes. Local indexes store the index within the same shard, while global indexes maintain a separate shard for the index, enabling efficient cross‑shard queries.

8.3 Routing

Client‑side routing : The client computes the shard and connects directly (e.g., Memcached client).

Proxy routing : A proxy layer receives requests and forwards them to the appropriate shard (e.g., Codis, CMEM).

Cluster routing : Any node can accept a request; if it does not own the shard, it forwards or redirects (e.g., Redis Cluster, CKV+).

8.4 Dynamic Rebalancing

When load or data volume changes, shards must be redistributed. Two common models:

Fixed partition count : Pre‑create many virtual shards (e.g., consistent hashing with 2^32‑1 vnodes) and move a subset when nodes join/leave.

Dynamic partitioning : Split a hot shard when its size exceeds a threshold, or merge small shards; similar to B‑tree node split/merge. Systems like HBase, Kafka, and TDSQL implement this.

8.5 Database Sharding

Large tables are split either vertically (different tables to different databases) or horizontally (rows distributed by hash/modulo). Vertical sharding isolates functional domains; horizontal sharding balances row‑level load but introduces cross‑shard joins and requires routing logic.

9. Storage Strategies

9.1 Read‑Write Separation

Primary nodes handle writes; replicas serve reads. Asynchronous replication reduces write latency but introduces replication lag. Mitigations:

Read‑your‑own‑writes: Direct the client’s immediate read to the primary.

Fallback reads: Try replica first, then primary on miss or stale data.

Critical paths use primary; non‑critical paths use replicas.

9.2 Dynamic Tiering

Separate hot data (SSD, high‑performance storage) from cold data (HDD, cheaper disks). Systems like Elasticsearch support hot‑warm‑cold nodes to optimize cost/performance.

9.3 Rewrite‑Light‑Read

Shift computation from read time to write time. Example: Build a user's feed list when a post is created (write‑heavy) so that reads become simple cache lookups.

10. Queues

Queues decouple producers and consumers, smooth traffic spikes, and enable asynchronous processing.

10.1 Use Cases

Asynchronous processing : Offload non‑critical work (e.g., notification sending).

Traffic shaping : Buffer bursts before writing to downstream stores.

System decoupling : Publish events without knowing which services consume them.

Data synchronization : Mirror writes from a primary DB to a cache via a message queue.

Flexible transactions : Use MQ‑based two‑phase commit where the first local transaction sends a pending message, and the second transaction runs after the message is confirmed.

10.2 Queue Types

Buffer queues : In‑memory or disk buffers (e.g., TCP send buffers, Kafka for log ingestion).

Request queues : Network frameworks often have inbound request queues for back‑pressure control.

Task queues : Thread‑pool work queues.

Message queues : Point‑to‑point or pub/sub (RabbitMQ, RocketMQ, Kafka).

Summary

This article consolidates a set of high‑performance techniques for backend service architecture, ranging from lock‑free programming and zero‑copy I/O to serialization choices, pooling, concurrency patterns, asynchronous designs, caching strategies, sharding, storage optimizations, and queue‑based decoupling. Applying the appropriate combination of these methods—tailored to specific workload characteristics—can significantly improve throughput, latency, and scalability of large‑scale services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend Sharding serialization Caching Zero‑copy high performance lock‑free Queue

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.