Why Blocking BIO Is a Performance Bottleneck: Deep Dive into Linux Socket Internals

This article dissects Linux's synchronous blocking network I/O implementation, explaining how socket creation, recv handling, and soft‑interrupt processing each incur context‑switch overhead that makes BIO unsuitable for high‑concurrency server workloads.

ITPUB
ITPUB
ITPUB
Why Blocking BIO Is a Performance Bottleneck: Deep Dive into Linux Socket Internals

In network development, synchronous blocking I/O (BIO) is simple but performs poorly under high concurrency. This article dissects the Linux kernel implementation of socket creation, recv, and the associated soft‑interrupt handling to reveal why each recv call incurs multiple context switches.

1. Creating a socket

When the user calls socket(), the kernel allocates a series of socket‑related objects. The syscall socket() in net/socket.c invokes sock_create(), which calls __sock_create(). Inside, sock_alloc() allocates a struct sock, the protocol family table is looked up, and the appropriate create method (e.g., inet_create() for AF_INET) is called. inet_create() allocates a struct sock, selects inet_stream_ops and tcp_prot, and links them to the socket.

SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
{
    retval = sock_create(family, type, protocol, &sock);
}

int __sock_create(struct net *net, int family, int type, int protocol,
                  struct socket **res, int kern)
{
    struct socket *sock;
    const struct net_proto_family *pf;
    sock = sock_alloc();
    pf = rcu_dereference(net_families[family]);
    err = pf->create(net, sock, protocol, kern);
}

The sock_init_data() function then sets sk_data_ready to sock_def_readable(), which will later be used to wake a blocked process.

void sock_init_data(struct socket *sock, struct sock *sk)
{
    sk->sk_data_ready   = sock_def_readable;
    sk->sk_write_space  = sock_def_write_space;
    sk->sk_error_report = sock_def_error_report;
}

2. Waiting for data (recv)

The recv() library call ends up in the recvfrom syscall. After locating the socket object, the kernel calls sock_recvmsg(), which forwards to the protocol‑specific recvmsg implementation ( inet_recvmsg() for IPv4). This eventually invokes tcp_recvmsg().

SYSCALL_DEFINE6(recvfrom, int, fd, void __user *, ubuf, size_t, size,
                unsigned int, flags, struct sockaddr __user *, addr,
                int __user *, addr_len)
{
    struct socket *sock;
    sock = sockfd_lookup_light(fd, &err, &fput_needed);
    err = sock_recvmsg(sock, &msg, size, flags);
}

int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
                size_t len, int nonblock, int flags, int *addr_len)
{
    int copied = 0;
    do {
        skb_queue_walk(&sk->sk_receive_queue, skb) { /* … */ }
    } while (copied < target);
    if (copied < target)
        sk_wait_data(sk, &timeo);
}

If the receive queue does not contain enough data, sk_wait_data() puts the current process onto the socket’s wait queue and puts it to sleep, incurring a context‑switch overhead.

int sk_wait_data(struct sock *sk, long *timeo)
{
    DEFINE_WAIT(wait);
    prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
    set_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
    rc = sk_wait_event(sk, timeo, !skb_queue_empty(&sk->sk_receive_queue));
}

3. Soft‑interrupt processing

When a packet arrives, the ksoftirqd soft‑interrupt runs tcp_v4_rcv(). It looks up the socket using the packet’s IP/port, then calls tcp_v4_do_rcv(). For established connections, tcp_rcv_established() queues the data onto sk_receive_queue and finally calls sk_data_ready(), which is still sock_def_readable().

int tcp_v4_rcv(struct sk_buff *skb)
{
    struct sock *sk = __inet_lookup_skb(&tcp_hashinfo, skb,
                                        th->source, th->dest);
    if (!sock_owned_by_user(sk))
        tcp_v4_do_rcv(sk, skb);
}

int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
                        const struct tcphdr *th, unsigned int len)
{
    eaten = tcp_queue_rcv(sk, skb, tcp_header_len, &fragstolen);
    sk->sk_data_ready(sk, 0);
}

The sock_def_readable() function checks the socket’s wait queue and wakes one sleeping process with wake_up_interruptible_sync_poll(). The wake‑up path ultimately calls __wake_up_sync_key() and __wake_up_common(), which invoke the stored autoremove_wake_function to place the process back on the run queue.

Summary

The overall flow consists of two parts: (1) the user process that creates the socket and blocks in recv(), and (2) the kernel’s interrupt context that receives packets, enqueues them, and wakes the blocked process. Each round of blocking and waking costs a context switch (≈3‑5 µs). Under high‑concurrency workloads this overhead becomes prohibitive, which is why scalable models such as select, poll, and epoll are preferred.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceKernelnetworkLinuxBlocking IO
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.