Deep Dive into Synchronous Blocking Network I/O in Linux: Socket Creation, recv, and Wake‑up Mechanism
This article explains why synchronous blocking network I/O (BIO) is a performance bottleneck in high‑concurrency Linux servers, detailing the kernel‑level steps of socket creation, the recv path, soft‑interrupt handling, and how processes are blocked and later awakened.
In Linux network programming, the traditional synchronous blocking I/O model (often called BIO in Java) is simple to use but performs poorly under high concurrency because each recv call can block the calling process, causing frequent context switches and requiring one process per connection.
1. Creating a socket
The user‑space socket() call triggers the kernel to allocate a series of socket‑related objects. The core creation flow is:
//file:net/socket.c
SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)
{
...
retval = sock_create(family, type, protocol, &sock);
}sock_create calls __sock_create , which allocates a struct sock , looks up the protocol family operations, and finally invokes the family‑specific create function (e.g., inet_create for AF_INET).
//file:net/ipv4/af_inet.c
int inet_create(struct net *net, struct socket *sock, int protocol, int kern)
{
struct sock *sk;
const struct net_proto_family *pf;
...
sock = sock_alloc();
pf = rcu_dereference(net_families[family]);
err = pf->create(net, sock, protocol, kern);
}During inet_create , the kernel assigns inet_stream_ops to sock->ops and tcp_prot to sock->sk_prot . The sock_init_data function then sets default callbacks such as sk_data_ready = sock_def_readable :
//file: net/core/sock.c
void sock_init_data(struct socket *sock, struct sock *sk)
{
sk->sk_data_ready = sock_def_readable;
sk->sk_write_space = sock_def_write_space;
sk->sk_error_report = sock_def_error_report;
}2. Waiting for data (recv)
The user‑space recv() eventually invokes the recvfrom system call, which looks up the socket object and calls sock_recvmsg → __sock_recvmsg → sock->ops->recvmsg . For TCP sockets this resolves to inet_recvmsg , which forwards to the protocol‑specific tcp_recvmsg implementation.
//file: net/socket.c
SYSCALL_DEFINE6(recvfrom, int, fd, void __user *, ubuf, size_t, size,
unsigned int, flags, struct sockaddr __user *, addr,
int __user *, addr_len)
{
struct socket *sock;
sock = sockfd_lookup_light(fd, &err, &fput_needed);
...
err = sock_recvmsg(sock, &msg, size, flags);
...
}Inside tcp_recvmsg , if the receive queue does not contain enough data, the kernel calls sk_wait_data to block the current process:
//file: net/core/sock.c
int sk_wait_data(struct sock *sk, long *timeo)
{
DEFINE_WAIT(wait);
prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
set_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
rc = sk_wait_event(sk, timeo, !skb_queue_empty(&sk->sk_receive_queue));
...
}DEFINE_WAIT creates a wait‑queue entry linking the current task to the socket’s wait list. prepare_to_wait inserts this entry into the socket’s wait queue and sets the task state to TASK_INTERRUPTIBLE , after which the scheduler puts the process to sleep.
3. Soft‑interrupt processing
When a network packet arrives, the soft‑interrupt context (ksoftirqd) runs tcp_v4_rcv , locates the corresponding socket, and eventually calls tcp_queue_rcv to enqueue the data onto sk->sk_receive_queue . After queuing, it invokes sk->sk_data_ready (which points to sock_def_readable ) to wake the waiting process.
//file: net/ipv4/tcp_ipv4.c
int tcp_v4_rcv(struct sk_buff *skb)
{
...
sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest);
...
tcp_queue_rcv(sk, skb, ...);
sk->sk_data_ready(sk, 0);
}The default readable handler checks whether any task is sleeping on the socket’s wait queue and wakes it with wake_up_interruptible_sync_poll , which ultimately calls default_wake_function → try_to_wake_up to move the blocked task back to the run queue.
//file: net/core/sock.c
static void sock_def_readable(struct sock *sk, int len)
{
struct socket_wq *wq;
rcu_read_lock();
wq = rcu_dereference(sk->sk_wq);
if (wq_has_sleeper(wq))
wake_up_interruptible_sync_poll(&wq->wait, POLLIN | POLLPRI | POLLRDNORM | POLLRDBAND);
sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
rcu_read_unlock();
}Only one waiting process is awakened (the “thundering herd” mitigation) because the wake‑up call passes nr_exclusive = 1 .
Summary
The synchronous blocking I/O path involves two major phases: (1) the user process entering the kernel to create a socket and later to call recv , which may block the process and cause a context switch; (2) the network stack’s soft‑interrupt handling that receives packets, enqueues them, and wakes the blocked process. Each request therefore incurs at least two context switches, which become a serious overhead when handling thousands of concurrent connections, motivating the use of multiplexed I/O models such as select , poll , or epoll .
Refining Core Development Skills
Fei has over 10 years of development experience at Tencent and Sogou. Through this account, he shares his deep insights on performance.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.