Unveiling the TCP Connection Process: Inside the Linux Socket System Calls
This article dissects the Linux kernel's TCP connection workflow, explaining how the three‑way handshake prevents stale SYN packets, and walks through the socket(), bind(), listen() and connect() system calls with detailed code analysis of the underlying kernel functions and data structures.
Why the three‑way handshake matters
When network latency or failures cause old SYN packets to linger, a server that blindly replies with ACK could create a spurious connection and waste resources. TCP avoids this by using a three‑way handshake that validates the freshness of the request before establishing state.
Socket creation (socket system call)
The user calls socket(AF_INET, SOCK_STREAM, 0), which invokes the C library socket(). This triggers the sys_socket system call (via syscall on x86‑64) and dispatches to SYSCALL_DEFINE2(socketcall, ...). The SYS_SOCKET case calls __sys_socket:
int __sys_socket(int family, int type, int protocol) {
struct socket *sock;
int flags;
sock = __sys_socket_create(family, type, protocol);
if (IS_ERR(sock))
return PTR_ERR(sock);
flags = type & ~SOCK_TYPE_MASK;
if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK;
return sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
} __sys_socket_createvalidates the type flags, strips auxiliary flags, and calls sock_create which forwards to __sock_create. The latter checks the protocol family, loads the appropriate module if needed, and finally invokes the family’s create function (e.g., inet_create for IPv4).
static const struct net_proto_family inet_family_ops = {
.family = PF_INET,
.create = inet_create,
.owner = THIS_MODULE,
}; inet_createallocates a struct sock, sets its state to SS_UNCONNECTED, looks up the matching inet_protosw entry in inetsw, and links the newly created sock with the high‑level struct socket via sock_init_data and sk_set_socket.
Binding a socket (bind system call)
The user‑level bind(sockfd, &addr, sizeof(addr)) reaches the kernel as __sys_bind. After copying the address from user space, a security check is performed, then the protocol‑specific bind operation is invoked (for TCP this is inet_bind).
int __sys_bind(int fd, struct sockaddr __user *umyaddr, int addrlen) {
struct socket *sock;
struct sockaddr_storage address;
int err, fput_needed;
sock = sockfd_lookup_light(fd, &err, &fput_needed);
if (sock) {
err = move_addr_to_kernel(umyaddr, addrlen, &address);
if (!err) {
err = security_socket_bind(sock, (struct sockaddr *)&address, addrlen);
if (!err)
err = sock->ops->bind(sock, (struct sockaddr *)&address, addrlen);
}
fput_light(sock->file, fput_needed);
}
return err;
} inet_bindperforms extensive validation: address family, address type, port‑binding privileges, and BPF hooks. The core work is delegated to __inet_bind, which checks IPv4 specifics, resolves the routing table, verifies the address is usable, acquires sock locks, and finally calls the protocol’s get_port (for TCP this is inet_csk_get_port) to allocate a local port.
Listening for connections (listen system call)
After a successful bind, the application calls listen(sockfd, backlog). The kernel entry point is __sys_listen, which caps backlog at somaxconn, performs a security check, and then calls sock->ops->listen (for TCP this is inet_listen).
int __sys_listen(int fd, int backlog) {
struct socket *sock;
int err, fput_needed;
int somaxconn;
sock = sockfd_lookup_light(fd, &err, &fput_needed);
if (sock) {
somaxconn = READ_ONCE(sock_net(sock->sk)->core.sysctl_somaxconn);
if ((unsigned int)backlog > somaxconn)
backlog = somaxconn;
err = security_socket_listen(sock, backlog);
if (!err)
err = sock->ops->listen(sock, backlog);
fput_light(sock->file, fput_needed);
}
return err;
} inet_listenverifies that the socket is in SS_UNCONNECTED and of type SOCK_STREAM, sets sk_max_ack_backlog, optionally enables TCP Fast Open, and finally calls inet_csk_listen_start to allocate the accept queue, transition the socket state to TCP_LISTEN, and bind a port.
The accept queue is represented by struct request_sock_queue. In modern kernels the half‑connection (SYN‑RECV) and full‑connection (ESTABLISHED) queues are managed via inet_connection_sock and request_sock_queue, with functions such as reqsk_queue_alloc, inet_csk_reqsk_queue_hash_add, and inet_csk_complete_hashdance handling the migration from half‑ to full‑connection state.
Establishing a connection (connect system call)
The client calls connect(fd, &addr, addrlen). The kernel path is __sys_connect, which copies the address and forwards to __sys_connect_file. After a security check, the protocol‑specific connect operation is invoked (for TCP this is inet_stream_connect).
int __sys_connect(int fd, struct sockaddr __user *uservaddr, int addrlen) {
struct fd f = fdget(fd);
int ret = -EBADF;
if (f.file) {
struct sockaddr_storage address;
ret = move_addr_to_kernel(uservaddr, addrlen, &address);
if (!ret)
ret = __sys_connect_file(f.file, &address, addrlen, 0);
fdput(f);
}
return ret;
} inet_stream_connectlocks the socket and calls __inet_stream_connect, which implements the TCP state machine. Depending on the current sock->sk_state, it may return -EISCONN, -EALREADY, or proceed to invoke the protocol’s connect method ( tcp_v4_connect for IPv4). The function handles deferred connections, BPF pre‑connect hooks, timeout handling, and finally updates the socket state to SS_CONNECTING or SS_CONNECTED once the three‑way handshake completes.
Practical observations and testing
The article also shows a simple Python TCP server that calls listen(1024) on port 8080, and a wrk command used to generate high‑concurrency traffic. By adjusting sysctl net.core.somaxconn the maximum size of the full‑connection queue can be observed, and queue overflow behaviour can be verified.
Reference
[1] 《深入理解Linux网络》
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
