Fundamentals 42 min read

Unveiling the TCP Connection Process: Inside the Linux Socket System Calls

This article dissects the Linux kernel's TCP connection workflow, explaining how the three‑way handshake prevents stale SYN packets, and walks through the socket(), bind(), listen() and connect() system calls with detailed code analysis of the underlying kernel functions and data structures.

Linux Kernel Journey
Linux Kernel Journey
Linux Kernel Journey
Unveiling the TCP Connection Process: Inside the Linux Socket System Calls

Why the three‑way handshake matters

When network latency or failures cause old SYN packets to linger, a server that blindly replies with ACK could create a spurious connection and waste resources. TCP avoids this by using a three‑way handshake that validates the freshness of the request before establishing state.

Socket creation (socket system call)

The user calls socket(AF_INET, SOCK_STREAM, 0), which invokes the C library socket(). This triggers the sys_socket system call (via syscall on x86‑64) and dispatches to SYSCALL_DEFINE2(socketcall, ...). The SYS_SOCKET case calls __sys_socket:

int __sys_socket(int family, int type, int protocol) {
    struct socket *sock;
    int flags;
    sock = __sys_socket_create(family, type, protocol);
    if (IS_ERR(sock))
        return PTR_ERR(sock);
    flags = type & ~SOCK_TYPE_MASK;
    if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
        flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK;
    return sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK));
}
__sys_socket_create

validates the type flags, strips auxiliary flags, and calls sock_create which forwards to __sock_create. The latter checks the protocol family, loads the appropriate module if needed, and finally invokes the family’s create function (e.g., inet_create for IPv4).

static const struct net_proto_family inet_family_ops = {
    .family = PF_INET,
    .create = inet_create,
    .owner  = THIS_MODULE,
};
inet_create

allocates a struct sock, sets its state to SS_UNCONNECTED, looks up the matching inet_protosw entry in inetsw, and links the newly created sock with the high‑level struct socket via sock_init_data and sk_set_socket.

Binding a socket (bind system call)

The user‑level bind(sockfd, &addr, sizeof(addr)) reaches the kernel as __sys_bind. After copying the address from user space, a security check is performed, then the protocol‑specific bind operation is invoked (for TCP this is inet_bind).

int __sys_bind(int fd, struct sockaddr __user *umyaddr, int addrlen) {
    struct socket *sock;
    struct sockaddr_storage address;
    int err, fput_needed;
    sock = sockfd_lookup_light(fd, &err, &fput_needed);
    if (sock) {
        err = move_addr_to_kernel(umyaddr, addrlen, &address);
        if (!err) {
            err = security_socket_bind(sock, (struct sockaddr *)&address, addrlen);
            if (!err)
                err = sock->ops->bind(sock, (struct sockaddr *)&address, addrlen);
        }
        fput_light(sock->file, fput_needed);
    }
    return err;
}
inet_bind

performs extensive validation: address family, address type, port‑binding privileges, and BPF hooks. The core work is delegated to __inet_bind, which checks IPv4 specifics, resolves the routing table, verifies the address is usable, acquires sock locks, and finally calls the protocol’s get_port (for TCP this is inet_csk_get_port) to allocate a local port.

Listening for connections (listen system call)

After a successful bind, the application calls listen(sockfd, backlog). The kernel entry point is __sys_listen, which caps backlog at somaxconn, performs a security check, and then calls sock->ops->listen (for TCP this is inet_listen).

int __sys_listen(int fd, int backlog) {
    struct socket *sock;
    int err, fput_needed;
    int somaxconn;
    sock = sockfd_lookup_light(fd, &err, &fput_needed);
    if (sock) {
        somaxconn = READ_ONCE(sock_net(sock->sk)->core.sysctl_somaxconn);
        if ((unsigned int)backlog > somaxconn)
            backlog = somaxconn;
        err = security_socket_listen(sock, backlog);
        if (!err)
            err = sock->ops->listen(sock, backlog);
        fput_light(sock->file, fput_needed);
    }
    return err;
}
inet_listen

verifies that the socket is in SS_UNCONNECTED and of type SOCK_STREAM, sets sk_max_ack_backlog, optionally enables TCP Fast Open, and finally calls inet_csk_listen_start to allocate the accept queue, transition the socket state to TCP_LISTEN, and bind a port.

The accept queue is represented by struct request_sock_queue. In modern kernels the half‑connection (SYN‑RECV) and full‑connection (ESTABLISHED) queues are managed via inet_connection_sock and request_sock_queue, with functions such as reqsk_queue_alloc, inet_csk_reqsk_queue_hash_add, and inet_csk_complete_hashdance handling the migration from half‑ to full‑connection state.

Establishing a connection (connect system call)

The client calls connect(fd, &addr, addrlen). The kernel path is __sys_connect, which copies the address and forwards to __sys_connect_file. After a security check, the protocol‑specific connect operation is invoked (for TCP this is inet_stream_connect).

int __sys_connect(int fd, struct sockaddr __user *uservaddr, int addrlen) {
    struct fd f = fdget(fd);
    int ret = -EBADF;
    if (f.file) {
        struct sockaddr_storage address;
        ret = move_addr_to_kernel(uservaddr, addrlen, &address);
        if (!ret)
            ret = __sys_connect_file(f.file, &address, addrlen, 0);
        fdput(f);
    }
    return ret;
}
inet_stream_connect

locks the socket and calls __inet_stream_connect, which implements the TCP state machine. Depending on the current sock->sk_state, it may return -EISCONN, -EALREADY, or proceed to invoke the protocol’s connect method ( tcp_v4_connect for IPv4). The function handles deferred connections, BPF pre‑connect hooks, timeout handling, and finally updates the socket state to SS_CONNECTING or SS_CONNECTED once the three‑way handshake completes.

Practical observations and testing

The article also shows a simple Python TCP server that calls listen(1024) on port 8080, and a wrk command used to generate high‑concurrency traffic. By adjusting sysctl net.core.somaxconn the maximum size of the full‑connection queue can be observed, and queue overflow behaviour can be verified.

Reference

[1] 《深入理解Linux网络》

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

TCPLinux kernelThree-way handshakenetwork programmingsocket system callsyscall analysis
Linux Kernel Journey
Written by

Linux Kernel Journey

Linux Kernel Journey

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.