Operations 25 min read

Inside Intel HDSLB: Architecture and Advanced Load‑Balancing Features

This article walks through Intel's high‑density scalable load balancer (HDSLB), detailing its layered software architecture, advanced elephant‑flow handling, fast/slow‑path separation, packet‑forwarding optimizations, and provides a deep code analysis of its configuration, initialization, and data‑plane job execution.

AI Cyberspace

Jun 16, 2024

Inside Intel HDSLB: Architecture and Advanced Load‑Balancing Features

Introduction

In the first two articles we introduced Intel HDSLB as a next‑generation high‑performance four‑layer load balancer, covering quick start, use cases, basic principles, and deployment configuration. This third part continues by dissecting the HDSLB‑DPVS open‑source code and highlighting several advanced features.

Software Architecture

The diagram below shows the HDSLB software stack, which can be divided into five layers from top to bottom.

Control Plane : retains the LVS control plane and implements three CLI tools (ipvsadm, dpip, keepalive) using a local Unix socket.

Load Balancer Layer : implements scheduler, four‑layer protocol handling, connection tracking, and FastPath. FastPath (fast/slow path separation) is a core DPVS optimization.

Lite IP‑Stack Layer : implements L2‑L3 protocols such as ARP, IPv4/IPv6, ICMP, and routing.

Net Devices Layer : manages physical NICs, bonding, VLAN, KNI, traffic control, and address lists.

Hardware Acceleration Layer : leverages Intel CPU and NIC features (FDIR, RSS, checksum offload, AVX‑512, DLB, SR‑IOV, etc.).

Advanced Features

Elephant‑Flow Forwarding Optimization

Modern DPDK programs use RSS to map IP 5‑tuple traffic to specific cores. When a few heavy flows (elephant flows) dominate traffic, some cores become overloaded while others stay idle, leading to packet loss.

HDSLB solves this with three steps:

Elephant‑Flow Identification : distinguish heavy from light flows.

Elephant‑Flow Splitting : distribute a heavy flow across multiple cores instead of a single core.

Elephant‑Flow Reordering : legally reorder the split traffic before transmission.

The key technology is Intel's Dynamic Load Balancer (DLB) feature, which enables:

Packet reception: split heavy flows across multiple cores.

Packet transmission: aggregate and order the split flows.

Implementation details:

Main Core runs a Switch Filter using Intel NIC FDIR to identify and mark elephant versus mouse flows, reducing software lookup overhead.

Worker Cores use Intel CPU DLB to split elephant flows.

Worker Cores process the split traffic.

For more details see Intel’s official DLB guide.

Fast/Slow Path Separation

Fast/slow path separation is now a standard high‑performance forwarding mode. HDSLB provides a Session/Connection fast path for latency‑sensitive TCP traffic.

Basic Packet Forwarding Optimizations

HDSLB improves basic forwarding with two techniques:

Vectorize : batch processing similar packets (inspired by VPP) to improve icache/dcache hit rates.

Microjobs : split original jobs into cache‑aligned microjobs and use pipeline node prefetching to reduce cache misses.

Code Analysis

Clone the source:

git clone https://github.com/intel/high-density-scalable-load-balancer.git

Directory Structure

$ high-density-scalable-load-balancer (main) tree -L 2
.
├── Makefile
├── conf   # configuration samples
│   ├── hdslb.bond.conf.sample
│   ├── hdslb.conf.items
│   ├── hdslb.conf.sample
│   ├── hdslb.conf.single-bond.sample
│   └── hdslb.conf.single-nic.sample
├── include # header files
│   ├── cfgfile.h
│   ├── common.h
│   ├── ctrl.h
│   ├── dpdk.h
│   ├── flow.h
│   ├── icmp.h
│   ├── ipvs
│   └── ...
├── patch   # DPDK patches
│   ├── dpdk-16.07
│   └── ...
├── scripts  # deployment scripts
│   ├── ipvs-tunnel.rs.deploy.sh
│   └── ...
├── src      # core source code
│   ├── main.c
│   ├── cfgfile.c
│   ├── ctrl.c
│   ├── dpip.c
│   ├── ipvs
│   └── ...
└── tools    # utilities (dpip, ipvsadm, keepalived, lbdebug)
    ├── Makefile
    ├── dpip
    ├── ipvsadm
    └── ...

Configuration Parsing

! global config
global_defs {
    log_level   DEBUG # convenient for debugging
    ! log_file    /var/log/hdslb.log
    ! log_async_mode    on
}

Key configuration sections include global_defs , netif_defs (DPDK NIC settings, RSS, FDIR), worker_defs (CPU core roles), timer_defs , neigh_defs , ipv4/ipv6_defs , ctrl_defs , ipvs_defs , and sa_pool .

Startup Flow Analysis

int main(int argc, char *argv[])
{
    if (get_numa_nodes() > DPVS_MAX_SOCKET) {
        fprintf(stderr, "DPVS_MAX_SOCKET is smaller than system numa nodes!
");
        return -1;
    }
    if (set_all_thread_affinity() != 0) {
        fprintf(stderr, "set_all_thread_affinity failed
");
        exit(EXIT_FAILURE);
    }
    err = rte_eal_init(argc, argv);
    if (err < 0)
        rte_exit(EXIT_FAILURE, "Invalid EAL parameters
");
    argc -= err; argv += err;
    RTE_LOG(INFO, DPVS, "HDSLB version: %s, build on %s
", HDSLB_VERSION, HDSLB_BUILD_DATE);
    if ((err = cfgfile_init()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail init configuration file: %s
", dpvs_strerror(err));
    if ((err = netif_virtual_devices_add()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail add virtual devices:%s
", dpvs_strerror(err));
    if ((err = dpvs_timer_init()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail init timer on %s
", dpvs_strerror(err));
    if ((err = tc_init()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail to init traffic control: %s
", dpvs_strerror(err));
    if ((err = netif_init(NULL)) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail to init netif: %s
", dpvs_strerror(err));
    if ((err = ctrl_init()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail to init ctrl plane: %s
", dpvs_strerror(err));
    if ((err = tc_ctrl_init()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail to init tc control plane: %s
", dpvs_strerror(err));
    if ((err = vlan_init()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail to init vlan: %s
", dpvs_strerror(err));
    if ((err = inet_init()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail to init inet: %s
", dpvs_strerror(err));
    if ((err = sa_pool_init()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail to init sa_pool: %s
", dpvs_strerror(err));
    if ((err = ip_tunnel_init()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail to init tunnel: %s
", dpvs_strerror(err));
    if ((err = dp_vs_init()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail to init ipvs: %s
", dpvs_strerror(err));
    if ((err = netif_ctrl_init()) != EDPVS_OK)
        rte_exit(EXIT_FAILURE, "Fail to init netif_ctrl: %s
", dpvs_strerror(err));
    /* start DPDK ports */
    nports = rte_eth_dev_count_avail();
    for (pid = 0; pid < nports; pid++) {
        dev = netif_port_get(pid);
        if (!dev) {
            RTE_LOG(WARNING, DPVS, "port %d not found
", pid);
            continue;
        }
        err = netif_port_start(dev);
        if (err != EDVPS_OK)
            rte_exit(EXIT_FAILURE, "Start %s failed, skipping ...
", dev->name);
    }
    /* launch data‑plane threads */
    netif_lcore_start();
    while (1) {
        try_reload();
        sockopt_ctl(NULL);
        msg_master_process(0);
        now_cycles = rte_get_timer_cycles();
        if ((now_cycles - prev_cycles) * 1000000 / cycles_per_sec > timer_sched_interval_us) {
            rte_timer_manage();
            prev_cycles = now_cycles;
        }
        kni_process_on_master();
        neigh_process_ring(NULL, 0);
        netif_update_master_loop_cnt();
    }
    return 0;
}

Data‑Plane Job Registration

During initialization three NETIF_LCORE_JOB_LOOP jobs are registered: recv_fwd: packet reception and forwarding. xmit: packet transmission. timer_manage: timer handling.

Additional jobs include control‑plane processing, IPv4 fragment handling, ARP/neighbor processing, and slow‑path jobs.

Forwarding Process Example

The lcore_job_recv_fwd function receives packets, processes ARP and redirect rings, updates statistics, and calls lcore_process_packets for L2/L3 handling.

static int lcore_job_recv_fwd(void *arg __rte_unused, int high_stat __rte_unused)
{
    int i, j;
    portid_t pid;
    lcoreid_t cid = rte_lcore_id();
    for (i = 0; i < lcore_conf[lcore2index[cid]].nports; i++) {
        pid = lcore_conf[lcore2index[cid]].pqs[i].id;
        for (j = 0; j < lcore_conf[lcore2index[cid]].pqs[i].nrxq; j++) {
            struct netif_queue_conf *qconf = &lcore_conf[lcore2index[cid]].pqs[i].rxqs[j];
            lcore_process_arp_ring(qconf, cid);
            lcore_process_redirect_ring(qconf, cid);
            qconf->len = netif_rx_burst(pid, qconf, nic_type);
            lcore_process_packets(qconf, qconf->mbufs, cid, qconf->len, 0);
        }
    }
    return stat;
}

After L2 processing, netif_deliver_mbuf removes the Ethernet header and dispatches the packet to the appropriate protocol handler (e.g., ipv4_rcv for IPv4). The IPv4 handler performs error checks and invokes the INET hook chain, where dp_vs_in and dp_vs_pre_routing perform connection lookup, scheduling, and final transmission via xmit_inbound or xmit_outbound.

Conclusion

This series demonstrates that HDSLB builds upon the solid DPVS data‑plane and adds Intel’s hardware‑assisted acceleration to address real‑world challenges such as elephant flows. Its comprehensive, production‑ready solution makes it a valuable reference for developers interested in DPDK‑based load‑balancing.

DPDK Network Acceleration Load Balancer Elephant flow Fast Path Intel HDSLB

Written by

AI Cyberspace

AI, big data, cloud computing, and networking.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.