Operations 33 min read

Root Cause Analysis and Optimization of Network Packet Loss in High‑Traffic Redis Services

Through kernel‑level analysis we discovered that Redis packet loss stemmed from rx_dropped buffer exhaustion caused by interrupt‑handling backlogs, and resolved it by assigning NIC interrupts to specific cores on one NUMA node while binding Redis processes to the other, eliminating loss under dual‑10 GbE load.

Meituan Technology Team

Mar 15, 2018

Root Cause Analysis and Optimization of Network Packet Loss in High‑Traffic Redis Services

Background: Since early 2017 the number of Redis users at Meituan‑Dianping has grown dramatically, pushing the total request volume from hundreds of billions of daily accesses to trillions during peak periods. This surge caused severe instability, especially frequent network‑card packet loss.

Initial investigations revealed that some Redis nodes were still equipped with 1 GbE NICs, which quickly became a bottleneck. After replacing them with 10 GbE NICs, packet loss persisted even though NIC bandwidth was far from saturated.

Locating the Source of Packet Loss

The first clue came from the net.if.in.dropped metric in the system monitor, which reported a large number of dropped packets. This metric is derived from /proc/net/dev. An example of the file is shown below:

# cat /proc/net/dev
Inter-|   Receive                                                |  Transmit
 face |bytes    packets errs drop fifo frame compressed multicast|bytes    packets errs drop fifo colls carrier compressed
  eth0: 1234567  8901    0    12   0    0    0          0          0   2345678  9012    0    0   0    0      0          0

The fields indicate that drops occur at the driver level. To understand why, we need to dive into the kernel source.

Relevant Kernel Code

The dev_seq_show function formats the /proc/net/dev output:

static int dev_seq_show(struct seq_file *seq, void *v)
{
    if (v == SEQ_START_TOKEN)
        seq_puts(seq, "Inter-|   Receive                     |  Transmit
"
                     " face |bytes   packets errs drop fifo frame "
                     "compressed multicast|bytes   packets errs "
                     "drop fifo colls carrier compressed
");
    else
        dev_seq_printf_stats(seq, v);
    return 0;
}

static void dev_seq_printf_stats(struct seq_file *seq, struct net_device *dev)
{
    struct rtnl_link_stats64 temp;
    const struct rtnl_link_stats64 *stats = dev_get_stats(dev, &temp);
    seq_printf(seq, "%6s: %7llu %7llu %4llu %4llu %4llu %5llu %10llu %9llu "
                   "%8llu %7llu %4llu %4llu %4llu %5llu %7llu %10llu
",
                   dev->name, stats->rx_bytes, stats->rx_packets,
                   stats->rx_errors,
                   stats->rx_dropped + stats->rx_missed_errors,
                   stats->rx_fifo_errors,
                   stats->rx_length_errors + stats->rx_over_errors +
                   stats->rx_crc_errors + stats->rx_frame_errors,
                   stats->rx_compressed, stats->multicast,
                   stats->tx_bytes, stats->tx_packets,
                   stats->tx_errors, stats->tx_dropped,
                   stats->tx_fifo_errors, stats->collisions,
                   stats->tx_carrier_errors +
                   stats->tx_aborted_errors +
                   stats->tx_window_errors +
                   stats->tx_heartbeat_errors,
                   stats->tx_compressed);
}

The drop count is composed of stats->rx_dropped (buffer shortage) and stats->rx_missed_errors (FIFO overflow). Both values are provided by the driver via dev_get_stats:

struct rtnl_link_stats64 *dev_get_stats(struct net_device *dev,
                                          struct rtnl_link_stats64 *storage)
{
    const struct net_device_ops *ops = dev->netdev_ops;
    if (ops->ndo_get_stats64) {
        memset(storage, 0, sizeof(*storage));
        ops->ndo_get_stats64(dev, storage);
    } else if (ops->ndo_get_stats) {
        netdev_stats_to_stats64(storage, ops->ndo_get_stats(dev));
    } else {
        netdev_stats_to_stats64(storage, &dev->stats);
    }
    storage->rx_dropped += (unsigned long)atomic_long_read(&dev->rx_dropped);
    storage->tx_dropped += (unsigned long)atomic_long_read(&dev->tx_dropped);
    storage->rx_nohandler += (unsigned long)atomic_long_read(&dev->rx_nohandler);
    return storage;
}

The rtnl_link_stats64 structure (defined in /usr/include/linux/if_link.h) contains fields such as rx_dropped, rx_fifo_errors, rx_missed_errors, etc.

struct rtnl_link_stats64 {
    __u64 rx_packets;    /* total packets received */
    __u64 tx_packets;    /* total packets transmitted */
    __u64 rx_bytes;      /* total bytes received */
    __u64 tx_bytes;      /* total bytes transmitted */
    __u64 rx_errors;    /* bad packets received */
    __u64 tx_errors;    /* packet transmit problems */
    __u64 rx_dropped;   /* no space in Linux buffers */
    __u64 tx_dropped;   /* no space available in Linux */
    __u64 multicast;
    __u64 collisions;
    /* detailed rx_errors */
    __u64 rx_length_errors;
    __u64 rx_over_errors;      /* receiver ring buffer overflow */
    __u64 rx_crc_errors;       /* received pkt with CRC error */
    __u64 rx_frame_errors;     /* recv'd frame alignment error */
    __u64 rx_fifo_errors;      /* recv'r fifo overrun */
    __u64 rx_missed_errors;    /* receiver missed packet */
    /* detailed tx_errors */
    __u64 tx_aborted_errors;
    __u64 tx_carrier_errors;
    __u64 tx_fifo_errors;
    __u64 tx_heartbeat_errors;
    __u64 tx_window_errors;
    __u64 rx_compressed;
    __u64 tx_compressed;
};

In our environment the drops were entirely due to rx_dropped, i.e., kernel buffer exhaustion, not FIFO overflow.

Packet Reception Path

The reception process can be summarised in five steps:

NIC receives a packet.

DMA copies the packet into a kernel buffer (sk_buff) via the Rx ring.

The NIC raises a hardware interrupt.

The interrupt handler schedules a soft‑interrupt (NAPI) to process the packet.

The TCP/IP stack processes the packet and finally the application reads it from the socket buffer.

The Rx ring consists of descriptors that point to pre‑allocated sk_buffs. If the driver cannot replenish descriptors quickly enough, the NIC’s internal FIFO overflows, leading to rx_fifo_errors. When the kernel’s per‑CPU input queue (softnet_data.input_pkt_queue) exceeds netdev_max_backlog, packets are dropped, contributing to rx_dropped.

Interrupt Handling Code

Key functions involved in interrupt registration and processing:

static int ixgbe_request_irq(struct ixgbe_adapter *adapter)
{
    int err;
    if (adapter->flags & IXGBE_FLAG_MSIX_ENABLED)
        err = ixgbe_request_msix_irqs(adapter);
    else if (adapter->flags & IXGBE_FLAG_MSI_ENABLED)
        err = request_irq(adapter->pdev->irq, &ixgbe_intr, 0,
                         netdev->name, adapter);
    else
        err = request_irq(adapter->pdev->irq, &ixgbe_intr, IRQF_SHARED,
                         netdev->name, adapter);
    if (err)
        e_err(probe, "request_irq failed, Error %d
", err);
    return err;
}

static int ixgbe_request_msix_irqs(struct ixgbe_adapter *adapter)
{
    for (vector = 0; vector < adapter->num_q_vectors; vector++) {
        struct ixgbe_q_vector *q_vector = adapter->q_vector[vector];
        struct msix_entry *entry = &adapter->msix_entries[vector];
        err = request_irq(entry->vector, &ixgbe_msix_clean_rings, 0,
                         q_vector->name, q_vector);
        if (err) {
            e_err(probe, "request_irq failed for MSIX interrupt '%s' Error: %d
",
                  q_vector->name, err);
            goto free_queue_irqs;
        }
    }
    return 0;
}

static irqreturn_t ixgbe_msix_clean_rings(int irq, void *data)
{
    struct ixgbe_q_vector *q_vector = data;
    if (q_vector->rx.ring || q_vector->tx.ring)
        napi_schedule(&q_vector->napi);
    return IRQ_HANDLED;
}

static void __napi_schedule(struct napi_struct *n)
{
    unsigned long flags;
    local_irq_save(flags);
    ____napi_schedule(this_cpu_ptr(&softnet_data), n);
    local_irq_restore(flags);
}

static void net_rx_action(struct softirq_action *h)
{
    struct softnet_data *sd = &__get_cpu_var(softnet_data);
    unsigned long time_limit = jiffies + 2;
    int budget = netdev_budget;
    while (!list_empty(&sd->poll_list)) {
        if (unlikely(budget <= 0 || time_after_eq(jiffies, time_limit)))
            break;
        // poll the NAPI instance
        // ...
    }
}

The soft‑interrupt net_rx_action processes packets from the per‑CPU backlog. If the backlog length exceeds netdev_max_backlog, enqueue_to_backlog drops the packet:

static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
                               unsigned int *qtail)
{
    struct softnet_data *sd = &per_cpu(softnet_data, cpu);
    if (skb_queue_len(&sd->input_pkt_queue) <= netdev_max_backlog) {
        __skb_queue_tail(&sd->input_pkt_queue, skb);
        // schedule NAPI if needed
        return NET_RX_SUCCESS;
    }
    sd->dropped++;
    atomic_long_inc(&skb->dev->rx_dropped);
    kfree_skb(skb);
    return NET_RX_DROP;
}

Optimization Strategies

CPU Affinity : By writing a CPU mask to /proc/irq/<em>IRQ</em>/smp_affinity, we can bind each NIC interrupt vector to a specific core. Initially all interrupts were handled by CPU 0, causing a hotspot. Distributing the vectors across the first eight cores reduced the backlog but introduced higher Redis slow‑query counts because Redis processes were also being pre‑empted on those cores.

Redis Process Affinity : Using taskset -cp we bound Redis processes to the remaining cores, separating them from the interrupt‑handling cores. This mitigated the increase in slow queries.

NUMA Awareness : The server has two NUMA nodes. When interrupts are spread across both nodes, the per‑CPU softnet_data structures and sk_buffs may reside on different nodes from the Redis processes, causing cross‑node memory accesses and higher latency. By confining all NIC interrupts to a single NUMA node and keeping Redis processes on the other node, cache locality improves and wake‑affinity effects are reduced.

Overall, the final configuration was:

Interrupt vectors → first 8 logical cores of NUMA node 0.

Redis processes → all cores of NUMA node 1.

This layout achieved near‑zero packet loss under full dual‑10 GbE load while keeping Redis slow‑query rates low.

References

Intel documentation on dropped packets.

Red Hat Network Performance Tuning guide.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Redis Network interrupt()Linux kernel NAPI Packet loss

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.