Operations 12 min read

Implementing the Classic DCQCN Congestion Control Algorithm in NS‑3

This article explains the DCQCN end‑to‑end congestion control algorithm for RoCEv2, describes its three roles (RP, CP, NP), and provides detailed NS‑3 code implementations for ECN marking, CNP generation, and rate‑adjustment logic, including scheduling and parameter calculations.

Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Implementing the Classic DCQCN Congestion Control Algorithm in NS‑3

RDMA deployments in data centers often use RoCEv2, which relies on Priority‑based Flow Control (PFC) to avoid packet loss, but PFC suffers from head‑of‑line blocking and unfairness. DCQCN is an end‑to‑end congestion control algorithm designed for RoCEv2 to address these issues.

DCQCN builds on QCN and DCTCP, with most functionality residing on the NIC rather than switches or the OS. Its key characteristics are: (1) operation on lossless L3 routed data‑center networks, (2) low CPU overhead on the host, and (3) rapid start‑up when no congestion is present.

The algorithm defines three roles: RP (Reaction Point) – the sender NIC, CP (Congestion Point) – the switch, and NP (Notification Point) – the receiver NIC. The following sections walk through the NS‑3 implementation of each role.

CP Implementation

When a switch egress queue exceeds a threshold, it probabilistically marks incoming packets with ECN using RED, a feature supported by most switches, so no switch modification is required.

void SwitchNode::SwitchNotifyDequeue(uint32_t ifIndex, uint32_t qIndex, Ptr<Packet> p){
    FlowIdTag t;
    p->PeekPacketTag(t);
    if (qIndex != 0){
        uint32_t inDev = t.GetFlowId();
        m_mmu->RemoveFromIngressAdmission(inDev, qIndex, p->GetSize());
        m_mmu->RemoveFromEgressAdmission(ifIndex, qIndex, p->GetSize());
        m_bytes[inDev][ifIndex][qIndex] -= p->GetSize();
        if (m_ecnEnabled){
            bool egressCongested = m_mmu->ShouldSendCN(ifIndex, qIndex);
            if (egressCongested){
                PppHeader ppp;
                Ipv4Header h;
                p->RemoveHeader(ppp);
                p->RemoveHeader(h);
                h.SetEcn((Ipv4Header::EcnType)0x03);
                p->AddHeader(h);
                p->AddHeader(ppp);
            }
        }
    }
    ...
}

The switch determines whether to send a Congestion Notification (CN) based on byte counters:

If egress bytes exceed kmax, CN is always sent.

If bytes are between kmin and kmax, CN is sent with probability p = pmax * (bytes - kmin) / (kmax - kmin), using a uniform random variable.

bool SwitchMmu::ShouldSendCN(uint32_t ifindex, uint32_t qIndex){
    if (qIndex == 0) return false;
    if (egress_bytes[ifindex][qIndex] > kmax[ifindex]) return true;
    if (egress_bytes[ifindex][qIndex] > kmin[ifindex]){
        double p = pmax[ifindex] * double(egress_bytes[ifindex][qIndex] - kmin[ifindex]) / (kmax[ifindex] - kmin[ifindex]);
        if (UniformVariable(0, 1).GetValue() < p) return true;
    }
    return false;
}

NP Implementation

When a packet marked with ECN reaches the receiver NIC, it indicates congestion. The NIC converts this information into a Congestion Notification Packet (CNP) and sends it back to the sender. The algorithm may generate a CNP for every ECN‑marked packet or limit generation to one CNP per 50 µs.

int RdmaHw::ReceiveUdp(Ptr<Packet> p, CustomHeader &ch){
    uint8_t ecnbits = ch.GetIpv4EcnBits();
    ...
    int x = ReceiverCheckSeq(ch.udp.seq, rxQp, payload_size);
    if (x == 1 || x == 2){ // generate ACK or NACK
        ...
        if (ecnbits) seqh.SetCnp();
        ...
        Ptr<Packet> newp = Create<Packet>(std::max(60-14-20-(int)seqh.GetSerializedSize(), 0));
        newp->AddHeader(seqh);
        ...
    }
}

The CNP is realized by setting a flag in the ACK/NACK header:

void qbbHeader::SetCnp(){
    flags |= 1 << FLAG_CNP;
}

RP Implementation

The sender NIC reacts to received CNPs. Two events can trigger a rate increase: a byte‑counter‑based event after receiving B bytes, and a timer‑based event after T time units. The three increase strategies, applied in order, are:

Fast recovery – set the current rate to the average of the current and target rates.

Additive increase – raise the target rate by m_rai, capped by the NIC’s maximum data rate.

Hyper increase – raise the target rate by m_rhai, also capped by the NIC’s maximum.

void RdmaHw::FastRecoveryMlx(Ptr<RdmaQueuePair> q){
    q->m_rate = (q->m_rate / 2) + (q->mlx.m_targetRate / 2);
}
void RdmaHw::ActiveIncreaseMlx(Ptr<RdmaQueuePair> q){
    uint32_t nic_idx = GetNicIdxOfQp(q);
    Ptr<QbbNetDevice> dev = m_nic[nic_idx].dev;
    q->mlx.m_targetRate += m_rai;
    if (q->mlx.m_targetRate > dev->GetDataRate())
        q->mlx.m_targetRate = dev->GetDataRate();
    q->m_rate = (q->m_rate / 2) + (q->mlx.m_targetRate / 2);
}
void RdmaHw::HyperIncreaseMlx(Ptr<RdmaQueuePair> q){
    uint32_t nic_idx = GetNicIdxOfQp(q);
    Ptr<QbbNetDevice> dev = m_nic[nic_idx].dev;
    q->mlx.m_targetRate += m_rhai;
    if (q->mlx.m_targetRate > dev->GetDataRate())
        q->mlx.m_targetRate = dev->GetDataRate();
    q->m_rate = (q->m_rate / 2) + (q->mlx.m_targetRate / 2);
}

If a CNP arrives within RATE_REDUCE_MONITOR_PERIOD microseconds, a rate decrease is triggered.

void RdmaHw::cnp_received_mlx(Ptr<RdmaQueuePair> q){
    q->mlx.m_alpha_cnp_arrived = true; // for alpha update
    q->mlx.m_decrease_cnp_arrived = true; // for rate decrease
    if (q->mlx.m_first_cnp){
        q->mlx.m_alpha = 1;
        q->mlx.m_alpha_cnp_arrived = false;
        ScheduleUpdateAlphaMlx(q);
        ScheduleDecreaseRateMlx(q, 1);
        q->mlx.m_targetRate = q->m_rate = m_rateOnFirstCNP * q->m_rate;
        q->mlx.m_first_cnp = false;
    }
}

Alpha, the binary feedback parameter, is updated periodically:

void RdmaHw::UpdateAlphaMlx(Ptr<RdmaQueuePair> q){
    if (q->mlx.m_alpha_cnp_arrived){
        q->mlx.m_alpha = (1 - m_g) * q->mlx.m_alpha + m_g;
    } else {
        q->mlx.m_alpha = (1 - m_g) * q->mlx.m_alpha;
    }
    q->mlx.m_alpha_cnp_arrived = false;
    ScheduleUpdateAlphaMlx(q);
}
void RdmaHw::ScheduleUpdateAlphaMlx(Ptr<RdmaQueuePair> q){
    q->mlx.m_eventUpdateAlpha = Simulator::Schedule(MicroSeconds(m_alpha_resume_interval), &RdmaHw::UpdateAlphaMlx, this, q);
}

When a decrease is needed, the new rate is computed as rate = max(minRate, rate * (1 - alpha/2)), and related state is reset before scheduling the next increase event.

void RdmaHw::CheckRateDecreaseMlx(Ptr<RdmaQueuePair> q){
    ScheduleDecreaseRateMlx(q, 0);
    if (q->mlx.m_decrease_cnp_arrived){
        bool clamp = true;
        if (!m_EcnClampTgtRate){
            if (q->mlx.m_rpTimeStage == 0) clamp = false;
        }
        if (clamp) q->mlx.m_targetRate = q->m_rate;
        q->m_rate = std::max(m_minRate, q->m_rate * (1 - q->mlx.m_alpha / 2));
        q->mlx.m_rpTimeStage = 0;
        q->mlx.m_decrease_cnp_arrived = false;
        Simulator::Cancel(q->mlx.m_rpTimer);
        q->mlx.m_rpTimer = Simulator::Schedule(MicroSeconds(m_rpgTimeReset), &RdmaHw::RateIncEventTimerMlx, this, q);
    }
}

The article concludes with references to the original DCQCN paper (Zhu et al., ACM 2015), NVIDIA’s documentation, and an open‑source GitHub repository implementing high‑precision congestion control.

References

Zhu Y, Eran H, Firestone D, et al. Congestion Control for Large‑Scale RDMA Deployments. ACM, 2015. DOI:10.1145/2785956.2787484.

NVIDIA Enterprise Support Portal – DCQCN CC Algorithm.

GitHub – alibaba-edu/High-Precision-Congestion-Control.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

RDMAcongestion controlData Center NetworkingECNDCQCNCNPNS3
Network Intelligence Research Center (NIRC)
Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.