Boosting Packet Forwarding with DPDK Graph Pipeline: L3FWD Example and ACL Node Performance
This article demonstrates how to use DPDK's Graph API for L3 packet forwarding, introduces a custom ACL node, and presents performance comparisons of different packet‑transfer mechanisms and batch sizes within the graph pipeline.
Example: l3fwd-graph
DPDK provides the sample application l3fwd-graph to illustrate how the Graph API can be used for L3 packet forwarding.
Packet flow in l3fwd-graph
The processing chain consists of the following nodes:
ethdev‑rx: receive packets on port 0 via rte_eth_rx_burst
pkt cls: classify packets (IPv4, IPv6, etc.) and forward unknown types to pkt drop
ipv4 lookup: route lookup based on destination IP, miss goes to pkt drop
ipv4 rewrite: modify headers such as TTL and checksum
ethdev‑tx: transmit packets on port 1, failures go to pkt drop
pkt drop: release packets
Overall, packets are received on port 0, processed through the node chain, and transmitted on port 1.
ACL node design
An ACL node is added to demonstrate custom node creation. The node implements a simple rule set (accept or drop) based on five‑tuple matching.
<code>struct {<br/> char mapped[NB_SOCKETS];<br/> struct rte_acl_ctx *acx_ipv4[NB_SOCKETS];<br/>} acl_config;<br/><br/>static uint16_t pkt_acl_node_process(struct rte_graph *graph, struct rte_node *node,<br/> void **objs, uint16_t nb_objs);<br/><br/>static struct rte_acl_ctx* setup_acl(struct rte_acl_rule *route_base,<br/> struct rte_acl_rule *acl_base, unsigned int route_num,<br/> unsigned int acl_num, int socketid);<br/><br/>int rte_node_acl_rules_setup(const char *rule_path, int numa_on,<br/> uint32_t enabled_port_mask);<br/><br/>static int pkt_acl_node_init(const struct rte_graph *graph, struct rte_node *node);<br/><br/>struct rte_node_register pkt_acl_node = {<br/> .process = pkt_acl_node_process,<br/> .name = "pkt_acl",<br/> .init = pkt_acl_node_init,<br/> .nb_edges = PKT_ACL_NEXT_MAX,<br/> .next_nodes = {<br/> [PKT_ACL_NEXT_PKT_CLS] = "pkt_cls",<br/> [PKT_ACL_NEXT_PKT_DROP] = "pkt_drop",<br/> },<br/>};<br/>RTE_NODE_REGISTER(pkt_acl_node);</code>The .process function classifies packets with rte_acl_classify and enqueues them to either pkt_cls or pkt_drop based on the ACL result.
<code>rte_acl_classify(acl_config.acx_ipv4[socketid], acl_search.data_ipv4,<br/> acl_search.res_ipv4, acl_search.num_ipv4, DEFAULT_MAX_CATEGORIES);<br/>for (i = 0; i < acl_search.num_ipv4; i++) {<br/> pkt = acl_search.m_ipv4[i];<br/> acl_res = acl_search.res_ipv4[i];<br/> if (likely((acl_res & ACL_DENY_SIGNATURE) == 0 && acl_res != 0)) {<br/> rte_node_enqueue_x1(graph, node, PKT_ACL_NEXT_PKT_CLS, pkt);<br/> } else {<br/> rte_node_enqueue_x1(graph, node, PKT_ACL_NEXT_PKT_DROP, pkt);<br/> }<br/>}</code>Performance testing
The experiment compares three packet‑transfer mechanisms—pointer swap, memory copy, and pointer assignment—and evaluates the impact of batch splitting on throughput.
Test environment
Intel(R) Xeon(R) Platinum 8336C CPU @ 2.30 GHz
128 cores (1 core bound to the graph program)
54 MB LLC cache
Node design
All nodes are kept minimal. The default node size is 256 packets per graph walk.
Source node
<code>static inline void rte_node_next_stream_put(struct rte_graph *graph, struct rte_node *node,<br/> rte_edge_t next, uint16_t idx) {<br/> uint16_t count;<br/> int i;<br/> RTE_SET_USED(objs);<br/> RTE_SET_USED(nb_objs);<br/> for (i = 0; i < node->ctx[0]; i++) {<br/> count = (node->ctx[i + 9] * RTE_GRAPH_BURST_SIZE) / 100;<br/> rte_node_next_stream_get(graph, node, node->ctx[i + 1], count);<br/> rte_node_next_stream_put(graph, node, node->ctx[i + 1], count);<br/> }<br/> return RTE_GRAPH_BURST_SIZE;<br/>}</code>Worker node
<code>static uint16_t test_perf_node_worker(struct rte_graph *graph, struct rte_node *node,<br/> void **objs, uint16_t nb_objs) {<br/> uint16_t next = 0;<br/> uint16_t count;<br/> void **to_next;<br/> int i;<br/> if (node->ctx[0] == 1) {<br/> rte_node_next_stream_move(graph, node, node->ctx[1]);<br/> return nb_objs;<br/> }<br/> for (i = 0; i < node->ctx[0]; i++) {<br/> next = node->ctx[i + 1];<br/> count = (node->ctx[i + 9] * nb_objs) / 100;<br/> to_next = rte_node_next_stream_get(graph, node, next, nb_objs);<br/> while (count) {<br/> rte_memcpy(to_next, objs, 8 * sizeof(objs[0]));<br/> to_next += 8;<br/> objs += 8;<br/> count -= 8;<br/> rte_node_next_stream_put(graph, node, next, 8);<br/> }<br/> }<br/> return nb_objs;<br/>}</code>Destination node
<code>static uint16_t test_perf_node_sink(struct rte_graph *graph, struct rte_node *node,<br/> void **objs, uint16_t nb_objs) {<br/> return nb_objs;<br/>}</code>Results
Images below illustrate the throughput of each transfer method and the effect of splitting batches.
Pointer swap yields the highest performance, followed by memory copy and then pointer assignment. Memory copy still outperforms pointer assignment, though its efficiency varies with packet count.
Batch splitting shows linear scalability: doubling the number of batches results in a comparable percentage drop in throughput, confirming the graph pipeline’s robustness for horizontal scaling.
Conclusion
The DPDK Graph Pipeline provides a flexible way to build packet‑processing graphs. Adding a custom ACL node demonstrates modularity, and performance tests reveal that pointer swap is the most efficient packet‑transfer mechanism, while memory copy remains a solid alternative. The pipeline also scales predictably when the workload is divided into multiple batches, making it advantageous over traditional linear pipelines.
ByteDance SYS Tech
Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.