Enable L3 PFC + DCQCN for RoCE on Mellanox ConnectX NICs
RoCE networks, a high-performance implementation of RDMA networks, offload flow control and congestion control algorithms to hardware to achieve high performance. However, these algorithms target lossless networks so they can be simple enough to implement on hardware. Thus, we should mitigate packet loss issues and guarantee lossless networks to our best. DCQCN (Congestion Control) + PFC (Flow Control) is a common option for many data centers. We observed that our system would suffer severe performance fluctuation if disabling them.
(Updated on Mar 31, 2024) You may not need PFC for modern NICs (ConnectX-6 or newer), which support NVIDIA RTTCC congestion control that doesn't rely on PFC and ECN.
Discussion
L2 PFC v.s. L3 PFC
It is recommended to set up L3 PFC rather than L2 PFC nowadays. Probably, it is because setting up L3 PFC is easier.
Note: 802.1p
, pcp
could refer to L2 PFC; dscp
could refer to L3 PFC.
Terminologies
- Type of Service (ToS): A value to distinguish applications. RoCE is one of the applications.
- DSCP / dot11p Tag: A value embedded in the packet header.
- Buffer: Receive buffer (SRAM) on NICs. Can be partitioned into multiple regions.
- Traffic Class (TC): An intermediate value between PFC Priority and Queue ID.
- Queue: Send queue on NICs.
- Note: Service Level is a concept in Infiniband networks, not related to RoCE networks.
1 | >>2 Map Map |
1 | ┌─────────────────────────────┐ |
1 | ┌──────────────────────────────┐ ┌────────┐ |
Prerequisite
If network traffic goes through network switches, ensure L3 PFC and ECN are enabled on switches. We omitted the detailed steps to configure switches here as they depend on the particular switch SKU and OS.
Note: we could set up a direct connection to test whether our configuration on NICs works or not.
Steps
Enable DCQCN
DCQCN is enabled by default on Mellanox ConnextX NICs. But, we need to ensure ECN marking is enabled on all network switches.
Identify Interface Name and Device Name
We could check the interface and device name with show_gids
, which should output something like below.
1 | DEV PORT INDEX GID IPv4 VER DEV |
Here mlx5_1
is the device name that is usually used to describe RDMA devices, and enp216s0f1np1
is the interface name that can be managed by Linux as an Ethernet interface. We could save them as environment variables.
1 | export IF_NAME=enp216s0f1np1 |
Tune PFC Headroom Size
We should reserve enough buffer space to store on-the-fly packets since senders (or requestors) need to take time (latency) to respond to PFC pauses. The latency will be affected by the cable length (a.k.a. propagation delay). Mellanox requires us to set the cable length manually, and then the NICs will automatically calculate the correct headroom size.
Fortunately, the cable length is recorded in the transceiver’s EEPROM.
1 | sudo mlxlink -d $DEV_NAME -m -c -e |
From the outputs, we could find the cable length.
1 | Module Info |
Then, apply this parameter to our QoS setting.
1 | sudo mlnx_qos -i $IF_NAME --cable_len=50 |
Enable L3 PFC
Execute the following commands to activate PFC and apply the PFC setting to RoCE traffic. Note: the configuration is NOT persistent. You may need to re-run all the commands above to enable PFC each time the machine reboots. Also, ensure PFC is enabled for DSCP 26 traffic on all network switches.
(Updated on Mar 31, 2024) You may need to use a different PFC priority or DSCP value depending on the configuration/restriction of network switches. Conventionally, we use Priority 3 for lossless networks.
1 | # use L3 PFC, default=pcp (L2 PFC) |
Verification
Show PFC Setting
1 | sudo mlnx_qos -i $IF_NAME |
1 | DCBX mode: OS controlled |
1 | >>2 Map Map |
Check DCQCN is functioning
1 | # Check DCQCN is enabled on Prio 3 |
Note: Two cases triggering NP to send CNP packets:
- NP’s NIC receives a packet with an ECN mark (marked by the switch indicating the switch’s buffer is about to be out of capacity).
- NP’s NIC receives an out-of-order packet (packet loss occurred).
Check PFC is functioning
1 | ethtool -S $IF_NAME | grep prio3 |
1 | rx_prio3_bytes: 462536457742 |
Note: tx_prio3_pause
refers to the number of PFC pauses sent from this NIC as the server cannot absorb the network traffic quickly.
Miscellaneous
Advanced QoS Settings
The default values of the parameters below should be fine…
1 | # mapping Priority to TC |
Performance Counters on NICs
Refer to this: https://enterprise-support.nvidia.com/s/article/understanding-mlx5-linux-counters-and-status-parameters.
Failed to map RoCE traffic to specified PFC Priority
- Consider upgrading OFED drivers.
ibv_qp_attr.ah_attr.grh.traffic_class
may override default ToS Value.
Configuration for BlueField DPU
QoS settings could be set on the host using mlnx_qos
, but Traffic Class values must be set on the DPU side individually.
References
- https://baiwei0427.github.io/papers/lumina-sigcomm2023.pdf
- Many thanks to Wei Bai for helping us configure our networks!
- https://docs.nvidia.com/networking/pages/viewpage.action?pageId=39264632
- https://enterprise-support.nvidia.com/s/article/lossless-roce-configuration-for-linux-drivers-in-dscp-based-qos-mode
- https://enterprise-support.nvidia.com/s/article/understanding-mlx5-linux-counters-and-status-parameters
- https://support.huawei.com/enterprise/zh/doc/EDOC1100197616/3dfff4ec
- https://www.rdmamojo.com/2013/01/12/ibv_modify_qp/
- https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p523.pdf