Trace of Verb Calls in RDMA Communication in Linux (How does RDMA communication work)
For an uni project at TU Dresden I had to work with the RDMA implementation inside Linux. I think that RDMA in Linux is terribly poorly documented and there are only a few good resources. I want to share with you the findings I made that I wish I had at the beginning. A few findings to better understand how the RDMA stack works and what an RDMA connection actually is.
I can’t guarantee that everythink is correct, because RDMA in Linux is not the easiest to understand. 🙂
The RDMA stack in Linux is based on OFED. OFED is the OpenFabrics Enterprise Distribution and they maintain all Kernel modules and userland libs necessary for RDMA on Linux. Examples for kernel modules are
rdma_rxe. For each RDMA implementation (Mellanox, SoftRoCE, iWarp) there is a driver pair required. The kernel drivers are located inside Linux source code in
/drivers/infiniband. The name is a bit missleading. Infiniband is one possible interconnect for RDMA (Mallanox uses it). But RDMA can also work over Ethernet (RoCE) or iWarp. All drivers/kernel objects are located inside this directory.
One driver of each driver pair is in the kernel and the other one is in the userland. The userland drivers are called providers and are organized inside RDMA core project (libibverbs).
This is what I wish I could find somewhere online at the beginning of my project. A (reliable) RDMA connection) is a connection between two QPs (queue pairs). A QP is a kernel object and is configured before the connection starts on each machine. To find more about how data is send and received learn more about the QP state machine here.
What verbs are called in what order when a connection is established and
libibverbs is used? (Trace)
I think this is a valuable information that everyone should know when starting to work on the RDMA code inside Linux to understand what’s going on. I used the tool
ib_send_bw from linux-rdma/perftest. I gained the list/trace from an RXE to RXE device RDMA connection. It should also look the same for Mallanox to Mallanox device etc. The dark parts in the figure show verbs that are inside
libibverbs as well as the corresponding kernel module. They are called control verbs or slow path verbs. The userland verb does an
ioctl syscall to the
ib_uverbs kernel module which forwards the request to the proper verb in the corresponding kernel module. The light parts in the figure show userland-only verbs. They are called data verbs or fast path verbs. On the left you can see the trace of the receiver, on the right the trace of the sender.
How to reproduce what I did
I created two VMs in Virtual Box and connected them via a virtual internal network. On each VM I set up and RXE device. RXE is an RDMA implementation over a regular ethernet device. It doesn’t have the performance benefit of real RDMA but is useful for development and prototyping.
On each machine:
$ sudo modprobe ib_uverbs
$ sudo modprobe ib_core
$ sudo modprobe rdma_rxe
$ sudo rdma link add rxe0 type rxe netdev <interface>(on Ubuntu e.g.
- clone, build, and install linux-rdma/perftest
On machine A:
$ ib_send_bw -d rxe0
On machine B:
$ ib_send_bw -d rxe0 <IP of other VM; 192.168.1.2 for example>