RDMA Programming Guide
References
Linux rdma-core
Nvidia
-
RDMA Architecture Overview
-
RDMA Programming Overview
-
RDMA Connection Manager (CM) API
-
RDMA Verbs API
RDMAmojo
RDMA Objects
RDMA Address Info
Stores RDMA device address information, similar to socket’s sockaddr
.
rdma_addrinfo
struct rdma_addrinfo {
int ai_flags;
int ai_family;
int ai_qp_type;
int ai_port_space;
socklen_t ai_src_len;
socklen_t ai_dst_len;
struct sockaddr *ai_src_addr;
struct sockaddr *ai_dst_addr;
char *ai_src_canonname;
char *ai_dst_canonname;
size_t ai_route_len;
void *ai_route;
size_t ai_connect_len;
void *ai_connect;
struct rdma_addrinfo *ai_next;
};
Connection Manager Event Channel
rdma_event_channel
Event channels are used to direct all events on an rdma_cm_id
.
For many clients, a single event channel may be sufficient,
however, when managing a large number of connections or `cm_id’s,
users may find it useful to direct events for different `cm_id’s
to different channels for processing.
All created event channels must be destroyed by calling
rdma_destroy_event_channel
. Users should call rdma_get_cm_event
to retrieve events on an event channel.
Each event channel is mapped to a file descriptor. The
associated file descriptor can be used and manipulated like any
other fd
to change its behavior. Users may make the fd
non-
blocking, poll or select the fd
, etc.
struct rdma_event_channel {
int fd;
};
Create a Communication Manager (CM) event channel:
struct rdma_event_channel * rdma_create_event_channel(void);
Connection Manager ID
rdma_cm_id
rdma_cm_id
is conceptually equivalent to a socket for RDMA communication.
The difference is that RDMA communication requires explicitly binding to a
specified RDMA device before communication can occur, and most operations are
asynchronous in nature. Asynchronous communication events on an rdma_cm_id
are
reported through the associated event channel. If the channel parameter is NULL
,
the rdma_cm_id
will be placed into synchronous operation. While operating
synchronously, calls that result in an event will block until the operation
completes. The event will be returned to the user through the rdma_cm_id
structure, and be available for access until another rdma_cm
call is made.
Users must release the rdma_cm_id
by calling rdma_destroy_id
.
struct rdma_cm_id {
struct ibv_context *verbs;
struct rdma_event_channel *channel;
void *context;
struct ibv_qp *qp;
struct rdma_route route;
enum rdma_port_space ps;
uint8_t port_num;
struct rdma_cm_event *event;
struct ibv_comp_channel *send_cq_channel;
struct ibv_cq *send_cq;
struct ibv_comp_channel *recv_cq_channel;
struct ibv_cq *recv_cq;
struct ibv_srq *srq;
struct ibv_pd *pd;
enum ibv_qp_type qp_type;
};
Create RDMA connection id:
int rdma_create_id(struct rdma_event_channel *channel,
struct rdma_cm_id **id,
void *context,
enum rdma_port_space ps);
IB Verbs Objects
Protection Domain (PD)
ibv_pd
: High-level container for other objects
Contains the work queues, memory regions, etc. Ensures that work queues can only access memory regions residing in the same protection domain. Applies to both local and remote operations. An incoming request can only access memory that it’s allowed to.
struct ibv_pd {
struct ibv_context *context;
uint32_t handle;
};
-
Allocate PD: https://www.rdmamojo.com/2012/08/24/ibv_alloc_pd/
-
Deallocate PD: https://www.rdmamojo.com/2012/08/31/ibv_dealloc_pd/
I/O Completion Channel (CC)
ibv_comp_channel
: Completion event channel for I/O events
Completion event channel (CC) is an object that helps handling Work Completions in a userspace process using event mode rather than polling mode.
struct ibv_comp_channel {
struct ibv_context *context;
int fd;
int refcnt;
};
Completion Queue (CQ)
ibv_cq
: Queue that receives completion notifications for send and receive
work requests; may be attached to one or more work queues.
Each work queue (WQ) is attached to a CQ. You can have multiple WQs attached to the same CQ if you want.
When an outstanding Work Request, within a Send or Receive Queue, is completed, a Work Completion is being added to the CQ of that Work Queue. This Work Completion indicates that the outstanding Work Request has been completed (and no longer considered outstanding) and provides details on it (status, direction, opcode, etc.).
A single CQ can be shared for sending, receiving, and sharing across multiple QPs. The Work Completion holds the information to specify the QP number and the Queue (Send or Receive) that it came from.
The user can define the minimum size of the CQ. The actual created size can be equal or higher than this value.
struct ibv_cq {
struct ibv_context *context;
struct ibv_comp_channel *channel;
void *cq_context;
uint32_t handle;
int cqe;
pthread_mutex_t mutex;
pthread_cond_t cond;
uint32_t comp_events_completed;
uint32_t async_events_completed;
};
Queue Pairs (QP)
Work Queues for both send and receive work requests.
ibv_qp_init_attr
describes the requested attributes of a newly created QP.
struct ibv_qp_init_attr {
void *qp_context;
struct ibv_cq *send_cq;
struct ibv_cq *recv_cq;
struct ibv_srq *srq;
struct ibv_qp_cap cap;
enum ibv_qp_type qp_type;
int sq_sig_all;
};
Here’s a description of ibv_qp_cap
:
struct ibv_qp_cap {
uint32_t max_send_wr;
uint32_t max_recv_wr;
uint32_t max_send_sge;
uint32_t max_recv_sge;
uint32_t max_inline_data;
};
ibv_qp_ex
: Encapsulates a queue for posting receive work requests and a queue
for posting send work requests
NOTE:
Due to evolution of this stack, _ex
version is the extended, more modern
variant of old ibv_qp
API.
Really is a receive queue and a send queue. QP is RDMA jargon for the two directions of a connection.
The user can define the minimum attributes to the QP: number of Work Requests and number of scatter/gather entries per Work Request to Send and Receive queues. The actual attributes can be equal or higher than those values.
struct ibv_qp {
struct ibv_context *context;
void *qp_context;
struct ibv_pd *pd;
struct ibv_cq *send_cq;
struct ibv_cq *recv_cq;
struct ibv_srq *srq;
uint32_t handle;
uint32_t qp_num;
enum ibv_qp_state state;
enum ibv_qp_type qp_type;
pthread_mutex_t mutex;
pthread_cond_t cond;
uint32_t events_completed;
};
Memory Regions (MR)
ibv_mr
: Represents a memory buffer that can be targeted by work requests; has
a local key (L_Key
) for use in local work requests and a remote key (R_Key
)
that can be shared with a peer for use in remote one-sided operations.
Simplest form of memory registration. When registered, you can decide whether to allow remote access, like reads and writes.
When you do a registration, you get some keys back, one for local work, and
another for remote work. If remote key, you’ll have to get this R_Key
to the
remote side so it can refer to this memory.
The MR’s starting address is addr
and its size is length
. The maximum size
of the block that can be registered is limited to device_attr.max_mr_size
.
Every memory address in the virtual space of the calling process can be
registered, including, but not limited to:
-
Local memory (either variable or array)
-
Global memory (either variable or array)
-
Dynamically allocated memory (using malloc() or mmap())
-
Shared memory
-
Addresses from the text segment
The registered memory buffer doesn’t have to be page-aligned.
There isn’t any way to know what is the total size of memory that can be registered for a specific device.
struct ibv_mr {
struct ibv_context *context;
struct ibv_pd *pd;
void *addr;
size_t length;
uint32_t handle;
uint32_t lkey;
uint32_t rkey;
};
Register MR: https://www.rdmamojo.com/2012/09/07/ibv_reg_mr/
Exchanging Data via Reliable Connected (RC) QP
Key steps:
-
Register buffers that will be used for communication
-
Create and connect a QP via
librdmacm
-
Post receive work requests
-
Post send work requests
-
Poll for completion of work requests
Examples
Server Setup
Bind to an address:
int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr);
Associates a source address with an rdma_cm_id
. The address may
be wildcarded. If binding to a specific local address, the
rdma_cm_id
will also be bound to a local RDMA device.
Listen for CM events:
int rdma_listen(struct rdma_cm_id *id, int backlog);
Initiates a listen for incoming connection requests or datagram service lookup. The listen will be restricted to the locally bound source address.
Users must have bound the rdma_cm_id
to a local address by
calling rdma_bind_addr
before calling this routine. If the
rdma_cm_id
is bound to a specific IP address, the listen will be
restricted to that address and the associated RDMA device. If
the rdma_cm_id
is bound to an RDMA port number only, the listen
will occur across all RDMA devices.
However, unlike a normal TCP listen, this is a non-blocking call. When a new client is connected, a new connection management (CM) event is generated on the RDMA CM event channel from where the listening id was created. Here we have only one channel, so it is easy.
Block for client connection event:
int rdma_get_cm_event(struct rdma_event_channel *channel, struct rdma_cm_event **event);
Retrieves a communication event. If no events are pending, by default, the call will block until an event is received.
The default synchronous behavior of this routine can be changed
by modifying the file descriptor associated with the given
channel. All events that are reported must be acknowledged by
calling rdma_ack_cm_event
. Destruction of an rdma_cm_id
will
block until related events have been acknowledged.
Acknowledge CM event:
int rdma_ack_cm_event(struct rdma_cm_event *event);
All events which are allocated by rdma_get_cm_event
must be
released, there should be a one-to-one correspondence between
successful gets and acks. This call frees the event structure
and any memory that it references.
Server Teardown
Destroy CM id:
Destroys the specified rdma_cm_id and cancels any outstanding asynchronous operation.
int rdma_destroy_id(struct rdma_cm_id *id);
Destroy CM event channel:
void rdma_destroy_event_channel(struct rdma_event_channel *channel);
Client Setup
Open a Connection Manager (CM) event channel for asynchronous communication events:
struct rdma_event_channel *cm_event_channel = rdma_create_event_channel();
Create CM id to track communication information:
int ret = rdma_create_id(cm_event_channel, &cm_client_id, NULL, RDMA_PS_TCP);
Set up a sockaddr_in
struct for the server’s RDMA address information,
and optionally one for the client’s RDMA address info.
Use these, cast to struct sockaddr*
, as the src/dst fields to
rdma_resolve_addr
. If successful, the specified rdma_cm_id
will be bound
to a local device.
const char *client_host = "192.168.0.104";
const char *server_host = "192.168.0.106";
int server_port = 20021;
struct sockaddr_in server_sockaddr;
memset(&server_sockaddr, 0, sizeof(server_sockaddr));
server_sockaddr.sin_family = AF_INET;
server_sockaddr.sinaddr.s_addr = inet_addr(server_host);
server_sockaddr.sin_port = htons(server_port);
/* Optional: set up client sockaddr_in information */
struct sockaddr_in client_sockaddr;
memset(&client_sockaddr, 0, sizeof(client_sockaddr));
client_sockaddr.sin_family = AF_INET;
client_sockaddr.sinaddr.s_addr = inet_addr(client_host);
int timeout_ms = 2000;
ret = rdma_resolve_addr(cm_client_id,
(struct sockaddr*)&client_sockaddr,
(struct sockaddr*)&server_sockaddr,
timeout_ms);
Resolve destination and optional source addresses from IP
addresses to an RDMA address. If successful, the specified
rdma_cm_id
will be bound to a local device.
int rdma_resolve_addr (struct rdma_cm_id *id, struct sockaddr *src_addr,
struct sockaddr *dst_addr, int timeout_ms);