RDMA Programming Guide

RDMA Objects

RDMA Address Info

Stores RDMA device address information, similar to socket’s sockaddr.

rdma_addrinfo

struct rdma_addrinfo {
	int			ai_flags;
	int			ai_family;
	int			ai_qp_type;
	int			ai_port_space;
	socklen_t		ai_src_len;
	socklen_t		ai_dst_len;
	struct sockaddr		*ai_src_addr;
	struct sockaddr		*ai_dst_addr;
	char			*ai_src_canonname;
	char			*ai_dst_canonname;
	size_t			ai_route_len;
	void			*ai_route;
	size_t			ai_connect_len;
	void			*ai_connect;
	struct rdma_addrinfo	*ai_next;
};

Connection Manager Event Channel

rdma_event_channel

Event channels are used to direct all events on an rdma_cm_id. For many clients, a single event channel may be sufficient, however, when managing a large number of connections or `cm_id’s, users may find it useful to direct events for different `cm_id’s to different channels for processing.

All created event channels must be destroyed by calling rdma_destroy_event_channel. Users should call rdma_get_cm_event to retrieve events on an event channel.

Each event channel is mapped to a file descriptor. The associated file descriptor can be used and manipulated like any other fd to change its behavior. Users may make the fd non- blocking, poll or select the fd, etc.

struct rdma_event_channel {
	int fd;
};

Create a Communication Manager (CM) event channel:

struct rdma_event_channel * rdma_create_event_channel(void);

Connection Manager ID

rdma_cm_id

rdma_cm_id is conceptually equivalent to a socket for RDMA communication. The difference is that RDMA communication requires explicitly binding to a specified RDMA device before communication can occur, and most operations are asynchronous in nature. Asynchronous communication events on an rdma_cm_id are reported through the associated event channel. If the channel parameter is NULL, the rdma_cm_id will be placed into synchronous operation. While operating synchronously, calls that result in an event will block until the operation completes. The event will be returned to the user through the rdma_cm_id structure, and be available for access until another rdma_cm call is made. Users must release the rdma_cm_id by calling rdma_destroy_id.

struct rdma_cm_id {
	struct ibv_context		*verbs;
	struct rdma_event_channel 	*channel;
	void				*context;
	struct ibv_qp			*qp;
	struct rdma_route	 	route;
	enum rdma_port_space	 	ps;
	uint8_t			 	port_num;
	struct rdma_cm_event		*event;
	struct ibv_comp_channel 	*send_cq_channel;
	struct ibv_cq			*send_cq;
	struct ibv_comp_channel 	*recv_cq_channel;
	struct ibv_cq			*recv_cq;
	struct ibv_srq			*srq;
	struct ibv_pd			*pd;
	enum ibv_qp_type		qp_type;
};

Create RDMA connection id:

int rdma_create_id(struct rdma_event_channel *channel,
		   struct rdma_cm_id **id,
		   void *context,
		   enum rdma_port_space ps);

Connection Manager Event

rdma_cm_event

struct rdma_cm_event {
	struct rdma_cm_id	*id;
	struct rdma_cm_id	*listen_id;
	enum rdma_cm_event_type	event;
	int			status;
	union {
		struct rdma_conn_param conn;
		struct rdma_ud_param   ud;
	} param;
};

IB Verbs Objects

Protection Domain (PD)

ibv_pd: High-level container for other objects

Contains the work queues, memory regions, etc. Ensures that work queues can only access memory regions residing in the same protection domain. Applies to both local and remote operations. An incoming request can only access memory that it’s allowed to.

struct ibv_pd {
	struct ibv_context     *context;
	uint32_t		handle;
};

I/O Completion Channel (CC)

ibv_comp_channel: Completion event channel for I/O events

Completion event channel (CC) is an object that helps handling Work Completions in a userspace process using event mode rather than polling mode.

struct ibv_comp_channel {
	struct ibv_context     *context;
	int			fd;
	int			refcnt;
};

Completion Queue (CQ)

ibv_cq: Queue that receives completion notifications for send and receive work requests; may be attached to one or more work queues.

Each work queue (WQ) is attached to a CQ. You can have multiple WQs attached to the same CQ if you want.

When an outstanding Work Request, within a Send or Receive Queue, is completed, a Work Completion is being added to the CQ of that Work Queue. This Work Completion indicates that the outstanding Work Request has been completed (and no longer considered outstanding) and provides details on it (status, direction, opcode, etc.).

A single CQ can be shared for sending, receiving, and sharing across multiple QPs. The Work Completion holds the information to specify the QP number and the Queue (Send or Receive) that it came from.

The user can define the minimum size of the CQ. The actual created size can be equal or higher than this value.

struct ibv_cq {
	struct ibv_context     *context;
	struct ibv_comp_channel *channel;
	void		       *cq_context;
	uint32_t		handle;
	int			cqe;

	pthread_mutex_t		mutex;
	pthread_cond_t		cond;
	uint32_t		comp_events_completed;
	uint32_t		async_events_completed;
};

Queue Pairs (QP)

Work Queues for both send and receive work requests.

ibv_qp_init_attr describes the requested attributes of a newly created QP.

struct ibv_qp_init_attr {
	void		       *qp_context;
	struct ibv_cq	       *send_cq;
	struct ibv_cq	       *recv_cq;
	struct ibv_srq	       *srq;
	struct ibv_qp_cap	cap;
	enum ibv_qp_type	qp_type;
	int			sq_sig_all;
};

Here’s a description of ibv_qp_cap:

struct ibv_qp_cap {
	uint32_t		max_send_wr;
	uint32_t		max_recv_wr;
	uint32_t		max_send_sge;
	uint32_t		max_recv_sge;
	uint32_t		max_inline_data;
};

ibv_qp_ex: Encapsulates a queue for posting receive work requests and a queue for posting send work requests

NOTE: Due to evolution of this stack, _ex version is the extended, more modern variant of old ibv_qp API.

Really is a receive queue and a send queue. QP is RDMA jargon for the two directions of a connection.

The user can define the minimum attributes to the QP: number of Work Requests and number of scatter/gather entries per Work Request to Send and Receive queues. The actual attributes can be equal or higher than those values.

struct ibv_qp {
	struct ibv_context     *context;
	void		       *qp_context;
	struct ibv_pd	       *pd;
	struct ibv_cq	       *send_cq;
	struct ibv_cq	       *recv_cq;
	struct ibv_srq	       *srq;
	uint32_t		handle;
	uint32_t		qp_num;
	enum ibv_qp_state       state;
	enum ibv_qp_type	qp_type;

	pthread_mutex_t		mutex;
	pthread_cond_t		cond;
	uint32_t		events_completed;
};

Memory Regions (MR)

ibv_mr: Represents a memory buffer that can be targeted by work requests; has a local key (L_Key) for use in local work requests and a remote key (R_Key) that can be shared with a peer for use in remote one-sided operations.

Simplest form of memory registration. When registered, you can decide whether to allow remote access, like reads and writes.

When you do a registration, you get some keys back, one for local work, and another for remote work. If remote key, you’ll have to get this R_Key to the remote side so it can refer to this memory.

The MR’s starting address is addr and its size is length. The maximum size of the block that can be registered is limited to device_attr.max_mr_size. Every memory address in the virtual space of the calling process can be registered, including, but not limited to:

  • Local memory (either variable or array)

  • Global memory (either variable or array)

  • Dynamically allocated memory (using malloc() or mmap())

  • Shared memory

  • Addresses from the text segment

The registered memory buffer doesn’t have to be page-aligned.

There isn’t any way to know what is the total size of memory that can be registered for a specific device.

struct ibv_mr {
	struct ibv_context     *context;
	struct ibv_pd	       *pd;
	void		       *addr;
	size_t			length;
	uint32_t		handle;
	uint32_t		lkey;
	uint32_t		rkey;
};

Exchanging Data via Reliable Connected (RC) QP

Key steps:

  1. Register buffers that will be used for communication

  2. Create and connect a QP via librdmacm

  3. Post receive work requests

  4. Post send work requests

  5. Poll for completion of work requests

Examples

Server Setup

Bind to an address:

int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr);

Associates a source address with an rdma_cm_id. The address may be wildcarded. If binding to a specific local address, the rdma_cm_id will also be bound to a local RDMA device.

Listen for CM events:

int rdma_listen(struct rdma_cm_id *id, int backlog);

Initiates a listen for incoming connection requests or datagram service lookup. The listen will be restricted to the locally bound source address.

Users must have bound the rdma_cm_id to a local address by calling rdma_bind_addr before calling this routine. If the rdma_cm_id is bound to a specific IP address, the listen will be restricted to that address and the associated RDMA device. If the rdma_cm_id is bound to an RDMA port number only, the listen will occur across all RDMA devices.

However, unlike a normal TCP listen, this is a non-blocking call. When a new client is connected, a new connection management (CM) event is generated on the RDMA CM event channel from where the listening id was created. Here we have only one channel, so it is easy.

Block for client connection event:

int rdma_get_cm_event(struct rdma_event_channel *channel, struct rdma_cm_event **event);

Retrieves a communication event. If no events are pending, by default, the call will block until an event is received.

The default synchronous behavior of this routine can be changed by modifying the file descriptor associated with the given channel. All events that are reported must be acknowledged by calling rdma_ack_cm_event. Destruction of an rdma_cm_id will block until related events have been acknowledged.

Acknowledge CM event:

int rdma_ack_cm_event(struct rdma_cm_event *event);

All events which are allocated by rdma_get_cm_event must be released, there should be a one-to-one correspondence between successful gets and acks. This call frees the event structure and any memory that it references.

Server Teardown

Destroy CM id:

Destroys the specified rdma_cm_id and cancels any outstanding asynchronous operation.

int rdma_destroy_id(struct rdma_cm_id *id);

Destroy CM event channel:

void rdma_destroy_event_channel(struct rdma_event_channel *channel);

Client Setup

Open a Connection Manager (CM) event channel for asynchronous communication events:

struct rdma_event_channel *cm_event_channel = rdma_create_event_channel();

Create CM id to track communication information:

int ret = rdma_create_id(cm_event_channel, &cm_client_id, NULL, RDMA_PS_TCP);

Set up a sockaddr_in struct for the server’s RDMA address information, and optionally one for the client’s RDMA address info. Use these, cast to struct sockaddr*, as the src/dst fields to rdma_resolve_addr. If successful, the specified rdma_cm_id will be bound to a local device.

const char *client_host = "192.168.0.104";
const char *server_host = "192.168.0.106";
int server_port = 20021;

struct sockaddr_in server_sockaddr;
memset(&server_sockaddr, 0, sizeof(server_sockaddr));
server_sockaddr.sin_family = AF_INET;
server_sockaddr.sinaddr.s_addr = inet_addr(server_host);
server_sockaddr.sin_port = htons(server_port);

/* Optional: set up client sockaddr_in information */
struct sockaddr_in client_sockaddr;
memset(&client_sockaddr, 0, sizeof(client_sockaddr));
client_sockaddr.sin_family = AF_INET;
client_sockaddr.sinaddr.s_addr = inet_addr(client_host);

int timeout_ms = 2000;
ret = rdma_resolve_addr(cm_client_id,
			(struct sockaddr*)&client_sockaddr,
			(struct sockaddr*)&server_sockaddr,
			timeout_ms);

Resolve destination and optional source addresses from IP addresses to an RDMA address. If successful, the specified rdma_cm_id will be bound to a local device.

int rdma_resolve_addr (struct rdma_cm_id *id, struct sockaddr *src_addr,
		       struct sockaddr *dst_addr, int timeout_ms);