Configuring InfiniBand#
To configure InfiniBand, you’ll need to edit your master.yaml
file.
Prerequisites#
Before starting, ensure your system meets the following requirements:
ib0
active/up inifconfig
on the master and all the compute nodes/dev/infiniband
configured on the master and all the compute nodesInfiniBand connectivity is in place (can be verified using
ibping
)
Test if InfiniBand is Present#
To test whether InfiniBand is present on the master and on a Compute Node (e.g., cn16
):
[root@master ~]# ibv_devices
To retrieve hardware information:
[root@master ~]# ibv_devinfo -d mlx5_0
Make note of port number 1
, the number of active ports (e.g., PORT_ACTIVE (4)
) and the
port_lid
. These details are useful for the ibping
test:
[root@master ~]# ibstat mlx5_0
To test the IB ports on both master and cn16 (PING cn16 -> Master
):
Activate the
ibping
server on the master:[root@master ~]# ibping -S -C mlx5_0 -P 1
Note
-P 1 → Port 1 on device mlx5_0.
Then, with the
ibping
client oncn16
, ping the masterport = 1
five times usingport_lid=1
:[root@cn16 ~]# ibping -c 5 -C mlx5_0 -P 1 -L 47
The expected result should be no packet loss, indicating full connectivity.
To test the IB ports across the master and all compute nodes (PING cnXX -> Master
):
[root@cn16 ~]# pdsh -w cn[02-79] "ibping -c 5 -C mlx5_0 -P 1 -L 47" | dshbak -c > ./cluster_wide_ib_ping_test.txt
Review the resulting text file for any packet loss or connectivity issues.
Configure the Master to Use /dev/infiniband
in Containers#
Update the master.yaml
by adding the following lines:
task_container_defaults:
shm_size_bytes: 4294967296
dtrain_network_interface: ib0
add_capabilities:
- IPC_LOCK
devices:
- host_path: /dev/infiniband/
container_path: /dev/infiniband/
Note
The shared memory configuration isn’t mandatory for activating IB. However, it’s best practice for training large models.
Restart the master:
systemctl restart determined-master
Verify the master’s status:
systemctl status determined-master