Nvlink is a communication protocol developed by Nvidia. It is a hardware technology for creating a high bandwidth link between two of their video cards. It can be used for many things, from basic SLI for faster gaming to potentially pooling GPU memory for rendering large and complex scenes.
It also allows the combination of two GPUs. It is used for data and control code transfers in processor systems between CPUs and GPUs and solely between GPUs. NVLink specifies a point-to-point connection with data rates of 20 and 25 Gbit/s. It is like a connection between GPUs for code and data transfer. Nvlink helps in distributing the workload on multiple GPUs.
By the end of this blog, you will be able to understand and compare the performance of NVLINK between 2 RTX 2080Ti GPUs along with a comparison against a single GPU I’ve recently done.
Hardware Setup:
- GeForce RTX 2080Ti
- Intel Core i9- 10940 X CPU
- 3.30 GHz Processor Base Frequency
- 126 GB Memory
Software Setup:
- Ubuntu 18.04.4 LTS
- NVIDIA display driver 440.44
- CUDA 10.2
- Torch 1.4.0
Before going into Nvlink, let’s first understand the problem statement. In this post, I have considered a PyTorch model called DeepNeuro. It is an open-source toolset of deep learning applications in medical imaging.
- We can check the status of available GPUs using nvidia-smi nvlink — status command.
This gives us the current status of GPUs and bandwidth.
- To know the topology of nvlink, use nvidia-smi topo -m command.
From the topology of Nvlink, we can understand that GPU0 and GPU1 have a connection traversing a bonded set of the number of Nvlinks.
This shows us which GPUs are connected to each other with which technology.
DeepNeuro with a single GPU:
I have trained a Deep Learning model using 3150 images with a batch size of 150. Here I’m using only one GPU and the entire workload will only 1 GPU. In these kinds of situations, when workload increases, a single GPU won’t be able to take the load completely and it will give Out of Memory issues.
DeepNeuro with Nvlink:
Nvlink will enable the communication between multiple GPUs and it distributes the workload among them. To use Nvlink, I have parallelized the PyTorch model, and using the Nvlink task gets distributed to both the GPUs.
Using Nvlink, we can parallelize the task which reduces the workload on Individual GPUs and it also reduces the processing time.
We can observe the plot for epochs against the time taken. For a given number of epochs, Multi-GPU models are processing the models faster than a single GPU. We can observe that Nvlink is processing the models nearly 20% faster.
Other Examples:
I have trained another PyTorch classification model to predict whether the patient has Pneumonia or not. From the following example, we can compare the computational speed of single and multiple GPUs more clearly.
Single GPU:
I have trained a PyTorch model with 5000 images on a single GPU and it utilizes nearly 73% of memory, when we increase the size of the dataset, it may give Out of Memory error in future.
Multi GPU:
The workload is getting shared by both the GPUs and around 30–40% of GPU memory is getting utilized.
We can observe the plot for epochs against the time taken. For a given number of epochs, Multi-GPU models are processing the models faster than a single GPU. We can observe that Nvlink is processing the models nearly 50% faster.
Comparing RTX 2080 Ti with other GPUs:
Boasting up to six times, the performance of the older GTX 1080 series graphics card, Nvidia’s latest RTX2080, and RTX 2080 Ti are the GPU Beasts.
Comparing Average G3D Mark of GPUs:
This graph shows the relative performance of the video card compared to the 10 other common video cards in terms of PassMark G3D Mark. 3DMark is a computer benchmarking tool created and developed by UL, (formerly Futuremark), to determine the performance of a computer’s 3D graphic rendering and CPU workload processing capabilities. Running 3DMark produces a 3DMark score, with higher numbers indicating better performance.