Photo by Massimo Botturi on Unsplash
Multi-Process Service (MPS) is a client-server implementation of the CUDA Application Programming Interface (API) for running multiple processes concurrently on the same GPU.
The server manages GPU access providing concurrency between clients. Clients connect to it through the client runtime, which is built into the CUDA Driver library and may be used transparently by any CUDA application.
Multi-Process control on a NVIDIA Volta architecture. Image from NVIDIA documentation.
MPS provides the following benefits:
The MPS can be very useful when:
MPS consists of several components, including a Control Daemon Process, a Client Runtime, and a Server Process. The Control Daemon is responsible for starting and stopping the server and coordinating connections between clients and servers, while the Client Runtime is built into the CUDA Driver library and may be used transparently by any CUDA application. The Server is the clients' shared connection to the GPU and provides concurrency between clients. More information on the architecture of MPS is available in its official documentation.
It is possible to automate GPU partitioning for workloads on a Kubernetes cluster using Nos, an open-source module. Nos enables users to specify the compute and memory requirements for each pod, and automatically partitions the GPU to allocate only the necessary resources to the pod. With Nos, GPUs are utilized to their fullest potential.
There are two alternatives to MIG to partition GPUs: time-slicing and NVIDIA MIG.
Time-slicing consists of oversubscribing a GPU leveraging its time-slicing scheduler, which executes multiple CUDA processes concurrently through temporal sharing.
This means that the GPU shares its compute resources among the different processes in a fair-sharing manner by switching between processes at regular intervals of time. This generates a computing time overhead related to the continuous context switching, which translates into jitter and higher latency.
Time-slicing is supported by basically every GPU architecture and is the simplest solution for sharing a GPU in a Kubernetes cluster. However, constant switching among processes creates a computation time overhead. Also, time-slicing does not provide any level of memory isolation among the processes sharing a GPU, nor any memory allocation limits, which can lead to frequent Out-Of-Memory (OOM) errors.
MIG technology from NVIDIA allows a single GPU to be partitioned into up to seven completely separate instances, each with its own dedicated high-bandwidth memory, cache, and compute cores. This greatly increases the number of users who can access the GPU, and maximizes GPU usage by allocating resources efficiently.
MIG is supported on NVIDIA H100, A100, and A30 Tensor Core GPUs. H100 further enhances MIG by providing multi-tenant, multi-user configurations in virtualized environments, ensuring secure isolation of each instance with confidential computing at the hardware and hypervisor level.
MIG, along with MPS, can be leveraged on Kubernetes with the opensource module Nos to automatically maximize GPU utilization in the cluster.
Overall, MPS can provide significant benefits to multi-process CUDA applications and can be a valuable tool for achieving greater levels of execution performance on NVIDIA GPUs.