NVIDIA Multi-Process Service (MPS)

Photo by Massimo Botturi on Unsplash

Multi-Process Service (MPS) is a client-server implementation of the CUDA Application Programming Interface (API) for running multiple processes concurrently on the same GPU.

The server manages GPU access providing concurrency between clients. Clients connect to it through the client runtime, which is built into the CUDA Driver library and may be used transparently by any CUDA application.

Multi-Process control on a NVIDIA Volta architecture. Image from NVIDIA documentation.

Benefits of MPS

MPS provides the following benefits:

  • GPU utilization: A single process may not utilize all the compute and memory-bandwidth capacity available on the GPU. MPS allows kernel and memcopy operations from different processes to overlap on the GPU, achieving higher utilization and shorter running times.
  • Reduced on-GPU context storage: Without MPS each CUDA processes using a GPU allocates separate storage and scheduling resources on the GPU. In contrast, the MPS server allocates one copy of GPU storage and scheduling resources shared by all its clients. Volta MPS supports increased isolation between MPS clients, so the resource reduction is to a much lesser degree.
  • Reduced GPU context switching: Without MPS, when processes share the GPU their scheduling resources must be swapped on and off the GPU. The MPS server shares one set of scheduling resources between all of its clients, eliminating the overhead of swapping when the GPU is scheduling between those clients.

When to Use MPS

The MPS can be very useful when:

  • Each application process does not generate enough work to saturate the GPU.
  • The application shows a low GPU occupancy because of a small number of threads-per-grid, and performance improvements may be achievable with MPS.
  • The problem size is held fixed while the compute capacity (node, CPU core and/or GPU count) is increased in strong-scaling situations.

Architecture

MPS consists of several components, including a Control Daemon Process, a Client Runtime, and a Server Process. The Control Daemon is responsible for starting and stopping the server and coordinating connections between clients and servers, while the Client Runtime is built into the CUDA Driver library and may be used transparently by any CUDA application. The Server is the clients' shared connection to the GPU and provides concurrency between clients. More information on the architecture of MPS is available in its official documentation.

Dynamic GPU Partitioning in Kubernetes

It is possible to automate GPU partitioning for workloads on a Kubernetes cluster using Nos, an open-source module. Nos enables users to specify the compute and memory requirements for each pod, and automatically partitions the GPU to allocate only the necessary resources to the pod. With Nos, GPUs are utilized to their fullest potential.

Alternatives to MPS for GPU partitioning

There are two alternatives to MIG to partition GPUs: time-slicing and NVIDIA MIG.

Time-slicing

Time-slicing consists of oversubscribing a GPU leveraging its time-slicing scheduler, which executes multiple CUDA processes concurrently through temporal sharing.

This means that the GPU shares its compute resources among the different processes in a fair-sharing manner by switching between processes at regular intervals of time. This generates a computing time overhead related to the continuous context switching, which translates into jitter and higher latency.

Time-slicing is supported by basically every GPU architecture and is the simplest solution for sharing a GPU in a Kubernetes cluster. However, constant switching among processes creates a computation time overhead. Also, time-slicing does not provide any level of memory isolation among the processes sharing a GPU, nor any memory allocation limits, which can lead to frequent Out-Of-Memory (OOM) errors.

NVIDIA Multi-Instance GPU (MIG)

MIG technology from NVIDIA allows a single GPU to be partitioned into up to seven completely separate instances, each with its own dedicated high-bandwidth memory, cache, and compute cores. This greatly increases the number of users who can access the GPU, and maximizes GPU usage by allocating resources efficiently.

MIG is supported on NVIDIA H100, A100, and A30 Tensor Core GPUs. H100 further enhances MIG by providing multi-tenant, multi-user configurations in virtualized environments, ensuring secure isolation of each instance with confidential computing at the hardware and hypervisor level.

MIG, along with MPS, can be leveraged on Kubernetes with the opensource module Nos to automatically maximize GPU utilization in the cluster.

Conclusion

Overall, MPS can provide significant benefits to multi-process CUDA applications and can be a valuable tool for achieving greater levels of execution performance on NVIDIA GPUs.

Resources

Don't forget to share this post!

Stay up to date on the latest news