Integrating NVIDIA Multi-Process Service (MPS) in Kubernetes to share GPUs among workloads for maximizing utilization and reducing infrastructure costs.
Most workloads do not require the full memory and computing resources of each GPU. Therefore, sharing a GPU among multiple processes is essential to increase GPU utilization and reduce infrastructure costs.
In Kubernetes, this can be achieved by exposing a single GPU as multiple resources (i.e. slices) of a specific memory and compute size that can be requested by individual containers. By creating GPU slices of the size strictly needed by each container, you can free up resources in the cluster. These resources can be used to schedule additional Pods, or can allow you to reduce the number of nodes of the cluster. In either case, sharing GPUs among processes enables you to reduce infrastructure costs.
GPU support in Kubernetes is provided by the NVIDIA Kubernetes Device Plugin, which at the moment supports only two sharing strategies: time-slicing and Multi-Instance GPU (MIG). However, there is a third GPU sharing strategy that balances the advantages and disadvantages of time-slicing and MIG: Multi-Process Service (MPS). Although MPS is not supported by NVIDIA Device Plugin, there is a way to use it in Kubernetes.
In this article, we will first examine the benefits and drawbacks of all the three GPU sharing technologies, and then provide a step-by-step guide on how to use MPS in Kubernetes. Additionally, we present a solution for automating management of MPS resources for optimizing utilization and reducing operational costs: Dynamic MPS Partitioning.
There are three approaches for sharing GPUs:
Let’s take an overview of these technologies before diving into the demo of Dynamic MPS Partitioning.
Time-slicing is a mechanism that allows workloads that land on oversubscribed GPUs to interleave with one another. Time-slicing leverages the GPU time-slicing scheduler, which executes multiple CUDA processes concurrently via temporal sharing.
When time-slicing is activated, the GPU shares its compute resources among the different processes in a fair-sharing manner by switching between processes at regular intervals of time. This generates a computing time overhead related to the continuous context switching, which translates into jitter and higher latency.
Time-slicing is supported by basically every GPU architecture and is the simplest solution for sharing a GPU in a Kubernetes cluster. However, constant switching among processes creates a computation time overhead. Also, time-slicing does not provide any level of memory isolation among the processes sharing a GPU, nor any memory allocation limits, which can lead to frequent Out-Of-Memory (OOM) errors.
If you want to use time-slicing in Kubernetes, all you have to do is edit the NVIDIA Device Plugin configuration. For example, you can apply the configuration below to a node with 2 GPUs. The device plugin running on that node will advertise 8 nvidia.com/gpu resources to Kubernetes, rather than 2. This allows each GPU to be shared by a maximum of 4 containers.
For more information about time-slicing partitioning in Kubernetes refer to the NVIDIA GPU Operator documentation.
Multi-Instance GPU (MIG) is a technology available on NVIDIA Ampere and Hopper architectures that allows to securely partition a GPU into up to seven separate GPU instances, each fully isolated with its own high-bandwidth memory, cache, and compute cores.
The isolated GPU slices are called MIG devices, and they are named adopting a format that indicates the compute and memory resources of the device. For example, 2g.20gb corresponds to a GPU slice with 20 GB of memory.
MIG does not allow to create GPU slices of custom sizes and quantity, as each GPU model only supports a specific set of MIG profiles. This reduces the degree of granularity with which you can partition the GPUs. Additionally, the MIG devices must be created respecting certain placement rules, which further limits flexibility of use.
MIG is the GPU sharing approach that offers the highest level of isolation among processes. However, it lacks flexibility and it is compatible only with few GPU architectures (Ampere and Hopper).
You can create and delete MIG devices manually with the nvidia-smi CLI or programmatically with NVML. The devices are then exposed as Kubernetes resources by the NVIDIA Device Plugin using different naming strategies. For instance, using the mixed strategy, the device 1g.10gb is exposed as nvidia.com/mig-1g.10gb. Instead the strategy single exposes the device as a generic nvidia.com/gpu resource.
Managing MIG devices manually with the nvidia-smi CLI or with NVML is rather impractical: in Kubernetes the NVIDIA GPU Operator offers an easier way to use MIG, though still with limitations. The operator uses a ConfigMap defining a set of allowed MIG configurations that you can apply to each node by tagging it with a label.
You can edit this ConfigMap to define your own custom MIG configurations, as in the example shown below. In this example, a node is labeled with nvidia.com/mig.config=all-1g.5gb. Therefore, the GPU Operator will partition each GPU of that node into seven 1g.5gb MIG devices, which are then exposed to Kubernetes as nvidia.com/mig-1g.5gb resources.
To make efficient use of the resources in the cluster with NVIDIA GPU Operator, the cluster admin would have to continuously modify the ConfigMap to adapt the MIG size to the ever-changing workload compute requirements.
This is very impractical. Although this approach is certainly better than SSH-ing to nodes and manually creating/deleting of MIG devices, it is very labor and time-consuming for the cluster admin. Therefore, it is often the case that the configuration of MIG devices is rarely changed over time or not applied at all, and in both cases this results in large inefficiencies in GPU utilization and thus higher infrastructure costs.
This challenge can be overcome with Dynamic GPU Partitioning. Later in this article we will see how to dynamically partition a GPU with MPS using the open source module nos, following an approach that also works with MIG.
Multi-Process Service (MPS) is a client-server implementation of the CUDA Application Programming Interface (API) for running multiple processes concurrently on the same GPU.
The server manages GPU access providing concurrency between clients. Clients connect to it through the client runtime, which is built into the CUDA Driver library and may be used transparently by any CUDA application.
MPS is compatible with basically every modern GPU and provides the highest flexibility, allowing to create GPU slices with arbitrary limits on both the amount of allocatable memory and the available compute. However, it does not enforce full memory isolation between processes. In most cases, MPS represents a good compromise between MIG and time-slicing.
Compared to time-slicing, MPS eliminates the overhead of context-switching by running processes in parallel through spatial sharing, and therefore leads to better compute performance. Moreover, MPS provides each process with its own GPU memory address space. This allows to enforce memory limits on the processes overcoming the limitations of time-slicing sharing.
In MPS, however, client processes are not fully isolated from each other. Indeed, even though MPS allows to limit clients’ compute and memory resources, it does not provide error isolation and memory protection. This means that a client process can crash and cause the entire GPU to reset, impacting all other processes running on the GPU.
The NVIDIA Kubernetes Device Plugin does not offer support for MPS partitioning, making it not straightforward to use it in Kubernetes. In the following section, we explore alternative methods for taking advantage of MPS for GPU sharing by leveraging nos and a different Kubernetes device plugin.
You can enable MPS partitioning in a Kubernetes cluster by installing this fork of the NVIDIA Device Plugin with Helm:
By default, the Helm chart deploys the device plugin with MPS mode enabled on all nodes labeled nos.nebuly.com/gpu-partitioning=mps. To enable MPS partitioning on the GPUs of a specific node, you need to simply apply the label nos.nebuly.com/gpu-partitioning=mps to it.
It is likely that a version of the NVIDIA Device Plugin is already installed on your cluster. If you don’t want to remove it, you can choose to install this forked plugin alongside the original NVIDIA Device Plugin and run it only on specific nodes. To do so, it is important to ensure that only one of the two plugins is running on a node at a time. As described in the installation guide, this can be achieved by editing the specification of the original NVIDIA Device Plugin and adding an anti-affinity rule in its spec.template.spec, so that it does not run on the same nodes targeted by the forked plugin:
After installing the device plugin, you can configure it to expose GPUs as multiple MPS resources by editing the sharing.mps section of its configuration. For example, the configuration below tells the plugin to expose to Kubernetes the GPU with index 0 as two GPU resources (named nvidia.com/gpu-4gb) with 4GB of memory each:
The resource name advertised to Kubernetes, the partition size and the number of replicas can be configured as needed. Going back to the example given above, a container can request a fraction of 4 GB of GPU memory as follows:
Note that there are a few constraints for Pods with containers requesting MPS resources:
Overall, it is complex and time-consuming to manage MPS resources through Device Plugin configuration. Moreover, as described here, the static nature of the configurations results in poor GPU utilization. Instead, it would be better just to create Pods requesting specific GPU memory resources, and let someone else automatically provision and manage them. We’ll see exactly how to do that in the next section.
Dynamic MPS Partitioning automates the creation and deletion of MPS resources based on real-time requirements of the workloads in the cluster, ensuring the optimal sharing configuration is always applied to the available GPUs.
To apply dynamic partitioning, we need to use nos, an open-source module that runs alongside the NVIDIA GPU Operator.
You can think of nos as a Cluster Autoscaler for GPUs: instead of scaling up the number of nodes and GPUs, it dynamically partitions them to maximize their utilization, leading to spare GPU capacity. Then, you can schedule more Pods or reduce the number of GPU nodes needed, reducing infrastructure costs.
nos manages MPS resources using the format nvidia.com/gpu-<size>gb. For instance, if a container requires a GPU slice of 10 GB of memory, which would correspond to requesting a nvidia.com/gpu-10gb resource, nos will automatically expose it on one of the available GPUs.
With nos, there is no need to manually configure the Device Plugin for advertising MPS resources. You can simply submit your Pods to the cluster and the requested MPS resources are automatically provisioned.
Let’s explore how nos and Dynamic MPS Partitioning work in practice.
nos does not replace the NVIDIA GPU Operator, but it works alongside it. Hence, you need to first install the operator as follows:
As already mentioned, the Device Plugin deployed by the NVIDIA GPU Operator does not support MPS. To use MPS, you must install the forked version of the Device Plugin, following the steps outlined in the previous section.
Once you have installed the NVIDIA GPU Operator and enabled MPS, you can simply install nos as follows
That’s it! Now you are ready to activate Dynamic MPS Partitioning on your nodes.
First, you need to specify to nos for which nodes it should manage GPU partitioning with MPS. Label those nodes as follow:
kubectl label nodes <node-names> "nos.nebuly.com/gpu-partitioning=mps"
This label marks a node as a “MPS node”, delegating the management of MPS resources of all the node’s GPUs to nos.
After that, you can submit workloads requesting MPS resources. nos will automatically create the missing MPS resources requested by Pods and delete the unused ones.
Let’s take a look at a simple example of nos in action.
Assume we are operating a simple cluster with two nodes, one of which has a single NVIDIA Tesla T4 with 16 GB of memory. We can enable automatic MPS partitioning for that node as follows:
The output of kubectl describe node aks-gput4–31021156-vmss000000 shows that the node does not have any available MPS resources, since no MPS resources have been requested yet:
Let’s now create some Pods that require running on GPU. We assume these Pods are small inference servers that need only, let’s say, 2 GB of GPU memory. Without partitioning, we would be able to schedule only one of such Pods, since we only have a single Tesla T4 in our cluster.
However, with MPS partitioning, the Pods can request only the necessary resources, allowing for up to 8 Pods to be scheduled on our single Tesla T4.
In this example, we create a deployment with 8 replicas of a Pod with a container requesting a GPU slice of 2GB of memory:
There are now 8 pending Pods in the namespace demo, requesting a total of five nvidia.com/gpu-2gb resources which are not yet available in the cluster:
In a few seconds, nos will detect these pending Pods and create the requested resources. Let’s check again the output of kubectl describe node aks-gput4–31021156-vmss00000:
If we check once again the state of the Pods, we can see that this time they are now in Running state:
Each container of the running Pods can allocate only up to 2 GB of memory on the shared GPU. If it tries to allocate more memory, it will crash with an Out-Of-Memory (OOM) error without affecting the other Pods.
However, it is important to point out that nvidia-smi accesses the NVIDIA drivers bypassing the MPS client runtime. As a result, running nvidia-smi within a container will display the entire GPU resources in its output:
The possibility of requesting GPU slices is crucial for improving GPU utilization and cutting down infrastructure costs.
There are three ways to achieve that: time-slicing, Multi-Instance GPU (MIG) and Multi-Process Server (MPS). Time-slicing is the simplest technology for sharing a GPU, but it lacks memory isolation and introduces overhead that degrades workloads performance. On the other hand, MIG offers the highest level of isolation, but its limited set of supported configurations and “slice” sizes makes it not flexible.
MPS is a valid compromise between MIG and time-slicing. Unlike MIG, it allows for creating GPU slices of arbitrary sizes. Unlike time-slicing, it allows to enforce memory allocation limits and reduce Out-Of-Memory (OOM) errors that may occur when multiple containers compete for shared GPU resources.
Currently, the NVIDIA Device Plugin does not support MPS. Nevertheless, MPS can be enabled by simply installing another Device Plugin that supports it.
MPS static configurations however do not automatically adjust to the changing demands of workloads and thus are inadequate to provide every Pod with the GPU resources it requires, especially in scenarios with workloads demanding a variety of slices in terms of memory and computing that change over time.
nos overcomes MPS static configurations limitations through Dynamic GPU Partitioning, which increases GPU utilization and reduces the operational burden of manually defining and applying MPS configurations on the Device Plugin instances running on cluster’s nodes.
In conclusion, we have to point out that there are situations where the flexibility of MPS is not necessary, while the full isolation provided by MIG is crucial. In these cases, it is still possible to take advantage of Dynamic GPU Partitioning through nos, since it supports both the partitioning modes. You can find more about it here.