Photo by Waldemar on Unsplash.
What makes NVIDIA Multi-Instance GPU (MIG) so special? Well, with MIG, a single GPU can be partitioned into as many as seven fully isolated instances, each with its own high-bandwidth memory, cache, and compute cores, expanding GPU access to more users and optimizing GPU utilization.
MIG enables administrators to support every workload, from the smallest to the largest, with guaranteed quality of service (QoS), allowing simultaneous mixed workloads to run on a single GPU with deterministic latency and throughput. This means researchers and developers have more resources and flexibility than ever before, while data center investments are maximized.
MIG has two main benefits:
MIG provides the flexibility to choose many different instance sizes, which allows for the provisioning of the right-sized GPU instance for each workload. This ultimately optimizes utilization and maximizes data center investment.
Moreover, MIG let run simultaneous mixed workloads. MIG enables inference, training, and high-performance computing (HPC) workloads to run at the same time on a single GPU with deterministic latency and throughput. Unlike time slicing, each workload runs in parallel, delivering high performance.
MIG technology works by partitioning a GPU into different-sized MIG instances. For example, in an NVIDIA A100 40GB, an administrator could create two instances with 20 gigabytes (GB) of memory each, three instances with 10GB each, or seven instances with 5GB each. MIG instances can also be dynamically reconfigured, enabling administrators to shift GPU resources in response to changing user and business demands.
With a dedicated set of hardware resources for compute, memory, and cache, each MIG instance delivers guaranteed QoS and fault isolation. This means that failure in an application running on one instance doesn’t impact applications running on other instances. It also means that different instances can run different types of workloads—interactive model development, deep learning training, AI inference, or HPC applications. Since the instances run in parallel, the workloads also run in parallel—but separate and isolated—on the same physical GPU.
MIG is available on NVIDIA H100, A100, and A30 Tensor Core GPUs, with H100 enhancing MIG by supporting multi-tenant, multi-user configurations in virtualized environments, securely isolating each instance with confidential computing at the hardware and hypervisor level.
Dedicated video decoders for each MIG instance deliver secure, high-throughput intelligent video analytics (IVA) on shared infrastructure. With Hopper’s concurrent MIG profiling, administrators can monitor right-sized GPU acceleration and allocate resources for multiple users.
For workloads running on a Kubernetes cluster, it is possible to automate the GPU partitioning process with Nos. Nos is an opensource module that allows the user to specify the compute and memory requirements for each pod and automatically partition the GPU to allocate only the strictly necessary resources to the pod, and the GPUs are always used to the maximum.
There are two alternatives to MIG to partition GPUs: time-slicing and NVIDIA MPS.
Time-slicing consists of oversubscribing a GPU leveraging its time-slicing scheduler, which executes multiple CUDA processes concurrently through temporal sharing.
This means that the GPU shares its compute resources among the different processes in a fair-sharing manner by switching between processes at regular intervals of time. This generates a computing time overhead related to the continuous context switching, which translates into jitter and higher latency.
Time-slicing is supported by basically every GPU architecture and is the simplest solution for sharing a GPU in a Kubernetes cluster. However, constant switching among processes creates a computation time overhead. Also, time-slicing does not provide any level of memory isolation among the processes sharing a GPU, nor any memory allocation limits, which can lead to frequent Out-Of-Memory (OOM) errors.
Multi-Process Service (MPS) is a client-server implementation of the CUDA Application Programming Interface (API) for running multiple processes concurrently on the same GPU.
The main advantage of MPS is that it provides a fine-grained control over the GPU assigned to each client, allowing to specify arbitrary limits on both the amount of allocatable memory and the available compute. The Nebuly k8s-device-plugin takes advantage of this feature for exposing to Kubernetes GPU resources with an arbitrary amount of allocatable memory defined by the user. This plugin is integrated in the opensource Nos and to enable dynamic partitioning of GPUs.
Compared to time-slicing, MPS eliminates the overhead of context-switching by running processes in parallel through spatial sharing, and therefore leads to better compute performance. Moreover, MPS provides each client with its own GPU memory address space. This allows to enforce memory limits on the processes overcoming the limitations of time-slicing sharing.
It is however important to point out that processes sharing a GPU through MPS are not fully isolated from each other. Indeed, even though MPS allows to limit clients' compute and memory resources, it does not provide error isolation and memory protection. This means that a client process can crash and cause the entire GPU to reset, impacting all other processes running on the GPU. However, this issue can often be addressed by properly handling CUDA errors and SIGTERM signals.
Overall, MIG technology offers a powerful solution to expand GPU access and optimize GPU utilization. With the ability to run simultaneous mixed workloads and dynamically reconfigure MIG instances, administrators can easily shift resources in response to changing demands. MIG provides the flexibility and performance required to meet the needs of every workload, from the smallest to the largest.