Automatically maximize the utilization and performance of GPU resources in your Kubernetes cluster.


Nos is the open-source module for running AI workloads on Kubernetes in an optimized way, both in terms of hardware utilization and workload performance.

The module is responsible for workloads scheduling and hardware abstraction. It orchestrates the workloads taking into account the specific needs of AI/ML workloads and leveraging techniques typical of High-Performance Computing (HPC).

Currently, this module provides two features: Automatic GPU Partitioning and Elastic Resource Quota Management.

The Automatic GPU Partitioner allows you to schedule Pods requesting fractions of GPUs without having to manually partition them. The partitioning is performed dynamically based on the pending and running Pods in your cluster, so that the GPUs are always fully utilized. The GPU partitioning is performed by the GPU Partitioner component. It constantly watches the GPU resources in the cluster and finds the best possible partitioning of the available GPUs. Leveraging fractions of GPUs, it is possible to schedule the highest number of pods, which otherwise could not be scheduled due to the lack of available resources. You can think of the GPU Partitioner as a sort of Cluster Autoscaler for GPUs: instead of scaling up the number of nodes and GPUs, it dynamically partitions the available GPUs to maximize their utilization. The GPU partitioning is performed either using Multi-instance GPU (MIG) or Multi-Process Service (MPS), depending on the partitioning mode you choose for each node. You can find more info about the partitioning modes in the section below.

The Elastic Resource Quota Management extends the Kubernetes Resource Quotas by implementing the Capacity Scheduling KEP and adding more flexibility through two custom resources: ElasticQuotas and CompositeElasticQuotas. While standard Kubernetes resource quotas allow you to define limits on the maximum overall resource allocation of each namespace, nos elastic quotas let you define two different limits:

  1. min: the minimum resources that are guaranteed to the namespace
  2. max: the upper bound of the resources that the namespace can consume

In this way namespaces can borrow reserved resource quotas from other namespaces that are not using them, as long as they do not exceed their max limit, if any, and the namespaces lending the quotas do not need them. When a namespace claims back its reserved min resources, pods borrowing resources from other namespaces (e.g. over-quota pods) are preempted to make up space.

Overall, nos provides a simple and effective way to maximize the utilization of GPU resources in your Kubernetes cluster, allowing users to schedule more Pods and get the most out of your GPUs. Try it out today, and reach out if you have any feedback!

Learn more about nos