Accelerating Workloads with NVIDIA MIG and Slurm on Azure

As modern AI, machine learning, and high-performance computing (HPC) workloads grow more complex, the need for flexibility, scalability, and efficient resource utilization becomes paramount. NVIDIA’s Multi-Instance GPU (MIG) feature allows a single GPU to be partitioned into multiple instances, offering a highly flexible way to share GPU resources among multiple jobs. When combined with a powerful workload manager like Slurm on Azure, you can further enhance your HPC environments by efficiently allocating and scheduling compute jobs across these GPU instances.

In this blog post, we’ll dive into how to configure and leverage NVIDIA MIG with Slurm on Azure to optimize and accelerate compute-heavy workloads. We’ll explore what MIG is, why it’s beneficial for resource management, and walk through the steps to set it up in an Azure-based Slurm environment.


What is NVIDIA MIG (Multi-Instance GPU)?

NVIDIA’s MIG is a feature introduced with NVIDIA A100 and extended to H100 GPUs, which enables a single GPU to be split into multiple independent instances. Each of these instances acts as a smaller, isolated GPU with its own dedicated compute cores, memory, and bandwidth. This allows users to efficiently allocate GPU resources to multiple smaller tasks or users, without the overhead of underutilizing an entire GPU.

For example, instead of dedicating a full, powerful A100 GPU to a task that requires only a fraction of the GPU’s resources, you can create several MIG instances on a single GPU. This approach maximizes hardware utilization, particularly in multi-user or multi-tenant environments.


Why Use NVIDIA MIG with Slurm on Azure?

Combining NVIDIA MIG with Slurm on Azure offers several advantages for organizations running AI, machine learning, and HPC workloads:

  1. Improved GPU Utilization: MIG allows multiple users or jobs to share a single GPU without interfering with each other, ensuring that GPU resources are fully utilized. This is ideal for running smaller, concurrent workloads.
  2. Granular Resource Allocation: Slurm is a robust workload manager that can efficiently schedule jobs and allocate compute resources. With MIG, Slurm can assign specific GPU instances to each job, giving you fine-grained control over how GPU resources are allocated.
  3. Cost Efficiency: Instead of purchasing multiple GPUs or spinning up entire VM instances with single GPUs for smaller workloads, MIG enables you to split a single GPU, reducing infrastructure costs while still maximizing performance.
  4. Scalability on Azure: With Azure, you can easily scale out your GPU-accelerated Slurm clusters, adding or removing MIG-enabled GPU instances based on workload demands.

Prerequisites

Before getting started, ensure you have the following:

  • An Azure subscription.
  • Azure CLI installed and configured.
  • A basic understanding of Slurm and NVIDIA GPU drivers.
  • Access to NVIDIA A100 or H100 GPUs in Azure, which support MIG.
  • CUDA installed for GPU-accelerated workloads.

Step 1: Set Up an Azure Virtual Machine with A100 or H100 GPUs

The first step is to set up an Azure VM with an NVIDIA A100 or H100 GPU that supports the MIG feature. Here’s how to provision a VM in Azure with a supported GPU:

  1. Login to Azure CLI:bashCopy codeaz login
  2. Create a Resource Group: Create a resource group where your VM and resources will reside:bashCopy codeaz group create --name myResourceGroup --location eastus
  3. Provision a VM with NVIDIA GPUs: Deploy a VM with A100 or H100 GPUs. The Standard_ND96asr_v4 SKU is a good choice for GPU workloads that support MIG:bashCopy codeaz vm create \ --resource-group myResourceGroup \ --name myMIGVM \ --image UbuntuLTS \ --size Standard_ND96asr_v4 \ --admin-username azureuser \ --generate-ssh-keys
  4. Install NVIDIA Drivers: Once the VM is running, you need to install the NVIDIA drivers to enable GPU support:bashCopy codesudo apt update sudo apt install -y nvidia-driver-525 nvidia-utils-525
  5. Reboot the VM: Reboot the VM to ensure that the drivers are properly installed:bashCopy codesudo reboot

Step 2: Enable and Configure MIG on Your GPU

Now that the VM is set up and has GPU drivers installed, you can enable the MIG feature on your A100 or H100 GPU. Here’s how:

  1. Enable MIG Mode: Use the nvidia-smi tool to enable MIG mode on your GPU:bashCopy codesudo nvidia-smi -mig 1
  2. Create MIG Instances: Once MIG mode is enabled, you can create up to 7 GPU instances, depending on your workload needs. Here’s an example of creating two instances:bashCopy codesudo nvidia-smi mig -cgi 19,19 -C The 19 refers to a GPU instance profile, where each profile defines how many GPU resources (like memory and compute cores) will be assigned to each instance. In this case, we’re creating two medium-sized instances with approximately 10 GB of memory each.
  3. Verify MIG Instances: To confirm that your MIG instances are created, run the following:bashCopy codenvidia-smi You should see the two newly created MIG instances listed.

Step 3: Install and Configure Slurm on Azure

With MIG enabled on your GPU, the next step is to install and configure Slurm to manage jobs and allocate GPU resources.

  1. Install Slurm: SSH into your VM and install Slurm using the following steps:bashCopy codesudo apt install -y slurm-wlm
  2. Configure Slurm for MIG: Update your Slurm configuration to account for the GPU instances. Add the following lines to your slurm.conf file:bashCopy codeGresTypes=gpu NodeName=myMIGVM Gres=gpu:2 CPUs=48 RealMemory=180000 Sockets=1 CoresPerSocket=24 ThreadsPerCore=2 PartitionName=gpu_partition Nodes=myMIGVM Default=YES MaxTime=INFINITE State=UP This tells Slurm that your VM has two GPU instances (as configured with MIG) and defines the number of CPUs and memory available for job scheduling.
  3. Set Up cgroup for GRES: To ensure proper resource allocation, you’ll need to set up cgroup for the GPU resources. Add the following to your cgroup configuration:bashCopy codeConstrainDevices=yes GresTypes=gpu
  4. Restart Slurm: After making these changes, restart the Slurm service:bashCopy codesudo systemctl restart slurmctld

Step 4: Submit a Job to Slurm Using MIG GPUs

Now that Slurm is configured and running, you can submit a job to the MIG-enabled GPUs.

  1. Create a Job Script: Here’s an example job script that runs a GPU-accelerated task using one of the MIG instances:bashCopy code#!/bin/bash #SBATCH --job-name=gpu_test #SBATCH --gres=gpu:1 # Request 1 GPU instance #SBATCH --time=00:30:00 #SBATCH --output=output.log module load cuda/11.3 ./my_gpu_program
  2. Submit the Job: Submit the job to Slurm using the following command:bashCopy codesbatch my_job_script.sh
  3. Monitor the Job: You can monitor the progress of your job using the squeue command:bashCopy codesqueue -u azureuser
  4. Check GPU Utilization: Once the job starts, use nvidia-smi to confirm that the job is utilizing the correct MIG instance:bashCopy codenvidia-smi

Step 5: Optimize and Scale MIG Workloads

With MIG and Slurm set up, you can now optimize the workload distribution across the GPU instances. For example:

  1. Multiple Concurrent Jobs: By splitting a GPU into multiple MIG instances, Slurm can allocate one or more jobs to each GPU instance. This enables concurrent GPU tasks to run without bottlenecking or underutilizing resources.
  2. Auto-Scaling in Azure: Use Azure’s autoscaling capabilities to dynamically add more GPU VMs as workload demands increase. You can configure the Azure VM Scale Set to scale out the GPU nodes as needed.

Conclusion

Combining the power of NVIDIA MIG with Slurm on Azure provides a flexible, scalable, and efficient way to manage GPU resources for HPC, AI, and machine learning workloads. MIG’s ability to partition a GPU into smaller instances allows for granular control and optimal resource usage, especially in multi-tenant or multi-job environments.

Latest Recipes

5 Comments
  • Ключ Шкода Карок says:
    Your comment is awaiting moderation. This is a preview; your comment will be visible after it has been approved.
    Здравствуйте, друзья! Сегодня хотел бы поговорить о актуальном вопросе для обладателей машин концерна VAG – фирменные брелоки. Многие из нас встречались с ситуацией, когда автомобильный ключ внезапно перестает работать или теряется. Это может случиться в самый неподходящий момент, оставляя нас беспомощными и расстроенными. Вот почему я считаю важным отметить значимость применения исключительно фирменных брелоков VAG. Оригинальные ключи VAG – это не просто кусок пластика с электроникой. Это сложное устройство, разработанное специально для вашего транспортного средства. Такие ключи гарантируют максимальную безопасность и сочетаемость со всеми системами вашего авто. Вот несколько причин, почему стоит выбирать только оригинальные ключи VAG: Надежность: Оригинальные ключи изготовлены из высококачественных материалов, что обеспечивает их длительный срок службы. Защищенность: В них применяются передовые технологии шифрования, оберегающие ваш автомобиль от взлома и угона. Функциональность: Многие современные ключи оснащены дополнительными функциями, такими как дистанционный запуск двигателя или контроль температуры в салоне. Гарантийное обслуживание: Фирменные брелоки продаются с гарантией производителя, что обеспечивает вам уверенность и поддержку в случае поломки. Теперь поговорим о том, как купить оригинальные брелоки VAG. После долгих поисков я обнаружил надежного поставщика – Ключ Шкода Карок. Они предлагают широкий ассортимент фирменных ключей для всех моделей автомобилей VAG, включая Volkswagen, Audi, Škoda и SEAT. Что мне особенно понравилось, так это их профессиональный подход. Они не просто продают ключи – они обеспечивают комплексное обслуживание, включая настройку и сопряжение брелока с вашим транспортным средством. Это критически необходимо, так как некорректно настроенный брелок может не работать или даже нарушить работу электронные системы вашего авто. Вдобавок, их цены вполне разумны, особенно если учесть качество предоставляемых услуг и подлинность товаров. Они также предлагают различные варианты доставки, что очень удобно, если вы не можете приехать к ним самостоятельно. В заключение, хочу сказать, что инвестиция в оригинальный брелок VAG – это вклад в безопасность и надежность вашего транспортного средства. Не экономьте на этом критически важном компоненте – ваше спокойствие стоит гораздо больше. А вы уже сталкивались с приобретением ключа для своего автомобиля VAG? Поделитесь своим опытом в комментариях ниже!
  • https://Lvivforum.pp.ua says:
    Your comment is awaiting moderation. This is a preview; your comment will be visible after it has been approved.
    Remarkable izsues here. I’m very happy to see our post. Thanks a lot and I am having a look forward to touch you. Will you kindly drop me a e-mail? https://Lvivforum.pp.ua
  • 🔓 We send a transfer from us. Take > https://telegra.ph/Go-to-your-personal-cabinet-08-25?hs=5c10649ec3dd2e21124b1d84c0bee25c& 🔓 says:
    Your comment is awaiting moderation. This is a preview; your comment will be visible after it has been approved.
    fw1hvb
  • 🔑 You got a transaction from Binance. Get > https://telegra.ph/Go-to-your-personal-cabinet-08-25?hs=5c10649ec3dd2e21124b1d84c0bee25c& 🔑 says:
    Your comment is awaiting moderation. This is a preview; your comment will be visible after it has been approved.
    9oqgxd
  • 📯 Notification- TRANSFER 1.82536 BTC. Confirm >> https://telegra.ph/Message--2868-12-25?hs=5c10649ec3dd2e21124b1d84c0bee25c& 📯 says:
    Your comment is awaiting moderation. This is a preview; your comment will be visible after it has been approved.
    rw90z9
  • Leave a Reply

    Your email address will not be published. Required fields are marked *

    Accelerating Workloads with NVIDIA MIG and Slurm on Azure – WordPress on Azure