Accelerating Workloads with Azure Batch and NVIDIA H100 GPUs
The demand for high-performance computing (HPC) and AI applications is rapidly increasing, and cloud services like Microsoft Azure offer the flexibility to scale such workloads efficiently. One of Azure’s most powerful services for batch processing is Azure Batch, which allows you to run large-scale parallel jobs on a pool of Virtual Machines (VMs). To further accelerate your workloads, Azure supports GPU-enabled VMs, including those equipped with NVIDIA H100 GPUs, which are ideal for AI training, deep learning, and high-performance computing tasks.
In this blog post, we’ll walk through how to set up and use Azure Batch with NVIDIA H100 GPUs to run compute-heavy tasks efficiently.
Why Use Azure Batch with NVIDIA H100 GPUs?
- Scalability: Azure Batch automatically manages your pool of VMs, scaling up or down based on the number of jobs you need to process. With the flexibility of GPUs, especially the powerful NVIDIA H100, you can process complex computations faster than ever.
- Efficiency: Azure Batch provides a cost-effective way to run massive parallel jobs, offering the ability to use VMs with GPU acceleration only when needed. The H100 GPUs are optimized for AI workloads, ensuring that your compute-heavy jobs are completed more efficiently.
- Flexibility: The H100 GPU can accelerate both AI training and inference, and with Azure Batch, you can handle a wide variety of workloads including large-scale simulations, model training, rendering, and more.
Prerequisites
Before you get started, make sure you have the following:
- An Azure subscription.
- Azure CLI installed.
- Azure Batch account created.
- Familiarity with NVIDIA GPU drivers and software like CUDA for running GPU-based applications.
Step 1: Set Up an Azure Batch Account
To use Azure Batch, you first need to create a Batch account.
- Log in to the Azure portal and navigate to Batch accounts.
- Click Create, then select the resource group and specify the location where you want the Batch account to reside.
- Define a Batch account name and click Review + Create to finish the process.
Step 2: Create a Pool of Virtual Machines with NVIDIA H100 GPUs
Now, let’s create a pool of VMs that includes NVIDIA H100 GPUs.
- Navigate to the Batch Account you created and select Pools under Batch Service.
- Click Add to create a new pool.
- Choose VM Size:
- For GPU-accelerated workloads, select one of the NC-series or ND-series VMs. For the NVIDIA H100 GPU, choose an appropriate SKU such as ND H100 v5. These VM sizes come with powerful GPUs tailored for compute-intensive tasks.
- Set Node Size and Scaling:
- Set the Target number of nodes based on your expected job workload. You can scale this later dynamically based on job demands.
- Select Image:
- Choose an image that supports the required NVIDIA GPU drivers and software, such as Ubuntu 20.04-LTS or Windows Server 2022 with GPU capabilities.
- Install GPU Drivers:
- Use a startup task to install the NVIDIA GPU drivers. This ensures that the VMs can leverage the power of the NVIDIA H100 GPUs.
Here’s an example startup task for Linux to install the necessary GPU drivers:
bashCopy code#!/bin/bash
# Install NVIDIA drivers on Ubuntu
sudo apt update
sudo apt install -y nvidia-driver-525
sudo reboot
For Windows, you can leverage pre-configured NVIDIA GPU driver installation scripts available from Microsoft.
- Create the Pool: Once you’ve configured your pool, click Create to finish setting it up.
Step 3: Submit a Job to Azure Batch Using GPUs
Once your pool of VMs with NVIDIA H100 GPUs is ready, you can submit jobs to the Batch account.
- Create a Job: Navigate to Jobs under your Batch account and click Add to create a new job. Assign the job to the pool of GPU-enabled VMs you just created.
- Define Tasks: A job consists of multiple tasks, each representing a unit of work. Define tasks by specifying the command that should run on each GPU-enabled VM.
For example, if you are running an AI training task, the command might look like this:
bashCopy codepython train_model.py --data_path /data --epochs 100
Each task will be distributed across the nodes in your pool, and with the power of the H100 GPUs, your AI model training or compute-heavy workload will run significantly faster.
Step 4: Monitor Job Execution
Azure Batch provides tools to monitor your job and pool in real-time.
- Navigate to the Jobs section to check the status of your running jobs.
- You can monitor the node status, task progress, and even set up alerts for task completions or failures.
- If you need to add more VMs as your workload increases, Azure Batch can automatically scale your pool based on demand.
Step 5: Clean Up Resources
Once your jobs are complete, it’s important to clean up the resources to avoid unnecessary charges. You can delete the pool of VMs or the entire Batch account depending on your needs.
- To delete the pool, go to the Pools section and select Delete for the relevant pool.
- Alternatively, you can scale down the number of nodes to zero to temporarily stop the resources without deleting them.
Best Practices for Using NVIDIA H100 GPUs with Azure Batch
- Optimize Workload Distribution: Divide your tasks to utilize each VM’s GPU efficiently. H100 GPUs are incredibly powerful, and running smaller, highly parallel tasks ensures full utilization of the GPU.
- Use GPU-Optimized Libraries: Ensure that your workloads leverage GPU-optimized libraries like cuDNN or TensorRT to maximize performance.
- Scale Dynamically: Set up auto-scaling rules to dynamically scale the number of VMs in your pool based on workload demands. This helps in balancing cost and performance, especially when using expensive GPUs like the NVIDIA H100.
- Monitor GPU Utilization: Use tools like NVIDIA-SMI to monitor GPU usage and ensure that the tasks are utilizing the GPUs efficiently. Under-utilized GPUs can lead to wasted resources and longer job runtimes.
Conclusion
By combining the scalability of Azure Batch with the raw power of NVIDIA H100 GPUs, you can accelerate your AI, machine learning, and compute-heavy workloads in a cost-efficient manner. Azure Batch automates the process of job scheduling, resource scaling, and task execution, while the H100 GPU ensures your computational tasks are processed with the best possible performance. Whether you’re running AI training tasks or simulations, this powerful combination can take your workloads to the next level.