Skip to main content

GPU quickstart

This guide walks through provisioning GPU bare metal servers as private nodes in a tenant cluster. It covers GPU-specific prerequisites, node type configuration, and workload verification.

The Getting Started guide covers NodeProvider setup, BMC credentials, and server registration.

Prerequisites

Complete the Getting Started guide first. You need a working NodeProvider with Metal3 deployed and at least one BareMetalHost in available state.

You also need:

  • GPU servers that have passed Metal3 inspection and are in available state.
  • An OS image with GPU drivers pre-installed, a standard cloud image with a cloud-init script that installs them, or a standard cloud image with the NVIDIA GPU Operator installed post-provision. GPU Operator is the most common approach for production fleets.
  • SSH keys configured as SSHKey resources if you need post-provision server access.

1. Label your GPU servers

Add labels to your BareMetalHost resources. The platform uses these labels to match servers to node types.

kubectl label baremetalhost server-01 -n metal3-system \
gpu-model=h100 \
gpu-count=8

These label keys are useful for GPU fleets:

  • gpu-model — GPU model, such as h100, a100, or l40s.
  • gpu-count — Number of GPUs per server.
  • rack or datacenter — Location for topology-aware scheduling.

2. Create an OSImage

An OSImage resource stores the OS image URL and checksum. Multiple node types can reference the same OSImage by name.

apiVersion: management.loft.sh/v1
kind: OSImage
metadata:
name: ubuntu-noble-gpu
spec:
properties:
metal3.vcluster.com/image-url: https://your-registry.example.com/ubuntu-noble-gpu-amd64.img
metal3.vcluster.com/image-checksum: sha256:abc123...
metal3.vcluster.com/image-checksum-type: sha256
kubectl apply -f os-image.yaml

If you don't have a GPU-ready image, use a standard cloud image and install the NVIDIA driver via vcluster.com/user-data. See Configuration for user data options.

3. Configure a GPU node type

Add a GPU node type to your NodeProvider. The resources field defines what the platform advertises to the Kubernetes scheduler.

nodeTypes:
- name: "h100-8x"
displayName: "H100 8x GPU"
resources:
cpu: "64"
memory: 256Gi
nvidia.com/gpu: "8"
bareMetalHosts:
selector:
matchLabels:
gpu-model: h100
gpu-count: "8"
properties:
vcluster.com/os-image: ubuntu-noble-gpu
vcluster.com/ssh-keys: my-ssh-key
note

The platform validates CPU and memory against the hardware inventory collected during Metal3 inspection. It does not validate GPU resources against hardware. The nvidia.com/gpu value in resources is what the scheduler sees. Make sure it matches the actual hardware.

4. Create a tenant cluster with GPU private nodes

Configure a vCluster to use the GPU node type as a private node.

privateNodes:
enabled: true
autoNodes:
- provider: metal3-provider
static:
- name: gpu-nodes
quantity: 1
nodeTypeSelector:
- property: vcluster.com/node-type
value: h100-8x

The platform selects an available h100-8x server, installs the OS, and joins it to the tenant cluster as a worker node.

5. Verify GPU access

Once the node joins, verify the GPU is visible to the scheduler.

vcluster connect my-cluster
kubectl get nodes -o json | jq '.items[].status.capacity | select(."nvidia.com/gpu")'

The output should show "nvidia.com/gpu": "8" in the node capacity. If it's missing, check that the NVIDIA device plugin or GPU Operator is installed on the node.

What the platform handles

  • Selecting a BareMetalHost that matches the node type label selector.
  • Allocating an IP and generating network configuration.
  • Generating cloud-init and joining the server to the tenant cluster.
  • Releasing the IP and returning the server to the pool on deprovision.

What you manage

  • Enrolling BareMetalHost resources and applying GPU labels.
  • Providing an OS image with GPU driver support, or installing drivers via cloud-init.
  • Deploying the NVIDIA device plugin or GPU Operator inside the tenant cluster.
  • Setting the correct nvidia.com/gpu count in the node type to match the hardware.