GPU quickstart

This guide walks through provisioning GPU bare metal servers as private nodes in a tenant cluster. It covers GPU-specific prerequisites, node type configuration, and workload verification.

The Getting Started guide covers NodeProvider setup, BMC credentials, and server registration.

Prerequisites

Complete the Getting Started guide first. You need a working NodeProvider with Metal3 deployed and at least one BareMetalHost in available state.

You also need:

GPU servers that have passed Metal3 inspection and are in available state.
An OS image with GPU drivers pre-installed, a standard cloud image with a cloud-init script that installs them, or a standard cloud image with the NVIDIA GPU Operator installed post-provision. GPU Operator is the most common approach for production fleets.
SSH keys configured as SSHKey resources if you need post-provision server access.

1. Label your GPU servers

Add labels to your BareMetalHost resources. The platform uses these labels to match servers to node types.

kubectl label baremetalhost server-01 -n metal3-system \
  gpu-model=h100 \
  gpu-count=8

These label keys are useful for GPU fleets:

gpu-model — GPU model, such as h100, a100, or l40s.
gpu-count — Number of GPUs per server.
rack or datacenter — Location for topology-aware scheduling.

2. Create an OSImage

An OSImage resource stores the OS image URL and checksum. Multiple node types can reference the same OSImage by name.

apiVersion: management.loft.sh/v1
kind: OSImage
metadata:
  name: ubuntu-noble-gpu
spec:
  properties:
    metal3.vcluster.com/image-url: https://your-registry.example.com/ubuntu-noble-gpu-amd64.img
    metal3.vcluster.com/image-checksum: sha256:abc123...
    metal3.vcluster.com/image-checksum-type: sha256

kubectl apply -f os-image.yaml

If you don't have a GPU-ready image, use a standard cloud image and install the NVIDIA driver via vcluster.com/user-data. See Configuration for user data options.

3. Configure a GPU node type

Add a GPU node type to your NodeProvider. The resources field defines what the platform advertises to the Kubernetes scheduler.

nodeTypes:
- name: "h100-8x"
  displayName: "H100 8x GPU"
  resources:
    cpu: "64"
    memory: 256Gi
    nvidia.com/gpu: "8"
  bareMetalHosts:
    selector:
      matchLabels:
        gpu-model: h100
        gpu-count: "8"
  properties:
    vcluster.com/os-image: ubuntu-noble-gpu
    vcluster.com/ssh-keys: my-ssh-key

note

The platform validates CPU and memory against the hardware inventory collected during Metal3 inspection. It does not validate GPU resources against hardware. The nvidia.com/gpu value in resources is what the scheduler sees. Make sure it matches the actual hardware.

4. Create a tenant cluster with GPU private nodes

Configure a vCluster to use the GPU node type as a private node.

privateNodes:
  enabled: true
  autoNodes:
    - provider: metal3-provider
      static:
        - name: gpu-nodes
          quantity: 1
          nodeTypeSelector:
            - property: vcluster.com/node-type
              value: h100-8x

The platform selects an available h100-8x server, installs the OS, and joins it to the tenant cluster as a worker node.

5. Verify GPU access

Once the node joins, verify the GPU is visible to the scheduler.

vcluster connect my-cluster
kubectl get nodes -o json | jq '.items[].status.capacity | select(."nvidia.com/gpu")'

The output should show "nvidia.com/gpu": "8" in the node capacity. If it's missing, check that the NVIDIA device plugin or GPU Operator is installed on the node.

What the platform handles

Selecting a BareMetalHost that matches the node type label selector.
Allocating an IP and generating network configuration.
Generating cloud-init and joining the server to the tenant cluster.
Releasing the IP and returning the server to the pool on deprovision.

What you manage

Enrolling BareMetalHost resources and applying GPU labels.
Providing an OS image with GPU driver support, or installing drivers via cloud-init.
Deploying the NVIDIA device plugin or GPU Operator inside the tenant cluster.
Setting the correct nvidia.com/gpu count in the node type to match the hardware.

Prerequisites​

1. Label your GPU servers​

2. Create an OSImage​

3. Configure a GPU node type​

4. Create a tenant cluster with GPU private nodes​

5. Verify GPU access​

What the platform handles​

What you manage​