安装nvidia驱动:

添加apt源,这是社区维护的NVIDIA 显卡驱动,通常比 Ubuntu 官方仓库中自带的版本更新

add-apt-repository ppa:graphics-drivers/ppa -y

更新apt源

apt update

安装指定版本的驱动,“apt-cache search nvidia-driver”命令查看可用的驱动版本

apt-cache search nvidia-driver | grep 570

安装 570 系列驱动(使用完整的包名)

apt install -y nvidia-driver-570 nvidia-dkms-570 nvidia-utils-570

重启并验证

reboot

检查模块加载

lsmod | grep nvidia

检查驱动版本

nvidia-smi

验证设备节点

ls -l /dev/nvidia*

防止未来升级导致问题,锁定驱动版本

apt-mark hold nvidia-driver-570 nvidia-dkms-570 nvidia-utils-570


如下右上角的 CUDA Version 12.8 表示此驱动支持的最高CUDA版本。

~# nvidia-smi 
Thu Jan 15 16:45:34 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P40                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   22C    P8             10W /  250W |       0MiB /  23040MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+


1、初始化系统

2、安装containerd


containerd配置:

    nvidia-container-toolkit 是容器内访问GPU资源的关键桥梁。

创建配置文件目录

mkdir /etc/containerd

生成配置文件

containerd config default > /etc/containerd/config.toml

修改第67行

vim /etc/containerd/config.toml
sandbox_image = "registry.k8s.io/pause:3.10" 由3.9修改为3.10

修改runc中的SystemdCgroup

vim /etc/containerd/config.toml
SystemdCgroup = true # 由false修改为true

设置开机自启动并现在启动

systemctl enable --now containerd

验证其版本

containerd --version

部分操作系统默认没有,需要先创建

mkdir -p /etc/apt/sources.list.d


安装NVIDIA Container Toolkit:

        Tookit将宿主机的显卡驱动自动注入到容器中,应用层的CUDA库由容器镜像自带,容器镜像就不用安装显卡驱动了。

其他功能:

        1、GPU资源按量分配。

        2、参数化管理,方便管理。

        3、给容器的权限小。

添加 nvidia-container-toolkit apt 源:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

安装 nvidia-container-toolkit:

apt-get update
apt-get install -y nvidia-container-toolkit

nvidia-ctk 命令用来配置容器运行时:

containerd:多了 /etc/containerd/conf.d/99-nvidia.toml 文件

# nvidia-ctk runtime configure --runtime=containerd
INFO[0000] Using config version 2                       
INFO[0000] Using CRI runtime plugin name "io.containerd.grpc.v1.cri" 
INFO[0000] Wrote updated config to /etc/containerd/conf.d/99-nvidia.toml
INFO[0000] It is recommended that containerd daemon be restarted.

docker:在 daemon.json 中追加。

# nvidia-ctk runtime configure --runtime=docker
INFO[0000] Loading config from /etc/docker/daemon.json  
INFO[0000] Wrote updated config to /etc/docker/daemon.json
INFO[0000] It is recommended that docker daemon be restarted.

查看daemon.json:多了runtimes段。

# cat /etc/docker/daemon.json 
{
    "registry-mirrors": [
        "https://docker.m.daocloud.io",
        "https://docker.nju.edu.cn",
        "http://hub-mirror.c.163.com"
    ],
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

重启containerd:

systemctl restart containerd

或重启docker:

systemctl restart docker

查看版本:

# nvidia-container-runtime --version
NVIDIA Container Runtime version 1.17.5
commit: f785e908a7f72149f8912617058644fd84e38cde
spec: 1.2.1
runc version 1.1.12
commit: v1.1.12-0-g51d5e946
spec: 1.0.2-dev
go: go1.21.8
libseccomp: 2.5.3


生成 CDI (Container Device Interface)规格文件:

    这个文件使得容器运行时 containerd 能够识别并将宿主机的 NVIDIA GPU 资源安全地注入到容器中,简单来说,CDI 提供了一种标准化的方式,告诉容器运行时“如何将宿主机的设备(比如GPU)安全地交给容器使用”。它定义了需要挂载哪些设备文件、库文件到容器内,以及相应的权限。成功生成 nvidia.yaml文件后,你就可以在运行容器时通过 CDI 来使用 GPU 了。

nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

验证文件是否存在

cat /etc/cdi/nvidia.yaml
kind: nvidia.com/gpu
...
devices:
- name: gpu0
  ...

检查CDI配置文件:

nvidia-ctk cdi list

预期输出:
nvidia.com/gpu=0
nvidia.com/gpu=1


3、安装etcd、k8s等各个组件。


集群安装完成后:

创建运行时类:

cat > runtimeclass.yaml <<EOF
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia
EOF

kubectl apply -f runtimeclass.yaml

kubectl get runtimeclass
NAME     HANDLER   AGE
nvidia   nvidia    120m


部署NVIDIA Device Plugin:

        NVIDIA Device Plugin 是 Kubernetes 集群中管理和调度 NVIDIA GPU 资源的官方核心组件。

1、插件发现节点上的GPU,并向kubelet注册资源。

2、持续监控GPU健康状态,并实时上报给kubelet。

3、Kubelet根据调度结果,向插件请求分配具体GPU设备。

4、插件响应请求,配置容器运行时环境以使用指定GPU。

方法一:使用yaml资源文件部署

wget https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.4/nvidia-device-plugin.yml

下载后需要修改:runtimeClassName: nvidia

cat > nvidia-device-plugin-corrected.yml <<EOF
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      runtimeClassName: nvidia  # 关键:使用nvidia运行时
      hostNetwork: true
      hostPID: true
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      priorityClassName: "system-node-critical"
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.4
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
          - name: NVIDIA_VISIBLE_DEVICES
            value: "all"
          - name: NVIDIA_DRIVER_CAPABILITIES
            value: "compute,utility"
        securityContext:
          privileged: true
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
        - name: dev
          mountPath: /dev
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
      - name: dev
        hostPath:
          path: /dev
EOF

修改后:

kubectl apply -f nvidia-device-plugin-corrected.yml
kubectl logs -n kube-system nvidia-device-plugin-daemonset-xxxxx

方法二:使用helm部署

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm repo list


# helm search repo
NAME                       	CHART VERSION	APP VERSION	DESCRIPTION                                       
nvdp/gpu-feature-discovery 	0.18.0       	0.18.0     	A Helm chart for gpu-feature-discovery on Kuber...
nvdp/nvidia-device-plugin  	0.18.0       	0.18.0     	A Helm chart for the nvidia-device-plugin on Ku...


# helm search repo nvdp/nvidia-device-plugin
NAME                     	CHART VERSION	APP VERSION	DESCRIPTION                                       
nvdp/nvidia-device-plugin	0.18.0       	0.18.0     	A Helm chart for the nvidia-device-plugin on Ku...


helm pull nvdp/nvidia-device-plugin --version 0.18.0

tar -xf nvidia-device-plugin-0.18.0.tgz

编辑values.yaml

cd nvidia-device-plugin

# 将 runtimeClassName: null 修改为 runtimeClassName: nvidia
sed -i 's/runtimeClassName: null/runtimeClassName: nvidia/g' values.yaml

vim values.yaml

# 关键点,使用nvidia运行时
runtimeClassName: nvidia

节点打上标签:

kubectl label nodes <node-name> feature.node.kubernetes.io/pci-10de.present=true

kubectl label nodes <node-name> feature.node.kubernetes.io/cpu-model.vendor_id=NVIDIA

kubectl label nodes <node-name> nvidia.com/gpu.present=true

或者注释掉亲和性部分:

vim values.yaml

# 测试环境亲和性这部分可以先注释掉,如果不注释节点不符合条件ds不运行
#affinity:
#  nodeAffinity:
#    requiredDuringSchedulingIgnoredDuringExecution:
#      nodeSelectorTerms:
#      - matchExpressions:
#        # On discrete-GPU based systems NFD adds the following label where 10de is the NVIDIA PCI vendor ID
#        - key: feature.node.kubernetes.io/pci-10de.present
#          operator: In
#          values:
#          - "true"
#      - matchExpressions:
#        # On some Tegra-based systems NFD detects the CPU vendor ID as NVIDIA
#        - key: feature.node.kubernetes.io/cpu-model.vendor_id
#          operator: In
#          values:
#          - "NVIDIA"
#      - matchExpressions:
#        # We allow a GPU deployment to be forced by setting the following label to "true"
#        - key: "nvidia.com/gpu.present"
#          operator: In
#          values:
#          - "true"


部署完成nvidia-device-plugin后查看运行状态:

# kubectl get pods -n kube-system nvidia-device-plugin-qrnk8 
NAME                         READY   STATUS    RESTARTS   AGE
nvidia-device-plugin-qrnk8   1/1     Running   0          18m

kubectl logs -f -n kube-system nvidia-device-plugin-qrnk8

NVIDIA-device-plugin部署完成后可以看到GPU数量:

kubectl describe node 10-60-137-81 | grep nvidia.com/gpu
  nvidia.com/gpu:     1
  nvidia.com/gpu:     1
  nvidia.com/gpu     0           0

测试从容器内是否可以调用GPU:

cat > nvidia-runtime-test.yaml <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: nvidia-runtime-test
  namespace: default
spec:
  runtimeClassName: nvidia
  containers:
  - name: test
    image: nvcr.io/nvidia/cuda:12.1.0-base-ubuntu22.04
    command: ["sh", "-c", "nvidia-smi && sleep 3600"]
  restartPolicy: Never
EOF

输出查看:

kubectl logs -f  nvidia-runtime-test
Fri Jan 16 05:02:18 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.195.03             Driver Version: 570.195.03     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla P40                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   22C    P8              9W /  250W |       0MiB /  23040MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

参考文档:

https://www.xcloudnote.com/d/fAAAAfOJX-9xp3