部署nvidia驱动,部署大模型到kubernetes中
安装nvidia驱动:
添加apt源,这是社区维护的NVIDIA 显卡驱动,通常比 Ubuntu 官方仓库中自带的版本更新
add-apt-repository ppa:graphics-drivers/ppa -y
更新apt源
apt update
安装指定版本的驱动,“apt-cache search nvidia-driver”命令查看可用的驱动版本
apt-cache search nvidia-driver | grep 570
安装 570 系列驱动(使用完整的包名)
apt install -y nvidia-driver-570 nvidia-dkms-570 nvidia-utils-570
重启并验证
reboot
检查模块加载
lsmod | grep nvidia
检查驱动版本
nvidia-smi
验证设备节点
ls -l /dev/nvidia*
防止未来升级导致问题,锁定驱动版本
apt-mark hold nvidia-driver-570 nvidia-dkms-570 nvidia-utils-570
如下右上角的 CUDA Version 12.8 表示此驱动支持的最高CUDA版本。
~# nvidia-smi Thu Jan 15 16:45:34 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.195.03 Driver Version: 570.195.03 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 Tesla P40 Off | 00000000:00:03.0 Off | 0 | | N/A 22C P8 10W / 250W | 0MiB / 23040MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+
1、初始化系统
2、安装containerd
containerd配置:
nvidia-container-toolkit 是容器内访问GPU资源的关键桥梁。
创建配置文件目录
mkdir /etc/containerd
生成配置文件
containerd config default > /etc/containerd/config.toml
修改第67行
vim /etc/containerd/config.toml sandbox_image = "registry.k8s.io/pause:3.10" 由3.9修改为3.10
修改runc中的SystemdCgroup
vim /etc/containerd/config.toml SystemdCgroup = true # 由false修改为true
设置开机自启动并现在启动
systemctl enable --now containerd
验证其版本
containerd --version
部分操作系统默认没有,需要先创建
mkdir -p /etc/apt/sources.list.d
安装NVIDIA Container Toolkit:
Tookit将宿主机的显卡驱动自动注入到容器中,应用层的CUDA库由容器镜像自带,容器镜像就不用安装显卡驱动了。
其他功能:
1、GPU资源按量分配。
2、参数化管理,方便管理。
3、给容器的权限小。
添加 nvidia-container-toolkit apt 源:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
安装 nvidia-container-toolkit:
apt-get update apt-get install -y nvidia-container-toolkit
nvidia-ctk 命令用来配置容器运行时:
containerd:多了 /etc/containerd/conf.d/99-nvidia.toml 文件
# nvidia-ctk runtime configure --runtime=containerd INFO[0000] Using config version 2 INFO[0000] Using CRI runtime plugin name "io.containerd.grpc.v1.cri" INFO[0000] Wrote updated config to /etc/containerd/conf.d/99-nvidia.toml INFO[0000] It is recommended that containerd daemon be restarted.
docker:在 daemon.json 中追加。
# nvidia-ctk runtime configure --runtime=docker INFO[0000] Loading config from /etc/docker/daemon.json INFO[0000] Wrote updated config to /etc/docker/daemon.json INFO[0000] It is recommended that docker daemon be restarted.
查看daemon.json:多了runtimes段。
# cat /etc/docker/daemon.json
{
"registry-mirrors": [
"https://docker.m.daocloud.io",
"https://docker.nju.edu.cn",
"http://hub-mirror.c.163.com"
],
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}重启containerd:
systemctl restart containerd
或重启docker:
systemctl restart docker
查看版本:
# nvidia-container-runtime --version NVIDIA Container Runtime version 1.17.5 commit: f785e908a7f72149f8912617058644fd84e38cde spec: 1.2.1 runc version 1.1.12 commit: v1.1.12-0-g51d5e946 spec: 1.0.2-dev go: go1.21.8 libseccomp: 2.5.3
生成 CDI (Container Device Interface)规格文件:
这个文件使得容器运行时 containerd 能够识别并将宿主机的 NVIDIA GPU 资源安全地注入到容器中,简单来说,CDI 提供了一种标准化的方式,告诉容器运行时“如何将宿主机的设备(比如GPU)安全地交给容器使用”。它定义了需要挂载哪些设备文件、库文件到容器内,以及相应的权限。成功生成 nvidia.yaml文件后,你就可以在运行容器时通过 CDI 来使用 GPU 了。
nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
验证文件是否存在
cat /etc/cdi/nvidia.yaml kind: nvidia.com/gpu ... devices: - name: gpu0 ...
检查CDI配置文件:
nvidia-ctk cdi list 预期输出: nvidia.com/gpu=0 nvidia.com/gpu=1
3、安装etcd、k8s等各个组件。
集群安装完成后:
创建运行时类:
cat > runtimeclass.yaml <<EOF apiVersion: node.k8s.io/v1 kind: RuntimeClass metadata: name: nvidia handler: nvidia EOF kubectl apply -f runtimeclass.yaml kubectl get runtimeclass NAME HANDLER AGE nvidia nvidia 120m
部署NVIDIA Device Plugin:
NVIDIA Device Plugin 是 Kubernetes 集群中管理和调度 NVIDIA GPU 资源的官方核心组件。
1、插件发现节点上的GPU,并向kubelet注册资源。
2、持续监控GPU健康状态,并实时上报给kubelet。
3、Kubelet根据调度结果,向插件请求分配具体GPU设备。
4、插件响应请求,配置容器运行时环境以使用指定GPU。
方法一:使用yaml资源文件部署
wget https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.4/nvidia-device-plugin.yml
下载后需要修改:runtimeClassName: nvidia
cat > nvidia-device-plugin-corrected.yml <<EOF apiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-device-plugin-daemonset namespace: kube-system spec: selector: matchLabels: name: nvidia-device-plugin-ds updateStrategy: type: RollingUpdate template: metadata: labels: name: nvidia-device-plugin-ds spec: runtimeClassName: nvidia # 关键:使用nvidia运行时 hostNetwork: true hostPID: true tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule priorityClassName: "system-node-critical" containers: - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.4 name: nvidia-device-plugin-ctr env: - name: FAIL_ON_INIT_ERROR value: "false" - name: NVIDIA_VISIBLE_DEVICES value: "all" - name: NVIDIA_DRIVER_CAPABILITIES value: "compute,utility" securityContext: privileged: true volumeMounts: - name: device-plugin mountPath: /var/lib/kubelet/device-plugins - name: dev mountPath: /dev volumes: - name: device-plugin hostPath: path: /var/lib/kubelet/device-plugins - name: dev hostPath: path: /dev EOF
修改后:
kubectl apply -f nvidia-device-plugin-corrected.yml kubectl logs -n kube-system nvidia-device-plugin-daemonset-xxxxx
方法二:使用helm部署
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin helm repo update helm repo list # helm search repo NAME CHART VERSION APP VERSION DESCRIPTION nvdp/gpu-feature-discovery 0.18.0 0.18.0 A Helm chart for gpu-feature-discovery on Kuber... nvdp/nvidia-device-plugin 0.18.0 0.18.0 A Helm chart for the nvidia-device-plugin on Ku... # helm search repo nvdp/nvidia-device-plugin NAME CHART VERSION APP VERSION DESCRIPTION nvdp/nvidia-device-plugin 0.18.0 0.18.0 A Helm chart for the nvidia-device-plugin on Ku... helm pull nvdp/nvidia-device-plugin --version 0.18.0 tar -xf nvidia-device-plugin-0.18.0.tgz
编辑values.yaml
cd nvidia-device-plugin # 将 runtimeClassName: null 修改为 runtimeClassName: nvidia sed -i 's/runtimeClassName: null/runtimeClassName: nvidia/g' values.yaml vim values.yaml # 关键点,使用nvidia运行时 runtimeClassName: nvidia
节点打上标签:
kubectl label nodes <node-name> feature.node.kubernetes.io/pci-10de.present=true kubectl label nodes <node-name> feature.node.kubernetes.io/cpu-model.vendor_id=NVIDIA kubectl label nodes <node-name> nvidia.com/gpu.present=true
或者注释掉亲和性部分:
vim values.yaml # 测试环境亲和性这部分可以先注释掉,如果不注释节点不符合条件ds不运行 #affinity: # nodeAffinity: # requiredDuringSchedulingIgnoredDuringExecution: # nodeSelectorTerms: # - matchExpressions: # # On discrete-GPU based systems NFD adds the following label where 10de is the NVIDIA PCI vendor ID # - key: feature.node.kubernetes.io/pci-10de.present # operator: In # values: # - "true" # - matchExpressions: # # On some Tegra-based systems NFD detects the CPU vendor ID as NVIDIA # - key: feature.node.kubernetes.io/cpu-model.vendor_id # operator: In # values: # - "NVIDIA" # - matchExpressions: # # We allow a GPU deployment to be forced by setting the following label to "true" # - key: "nvidia.com/gpu.present" # operator: In # values: # - "true"
部署完成nvidia-device-plugin后查看运行状态:
# kubectl get pods -n kube-system nvidia-device-plugin-qrnk8 NAME READY STATUS RESTARTS AGE nvidia-device-plugin-qrnk8 1/1 Running 0 18m kubectl logs -f -n kube-system nvidia-device-plugin-qrnk8
NVIDIA-device-plugin部署完成后可以看到GPU数量:
kubectl describe node 10-60-137-81 | grep nvidia.com/gpu nvidia.com/gpu: 1 nvidia.com/gpu: 1 nvidia.com/gpu 0 0
测试从容器内是否可以调用GPU:
cat > nvidia-runtime-test.yaml <<EOF apiVersion: v1 kind: Pod metadata: name: nvidia-runtime-test namespace: default spec: runtimeClassName: nvidia containers: - name: test image: nvcr.io/nvidia/cuda:12.1.0-base-ubuntu22.04 command: ["sh", "-c", "nvidia-smi && sleep 3600"] restartPolicy: Never EOF
输出查看:
kubectl logs -f nvidia-runtime-test Fri Jan 16 05:02:18 2026 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.195.03 Driver Version: 570.195.03 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 Tesla P40 Off | 00000000:00:03.0 Off | 0 | | N/A 22C P8 9W / 250W | 0MiB / 23040MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+
参考文档:
https://www.xcloudnote.com/d/fAAAAfOJX-9xp3
