欢迎关注我的CSDN:https://spike.blog.csdn.net/
本文地址:https://blog.csdn.net/caroline_wendy/article/details/136499768
PyTorchJob 是 Kubernetes 中的自定义资源,用于在 Kubernetes 上运行 PyTorch 训练任务,这是 Kubeflow 组件的一部分,具有稳定的状态,PyTorchJob 允许像管理 Kubernetes 中的其他内置资源一样创建和管理 PyTorch 作业。要使用 PyTorchJob,需要先安装 PyTorch Operator。默认情况下,PyTorch Operator 会作为控制器部署在 training operator 中。
YAML 配置如下,其中:
kind
是PyTorchJob
metadata/name
,运行的 Job 名称,不要重名- 节点使用
Worker
,replicas
重复的节点数量,resources
配置 GPU 数量,即支持2机1卡,或1机2卡 command
是运行命令
源码:
yaml
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: pytorch-simple-001
spec:
pytorchReplicaSpecs:
Worker:
replicas: 1
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
labels:
file-mount: "true"
user-mount: "true"
spec:
# hostNetwork: false # New
containers:
- name: pytorch
command:
- /bin/sh
- -cl
- "bash k8s/run_grid0_for_gpu1.sh > nohup.test.log 2>&1"
image: "harbor.[xxx].com/cryoem:v1.3.1"
imagePullPolicy: Always
securityContext: # New
privileged: false
capabilities:
add: [ "IPC_LOCK" ]
resources:
limits:
rdma/hca : 1
cpu: 12
memory: "100G"
nvidia.com/gpu: 2
workingDir: "workspace/cryoem-project/"
volumeMounts:
- name: cache-volume # change the name to your volume on k8s
mountPath: /dev/shm
nodeSelector:
gpu.device: "a100" # support 'a10' or 'a100'
group: "algo2"
tolerations:
- effect: NoSchedule
key: role
operator: Equal
value: "algo2"
volumes:
- name: cache-volume # change the name to your volume on k8s
emptyDir:
medium: Memory
sizeLimit: "960G"
查看运行情况:
bash
kubectl get pytorchjobs
# kubectl delete pytorchjobs pytorch-simple-001
kubectl get pods
kubectl exec -it -n [your name] pytorch-simple-001-worker-0 bash
运行结果:
bash
Thu Mar 7 07:39:13 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A800-SXM... On | 00000000:58:00.0 Off | 0 |
| N/A 52C P0 259W / 400W | 7833MiB / 81920MiB | 93% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A800-SXM... On | 00000000:D0:00.0 Off | 0 |
| N/A 52C P0 235W / 400W | 12917MiB / 81920MiB | 93% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+