一、参考资料
二、重要说明
1. 依赖包冲突
遇到包冲突时,可使用 pip install --no-deps -e .
解决。
2. DCU加速
未适配的PyTorch版本,不支持DCU加速。
测试PyTorch是否支持DCU:
bash
(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/m
odels/LLaMA-Factory# python
Python 3.10.8 (main, Nov 4 2022, 13:48:29) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
3. pip软件包
环境内缺失的依赖可以到光合社区内查找,或者直接从平台预置的常用依赖包路径下查找 /public/software/apps/DeepLearning/whl/dtk-24.04
,直接cp到用户项目路径下,直接pip安装。
pip不安装依赖包,只安装指定包,防止包冲突。
bash
pip install --no-dependencies modelscope
三、相关介绍
1. DCU
DCU(Deep Computing Unit)深度计算处理器 ,基于通用图形处理器理念设计,更加适合为人工智能计算提供强大的算力。可以完美支持深度学习训练场景,轻松应对复杂神经网络训练。初次使用者可将其与 NVIDIA的 GPU类比,均属于计算加速类硬件。
2. DTK/dtk
DTK/dtk(DCU Toolkit),是基于 DCU的硬件进行优化并提供完整的软件工具链,对标 CUDA的软件栈,为开发者提供运行、编译、调试和性能分析等功能。使用集群中的 DCU队列进行人工智能加速计算时需要配合 DTK配置相关环境。
每年 4月、10月 dtk会推出新版本及相关依赖,优化算子,修复 bug,并以 dtk-年月命名,例如 dtk-2304、dtk-2310,可以根据需要选择驱动及环境。为了方便用户使用,针对每个dtk版本,平台提供编译适配常用依赖软件,可以通过 光合社区 快速获取AI生态包。
四、常用操作
1. ssh登录
查看ssh登录的账号以及密码。
使用ssh工具,登录成功如下所示:
2. 数据/文件上传
通过控制台面板,将本地数据上传到服务器中。
通常来说,如果服务器中无法下载安装包,可以先下载到本地,然后上传到服务器中。
3. 安装Ping
bash
root@notebook-1813389960667746306-scnlbe5oi5-17811:~# sudo apt install iputils-ping
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
gir1.2-glib-2.0 gir1.2-gst-plugins-bad-1.0 gir1.2-gst-plugins-base-1.0 gir1.2-gstreamer-1.0 libblkid-dev
libegl1-mesa-dev libgirepository-1.0-1 libgles2-mesa-dev libglib2.0-bin libglib2.0-dev-bin libgstreamer-opencv1.0-0
libmount-dev liborc-0.4-dev liborc-0.4-dev-bin libpcre16-3 libpcre2-32-0 libpcre2-dev libpcre2-posix2 libpcre3-dev
libpcre32-3 libpcrecpp0v5 libselinux1-dev libsepol-dev libx11-xcb-dev python3-distutils python3-lib2to3
Use 'sudo apt autoremove' to remove them.
The following NEW packages will be installed:
iputils-ping
0 upgraded, 1 newly installed, 0 to remove and 646 not upgraded.
Need to get 44.3 kB of archives.
After this operation, 124 kB of additional disk space will be used.
Get:1 https://mirrors.tuna.tsinghua.edu.cn/ubuntu noble/main amd64 iputils-ping amd64 3:20240117-1build1 [44.3 kB]
Fetched 44.3 kB in 0s (205 kB/s)
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package iputils-ping.
(Reading database ... 48168 files and directories currently installed.)
Preparing to unpack .../iputils-ping_3%3a20240117-1build1_amd64.deb ...
Unpacking iputils-ping (3:20240117-1build1) ...
Setting up iputils-ping (3:20240117-1build1) ...
Processing triggers for man-db (2.9.1-1) ...
五、服务器资源信息
1. 服务器整机规格
2. 异构加速卡AI 显存64GB PCIE
2.1 单卡
单卡规格
单卡的CPU为15核,内存为110GB,显存大小为64GB。
单卡的notebook实例规格
2.2 多卡
4卡规格
4卡的CPU为115*4核,内存为110*4GB,显存大小为64*4GB。
3. CPU
bash
root@notebook-1813389960667746306-scnlbe5oi5-50216:~# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 45 bits physical, 48 bits virtual
CPU(s): 256
On-line CPU(s) list: 0-254
Off-line CPU(s) list: 255
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 8
Vendor ID: HygonGenuine
CPU family: 24
Model: 4
Model name: Hygon C86 7490 64-core Processor
Stepping: 1
CPU MHz: 2700.017
BogoMIPS: 5400.03
Virtualization: AMD-V
L1d cache: 2 MiB
L1i cache: 2 MiB
L2 cache: 32 MiB
L3 cache: 256 MiB
NUMA node0 CPU(s): 0-15,128-143
NUMA node1 CPU(s): 16-31,144-159
NUMA node2 CPU(s): 32-47,160-175
NUMA node3 CPU(s): 48-63,176-191
NUMA node4 CPU(s): 64-79,192-207
NUMA node5 CPU(s): 80-95,208-223
NUMA node6 CPU(s): 96-111,224-239
NUMA node7 CPU(s): 112-127,240-254
Vulnerability L1tf: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; Load fences, __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Full retpoline
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht
syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl xtopology nonstop_tsc ext
d_apicid amd_dcm aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe
popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dno
wprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 cpb hw_pstate sme retp
oline_amd ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni
xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_cle
an flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip overflow_recov su
ccor smca
4. 显卡
bash
Every 1.0s: rocm-smi notebook-1813389960667746306-scnlbe5oi5-50216: Tue Jul 30 12:39:23 2024
============================ System Management Interface =============================
======================================================================================
DCU Temp AvgPwr Perf PwrCap VRAM% DCU% Mode
0 44.0C 112.0W auto 300.0W 2% 0% Normal
======================================================================================
=================================== End of SMI Log ===================================
5. 内存
bash
root@notebook-1813389960667746306-scnlbe5oi5-17811:~# free -h
total used free shared buff/cache available
Mem: 1.0Ti 64Gi 115Gi 11Mi 827Gi 942Gi
Swap: 0B 0B 0B
6. 硬盘
bash
root@notebook-1813389960667746306-scnlbe5oi5-17811:~# df -h
Filesystem Size Used Avail Use% Mounted on
overlay 11T 1.4T 8.5T 14% /
tmpfs 64M 0 64M 0% /dev
tmpfs 504G 0 504G 0% /sys/fs/cgroup
ks_p300s_public 53P 37P 16P 70% /etc/sugon_motd
/dev/md0 11T 1.4T 8.5T 14% /etc/hosts
/dev/mapper/centos-root 3.5T 21G 3.5T 1% /etc/tmp
tmpfs 110G 36K 110G 1% /dev/shm
tmpfs 110G 12K 110G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 504G 0 504G 0% /proc/acpi
tmpfs 504G 0 504G 0% /proc/scsi
tmpfs 504G 0 504G 0% /sys/firmware
7. 系统信息
bash
(hugface) root@notebook-1813389960667746306-scnlbe5oi5-50216:~/envs# cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.1 LTS"
NAME="Ubuntu"
VERSION="20.04.1 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.1 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
六、FAQ
Q:E: Unable to locate package iputils-ping
bash
root@notebook-1813056950361673730-scnlbe5oi5-39425:~# sudo apt install iputils-ping
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package iputils-ping
解决方法:
bash
# 更新源,并重新安装
sudo apt-get install update
sudo apt install iputils-ping
Q:ping: www.baidu.com: Temporary failure in name resolution
bash
root@notebook-1813056950361673730-scnlbe5oi5-39425:~# ping www.baidu.com
ping: www.baidu.com: Temporary failure in name resolution
解决办法:配置域名解析服务器ip。
bash
sudo vi /etc/resolv.conf
# 添加以下内容
nameserver 1.1.1.1
nameserver 8.8.8.8
bash
root@notebook-1813389960667746306-scnlbe5oi5-17811:~# vi /etc/resolv.conf
root@notebook-1813389960667746306-scnlbe5oi5-17811:~# ping www.baidu.com
PING www.wshifen.com (103.235.47.188) 56(84) bytes of data.
64 bytes from 103.235.47.188: icmp_seq=1 ttl=46 time=37.1 ms
64 bytes from 103.235.47.188: icmp_seq=2 ttl=46 time=37.0 ms
^C
--- www.wshifen.com ping statistics ---
3 packets transmitted, 2 received, 33.3333% packet loss, time 2002ms
rtt min/avg/max/mdev = 37.002/37.033/37.065/0.031 ms