曙光超算互联网平台SCNet之国产异构加速卡DCU

一、参考资料

超算互联网平台

异构加速卡AI 显存64GB PCIE

光合社区

二、重要说明

1. 依赖包冲突

遇到包冲突时,可使用 pip install --no-deps -e . 解决。

2. DCU加速

未适配的PyTorch版本,不支持DCU加速。

测试PyTorch是否支持DCU:

bash 复制代码
(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/m
odels/LLaMA-Factory# python
Python 3.10.8 (main, Nov  4 2022, 13:48:29) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True

3. pip软件包

环境内缺失的依赖可以到光合社区内查找,或者直接从平台预置的常用依赖包路径下查找 /public/software/apps/DeepLearning/whl/dtk-24.04,直接cp到用户项目路径下,直接pip安装。

pip不安装依赖包,只安装指定包,防止包冲突。

bash 复制代码
pip install --no-dependencies modelscope

三、相关介绍

1. DCU

DCU(Deep Computing Unit)深度计算处理器 ,基于通用图形处理器理念设计,更加适合为人工智能计算提供强大的算力。可以完美支持深度学习训练场景,轻松应对复杂神经网络训练。初次使用者可将其与 NVIDIA的 GPU类比,均属于计算加速类硬件。

2. DTK/dtk

DTK/dtk(DCU Toolkit),是基于 DCU的硬件进行优化并提供完整的软件工具链,对标 CUDA的软件栈,为开发者提供运行、编译、调试和性能分析等功能。使用集群中的 DCU队列进行人工智能加速计算时需要配合 DTK配置相关环境。

每年 4月、10月 dtk会推出新版本及相关依赖,优化算子,修复 bug,并以 dtk-年月命名,例如 dtk-2304、dtk-2310,可以根据需要选择驱动及环境。为了方便用户使用,针对每个dtk版本,平台提供编译适配常用依赖软件,可以通过 光合社区 快速获取AI生态包。

四、常用操作

1. ssh登录

查看ssh登录的账号以及密码。

使用ssh工具,登录成功如下所示:

2. 数据/文件上传

通过控制台面板,将本地数据上传到服务器中。

通常来说,如果服务器中无法下载安装包,可以先下载到本地,然后上传到服务器中。

3. 安装Ping

bash 复制代码
root@notebook-1813389960667746306-scnlbe5oi5-17811:~# sudo apt install iputils-ping
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages were automatically installed and are no longer required:
  gir1.2-glib-2.0 gir1.2-gst-plugins-bad-1.0 gir1.2-gst-plugins-base-1.0 gir1.2-gstreamer-1.0 libblkid-dev
  libegl1-mesa-dev libgirepository-1.0-1 libgles2-mesa-dev libglib2.0-bin libglib2.0-dev-bin libgstreamer-opencv1.0-0
  libmount-dev liborc-0.4-dev liborc-0.4-dev-bin libpcre16-3 libpcre2-32-0 libpcre2-dev libpcre2-posix2 libpcre3-dev
  libpcre32-3 libpcrecpp0v5 libselinux1-dev libsepol-dev libx11-xcb-dev python3-distutils python3-lib2to3
Use 'sudo apt autoremove' to remove them.
The following NEW packages will be installed:
  iputils-ping
0 upgraded, 1 newly installed, 0 to remove and 646 not upgraded.
Need to get 44.3 kB of archives.
After this operation, 124 kB of additional disk space will be used.
Get:1 https://mirrors.tuna.tsinghua.edu.cn/ubuntu noble/main amd64 iputils-ping amd64 3:20240117-1build1 [44.3 kB]
Fetched 44.3 kB in 0s (205 kB/s)
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package iputils-ping.
(Reading database ... 48168 files and directories currently installed.)
Preparing to unpack .../iputils-ping_3%3a20240117-1build1_amd64.deb ...
Unpacking iputils-ping (3:20240117-1build1) ...
Setting up iputils-ping (3:20240117-1build1) ...
Processing triggers for man-db (2.9.1-1) ...

五、服务器资源信息

1. 服务器整机规格

2. 异构加速卡AI 显存64GB PCIE

2.1 单卡

单卡规格

单卡的CPU为15核,内存为110GB,显存大小为64GB。

单卡的notebook实例规格

2.2 多卡

4卡规格

4卡的CPU为115*4核,内存为110*4GB,显存大小为64*4GB。

3. CPU

bash 复制代码
root@notebook-1813389960667746306-scnlbe5oi5-50216:~# lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   45 bits physical, 48 bits virtual
CPU(s):                          256
On-line CPU(s) list:             0-254
Off-line CPU(s) list:            255
Thread(s) per core:              1
Core(s) per socket:              64
Socket(s):                       2
NUMA node(s):                    8
Vendor ID:                       HygonGenuine
CPU family:                      24
Model:                           4
Model name:                      Hygon C86 7490 64-core Processor
Stepping:                        1
CPU MHz:                         2700.017
BogoMIPS:                        5400.03
Virtualization:                  AMD-V
L1d cache:                       2 MiB
L1i cache:                       2 MiB
L2 cache:                        32 MiB
L3 cache:                        256 MiB
NUMA node0 CPU(s):               0-15,128-143
NUMA node1 CPU(s):               16-31,144-159
NUMA node2 CPU(s):               32-47,160-175
NUMA node3 CPU(s):               48-63,176-191
NUMA node4 CPU(s):               64-79,192-207
NUMA node5 CPU(s):               80-95,208-223
NUMA node6 CPU(s):               96-111,224-239
NUMA node7 CPU(s):               112-127,240-254
Vulnerability L1tf:              Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; Load fences, __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full retpoline
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht
                                 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl xtopology nonstop_tsc ext
                                 d_apicid amd_dcm aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe
                                 popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dno
                                 wprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 cpb hw_pstate sme retp
                                 oline_amd ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni
                                  xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_cle
                                 an flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip overflow_recov su
                                 ccor smca

4. 显卡

bash 复制代码
Every 1.0s: rocm-smi                                           notebook-1813389960667746306-scnlbe5oi5-50216: Tue Jul 30 12:39:23 2024


============================ System Management Interface =============================
======================================================================================
DCU     Temp     AvgPwr     Perf     PwrCap     VRAM%      DCU%      Mode
0       44.0C    112.0W     auto     300.0W     2%         0%        Normal
======================================================================================
=================================== End of SMI Log ===================================

5. 内存

bash 复制代码
root@notebook-1813389960667746306-scnlbe5oi5-17811:~# free -h
              total        used        free      shared  buff/cache   available
Mem:          1.0Ti        64Gi       115Gi        11Mi       827Gi       942Gi
Swap:            0B          0B          0B

6. 硬盘

bash 复制代码
root@notebook-1813389960667746306-scnlbe5oi5-17811:~# df -h
Filesystem               Size  Used Avail Use% Mounted on
overlay                   11T  1.4T  8.5T  14% /
tmpfs                     64M     0   64M   0% /dev
tmpfs                    504G     0  504G   0% /sys/fs/cgroup
ks_p300s_public           53P   37P   16P  70% /etc/sugon_motd
/dev/md0                  11T  1.4T  8.5T  14% /etc/hosts
/dev/mapper/centos-root  3.5T   21G  3.5T   1% /etc/tmp
tmpfs                    110G   36K  110G   1% /dev/shm
tmpfs                    110G   12K  110G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs                    504G     0  504G   0% /proc/acpi
tmpfs                    504G     0  504G   0% /proc/scsi
tmpfs                    504G     0  504G   0% /sys/firmware

7. 系统信息

bash 复制代码
(hugface) root@notebook-1813389960667746306-scnlbe5oi5-50216:~/envs# cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=20.04
DISTRIB_CODENAME=focal
DISTRIB_DESCRIPTION="Ubuntu 20.04.1 LTS"
NAME="Ubuntu"
VERSION="20.04.1 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.1 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

六、FAQ

Q:E: Unable to locate package iputils-ping

bash 复制代码
root@notebook-1813056950361673730-scnlbe5oi5-39425:~# sudo apt install iputils-ping
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package iputils-ping

解决方法

bash 复制代码
# 更新源,并重新安装
sudo apt-get install update
sudo apt install iputils-ping

Q:ping: www.baidu.com: Temporary failure in name resolution

bash 复制代码
root@notebook-1813056950361673730-scnlbe5oi5-39425:~# ping www.baidu.com
ping: www.baidu.com: Temporary failure in name resolution

解决办法:配置域名解析服务器ip。

bash 复制代码
sudo vi /etc/resolv.conf 

# 添加以下内容
nameserver 1.1.1.1
nameserver 8.8.8.8
bash 复制代码
root@notebook-1813389960667746306-scnlbe5oi5-17811:~# vi /etc/resolv.conf
root@notebook-1813389960667746306-scnlbe5oi5-17811:~# ping www.baidu.com
PING www.wshifen.com (103.235.47.188) 56(84) bytes of data.
64 bytes from 103.235.47.188: icmp_seq=1 ttl=46 time=37.1 ms
64 bytes from 103.235.47.188: icmp_seq=2 ttl=46 time=37.0 ms
^C
--- www.wshifen.com ping statistics ---
3 packets transmitted, 2 received, 33.3333% packet loss, time 2002ms
rtt min/avg/max/mdev = 37.002/37.033/37.065/0.031 ms
相关推荐
花花少年1 个月前
快速体验LLaMA-Factory 私有化部署和高效微调Llama3模型(曙光超算互联网平台异构加速卡DCU)
llama-factory·llama3·scnet·dcu·国产异构加速卡