昇思MindSpore进阶教程--单节点数据缓存(中)

大家好，我是刘明，明志科技创始人，华为昇思MindSpore布道师。

技术上主攻前端开发、鸿蒙开发和AI算法研究。

努力为大家带来持续的技术分享，如果你也喜欢我的文章，就点个关注吧

缓存共享

对于单机多卡的分布式训练的场景，缓存还允许多个相同的训练脚本共享同一个缓存，共同从缓存中读写数据。

启动缓存服务器

python 复制代码

$cache_admin --start
Cache server startup completed successfully!
The cache server daemon has been created as process id 39337 and listening on port 50052
Recommendation:
Since the server is detached into its own daemon process, monitor the server logs (under /tmp/mindspore/cache/log) for any issues that may happen after startup

创建缓存会话

创建启动Python训练的Shell脚本cache.sh，通过以下命令生成一个缓存会话id：

python 复制代码

#!/bin/bash
# This shell script will launch parallel pipelines

# get path to dataset directory
if [ $# != 1 ]
then
        echo "Usage: sh cache.sh DATASET_PATH"
exit 1
fi
dataset_path=$1

# generate a session id that these parallel pipelines can share
result=$(cache_admin -g 2>&1)
rc=$?
if [ $rc -ne 0 ]; then
    echo "some error"
    exit 1
fi

# grab the session id from the result string
session_id=$(echo $result | awk '{print $NF}')

会话id传入训练脚本

继续编写Shell脚本，添加以下命令在启动Python训练时将session_id以及其他参数传入：

python 复制代码

# make the session_id available to the python scripts
num_devices=4

for p in $(seq 0 $((${num_devices}-1))); do
    python my_training_script.py --num_devices "$num_devices" --device "$p" --session_id $session_id --dataset_path $dataset_path
done

创建并应用缓存实例

下面样例中使用到CIFAR-10数据集。

python 复制代码

├─cache.sh
├─my_training_script.py
└─cifar-10-batches-bin
    ├── batches.meta.txt
    ├── data_batch_1.bin
    ├── data_batch_2.bin
    ├── data_batch_3.bin
    ├── data_batch_4.bin
    ├── data_batch_5.bin
    ├── readme.html
    └── test_batch.bin

创建并编写Python脚本my_training_script.py，通过以下代码接收传入的session_id，并在定义缓存实例时将其作为参数传入。

python 复制代码

import argparse
import mindspore.dataset as ds

parser = argparse.ArgumentParser(description='Cache Example')
parser.add_argument('--num_devices', type=int, default=1, help='Device num.')
parser.add_argument('--device', type=int, default=0, help='Device id.')
parser.add_argument('--session_id', type=int, default=1, help='Session id.')
parser.add_argument('--dataset_path', type=str, default=None, help='Dataset path')
args_opt = parser.parse_args()

# apply cache to dataset
test_cache = ds.DatasetCache(session_id=args_opt.session_id, size=0, spilling=False)
dataset = ds.Cifar10Dataset(dataset_dir=args_opt.dataset_path, num_samples=4, shuffle=False, num_parallel_workers=1,
                            num_shards=args_opt.num_devices, shard_id=args_opt.device, cache=test_cache)
num_iter = 0
for _ in dataset.create_dict_iterator():
    num_iter += 1
print("Got {} samples on device {}".format(num_iter, args_opt.device))

运行训练脚本

运行Shell脚本cache.sh开启分布式训练：

python 复制代码

$ sh cache.sh cifar-10-batches-bin/
Got 4 samples on device 0
Got 4 samples on device 1
Got 4 samples on device 2
Got 4 samples on device 3

通过cache_admin --list_sessions命令可以查看当前会话中只有一组数据，说明缓存共享成功。

python 复制代码

$ cache_admin --list_sessions
Listing sessions for server on port 50052

Session    Cache Id  Mem cached Disk cached  Avg cache size  Numa hit
3392558708   821590605          16         n/a            3227        16

销毁缓存会话

在训练结束后，可以选择将当前的缓存销毁并释放内存：

python 复制代码

$ cache_admin --destroy_session 3392558708
Drop session successfully for server on port 50052

关闭缓存服务器

使用完毕后，可以选择关闭缓存服务器：

python 复制代码

$ cache_admin --stop
Cache server on port 50052 has been stopped successfully.