torchvision中的数据集使用
- [1. torchvision中的数据集使用](#1. torchvision中的数据集使用)
- [2. DataLoader的使用](#2. DataLoader的使用)
-
- 官方文档
- 参数解释
- DataLoader的返回值,返回两个迭代器
- 代码
- 执行结果
-
- [1. shuffle=False 不重新洗牌,拿到数据的顺序一致](#1. shuffle=False 不重新洗牌,拿到数据的顺序一致)
- [2. shuffle=True 重新洗牌,拿到数据的顺序不一致](#2. shuffle=True 重新洗牌,拿到数据的顺序不一致)
- [3. drop_last=False 不丢弃未除尽batch的样本](#3. drop_last=False 不丢弃未除尽batch的样本)
B站 小土堆 视频学习笔记
1. torchvision中的数据集使用
官网文档
注意左上角的版本
注意点1 totensor实例化不要忘记加括号
totensor实例化不要忘记加括号,否则后面用数据集序列号的时候会报错
注意点2 download可以一直保持为True
download可以一直保持为True,下载一次后指定目录下有下载好的数据集,代码不会重复下载,也可以自己把下载好的数据集压缩包放到指定目录,代码会自动解压缩
代码
py
from torch.utils.tensorboard import SummaryWriter
from torchvision import datasets, transforms
# 用法1
# 数据下载很慢的话 可以使用迅雷下载,属性里面可以看到迅雷是从多方下载的,速度比较快 https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
train_set = datasets.CIFAR10(root='./dataset', train=True, download=True)
test_set = datasets.CIFAR10(root='./dataset', train=False, download=True)
# 下载的数据集是图片类型,可以debug查看数据
print(test_set[0]) # __getitem__ return img, target
print(type(test_set[0]))
img, target = test_set[0]
print(target)
print(test_set.classes[target])
print(img)
# PIL 图片可以直接show函数展示
img.show()
# 用法2
# 将数据集批量调用transforms,使用tensor数据类型
# trans_compose = transforms.Compose([transforms.ToTensor]) # 错误写法 会导致后面报错
trans_compose = transforms.Compose([transforms.ToTensor()])
train_set2 = datasets.CIFAR10(root='./dataset', train=True, transform=trans_compose, download=True)
test_set2 = datasets.CIFAR10(root='./dataset', train=False, transform=trans_compose, download=True)
print(type(test_set2[2]))
img, target = test_set2[0]
print(target)
print(test_set2.classes[target])
print(type(img))
writer = SummaryWriter("logs")
for i in range(10):
img_tensor, target = test_set2[i]
writer.add_image('tensor dataset', img_tensor, i)
writer.close()
执行结果
sh
> p11_torchvision_dataset.py
Files already downloaded and verified
Files already downloaded and verified
(<PIL.Image.Image image mode=RGB size=32x32 at 0x1CF47DA9E20>, 3)
<class 'tuple'>
3
cat
<PIL.Image.Image image mode=RGB size=32x32 at 0x1CF47DA9E20>
Files already downloaded and verified
Files already downloaded and verified
<class 'tuple'>
3
cat
<class 'torch.Tensor'>
Process finished with exit code 0
2. DataLoader的使用
官方文档
https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader
参数解释
Parameters:
-
dataset (Dataset) -- dataset from which to load the data.
加载哪个数据
-
batch_size (int, optional) -- how many samples per batch to load (default: 1).
每次拿多少数据
-
shuffle (bool, optional) -- set to True to have the data reshuffled at every epoch (default: False).
当设置为False,则两次拿的数据都是一样的【相当于设置了随机数种子】,当设置为True,则每次拿的数据不一样
-
sampler (Sampler or Iterable, optional) -- defines the strategy to draw samples from the dataset. Can be any Iterable with len implemented. If specified, shuffle must not be specified.
-
batch_sampler (Sampler or Iterable, optional) -- like sampler, but returns a batch of indices at a time. Mutually exclusive with batch_size, shuffle, sampler, and drop_last.
-
num_workers (int, optional) -- how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 0)
多进程取数据,设置为0则用主进程执行,win系统不设置为0可能报下面的错误BrokenPipeError
-
collate_fn (Callable, optional) -- merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.
-
pin_memory (bool, optional) -- If True, the data loader will copy Tensors into device/CUDA pinned memory before returning them. If your data elements are a custom type, or your collate_fn returns a batch that is a custom type, see the example below.
-
drop_last (bool, optional) -- set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False)
当数据总数除以batch_size 除不尽有余数是,设置为True是余下的数据不参与训练,舍去余数,设置为False则剩下的数据就参与训练,不舍弃余数。默认为False
-
timeout (numeric, optional) -- if positive, the timeout value for collecting a batch from workers. Should always be non-negative. (default: 0)
-
worker_init_fn (Callable, optional) -- If not None, this will be called on each worker subprocess with the worker id (an int in [0, num_workers - 1]) as input, after seeding and before data loading. (default: None)
-
generator (torch.Generator, optional) -- If not None, this RNG will be used by RandomSampler to generate random indexes and multiprocessing to generate base_seed for workers. (default: None)
-
prefetch_factor (int, optional, keyword-only arg) -- Number of batches loaded in advance by each worker. 2 means there will be a total of 2 * num_workers batches prefetched across all workers. (default value depends on the set value for num_workers. If value of num_workers=0 default is None. Otherwise if value of num_workers>0 default is 2).
-
persistent_workers (bool, optional) -- If True, the data loader will not shutdown the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. (default: False)
-
pin_memory_device (str, optional) -- the data loader will copy Tensors into device pinned memory before returning them if pin_memory is set to true.
当num_worker>0 在windows上使用的时候可能会报错 BrokenPipeError,此时可以把num_wordker参数设置为0试试
https://blog.csdn.net/Ginomica_xyx/article/details/113745596
DataLoader的返回值,返回两个迭代器
取数据的时候,默认是随机采样 sampler
代码
py
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
from torchvision import datasets
from torchvision import transforms
test_set = datasets.CIFAR10(root='./dataset', train=False, transform=transforms.ToTensor(), download=True)
"""
注意调整的参数:
1. batch_size 一次拿多少张图片
2. shuffle 两个epoch数据顺序是否一致 false不打乱,顺序一致
3. drop_last 除不尽batch_size余下的样本是否丢弃不处理, false 剩余样本不丢弃,数据一样处理
"""
# data = DataLoader(dataset=test_set, batch_size=4, shuffle=False, num_workers=0, drop_last=False)
data = DataLoader(dataset=test_set, batch_size=64, shuffle=False, num_workers=0, drop_last=False)
# data = DataLoader(dataset=test_set, batch_size=64, shuffle=False, num_workers=0, drop_last=True)
# data = DataLoader(dataset=test_set, batch_size=64, shuffle=True, num_workers=0, drop_last=True)
writer = SummaryWriter("logs")
for epoch in range(2):
step = 0
for one in data:
imgs, targets = one
# print(imgs.shape)
# print(targets)
"""add_images 可以在同一张画布上展示多张图片, imgs有多少张展示多少张"""
writer.add_images("step_dropFalse: {}".format(epoch), imgs, step)
# writer.add_images("step_dropTrue: {}".format(epoch), imgs, step)
# writer.add_images("step_dropTrue_shufTrue: {}".format(epoch), imgs, step)
step += 1
writer.close()