学习pytorch6 torchvision中的数据集使用

torchvision中的数据集使用

  • [1. torchvision中的数据集使用](#1. torchvision中的数据集使用)
    • 官网文档
    • [注意点1 totensor实例化不要忘记加括号](#注意点1 totensor实例化不要忘记加括号)
    • [注意点2 download可以一直保持为True](#注意点2 download可以一直保持为True)
    • 代码
    • 执行结果
  • [2. DataLoader的使用](#2. DataLoader的使用)
    • 官方文档
    • 参数解释
    • DataLoader的返回值,返回两个迭代器
    • 代码
    • 执行结果
      • [1. shuffle=False 不重新洗牌,拿到数据的顺序一致](#1. shuffle=False 不重新洗牌,拿到数据的顺序一致)
      • [2. shuffle=True 重新洗牌,拿到数据的顺序不一致](#2. shuffle=True 重新洗牌,拿到数据的顺序不一致)
      • [3. drop_last=False 不丢弃未除尽batch的样本](#3. drop_last=False 不丢弃未除尽batch的样本)

B站 小土堆 视频学习笔记

1. torchvision中的数据集使用

官网文档

注意左上角的版本

https://pytorch.org/vision/0.9/

注意点1 totensor实例化不要忘记加括号

totensor实例化不要忘记加括号,否则后面用数据集序列号的时候会报错

注意点2 download可以一直保持为True

download可以一直保持为True,下载一次后指定目录下有下载好的数据集,代码不会重复下载,也可以自己把下载好的数据集压缩包放到指定目录,代码会自动解压缩

代码

py 复制代码
from torch.utils.tensorboard import SummaryWriter
from torchvision import datasets, transforms

# 用法1
# 数据下载很慢的话 可以使用迅雷下载,属性里面可以看到迅雷是从多方下载的,速度比较快 https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
train_set = datasets.CIFAR10(root='./dataset', train=True, download=True)
test_set = datasets.CIFAR10(root='./dataset', train=False, download=True)
# 下载的数据集是图片类型,可以debug查看数据
print(test_set[0])  # __getitem__ return img, target
print(type(test_set[0]))
img, target = test_set[0]
print(target)
print(test_set.classes[target])
print(img)
# PIL 图片可以直接show函数展示
img.show()

# 用法2
# 将数据集批量调用transforms,使用tensor数据类型
# trans_compose = transforms.Compose([transforms.ToTensor])  # 错误写法 会导致后面报错
trans_compose = transforms.Compose([transforms.ToTensor()])
train_set2 = datasets.CIFAR10(root='./dataset', train=True, transform=trans_compose, download=True)
test_set2 = datasets.CIFAR10(root='./dataset', train=False, transform=trans_compose, download=True)
print(type(test_set2[2]))
img, target = test_set2[0]
print(target)
print(test_set2.classes[target])
print(type(img))
writer = SummaryWriter("logs")
for i in range(10):
    img_tensor, target = test_set2[i]
    writer.add_image('tensor dataset', img_tensor, i)
writer.close()

执行结果

sh 复制代码
> p11_torchvision_dataset.py
Files already downloaded and verified
Files already downloaded and verified
(<PIL.Image.Image image mode=RGB size=32x32 at 0x1CF47DA9E20>, 3)
<class 'tuple'>
3
cat
<PIL.Image.Image image mode=RGB size=32x32 at 0x1CF47DA9E20>
Files already downloaded and verified
Files already downloaded and verified
<class 'tuple'>
3
cat
<class 'torch.Tensor'>

Process finished with exit code 0

2. DataLoader的使用

官方文档

https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader

参数解释

Parameters:

  • dataset (Dataset) -- dataset from which to load the data.

    加载哪个数据

  • batch_size (int, optional) -- how many samples per batch to load (default: 1).

    每次拿多少数据

  • shuffle (bool, optional) -- set to True to have the data reshuffled at every epoch (default: False).

    当设置为False,则两次拿的数据都是一样的【相当于设置了随机数种子】,当设置为True,则每次拿的数据不一样

  • sampler (Sampler or Iterable, optional) -- defines the strategy to draw samples from the dataset. Can be any Iterable with len implemented. If specified, shuffle must not be specified.

  • batch_sampler (Sampler or Iterable, optional) -- like sampler, but returns a batch of indices at a time. Mutually exclusive with batch_size, shuffle, sampler, and drop_last.

  • num_workers (int, optional) -- how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 0)

    多进程取数据,设置为0则用主进程执行,win系统不设置为0可能报下面的错误BrokenPipeError

  • collate_fn (Callable, optional) -- merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

  • pin_memory (bool, optional) -- If True, the data loader will copy Tensors into device/CUDA pinned memory before returning them. If your data elements are a custom type, or your collate_fn returns a batch that is a custom type, see the example below.

  • drop_last (bool, optional) -- set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False)

    当数据总数除以batch_size 除不尽有余数是,设置为True是余下的数据不参与训练,舍去余数,设置为False则剩下的数据就参与训练,不舍弃余数。默认为False

  • timeout (numeric, optional) -- if positive, the timeout value for collecting a batch from workers. Should always be non-negative. (default: 0)

  • worker_init_fn (Callable, optional) -- If not None, this will be called on each worker subprocess with the worker id (an int in [0, num_workers - 1]) as input, after seeding and before data loading. (default: None)

  • generator (torch.Generator, optional) -- If not None, this RNG will be used by RandomSampler to generate random indexes and multiprocessing to generate base_seed for workers. (default: None)

  • prefetch_factor (int, optional, keyword-only arg) -- Number of batches loaded in advance by each worker. 2 means there will be a total of 2 * num_workers batches prefetched across all workers. (default value depends on the set value for num_workers. If value of num_workers=0 default is None. Otherwise if value of num_workers>0 default is 2).

  • persistent_workers (bool, optional) -- If True, the data loader will not shutdown the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. (default: False)

  • pin_memory_device (str, optional) -- the data loader will copy Tensors into device pinned memory before returning them if pin_memory is set to true.

    当num_worker>0 在windows上使用的时候可能会报错 BrokenPipeError,此时可以把num_wordker参数设置为0试试

https://blog.csdn.net/Ginomica_xyx/article/details/113745596

https://www.ngui.cc/el/1916356.html?action=onClick

DataLoader的返回值,返回两个迭代器

取数据的时候,默认是随机采样 sampler

代码

py 复制代码
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
from torchvision import datasets
from torchvision import transforms


test_set = datasets.CIFAR10(root='./dataset', train=False, transform=transforms.ToTensor(), download=True)
"""
注意调整的参数:
1. batch_size 一次拿多少张图片
2. shuffle  两个epoch数据顺序是否一致 false不打乱,顺序一致
3. drop_last 除不尽batch_size余下的样本是否丢弃不处理, false 剩余样本不丢弃,数据一样处理
"""
# data = DataLoader(dataset=test_set, batch_size=4, shuffle=False, num_workers=0, drop_last=False)
data = DataLoader(dataset=test_set, batch_size=64, shuffle=False, num_workers=0, drop_last=False)
# data = DataLoader(dataset=test_set, batch_size=64, shuffle=False, num_workers=0, drop_last=True)
# data = DataLoader(dataset=test_set, batch_size=64, shuffle=True, num_workers=0, drop_last=True)

writer = SummaryWriter("logs")
for epoch in range(2):
    step = 0
    for one in data:
        imgs, targets = one
        # print(imgs.shape)
        # print(targets)
        """add_images 可以在同一张画布上展示多张图片, imgs有多少张展示多少张"""
        writer.add_images("step_dropFalse: {}".format(epoch), imgs, step)
        # writer.add_images("step_dropTrue: {}".format(epoch), imgs, step)
        # writer.add_images("step_dropTrue_shufTrue: {}".format(epoch), imgs, step)
        step += 1

writer.close()

执行结果

1. shuffle=False 不重新洗牌,拿到数据的顺序一致

2. shuffle=True 重新洗牌,拿到数据的顺序不一致

3. drop_last=False 不丢弃未除尽batch的样本

相关推荐
醒了就刷牙6 分钟前
56 门控循环单元(GRU)_by《李沐:动手学深度学习v2》pytorch版
pytorch·深度学习·gru
炼丹师小米6 分钟前
Ubuntu24.04.1系统下VideoMamba环境配置
python·环境配置·videomamba
橙子小哥的代码世界7 分钟前
【深度学习】05-RNN循环神经网络-02- RNN循环神经网络的发展历史与演化趋势/LSTM/GRU/Transformer
人工智能·pytorch·rnn·深度学习·神经网络·lstm·transformer
GFCGUO13 分钟前
ubuntu18.04运行OpenPCDet出现的问题
linux·python·学习·ubuntu·conda·pip
985小水博一枚呀2 小时前
【深度学习基础模型】神经图灵机(Neural Turing Machines, NTM)详细理解并附实现代码。
人工智能·python·rnn·深度学习·lstm·ntm
丝丝不是土豆丝2 小时前
学习 CSS 新的属性 conic-gradient 实现环形进度条
学习
S hh2 小时前
【Linux】进程地址空间
java·linux·运维·服务器·学习
wusam2 小时前
螺蛳壳里做道场:老破机搭建的私人数据中心---Centos下Docker学习04(环境准备)
学习·docker·centos
攸攸太上2 小时前
Spring Gateway学习
java·后端·学习·spring·微服务·gateway
萧鼎3 小时前
Python调试技巧:高效定位与修复问题
服务器·开发语言·python