学习pytorch6 torchvision中的数据集使用

torchvision中的数据集使用

  • [1. torchvision中的数据集使用](#1. torchvision中的数据集使用)
    • 官网文档
    • [注意点1 totensor实例化不要忘记加括号](#注意点1 totensor实例化不要忘记加括号)
    • [注意点2 download可以一直保持为True](#注意点2 download可以一直保持为True)
    • 代码
    • 执行结果
  • [2. DataLoader的使用](#2. DataLoader的使用)
    • 官方文档
    • 参数解释
    • DataLoader的返回值,返回两个迭代器
    • 代码
    • 执行结果
      • [1. shuffle=False 不重新洗牌,拿到数据的顺序一致](#1. shuffle=False 不重新洗牌,拿到数据的顺序一致)
      • [2. shuffle=True 重新洗牌,拿到数据的顺序不一致](#2. shuffle=True 重新洗牌,拿到数据的顺序不一致)
      • [3. drop_last=False 不丢弃未除尽batch的样本](#3. drop_last=False 不丢弃未除尽batch的样本)

B站 小土堆 视频学习笔记

1. torchvision中的数据集使用

官网文档

注意左上角的版本

https://pytorch.org/vision/0.9/

注意点1 totensor实例化不要忘记加括号

totensor实例化不要忘记加括号,否则后面用数据集序列号的时候会报错

注意点2 download可以一直保持为True

download可以一直保持为True,下载一次后指定目录下有下载好的数据集,代码不会重复下载,也可以自己把下载好的数据集压缩包放到指定目录,代码会自动解压缩

代码

py 复制代码
from torch.utils.tensorboard import SummaryWriter
from torchvision import datasets, transforms

# 用法1
# 数据下载很慢的话 可以使用迅雷下载,属性里面可以看到迅雷是从多方下载的,速度比较快 https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
train_set = datasets.CIFAR10(root='./dataset', train=True, download=True)
test_set = datasets.CIFAR10(root='./dataset', train=False, download=True)
# 下载的数据集是图片类型,可以debug查看数据
print(test_set[0])  # __getitem__ return img, target
print(type(test_set[0]))
img, target = test_set[0]
print(target)
print(test_set.classes[target])
print(img)
# PIL 图片可以直接show函数展示
img.show()

# 用法2
# 将数据集批量调用transforms,使用tensor数据类型
# trans_compose = transforms.Compose([transforms.ToTensor])  # 错误写法 会导致后面报错
trans_compose = transforms.Compose([transforms.ToTensor()])
train_set2 = datasets.CIFAR10(root='./dataset', train=True, transform=trans_compose, download=True)
test_set2 = datasets.CIFAR10(root='./dataset', train=False, transform=trans_compose, download=True)
print(type(test_set2[2]))
img, target = test_set2[0]
print(target)
print(test_set2.classes[target])
print(type(img))
writer = SummaryWriter("logs")
for i in range(10):
    img_tensor, target = test_set2[i]
    writer.add_image('tensor dataset', img_tensor, i)
writer.close()

执行结果

sh 复制代码
> p11_torchvision_dataset.py
Files already downloaded and verified
Files already downloaded and verified
(<PIL.Image.Image image mode=RGB size=32x32 at 0x1CF47DA9E20>, 3)
<class 'tuple'>
3
cat
<PIL.Image.Image image mode=RGB size=32x32 at 0x1CF47DA9E20>
Files already downloaded and verified
Files already downloaded and verified
<class 'tuple'>
3
cat
<class 'torch.Tensor'>

Process finished with exit code 0

2. DataLoader的使用

官方文档

https://pytorch.org/docs/stable/data.html?highlight=dataloader#torch.utils.data.DataLoader

参数解释

Parameters:

  • dataset (Dataset) -- dataset from which to load the data.

    加载哪个数据

  • batch_size (int, optional) -- how many samples per batch to load (default: 1).

    每次拿多少数据

  • shuffle (bool, optional) -- set to True to have the data reshuffled at every epoch (default: False).

    当设置为False,则两次拿的数据都是一样的【相当于设置了随机数种子】,当设置为True,则每次拿的数据不一样

  • sampler (Sampler or Iterable, optional) -- defines the strategy to draw samples from the dataset. Can be any Iterable with len implemented. If specified, shuffle must not be specified.

  • batch_sampler (Sampler or Iterable, optional) -- like sampler, but returns a batch of indices at a time. Mutually exclusive with batch_size, shuffle, sampler, and drop_last.

  • num_workers (int, optional) -- how many subprocesses to use for data loading. 0 means that the data will be loaded in the main process. (default: 0)

    多进程取数据,设置为0则用主进程执行,win系统不设置为0可能报下面的错误BrokenPipeError

  • collate_fn (Callable, optional) -- merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

  • pin_memory (bool, optional) -- If True, the data loader will copy Tensors into device/CUDA pinned memory before returning them. If your data elements are a custom type, or your collate_fn returns a batch that is a custom type, see the example below.

  • drop_last (bool, optional) -- set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False)

    当数据总数除以batch_size 除不尽有余数是,设置为True是余下的数据不参与训练,舍去余数,设置为False则剩下的数据就参与训练,不舍弃余数。默认为False

  • timeout (numeric, optional) -- if positive, the timeout value for collecting a batch from workers. Should always be non-negative. (default: 0)

  • worker_init_fn (Callable, optional) -- If not None, this will be called on each worker subprocess with the worker id (an int in 0, num_workers - 1) as input, after seeding and before data loading. (default: None)

  • generator (torch.Generator, optional) -- If not None, this RNG will be used by RandomSampler to generate random indexes and multiprocessing to generate base_seed for workers. (default: None)

  • prefetch_factor (int, optional, keyword-only arg) -- Number of batches loaded in advance by each worker. 2 means there will be a total of 2 * num_workers batches prefetched across all workers. (default value depends on the set value for num_workers. If value of num_workers=0 default is None. Otherwise if value of num_workers>0 default is 2).

  • persistent_workers (bool, optional) -- If True, the data loader will not shutdown the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. (default: False)

  • pin_memory_device (str, optional) -- the data loader will copy Tensors into device pinned memory before returning them if pin_memory is set to true.

    当num_worker>0 在windows上使用的时候可能会报错 BrokenPipeError,此时可以把num_wordker参数设置为0试试

https://blog.csdn.net/Ginomica_xyx/article/details/113745596

https://www.ngui.cc/el/1916356.html?action=onClick

DataLoader的返回值,返回两个迭代器

取数据的时候,默认是随机采样 sampler

代码

py 复制代码
from torch.utils.data import DataLoader
from torch.utils.tensorboard import SummaryWriter
from torchvision import datasets
from torchvision import transforms


test_set = datasets.CIFAR10(root='./dataset', train=False, transform=transforms.ToTensor(), download=True)
"""
注意调整的参数:
1. batch_size 一次拿多少张图片
2. shuffle  两个epoch数据顺序是否一致 false不打乱,顺序一致
3. drop_last 除不尽batch_size余下的样本是否丢弃不处理, false 剩余样本不丢弃,数据一样处理
"""
# data = DataLoader(dataset=test_set, batch_size=4, shuffle=False, num_workers=0, drop_last=False)
data = DataLoader(dataset=test_set, batch_size=64, shuffle=False, num_workers=0, drop_last=False)
# data = DataLoader(dataset=test_set, batch_size=64, shuffle=False, num_workers=0, drop_last=True)
# data = DataLoader(dataset=test_set, batch_size=64, shuffle=True, num_workers=0, drop_last=True)

writer = SummaryWriter("logs")
for epoch in range(2):
    step = 0
    for one in data:
        imgs, targets = one
        # print(imgs.shape)
        # print(targets)
        """add_images 可以在同一张画布上展示多张图片, imgs有多少张展示多少张"""
        writer.add_images("step_dropFalse: {}".format(epoch), imgs, step)
        # writer.add_images("step_dropTrue: {}".format(epoch), imgs, step)
        # writer.add_images("step_dropTrue_shufTrue: {}".format(epoch), imgs, step)
        step += 1

writer.close()

执行结果

1. shuffle=False 不重新洗牌,拿到数据的顺序一致

2. shuffle=True 重新洗牌,拿到数据的顺序不一致

3. drop_last=False 不丢弃未除尽batch的样本

相关推荐
WangN22 分钟前
【通识】RSL-RL快速上手
人工智能·python·机器学习·机器人
geovindu4 分钟前
python: Reactor Pattern
开发语言·python·设计模式·反应器模式
1024+4 分钟前
在 ‌Ubuntu 24.04‌ 上安装 ‌Python 3.8‌
linux·python·ubuntu
财经资讯数据_灵砚智能6 分钟前
基于全球经济类多源新闻的NLP情感分析与数据可视化(日间)2026年6月15日
大数据·人工智能·python·信息可视化·自然语言处理
某林21215 分钟前
从 Isaac Lab API 踩坑到硬件 MVP 的全链路实战破局
python·机器人·人机交互·ros2
专注搞钱18 分钟前
Python自动爬设备报警日志,每天省1小时
开发语言·python·半导体
袁小皮皮不皮19 分钟前
6.HCIP OSPF域间防环机制与虚链路
服务器·网络·笔记·网络协议·学习·智能路由器
2601_9619633820 分钟前
Spring Boot集成电子签章的7个典型问题与解决方案:从入门到生产级实践
大数据·人工智能·spring boot·python·区块链·智能合约
一口吃俩胖子26 分钟前
【脉宽调制DCDC功率变换学习笔记026】补偿设计和闭环性能
笔记·学习
三品吉他手会点灯27 分钟前
C语言学习笔记 - 48.流程控制2 - 什么是流程控制
c语言·开发语言·笔记·学习