机器学习 - 准备数据

"Data" in machine learning can be almost anything you can imagine. A table of big Excel spreadsheet, images, videos, audio files, text and more.

机器学习其实可以分为两部分

  1. 将不管是什么data,都转成numbers.
  2. 挑选或者建立一个模型来学习这些numbers as best as possible.

下面是代码展示,创建一个straight line data

python 复制代码
import torch 
from torch import nn  # nn: neural networks. This package contains the building blocks for creating neural networks 
import matplotlib.pyplot as plt 

# Create linear regression parameters
weight = 0.7
bias = 0.3 

# Create data 
start = 0
end = 1
step = 0.02 
X = torch.arange(start, end, step).unsqueeze(dim=1)  # X is features
y = weight * X + bias   # y is labels
print(X[:10])
print(y[:10])

# 结果如下
tensor([[0.0000],
        [0.0200],
        [0.0400],
        [0.0600],
        [0.0800],
        [0.1000],
        [0.1200],
        [0.1400],
        [0.1600],
        [0.1800]])
tensor([[0.3000],
        [0.3140],
        [0.3280],
        [0.3420],
        [0.3560],
        [0.3700],
        [0.3840],
        [0.3980],
        [0.4120],
        [0.4260]])

将上面获取到的数据进行拆分,每部分数据带有不同的意思。

Split Purpose Amount of total data How often is it used?
Training set The model learns from this data (like the course materials you study during the semester) ~60-80% Always
Validation set The model gets tuned on this data (like the practice exam you take before the final exam). ~10-20% Often but not always
Testing set The model gets evaluated on this data to test what it has leanred (like the final exam you take at the end of the semester). ~10-20% Always

When dealing with real-world data, this step is typically done right at the start of a project (the test set should always be kept separate from all other data). Let the model learn on training data and then evaluate the model on test data to get an indication of how well it generalizes to unseen examples.

下面是代码。

python 复制代码
# Create train/test split 
train_split = int(0.8 * len(X))
X_train, y_train = X[:train_split], y[:train_split]
X_test, y_test = X[train_split:], y[train_split:]

# Learn the relationship between X_train and y_train
print(f"X_train length: {len(X_train)}")
print(f"y_train length: {len(y_train)}")
# Learn the relationship between X_test and y_test
print(f"X_test length: {len(X_test)}")
print(f"y_test length: {len(y_test)}")

# 输出如下
X_train length: 40
y_train length: 40
X_test length: 10
y_test length: 10

通过将各个数字显示出来,更直观

python 复制代码
plt.figure(figsize=(10, 7))

# s 代表是散点的大小
plt.scatter(X_train, y_train, c="b", s=4, label="Training data")
plt.scatter(X_test, y_test, c="r", s=4, label="Testing data")

plt.legend(prop={"size": 14})
plt.show()

都看到这了,给个赞呗~

相关推荐
美狐美颜sdk3 小时前
跨平台直播美颜SDK集成实录:Android/iOS如何适配贴纸功能
android·人工智能·ios·架构·音视频·美颜sdk·第三方美颜sdk
DeepSeek-大模型系统教程3 小时前
推荐 7 个本周 yyds 的 GitHub 项目。
人工智能·ai·语言模型·大模型·github·ai大模型·大模型学习
郭庆汝3 小时前
pytorch、torchvision与python版本对应关系
人工智能·pytorch·python
IT古董3 小时前
【第二章:机器学习与神经网络概述】03.类算法理论与实践-(3)决策树分类器
神经网络·算法·机器学习
小雷FansUnion5 小时前
深入理解MCP架构:智能服务编排、上下文管理与动态路由实战
人工智能·架构·大模型·mcp
资讯分享周5 小时前
扣子空间PPT生产力升级:AI智能生成与多模态创作新时代
人工智能·powerpoint
思则变6 小时前
[Pytest] [Part 2]增加 log功能
开发语言·python·pytest
叶子爱分享6 小时前
计算机视觉与图像处理的关系
图像处理·人工智能·计算机视觉
鱼摆摆拜拜6 小时前
第 3 章:神经网络如何学习
人工智能·神经网络·学习
一只鹿鹿鹿6 小时前
信息化项目验收,软件工程评审和检查表单
大数据·人工智能·后端·智慧城市·软件工程