机器学习 - 准备数据

"Data" in machine learning can be almost anything you can imagine. A table of big Excel spreadsheet, images, videos, audio files, text and more.

机器学习其实可以分为两部分

  1. 将不管是什么data,都转成numbers.
  2. 挑选或者建立一个模型来学习这些numbers as best as possible.

下面是代码展示,创建一个straight line data

python 复制代码
import torch 
from torch import nn  # nn: neural networks. This package contains the building blocks for creating neural networks 
import matplotlib.pyplot as plt 

# Create linear regression parameters
weight = 0.7
bias = 0.3 

# Create data 
start = 0
end = 1
step = 0.02 
X = torch.arange(start, end, step).unsqueeze(dim=1)  # X is features
y = weight * X + bias   # y is labels
print(X[:10])
print(y[:10])

# 结果如下
tensor([[0.0000],
        [0.0200],
        [0.0400],
        [0.0600],
        [0.0800],
        [0.1000],
        [0.1200],
        [0.1400],
        [0.1600],
        [0.1800]])
tensor([[0.3000],
        [0.3140],
        [0.3280],
        [0.3420],
        [0.3560],
        [0.3700],
        [0.3840],
        [0.3980],
        [0.4120],
        [0.4260]])

将上面获取到的数据进行拆分,每部分数据带有不同的意思。

Split Purpose Amount of total data How often is it used?
Training set The model learns from this data (like the course materials you study during the semester) ~60-80% Always
Validation set The model gets tuned on this data (like the practice exam you take before the final exam). ~10-20% Often but not always
Testing set The model gets evaluated on this data to test what it has leanred (like the final exam you take at the end of the semester). ~10-20% Always

When dealing with real-world data, this step is typically done right at the start of a project (the test set should always be kept separate from all other data). Let the model learn on training data and then evaluate the model on test data to get an indication of how well it generalizes to unseen examples.

下面是代码。

python 复制代码
# Create train/test split 
train_split = int(0.8 * len(X))
X_train, y_train = X[:train_split], y[:train_split]
X_test, y_test = X[train_split:], y[train_split:]

# Learn the relationship between X_train and y_train
print(f"X_train length: {len(X_train)}")
print(f"y_train length: {len(y_train)}")
# Learn the relationship between X_test and y_test
print(f"X_test length: {len(X_test)}")
print(f"y_test length: {len(y_test)}")

# 输出如下
X_train length: 40
y_train length: 40
X_test length: 10
y_test length: 10

通过将各个数字显示出来,更直观

python 复制代码
plt.figure(figsize=(10, 7))

# s 代表是散点的大小
plt.scatter(X_train, y_train, c="b", s=4, label="Training data")
plt.scatter(X_test, y_test, c="r", s=4, label="Testing data")

plt.legend(prop={"size": 14})
plt.show()

都看到这了,给个赞呗~

相关推荐
ai_xiaogui1 分钟前
AIStarter:全网唯一跨平台桌面AI管理工具,支持Windows、Mac和Linux一键部署
linux·人工智能·macos·跨平台ai项目一键部署工具·comfyui模型库·高效管理2.19tb模型库·一键配置comfyui模型库
未来之窗软件服务8 分钟前
繁花深处:花店建设的时代意义与多元应用—仙盟创梦IDE
人工智能·小程序·仙盟创梦ide·东方仙盟
suke14 分钟前
前脚说为你好,后脚就把你“出卖”!“沉浸式翻译”上演史诗级“神操作”
人工智能·开源·产品
Waitind_22 分钟前
Python数据分析常规步骤整理
开发语言·python·数据分析
这里有鱼汤43 分钟前
Python 虚拟环境是什么?它到底是怎么工作的?
后端·python
知了一笑1 小时前
AI编程:代码多,效果好?
人工智能·大模型·kimi·千问·deepseek
过往入尘土1 小时前
全面了解机器语言之kmeans
人工智能·机器学习·支持向量机
m0_677034351 小时前
机器学习-增加样本、精确率与召回率
人工智能·机器学习
乔公子搬砖1 小时前
计算机视觉全景指南:从OpenCV预处理到YOLOv8实战,解锁多模态AI时代(第五章)
人工智能·opencv·计算机视觉·ai·语言模型·大模型