机器学习 - 准备数据

"Data" in machine learning can be almost anything you can imagine. A table of big Excel spreadsheet, images, videos, audio files, text and more.

机器学习其实可以分为两部分

  1. 将不管是什么data,都转成numbers.
  2. 挑选或者建立一个模型来学习这些numbers as best as possible.

下面是代码展示,创建一个straight line data

python 复制代码
import torch 
from torch import nn  # nn: neural networks. This package contains the building blocks for creating neural networks 
import matplotlib.pyplot as plt 

# Create linear regression parameters
weight = 0.7
bias = 0.3 

# Create data 
start = 0
end = 1
step = 0.02 
X = torch.arange(start, end, step).unsqueeze(dim=1)  # X is features
y = weight * X + bias   # y is labels
print(X[:10])
print(y[:10])

# 结果如下
tensor([[0.0000],
        [0.0200],
        [0.0400],
        [0.0600],
        [0.0800],
        [0.1000],
        [0.1200],
        [0.1400],
        [0.1600],
        [0.1800]])
tensor([[0.3000],
        [0.3140],
        [0.3280],
        [0.3420],
        [0.3560],
        [0.3700],
        [0.3840],
        [0.3980],
        [0.4120],
        [0.4260]])

将上面获取到的数据进行拆分,每部分数据带有不同的意思。

Split Purpose Amount of total data How often is it used?
Training set The model learns from this data (like the course materials you study during the semester) ~60-80% Always
Validation set The model gets tuned on this data (like the practice exam you take before the final exam). ~10-20% Often but not always
Testing set The model gets evaluated on this data to test what it has leanred (like the final exam you take at the end of the semester). ~10-20% Always

When dealing with real-world data, this step is typically done right at the start of a project (the test set should always be kept separate from all other data). Let the model learn on training data and then evaluate the model on test data to get an indication of how well it generalizes to unseen examples.

下面是代码。

python 复制代码
# Create train/test split 
train_split = int(0.8 * len(X))
X_train, y_train = X[:train_split], y[:train_split]
X_test, y_test = X[train_split:], y[train_split:]

# Learn the relationship between X_train and y_train
print(f"X_train length: {len(X_train)}")
print(f"y_train length: {len(y_train)}")
# Learn the relationship between X_test and y_test
print(f"X_test length: {len(X_test)}")
print(f"y_test length: {len(y_test)}")

# 输出如下
X_train length: 40
y_train length: 40
X_test length: 10
y_test length: 10

通过将各个数字显示出来,更直观

python 复制代码
plt.figure(figsize=(10, 7))

# s 代表是散点的大小
plt.scatter(X_train, y_train, c="b", s=4, label="Training data")
plt.scatter(X_test, y_test, c="r", s=4, label="Testing data")

plt.legend(prop={"size": 14})
plt.show()

都看到这了,给个赞呗~

相关推荐
Android出海3 分钟前
Android 15重磅升级:16KB内存页机制详解与适配指南
android·人工智能·新媒体运营·产品运营·内容运营
cyyt4 分钟前
深度学习周报(9.1~9.7)
人工智能·深度学习
悟乙己5 分钟前
使用 Python 中的强化学习最大化简单 RAG 性能
开发语言·python·agent·rag·n8n
聚客AI7 分钟前
🌸万字解析:大规模语言模型(LLM)推理中的Prefill与Decode分离方案
人工智能·llm·掘金·日新计划
max5006009 分钟前
图像处理:实现多图点重叠效果
开发语言·图像处理·人工智能·python·深度学习·音视频
AI原吾21 分钟前
玩转物联网只需十行代码,可它为何悄悄停止维护
python·物联网·hbmqtt
麦麦麦造23 分钟前
国外网友的3个步骤,实现用Prompt来写Prompt!超简单!
人工智能
云动雨颤28 分钟前
Python单元测试入门:3个核心断言方法,帮你快速定位代码bug
python·单元测试
闲看云起36 分钟前
从BERT到T5:为什么说T5是NLP的“大一统者”?
人工智能·语言模型·transformer
SunnyDays10111 小时前
Python 实现 HTML 转 Word 和 PDF
python·html转word·html转pdf·html转docx·html转doc