【图神经网络——PubMed】

PubMed

1、导包:

python 复制代码
import time

import torch
from torch_geometric.datasets import Planetoid
from torch_geometric.transforms import NormalizeFeatures
import matplotlib.pyplot as plt
import torch.nn as nn
from torch_geometric.nn import GCNConv
import torch.nn.functional as F

2、加载数据集

python 复制代码
dataset = Planetoid(root='data/Planetoid',name='PubMed',transform=NormalizeFeatures())

观测数据集

python 复制代码
print(f'Dataset: {dataset}')
print('==================')
print(f'Number of graphs:{len(dataset)}')
print(f'Number of features:{dataset.num_features}')
print(f'Number of classes:{dataset.num_classes}')

data = dataset[0]

print()
# Data(x=[19717, 500], edge_index=[2, 88648], y=[19717], train_mask=[19717], val_mask=[19717], test_mask=[19717])
print(data)
print('=====================================================')
print(f'Number of nodes:{data.num_nodes}')
print(f'Number of edges:{data.num_edges}')
print(f'Average node degree: {data.num_edges / data.num_nodes:.2f}')
print(f'Number of training nodes: {data.train_mask.sum()}')
print(f'Training node label trate: {int(data.train_mask.sum()) / data.num_nodes:.2f}')
print(f'Has isolated nodes: {data.has_isolated_nodes()}')
print(f'Has self-loops:{data.has_self_loops()}')
print(f'Is undirected:{data.is_undirected()}')

# train_mask数据集中有True和False 两种类型,True表示是训练集,
# 使用data.train_mask.sum() 统计出train_mask 中属于训练集的元素
print(int(data.train_mask.sum()))
print(int(data.val_mask.sum()))
print(int(data.test_mask.sum()))
print('===========================')

输出如下:Dataset: PubMed()

Number of graphs:1

Number of features:500
Number of classes:3
Data(x=[19717, 500], edge_index=[2, 88648], y=[19717], train_mask=[19717], val_mask=[19717], test_mask=[19717])

Number of nodes:19717

Number of edges:88648

Average node degree: 4.50

Number of training nodes: 60

Training node label trate: 0.00

Has isolated nodes: False

Has self-loops:False

Is undirected:True

60

500

1000

3、数据获取并绘制图画

python 复制代码
# 统计每个类别的数量
# 怎么说呢?例如PubMed 数据集中 y=[19717],找出 标签y中所有唯一的标签,也就是去重,然后再找出每个唯一的标签所出现的次数并返回
# 例如:data.y = torch.tensor([0, 1, 0, 2, 1, 2, 2, 0])
# 那么唯一的标签就是:unique_labels= tensor([0, 1, 2])
# 出现的次数就是:   counts       = tensor([3, 2, 3])
unique_labels,counts = torch.unique(data.y,return_counts=True)

# 打印每个类别的数量
for label,count in zip(unique_labels,counts):
    print(f'类别{label.item()}:{count.item()}个样本')

# 使用直方图实现可视化
plt.bar(unique_labels.numpy(),counts.numpy())
plt.xlabel('labels')
plt.ylabel('counts')

# 在每个条形上方添加数字标签]
for label,count in zip(unique_labels,counts):
    plt.text(label.item(),count.item(),str(count.item()),ha='center',va='bottom')
# plt.show()

4、定义模型:

python 复制代码
class GCN(nn.Module):
    def __init__(self,output_channels=3):
        super(GCN,self).__init__()

        self.covn1 = GCNConv(500,32)
        self.covn2 = GCNConv(32,output_channels)

        self.dp = nn.Dropout(0.5)
    def forward(self,x,edge_index):
        x = self.covn1(x,edge_index)
        x = F.relu(x)
        x = self.dp(x)
        x = self.covn2(x,edge_index)
        return x

5、定义训练函数和测试函数

python 复制代码
```python
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = GCN().to(device)
optimizer = torch.optim.Adam(model.parameters(),lr=0.01,weight_decay=1e-4)
criterion = nn.CrossEntropyLoss(reduction='mean').to(device)
data.to(device)

def train():
    model.train() # 设置为训练模式
    optimizer.zero_grad()
    train_outputs = model(data.x,data.edge_index)
    loss = criterion(train_outputs[data.train_mask],data.y[data.train_mask])
    loss.backward()
    optimizer.step()
    return loss
def test():
    model.eval() # 设置为评估模式
    outputs = model(data.x,data.edge_index)
    # 进行预测
    preds = outputs.argmax(dim=1)
    accs = []
    for mask in [data.train_mask,data.val_mask,data.test_mask]:
        correct = preds[mask] == data.y[mask]
        accs.append(int(correct.sum()) / int(mask.sum()))
    return accs

小小理解:

如何理解上边的test呢?

举个例子:data = {

'x': torch.randn(10, 5), # 10个节点,每个节点有5个特征

'edge_index': torch.tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],

[1, 2, 3, 4, 5, 6, 7, 8, 9, 0]]), # 10条边,连成一个环

'y': torch.tensor([0, 1, 0, 1, 0, 1, 0, 1, 0, 1]), # 10个节点的标签

'train_mask': torch.tensor([True, True, True, True, False, False, False, False, False, False]), # 前4个节点为训练集

'val_mask': torch.tensor([False, False, False, False, True, True, True, False, False, False]), # 中间3个节点为验证集

'test_mask': torch.tensor([False, False, False, False, False, False, False, True, True, True]) # 最后3个节点为测试集

观测train_mask,val_mask,test_mask的形状大小,这也就是他们三个大小形状相同原因,只是设置不同的True/False 来进行区分

假设经过 outputs = model(data.x,data.edge_index) 之后 输出的数据的如下:

torch.tensor([

[0.1, 0.9], [0.8, 0.2], [0.4, 0.6], [0.3, 0.7], [0.6, 0.4],

[0.5, 0.5], [0.9, 0.1], [0.2, 0.8], [0.7, 0.3], [0.4, 0.6]

]) # 假设输出是一个形状为 (10, 2) 的张量,表示每个节点的类别概率}

进行预测 preds = output.argmax(dim=1)之后,得到

preds = [1, 0, 1, 1, 0, 0, 0, 1, 0, 1]

对于训练集train_mask来说:

preds = [ 1, 0, 1, 1, 0, 0, 0, 1, 0, 1]

train_mask = tensor([True, True, True, True, False, False, False, False, False, False])

data.y = [ 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]

前两行选出了 preds中的前4个 [1,0,1,1]

后两行选出了 y 中的前4个 [0,1,0,1]

preds[mask]的作用是筛选出 对应的训练集,同样data['y'][mask] 也是筛选出y中对应的训练标签

preds[mask] = [1,0,1,1] data['y'][mask] =[0,1,0,1]

经过: correct = preds[mask] == data.y[mask] 得到:

correct = [0,0,0,1]

故:correct.sum() = 1, 而mask.sum() =4

最后:accs.append(1/4)

然后依次计算验证集,测试集

6、开始训练:

python 复制代码
start_time = time.time()
for epoch in range(101):
    loss = train()
    train_acc,val_acc,test_acc = test()
    if epoch % 10 == 0:
        print('Epoch#{:03d},Loss:{:.4f},Train_Accuracy:{:.4f},Val_Accuracy:{:.4f},Test_Accuracy:{:.4f}'.format(epoch,loss,train_acc,val_acc,test_acc))

end_time = time.time()
elapsed_tiem = end_time - start_time
print(f'Elapsed_time:{elapsed_tiem}  seconds')

输出结果如下:

Epoch#000,Loss:1.0993,Train_Accuracy:0.6333,Val_Accuracy:0.6020,Test_Accuracy:0.6020

Epoch#010,Loss:0.9639,Train_Accuracy:0.9167,Val_Accuracy:0.7160,Test_Accuracy:0.7180

Epoch#020,Loss:0.7399,Train_Accuracy:0.9500,Val_Accuracy:0.7520,Test_Accuracy:0.7430

Epoch#030,Loss:0.4626,Train_Accuracy:0.9500,Val_Accuracy:0.7600,Test_Accuracy:0.7520

Epoch#040,Loss:0.3323,Train_Accuracy:0.9500,Val_Accuracy:0.7660,Test_Accuracy:0.7640

Epoch#050,Loss:0.2175,Train_Accuracy:0.9833,Val_Accuracy:0.7820,Test_Accuracy:0.7710

Epoch#060,Loss:0.1686,Train_Accuracy:1.0000,Val_Accuracy:0.7840,Test_Accuracy:0.7740

Epoch#070,Loss:0.1014,Train_Accuracy:1.0000,Val_Accuracy:0.7820,Test_Accuracy:0.7730

Epoch#080,Loss:0.0863,Train_Accuracy:1.0000,Val_Accuracy:0.7880,Test_Accuracy:0.7810

Epoch#090,Loss:0.0832,Train_Accuracy:1.0000,Val_Accuracy:0.7760,Test_Accuracy:0.7780

Epoch#100,Loss:0.0750,Train_Accuracy:1.0000,Val_Accuracy:0.7800,Test_Accuracy:0.7840

Elapsed_time:1.9149479866027832 seconds

相关推荐
软工菜鸡22 分钟前
预训练语言模型BERT——PaddleNLP中的预训练模型
大数据·人工智能·深度学习·算法·语言模型·自然语言处理·bert
放飞自我的Coder26 分钟前
【python ROUGE BLEU jiaba.cut NLP常用的指标计算】
python·自然语言处理·bleu·rouge·jieba分词
正义的彬彬侠1 小时前
【scikit-learn 1.2版本后】sklearn.datasets中load_boston报错 使用 fetch_openml 函数来加载波士顿房价
python·机器学习·sklearn
张小生1801 小时前
PyCharm中 argparse 库 的使用方法
python·pycharm
秃头佛爷1 小时前
Python使用PDF相关组件案例详解
python
Dxy12393102161 小时前
python下载pdf
数据库·python·pdf
叶知安1 小时前
如何用pycharm连接sagemath?
ide·python·pycharm
weixin_432702261 小时前
代码随想录算法训练营第五十五天|图论理论基础
数据结构·python·算法·深度优先·图论
菜鸟清风1 小时前
ChromeDriver下载地址
python
deephub2 小时前
Tokenformer:基于参数标记化的高效可扩展Transformer架构
人工智能·python·深度学习·架构·transformer