用户在Yelp等平台上影响彼此的评价通常指的是一种社交传播现象,即一个用户的评价行为(如评分和评论内容)可能会影响其他用户对同一商家的看法和行为。这种影响可以通过多个因素和特征来捕捉。在分析用户如何影响彼此的评价时,最相关的数据集是:
- yelp_academic_dataset_review.json:这个数据集包含了用户对商家的评价,包括评分、评论内容和时间戳等信息,非常适合用于研究用户之间的评价传播和影响。
此外,其他数据集也可以提供有用的信息:
-
yelp_academic_dataset_user.json:包含用户的基本信息,如用户ID、注册时间和朋友列表等,可以帮助建立用户之间的社交网络。
-
yelp_academic_dataset_business.json:提供商家的基本信息,如商家ID、类型、位置等,对于了解商家的属性和影响力很重要。
主要使用 :yelp_academic_dataset_review.json
辅助使用 :yelp_academic_dataset_user.json
和 yelp_academic_dataset_business.json
def load_data():
# 加载数据集
business_df = pd.read_json('yelp_academic_dataset_business.json', lines=True)
user_df = pd.read_json('yelp_academic_dataset_user.json', lines=True)
review_df = pd.read_json('yelp_academic_dataset_review.json', lines=True)
return business_df, user_df, review_df
一般性的特征提取
def extract_features(user_df, business_df, review_df):
# 提取用户特征
user_features = user_df[['user_id', 'friends', 'review_count', 'yelping_since']]
user_features['friend_count'] = user_features['friends'].apply(lambda x: len(x.split(',')) if x else 0)
# 提取商家特征
business_features = business_df[['business_id', 'categories', 'city', 'state', 'stars']]
# 提取评价特征
review_features = review_df[['user_id', 'business_id', 'stars', 'text', 'date']]
# 合并用户和评价特征
merged_df = pd.merge(review_features, user_features, on='user_id')
# 合并商家和合并后的数据框
final_df = pd.merge(merged_df, business_features, on='business_id')
return final_df
采样关键特征有:
-
用户特征:
- 用户ID:唯一标识用户。
- 注册时间:用户在平台的活跃时间,可能影响其评价行为。
- 朋友数量:用户的社交网络规模,可能影响其接收到的信息量。
- 历史评价数量:用户过去的评价数量,反映其活跃程度。
-
商家特征:
- 商家ID:唯一标识商家。
- 类别:商家的类型(如餐馆、商店等),可能影响用户的评价偏好。
- 地理位置:商家的位置,可能对用户的访问和评价产生影响。
- 总体评分:商家的平均评分,可能影响用户的期望和评价。
-
评价特征:
- 评分:用户对商家的评分,直接反映用户的满意度。
- 评论内容:用户的评论文本,可能包含影响其他用户的意见和观点。
- 时间戳:评价的时间信息,可能用于分析评价的时间动态。
除此之外还需要一个社交关系图,这个图的节点代表用户,边代表用户之间的朋友关系
-
社交网络特征:
- 朋友关系:用户之间的朋友关系,分析用户如何通过朋友的评价进行决策。
- 评论互动:其他用户对该评价的点赞或回复,反映评论的影响力
def build_social_network(user_df):
# 创建社交网络图
G = nx.Graph()# 为每个用户添加节点 for _, row in user_df.iterrows(): G.add_node(row['user_id'], review_count=row['review_count']) # 为每个朋友关系添加边 for _, row in user_df.iterrows(): friends = row['friends'].split(',') if row['friends'] else [] for friend in friends: if friend: # 确保朋友ID不为空 G.add_edge(row['user_id'], friend.strip()) return G
- 使用
iterrows()
方法遍历user_df
中的每一行。每一行代表一个用户。 - 对于每个用户,使用
G.add_node()
方法将用户添加为图的节点。每个节点包括用户的唯一标识符user_id
作为节点名,同时也存储该用户的评价数量review_count
作为节点的属性。 - 再次遍历
user_df
中的每一行,用于添加用户之间的边(朋友关系)。 - 获取每个用户的朋友列表,使用
split(',')
将字符串分割成一个列表。如果没有朋友,则设置为空列表。 - 对于每个朋友,检查其是否为空(确保朋友ID存在),然后使用
G.add_edge()
方法将用户与其朋友之间添加一条边,表示他们是朋友关系。这里使用strip()
方法去除朋友ID的前后空格。这种图结构可以用于后续分析,如用户之间的影响力、社交传播等。通过分析这个网络,可以深入理解用户如何通过社交关系影响彼此的评价行为。
在提取特征的时候结合社交图,把用户身边其他好友的评论avg_friend_rating也提取出来
def extract_network_features(G, user_df):
# 提取每个用户的度数和平均朋友评分
user_degrees = dict(G.degree())
avg_friend_ratings = {}
for user in G.nodes:
friends = list(G.neighbors(user))
if friends:
avg_friend_ratings[user] = user_df[user_df['user_id'].isin(friends)]['stars'].mean()
else:
avg_friend_ratings[user] = 0 # 如果没有朋友,平均评分为0
return user_degrees, avg_friend_ratings
def extract_features(user_df, business_df, review_df, G):
# 提取用户特征
user_features = user_df[['user_id', 'friends', 'review_count', 'yelping_since']]
user_features['friend_count'] = user_features['friends'].apply(lambda x: len(x.split(',')) if x else 0)
# 提取商家特征
business_features = business_df[['business_id', 'categories', 'city', 'state', 'stars']]
# 提取评价特征
review_features = review_df[['user_id', 'business_id', 'stars', 'text', 'date']]
# 合并用户和评价特征
merged_df = pd.merge(review_features, user_features, on='user_id')
# 合并商家和合并后的数据框
final_df = pd.merge(merged_df, business_features, on='business_id')
# 提取网络特征
user_degrees, avg_friend_ratings = extract_network_features(G, user_df)
# 将网络特征添加到 DataFrame
final_df['user_degree'] = final_df['user_id'].map(user_degrees)
final_df['avg_friend_rating'] = final_df['user_id'].map(avg_friend_ratings)
return final_df
定义模型网络结构
import torch
import torch.nn as nn
import torch.optim as optim
class AdvancedNN(nn.Module):
def __init__(self):
super(AdvancedNN, self).__init__()
self.fc1 = nn.Linear(3, 128) # 输入层,128个神经元
self.bn1 = nn.BatchNorm1d(128) # Batch Normalization
self.dropout1 = nn.Dropout(0.3) # Dropout层
self.fc2 = nn.Linear(128, 64) # 隐藏层,64个神经元
self.bn2 = nn.BatchNorm1d(64) # Batch Normalization
self.dropout2 = nn.Dropout(0.3) # Dropout层
self.fc3 = nn.Linear(64, 32) # 隐藏层,32个神经元
self.bn3 = nn.BatchNorm1d(32) # Batch Normalization
self.dropout3 = nn.Dropout(0.3) # Dropout层
self.fc4 = nn.Linear(32, 1) # 输出层,单输出
def forward(self, x):
x = torch.relu(self.bn1(self.fc1(x)))
x = self.dropout1(x)
x = torch.relu(self.bn2(self.fc2(x)))
x = self.dropout2(x)
x = torch.relu(self.bn3(self.fc3(x)))
x = self.dropout3(x)
x = self.fc4(x)
return x
Input Layer (3 Features)
|
v
+-----------------+
| Linear (128) | <--- Fully Connected Layer 1
+-----------------+
|
v
+-----------------+
| Batch Norm | <--- Batch Normalization
+-----------------+
|
v
+-----------------+
| ReLU | <--- Activation Function
+-----------------+
|
v
+-----------------+
| Dropout | <--- Dropout Regularization
+-----------------+
|
v
+-----------------+
| Linear (64) | <--- Fully Connected Layer 2
+-----------------+
|
v
+-----------------+
| Batch Norm | <--- Batch Normalization
+-----------------+
|
v
+-----------------+
| ReLU | <--- Activation Function
+-----------------+
|
v
+-----------------+
| Dropout | <--- Dropout Regularization
+-----------------+
|
v
+-----------------+
| Linear (32) | <--- Fully Connected Layer 3
+-----------------+
|
v
+-----------------+
| Batch Norm | <--- Batch Normalization
+-----------------+
|
v
+-----------------+
| ReLU | <--- Activation Function
+-----------------+
|
v
+-----------------+
| Dropout | <--- Dropout Regularization
+-----------------+
|
v
+-----------------+
| Linear (1) | <--- Output Layer
+-----------------+
|
v
Output (Predicted Rating)
用更复杂的模型结构来提高模型的表现。以下是一些建议的模型架构,适用于回归任务,例如预测用户评分:
-
更深的全连接网络:增加层数和每层的神经元数量,以捕捉更复杂的特征。
-
使用Dropout:在网络中加入Dropout层可以减少过拟合。
-
Batch Normalization:在每个隐藏层后使用Batch Normalization可以加速训练并提高稳定性。
-
Residual Connections:使用残差连接(ResNet结构)帮助训练更深的网络。
准备训练代码
def train_model(final_df):
# 特征选择
features = final_df[['review_count', 'friend_count', 'stars_y']].values
target = final_df['stars_x'].values
# 标准化特征
scaler = StandardScaler()
features = scaler.fit_transform(features)
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)
# 转换为Tensor
X_train_tensor = torch.FloatTensor(X_train)
y_train_tensor = torch.FloatTensor(y_train).view(-1, 1)
X_test_tensor = torch.FloatTensor(X_test)
y_test_tensor = torch.FloatTensor(y_test).view(-1, 1)
# 模型、损失函数和优化器
model = AdvancedNN().to('cuda') # 使用GPU
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# 训练模型
model.train()
for epoch in range(50): # 增加训练轮数
optimizer.zero_grad()
outputs = model(X_train_tensor.to('cuda'))
loss = criterion(outputs, y_train_tensor.to('cuda'))
loss.backward()
optimizer.step()
print(f'Epoch {epoch+1}, Loss: {loss.item()}')
# 预测
model.eval()
with torch.no_grad():
predictions = model(X_test_tensor.to('cuda')).cpu()
return predictions
打印社交图 nx.info(social_network), 构建G比较吃内存,建议保存为pkl供后续使用
下面附上述提到的全部代码
import pandas as pd
import networkx as nx
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pickle
from tqdm import tqdm # 用于进度条
# 定义改进的神经网络模型
class AdvancedNN(nn.Module):
def __init__(self):
super(AdvancedNN, self).__init__()
self.fc1 = nn.Linear(5, 128) # 输入层,128个神经元 (更新为5个特征)
self.bn1 = nn.BatchNorm1d(128) # Batch Normalization
self.dropout1 = nn.Dropout(0.3) # Dropout层
self.fc2 = nn.Linear(128, 64) # 隐藏层,64个神经元
self.bn2 = nn.BatchNorm1d(64) # Batch Normalization
self.dropout2 = nn.Dropout(0.3) # Dropout层
self.fc3 = nn.Linear(64, 32) # 隐藏层,32个神经元
self.bn3 = nn.BatchNorm1d(32) # Batch Normalization
self.dropout3 = nn.Dropout(0.3) # Dropout层
self.fc4 = nn.Linear(32, 1) # 输出层,单输出
def forward(self, x):
x = torch.relu(self.bn1(self.fc1(x)))
x = self.dropout1(x)
x = torch.relu(self.bn2(self.fc2(x)))
x = self.dropout2(x)
x = torch.relu(self.bn3(self.fc3(x)))
x = self.dropout3(x)
x = self.fc4(x)
return x
def load_data():
# 加载数据集
business_df = pd.read_json('yelp_academic_dataset_business.json', lines=True)
user_df = pd.read_json('yelp_academic_dataset_user.json', lines=True)
review_df = pd.read_json('yelp_academic_dataset_review.json', lines=True)
return business_df, user_df, review_df
def extract_network_features(G, user_df):
# 提取每个用户的度数和平均朋友评分
user_degrees = dict(G.degree())
avg_friend_ratings = {}
for user in G.nodes:
friends = list(G.neighbors(user))
if friends:
avg_friend_ratings[user] = user_df[user_df['user_id'].isin(friends)]['stars'].mean()
else:
avg_friend_ratings[user] = 0 # 如果没有朋友,平均评分为0
return user_degrees, avg_friend_ratings
def extract_features(user_df, business_df, review_df, G):
# 提取用户特征
user_features = user_df[['user_id', 'friends', 'review_count', 'yelping_since']]
user_features['friend_count'] = user_features['friends'].apply(lambda x: len(x.split(',')) if x else 0)
# 提取商家特征
business_features = business_df[['business_id', 'categories', 'city', 'state', 'stars']]
# 提取评价特征
review_features = review_df[['user_id', 'business_id', 'stars', 'text', 'date']]
# 合并用户和评价特征
merged_df = pd.merge(review_features, user_features, on='user_id')
# 合并商家和合并后的数据框
final_df = pd.merge(merged_df, business_features, on='business_id')
# 提取网络特征
user_degrees, avg_friend_ratings = extract_network_features(G, user_df)
# 将网络特征添加到 DataFrame
final_df['user_degree'] = final_df['user_id'].map(user_degrees)
final_df['avg_friend_rating'] = final_df['user_id'].map(avg_friend_ratings)
return final_df
def save_network(G, filename):
with open(filename, 'wb') as f:
pickle.dump(G, f)
def load_network(filename):
with open(filename, 'rb') as f:
G = pickle.load(f)
return G
def build_social_network(user_df):
# 创建社交网络图
G = nx.Graph()
# 为每个用户添加节点
for _, row in tqdm(user_df.iterrows(), total=user_df.shape[0], desc="Building Social Network"):
G.add_node(row['user_id'], review_count=row['review_count'])
# 为每个朋友关系添加边
for _, row in tqdm(user_df.iterrows(), total=user_df.shape[0], desc="Adding Edges"):
friends = row['friends'].split(',') if row['friends'] else []
for friend in friends:
if friend: # 确保朋友ID不为空
G.add_edge(row['user_id'], friend.strip())
return G
def train_model(final_df):
# 特征选择
features = final_df[['review_count', 'friend_count', 'stars_y', 'user_degree', 'avg_friend_rating']].values
target = final_df['stars_x'].values
# 标准化特征
scaler = StandardScaler()
features = scaler.fit_transform(features)
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)
# 转换为Tensor
X_train_tensor = torch.FloatTensor(X_train)
y_train_tensor = torch.FloatTensor(y_train).view(-1, 1)
X_test_tensor = torch.FloatTensor(X_test)
y_test_tensor = torch.FloatTensor(y_test).view(-1, 1)
# 模型、损失函数和优化器
model = AdvancedNN().to('cuda') # 使用GPU
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
# 训练模型
model.train()
for epoch in range(50): # 增加训练轮数
optimizer.zero_grad()
outputs = model(X_train_tensor.to('cuda'))
loss = criterion(outputs, y_train_tensor.to('cuda'))
loss.backward()
optimizer.step()
print(f'Epoch {epoch+1}, Loss: {loss.item()}')
# 预测
model.eval()
with torch.no_grad():
predictions = model(X_test_tensor.to('cuda')).cpu()
return predictions
def main():
# 加载数据
business_df, user_df, review_df = load_data()
# 检查是否存在已保存的社交网络图
try:
social_network = load_network('social_network.pkl')
print("Loaded Social Network from 'social_network.pkl'.")
except FileNotFoundError:
# 构建社交网络
social_network = build_social_network(user_df)
# 保存社交网络图
save_network(social_network, 'social_network.pkl')
print("Social Network saved as 'social_network.pkl'.")
# 提取特征
final_df = extract_features(user_df, business_df, review_df, social_network)
print("Merged DataFrame:")
print(final_df.head())
# 训练模型并进行预测
predictions = train_model(final_df)
print("Predictions:")
print(predictions)
if __name__ == "__main__":
main()