scikit-surprise 智能推荐模块使用说明

目录

1、前言

2、算法

3、数据集

[3.1 three built-in datasets are available:](#3.1 three built-in datasets are available:)

[3.2 Load a dataset from a pandas dataframe.](#3.2 Load a dataset from a pandas dataframe.)

[3.3 Load a dataset from a (custom) file.](#3.3 Load a dataset from a (custom) file.)

[3.4 Load a dataset where folds (for cross-validation) are predefined by some files.](#3.4 Load a dataset where folds (for cross-validation) are predefined by some files.)

4、predict

[4.1 SVD & load_builtin("ml-100k")](#4.1 SVD & load_builtin("ml-100k"))

[4.2 KNNBasic&load_builtin("ml-100k")](#4.2 KNNBasic&load_builtin("ml-100k"))

[4.3 BaselineOnly&custom dataset](#4.3 BaselineOnly&custom dataset)

[5 精度评定](#5 精度评定)


1、前言

Surprise,提供一系列内置的智能推荐算法算法和相应的练习数据集。

参考:The model_selection package --- Surprise 1 documentation

安装:pip install scikit-surprise -i https://pypi.org/simple

2、算法

The available prediction algorithms are:

|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|
| random_pred.NormalPredictor | Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal. |
| baseline_only.BaselineOnly | Algorithm predicting the baseline estimate for given user and item. |
| knns.KNNBasic | A basic collaborative filtering algorithm. |
| knns.KNNWithMeans | A basic collaborative filtering algorithm, taking into account the mean ratings of each user. |
| knns.KNNWithZScore | A basic collaborative filtering algorithm, taking into account the z-score normalization of each user. |
| knns.KNNBaseline | A basic collaborative filtering algorithm taking into account a baseline rating. |
| matrix_factorization.SVD | The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize. |
| matrix_factorization.SVDpp | The SVD++ algorithm, an extension of SVD taking into account implicit ratings. |
| matrix_factorization.NMF | A collaborative filtering algorithm based on Non-negative Matrix Factorization. |
| slope_one.SlopeOne | A simple yet accurate collaborative filtering algorithm. |
| co_clustering.CoClustering | A collaborative filtering algorithm based on co-clustering. |

3、数据集

3.1 three built-in datasets are available:

Built-in datasets can all be loaded (or downloaded if you haven't already) using the Dataset.load_builtin() method. Summary:

|---------------------------------------------------------------------------------------------------------------------------------------------|--------------------------|
| Dataset.load_builtin | Load a built-in dataset. |

classmethod:

load_builtin(name='ml-100k' , prompt=True)

eg:

复制代码
from surprise import accuracy, Dataset, SVD
from surprise.model_selection import train_test_split
# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin("ml-100k")
# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=0.25)

3.2 Load a dataset from a pandas dataframe.

you can use a custom dataset that is stored in a pandas dataframe.

classmethod:

load_from_df(df , reader)

eg:

复制代码
import pandas as pd
from surprise import Dataset, NormalPredictor, Reader
from surprise.model_selection import cross_validate
# Creation of the dataframe. Column names are irrelevant.
ratings_dict = {
    "itemID": [1, 1, 1, 2, 2],
    "userID": [9, 32, 2, 45, "user_foo"],
    "rating": [3, 2, 4, 3, 1],
}
df = pd.DataFrame(ratings_dict)
# A reader is still needed but only the rating_scale param is required.
reader = Reader(rating_scale=(1, 5))
# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(df[["userID", "itemID", "rating"]], reader)

3.3 Load a dataset from a (custom) file.

classmethod:

load_from_file(file_path , reader )[source]

Use this if you want to use a custom dataset and all of the ratings are stored in one file. You will have to split your dataset using the split method.

Parameters:

  • file_path (string) -- The path to the file containing ratings.

  • reader (Reader) -- A reader to read the file.

eg:

复制代码
import os
from surprise import BaselineOnly, Dataset, Reader
from surprise.model_selection import cross_validate
# path to dataset file
file_path = os.path.expanduser("~/.surprise_data/ml-100k/ml-100k/u.data")
# As we're loading a custom dataset, we need to define a reader. In the
# movielens-100k dataset, each line has the following format:
# 'user item rating timestamp', separated by '\t' characters.
reader = Reader(line_format="user item rating timestamp", sep="\t")
data = Dataset.load_from_file(file_path, reader=reader)

3.4 Load a dataset where folds (for cross-validation) are predefined by some files.

classmethod:

load_from_folds(folds_files , reader)

The purpose of this method is to cover a common use case where a dataset is already split into predefined folds, such as the movielens-100k dataset which defines files u1.base, u1.test, u2.base, u2.test, etc... It can also be used when you don't want to perform cross-validation but still want to specify your training and testing data (which comes down to 1-fold cross-validation anyway).

Parameters:

  • folds_files (iterable of tuples) -- The list of the folds. A fold is a tuple of the form (path_to_train_file, path_to_test_file).

  • reader (Reader) -- A reader to read the files.

class surprise.dataset.DatasetAutoFolds(ratings_file=None , reader=None , df=None)

A derived class from Dataset for which folds (for cross-validation) are not predefined. (Or for when there are no folds at all).

build_full_trainset()

Do not split the dataset into folds and just return a trainset as is, built from the whole dataset.

User can then query for predictions.

4、predict

4.1 SVD & load_builtin("ml-100k")

from surprise import accuracy, Dataset, SVD

from surprise.model_selection import train_test_split

Load the movielens-100k dataset (download it if needed),

data = Dataset.load_builtin("ml-100k")

sample random trainset and testset

test set is made of 25% of the ratings.

trainset, testset = train_test_split(data, test_size=0.25)

We'll use the famous SVD algorithm.

algo = SVD()

Train the algorithm on the trainset, and predict ratings for the testset

algo.fit(trainset)

predictions = algo.test(testset) #predict 参数为数据集

accuracy.rmse(predictions) #精度评定

algo.predict(uid,iid,u_r) # predict( a single sample)单个的样本

4.2 KNNBasic&load_builtin("ml-100k")

from surprise import Dataset, KNNBasic

Load the movielens-100k dataset

data = Dataset.load_builtin("ml-100k")

Retrieve the trainset.

trainset = data.build_full_trainset()

Build an algorithm, and train it.

algo = KNNBasic()

algo.fit(trainset)

#algo.test()

#algo.predict(uuid,iid)

4.3 BaselineOnly&custom dataset

import os

from surprise import BaselineOnly, Dataset, Reader

from surprise.model_selection import train_test_split

path to dataset file

file_path = os.path.expanduser("~/.surprise_data/ml-100k/ml-100k/u.data")

As we're loading a custom dataset, we need to define a reader. In the

movielens-100k dataset, each line has the following format:

'user item rating timestamp', separated by '\t' characters.

reader = Reader(line_format="user item rating timestamp", sep="\t")

data = Dataset.load_from_file(file_path, reader=reader)

trainset, testset = train_test_split(data, test_size=0.25)

algo=BaselineOnly()

predictions=algo.fit(trainset).test(testset)

#algo.predict(uid,iid)

5 精度评定

Available accuracy metrics:

|-----------------------------------------------------------------------------------------------|---------------------------------------------|
| rmse | Compute RMSE (Root Mean Squared Error). |
| mse | Compute MSE (Mean Squared Error). |
| mae | Compute MAE (Mean Absolute Error). |
| fcp | Compute FCP (Fraction of Concordant Pairs). |

accuracy.rmse(predictions, verbose=True) #精度评定(rmse)

accuracy.mae(predictions,verbose=True)

accuracy.mse(predictions,verbose=True)



相关推荐
Aric_Jones2 小时前
lua入门语法,包含安装,注释,变量,循环等
java·开发语言·git·elasticsearch·junit·lua
Akiiiira2 小时前
【日撸 Java 三百行】Day 12(顺序表(二))
java·开发语言
EndingCoder2 小时前
2025年JavaScript性能优化全攻略
开发语言·javascript·性能优化
Narutolxy5 小时前
大模型数据分析破局之路20250512
人工智能·chatgpt·数据分析
码上淘金6 小时前
【Python】Python常用控制结构详解:条件判断、遍历与循环控制
开发语言·python
Brilliant Nemo6 小时前
四、SpringMVC实战:构建高效表述层框架
开发语言·python
Ai尚研修-贾莲8 小时前
Python语言在地球科学交叉领域中的应用——从数据可视化到常见数据分析方法的使用【实例操作】
python·信息可视化·数据分析·地球科学
格林威8 小时前
Baumer工业相机堡盟工业相机的工业视觉中为什么偏爱“黑白相机”
开发语言·c++·人工智能·数码相机·计算机视觉
橙子199110168 小时前
在 Kotlin 中什么是委托属性,简要说说其使用场景和原理
android·开发语言·kotlin
androidwork8 小时前
Kotlin Android LeakCanary内存泄漏检测实战
android·开发语言·kotlin