ALS 推荐算法案例演示(python)

数学知识补充:矩阵

总结来说:

Am*k X Bk*n = Cm*n ----至于乘法的规则,是数学问题, 知道可以乘即可,不需要我们自己计算

反过来

Cm*n = Am*k X Bk*n ----至于矩阵如何拆分/如何分解,是数学问题,知道可以拆/可以分解即可

ALS 推荐算法案例:电影推荐

需求:

大数据分析师决定使用SparkMLlib的ALS(Alternating Least Squarcs)推荐算法,采用这种方式可以解决稀疏矩阵(SparseMatrix)的问题。即使是大量的用户与产品,都能够在合理的时间内完成运算。在使用历史数据训练后,就可以创建模型。
有了模型之后,就可以使用模型进行推荐。我们设计了如下推荐功能,

可以增加会员观看电影的次数:

针对用户推荐感兴趣的电影: 以针对每一位会员,定期发送短信或E-mail或会员登录时,推荐给他/她可能会感兴趣的电影。

针对电影推荐给感兴趣的用户:当想要促销某些电影时,也可以找出可能会对这些电影感兴趣的会员,并且发送短信或E-mail.

数据引入:

第一种:显示评分数据

现在我们手里有用户对电影,那么接下来就可以使用SparkMLlib中提供的一个基于隐语义模型的协同过滤推荐算法-ALS

第二种:隐式评分(Implicit rating)

有时在网站的设计上,并不会请用户对某个产品进行评分,但是会记录用户是否点选了某个产品。如果点选了某个产品,代表该用户可能对该产品感兴趣,但是我们不知道评分为几颗星,这种方式称为隐式评分;1代表用户对该项产品有兴趣。

具体做法

将该评分矩阵进行拆解如下:

然后进行计算填充:

上面已经可以将空白处进行补全了,但是问题是:凭什么补全的数字就能够代表用户对电影的预测评分?

SparkMlLib中的ALS算法:基于隐语义模型的协同过滤算法,认为:

拆分出来的

A矩阵是用户的隐藏的特征矩阵,

B矩阵是物品的隐藏的特征矩阵,

用户之所以会给物品打出相应的评分,是因为用户和物品具有这些隐藏的特征。

代码编写:

import org.apache.spark.SparkContext
import org.apache.spark.ml.recommendation.{ALS, ALSModel}
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}

object ALSMovieDemoTest {
  def main(args: Array[String]): Unit = {
    //TODO 0.准备环境
    val spark: SparkSession = SparkSession.builder().appName("BatchAnalysis").master("local[*]")
    .config("spark.sql.shuffle.partitions", "4")//本次测试时将分区数设置小一点,实际开发中可以根据集群规模调整大小,默认200
    .getOrCreate()
    val sc: SparkContext = spark.sparkContext
    sc.setLogLevel("WARN")
    import spark.implicits._
    import org.apache.spark.sql.functions._

    //TODO 1.加载数据并处理
    val fileDS: Dataset[String] = spark.read.textFile("data/input/u.data")
    val ratingDF: DataFrame = fileDS.map(line => {
      val arr: Array[String] = line.split("\t")
      (arr(0).toInt, arr(1).toInt, arr(2).toDouble)
    }).toDF("userId", "movieId", "score")

    val Array(trainSet,testSet) = ratingDF.randomSplit(Array(0.8,0.2))//按照8:2划分训练集和测试集

    //TODO 2.构建ALS推荐算法模型并训练
    val als: ALS = new ALS()
    .setUserCol("userId") //设置用户id是哪一列
    .setItemCol("movieId") //设置产品id是哪一列
    .setRatingCol("score") //设置评分列
    .setRank(10) //可以理解为Cm*n = Am*k X Bk*n 里面的k的值
    .setMaxIter(10) //最大迭代次数
    .setAlpha(1.0)//迭代步长

    //使用训练集训练模型
    val model: ALSModel = als.fit(trainSet)

    //使用测试集测试模型
    //val testResult: DataFrame = model.recommendForUserSubset(testSet,5)
    //计算模型误差--模型评估
    //......

    //TODO 3.给用户做推荐
    val result1: DataFrame = model.recommendForAllUsers(5)//给所有用户推荐5部电影
    val result2: DataFrame = model.recommendForAllItems(5)//给所有电影推荐5个用户

    val result3: DataFrame = model.recommendForUserSubset(sc.makeRDD(Array(196)).toDF("userId"),5)//给指定用户推荐5部电影
    val result4: DataFrame = model.recommendForItemSubset(sc.makeRDD(Array(242)).toDF("movieId"),5)//给指定电影推荐5个用户

    result1.show(false)
    result2.show(false)
    result3.show(false)
    result4.show(false)
  }
}

如果 使用 python 语言编写需求:

import os

from pyspark.sql import SparkSession
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import col

# TODO 0. 准备环境
# 配置环境
if __name__ == '__main__':

    os.environ['JAVA_HOME'] = 'C:/Program Files/Java/jdk1.8.0_241'
    # 配置Hadoop的路径,就是前面解压的那个路径
    os.environ['HADOOP_HOME'] = 'D:/hadoop-3.3.1'
    # 配置base环境Python解析器的路径
    os.environ['PYSPARK_PYTHON'] = 'C:/ProgramData/Miniconda3/python.exe'  # 配置base环境Python解析器的路径
    os.environ['PYSPARK_DRIVER_PYTHON'] = 'C:/ProgramData/Miniconda3/python.exe'
    os.environ['HADOOP_USER_NAME'] = 'root'
    os.environ['file.encoding'] = 'UTF-8'

    # 准备环境
    spark = SparkSession.builder.appName("StreamingAnalysis")\
        .master("local[*]").config("spark.sql.shuffle.partitions","4").getOrCreate()
    sc = spark.sparkContext
    sc.setLogLevel("WARN")

    # TODO 1. 加载数据并处理
    fileDS = spark.read.text("data/input/u.data")
    fileDS.printSchema()
    print(fileDS.take(5))
    ratingDF = fileDS.rdd.map(lambda row: row.value.split("\t")) \
        .map(lambda x: (int(x[0]), int(x[1]), float(x[2]))) \
        .toDF(["userId", "movieId", "score"])

    train_set, test_set = ratingDF.randomSplit([0.8, 0.2])  # 按照8:2划分训练集和测试集

    # TODO 2. 构建ALS推荐算法模型并训练
    als = ALS(userCol="userId",
              itemCol="movieId",
              ratingCol="score",
              rank=10,
              maxIter=10,
              alpha=1.0)

    # 使用训练集训练模型
    model = als.fit(train_set)

    # 使用测试集测试模型
    # test_result = model.recommendForUserSubset(test_set, 5)
    # 计算模型误差--模型评估
    # ...

    # TODO 3. 给用户做推荐
    result1 = model.recommendForAllUsers(5)  # 给所有用户推荐5部电影
    result2 = model.recommendForAllItems(5)  # 给所有电影推荐5个用户

    result3 = model.recommendForUserSubset(spark.createDataFrame([(196,)], ["userId"]), 5)  # 给指定用户推荐5部电影
    result4 = model.recommendForItemSubset(spark.createDataFrame([(242,)], ["movieId"]), 5)  # 给指定电影推荐5个用户

    result1.show(truncate=False)
    result2.show(truncate=False)
    result3.show(truncate=False)
    result4.show(truncate=False)

    # 关闭Spark会话
    spark.stop()

最终结果如下所示:

root
 |-- value: string (nullable = true)

[Row(value='196\t242\t3\t881250949'), Row(value='186\t302\t3\t891717742'), Row(value='22\t377\t1\t878887116'), Row(value='244\t51\t2\t880606923'), Row(value='166\t346\t1\t886397596')]

+------+---------------------------------------------------------------------------------------------+
|userId|recommendations                                                                              |
+------+---------------------------------------------------------------------------------------------+
|12    |[{1643, 5.44792}, {1463, 5.249074}, {1450, 5.1887774}, {64, 5.0688186}, {318, 5.0383205}]    |
|13    |[{1643, 4.8755937}, {814, 4.873669}, {963, 4.7418056}, {867, 4.725667}, {1463, 4.6931405}]   |
|14    |[{1463, 5.1732297}, {1643, 5.1153564}, {1589, 5.0040984}, {1367, 4.984417}, {1524, 4.955745}]|
|18    |[{1643, 5.213776}, {1463, 5.1320825}, {1398, 4.819699}, {483, 4.6260805}, {1449, 4.6111727}] |
|25    |[{1643, 5.449965}, {1589, 5.017608}, {1463, 4.9372115}, {169, 4.6056967}, {963, 4.5825796}]  |
|37    |[{1643, 5.3220835}, {1589, 4.695943}, {1268, 4.610497}, {42, 4.4597883}, {169, 4.4325438}]   |
|38    |[{143, 5.9212527}, {1472, 5.595081}, {1075, 5.4555163}, {817, 5.4316535}, {1463, 5.2957745}] |
|46    |[{1643, 5.9912925}, {1589, 5.490053}, {320, 5.175288}, {958, 5.080977}, {1131, 5.067922}]    |
|50    |[{838, 4.6296134}, {324, 4.6239386}, {962, 4.567323}, {987, 4.5356846}, {1386, 4.5315967}]   |
|52    |[{1643, 5.800831}, {1589, 5.676579}, {1463, 5.6091275}, {1449, 5.2481527}, {1398, 5.164145}] |
|56    |[{1643, 5.2523932}, {1463, 4.8217216}, {174, 4.561838}, {50, 4.5330524}, {313, 4.5247965}]   |
|65    |[{1643, 5.009448}, {1463, 4.977561}, {1450, 4.7058015}, {496, 4.6496506}, {318, 4.6017523}]  |
|67    |[{1589, 6.091304}, {1643, 5.8771777}, {1268, 5.4765506}, {169, 5.2630634}, {645, 5.1223965}] |
|70    |[{1643, 4.903953}, {1463, 4.805949}, {318, 4.3851447}, {50, 4.3817987}, {64, 4.3547297}]     |
|73    |[{1643, 4.8607855}, {1449, 4.804972}, {1589, 4.7613616}, {1463, 4.690458}, {853, 4.6646543}] |
|83    |[{1643, 4.6920056}, {1463, 4.6447496}, {22, 4.567131}, {1278, 4.505245}, {1450, 4.4618435}]  |
|93    |[{1643, 5.4505115}, {1463, 5.016514}, {1160, 4.83699}, {1131, 4.673481}, {904, 4.6326823}]   |
|95    |[{1643, 4.828537}, {1463, 4.8062463}, {318, 4.390673}, {64, 4.388152}, {1064, 4.354666}]     |
|97    |[{1589, 5.1252556}, {963, 5.0905123}, {1643, 5.014373}, {793, 4.8556504}, {169, 4.851328}]   |
|101   |[{1643, 4.410446}, {1463, 4.167996}, {313, 4.1381097}, {64, 3.9999022}, {174, 3.9533536}]    |
+------+---------------------------------------------------------------------------------------------+
only showing top 20 rows

+-------+------------------------------------------------------------------------------------------+
|movieId|recommendations                                                                           |
+-------+------------------------------------------------------------------------------------------+
|12     |[{118, 5.425505}, {808, 5.324106}, {628, 5.2948637}, {173, 5.2587204}, {923, 5.2580886}]  |
|13     |[{928, 4.5580163}, {808, 4.484994}, {239, 4.4301133}, {9, 4.3891873}, {157, 4.256134}]    |
|14     |[{928, 4.7927723}, {686, 4.784753}, {240, 4.771472}, {252, 4.7258406}, {310, 4.719638}]   |
|18     |[{366, 3.5298047}, {270, 3.5042968}, {118, 3.501615}, {115, 3.4122925}, {923, 3.407579}]  |
|25     |[{732, 4.878368}, {928, 4.8120456}, {688, 4.765749}, {270, 4.7419496}, {811, 4.572586}]   |
|37     |[{219, 3.8507814}, {696, 3.5646195}, {366, 3.4811506}, {75, 3.374816}, {677, 3.3565707}]  |
|38     |[{507, 4.79451}, {127, 4.5993023}, {137, 4.4605145}, {849, 4.3109775}, {688, 4.298151}]   |
|46     |[{270, 4.6816626}, {928, 4.5854187}, {219, 4.4919205}, {34, 4.4880714}, {338, 4.484614}]  |
|50     |[{357, 5.366201}, {640, 5.2883763}, {287, 5.244199}, {118, 5.222288}, {507, 5.2122903}]   |
|52     |[{440, 4.7918897}, {565, 4.592798}, {252, 4.5657616}, {697, 4.5496006}, {4, 4.52615}]     |
|56     |[{628, 5.473441}, {808, 5.3515406}, {252, 5.2790856}, {4, 5.197684}, {118, 5.146353}]     |
|65     |[{770, 4.4615817}, {242, 4.3993964}, {711, 4.3992624}, {928, 4.3836145}, {523, 4.365783}] |
|67     |[{887, 4.6947756}, {511, 4.151247}, {324, 4.1026692}, {849, 4.0851464}, {688, 4.0792685}] |
|70     |[{928, 4.661159}, {688, 4.5623326}, {939, 4.527151}, {507, 4.5014353}, {810, 4.4822607}]  |
|73     |[{507, 4.8688984}, {688, 4.810653}, {849, 4.727747}, {810, 4.6686435}, {127, 4.6246667}]  |
|83     |[{939, 5.135272}, {357, 5.12999}, {523, 5.071391}, {688, 5.034591}, {477, 4.9770975}]     |
|93     |[{115, 4.5568433}, {581, 4.5472555}, {809, 4.5035434}, {819, 4.477037}, {118, 4.467347}]  |
|95     |[{507, 5.097106}, {688, 4.974432}, {810, 4.950163}, {849, 4.9388885}, {152, 4.897256}]    |
|97     |[{688, 5.1705074}, {628, 5.0447206}, {928, 4.9556565}, {810, 4.8580494}, {849, 4.8418307}]|
|101    |[{495, 4.624121}, {67, 4.5662155}, {550, 4.5428996}, {472, 4.47312}, {347, 4.4586687}]    |
+-------+------------------------------------------------------------------------------------------+
only showing top 20 rows

+------+------------------------------------------------------------------------------------------+
|userId|recommendations                                                                           |
+------+------------------------------------------------------------------------------------------+
|196   |[{1463, 5.5212154}, {1643, 5.4587097}, {318, 4.763221}, {50, 4.7338095}, {1449, 4.710921}]|
+------+------------------------------------------------------------------------------------------+

+-------+-----------------------------------------------------------------------------------------+
|movieId|recommendations                                                                          |
+-------+-----------------------------------------------------------------------------------------+
|242    |[{928, 5.2815547}, {240, 4.958071}, {147, 4.9559183}, {909, 4.7904325}, {252, 4.7793174}]|
+-------+-----------------------------------------------------------------------------------------+


Process finished with exit code 0
相关推荐
XiaoLeisj15 分钟前
【优选算法 — 滑动窗口】串联所有单词的子串 & 最小覆盖子串
java·开发语言·算法·leetcode
小林熬夜学编程16 分钟前
【Linux系统编程】第四十七弹---深入探索:POSIX信号量与基于环形队列的生产消费模型实现
linux·运维·服务器·c语言·开发语言·c++·算法
一只大侠20 分钟前
牛客周赛第一题2024/11/17日
数据结构·算法
不爱学英文的码字机器23 分钟前
[C++11] 包装器 : function 与 bind 的原理及使用
开发语言·c++·算法
Lemon_man_1 小时前
算法——反转链表(leetcode206)
数据结构·算法·链表
stm 学习ing3 小时前
FPGA 第8讲 简单组合逻辑--半加器
c语言·开发语言·stm32·算法·fpga开发·fpga
wjcroom3 小时前
【Python模拟websocket登陆-拆包封包】
开发语言·python·websocket
AI原吾5 小时前
探索Python文档自动化的奥秘:`python-docx`库全解析
开发语言·python·自动化·python-docx
记得多吃点5 小时前
二、神经网络基础与搭建
人工智能·pytorch·python·深度学习·神经网络·pycharm