Item-Based Recommendations with Hadoop

Mahout在MapReduce上实现了Item-Based Collaborative Filtering,这里我尝试运行一下。

  1. 安装Hadoop

  2. 从下载Mahout并解压

  3. 准备数据

    下载1 Million MovieLens Dataset,解压得到ratings.dat,用

    sed 's/:😦[0-9]{1,}):😦[0-9]{1})::[0-9]{1,}$/,\1,\2/' ratings.dat

    处理成需要的格式。

  4. 运行

    mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -i /path/to/input/file -o /path/to/desired/output -n 25

    参数:

    MAHOUT-JOB: /home/laxe/apple/mahout/mahout-examples-0.11.0-job.jar
    Job-Specific Options:
    --input (-i) input Path to job input directory.
    --output (-o) output The directory pathname for output.
    --numRecommendations (-n) numRecommendations Number of recommendations per user.
    --usersFile usersFile File of users to recommend for.
    --itemsFile itemsFile File of items to recommend for.
    --filterFile (-f) filterFile File containing comma-separated userID,itemID pairs. Used to exclude the item from the recommendations for that user(optional).
    --userItemFile (-uif) userItemFile File containing comma-separated userID,itemID pairs(optional). Used to include only these items into recommendations. Cannot be used together with usersFile or itemsFile.
    --booleanData (-b) booleanData Treat input as without prefvalues.
    --maxPrefsPerUser (-mxp) maxPrefsPerUser Maximum number of preferences considered per user in final recommendation phase.
    --minPrefsPerUser (-mp) minPrefsPerUser Ignore users with less preferences than this in the similarity computation (default: 1).
    --maxSimilaritiesPerItem (-m) maxSimilaritiesPerItem Maximum number of similarities considered per item.
    --maxPrefsInItemSimilarity (-mpiis) maxPrefsInItemSimilarity Max number of preferences to consider per user or item in the item similarity computation phase, users or items with more preferences will be sampled down(default: 500).
    --similarityClassname (-s) similarityClassname Name of distributed similarity measures class to instantiate,
    alternatively use one of the predefined similarities([SIMILARITY_COOCCURRENCE, SIMILARITY_LOGLIKELIHOOD, SIMILARITY_TANIMOTO_COEFFICIENT, SIMILARITY_CITY_BLOCK, SIMILARITY_COSINE, SIMILARITY_PEARSON_CORRELATION, SIMILARITY_EUCLIDEAN_DISTANCE])
    --threshold (-tr) threshold Discard item pairs with a similarity value below this.
    --outputPathForSimilarityMatrix (-opfsm) outputPathForSimilarityMatrix Write the items imilarity matrix to this path(optional).
    --randomSeed randomSeed Use this seed for sampling.
    --sequencefileOutput Write the output into a Sequence File instead of a text file.
    --help (-h) Print out help.
    --tempDir tempDir Intermediate output directory.
    --startPhase startPhase First phase to run.
    --endPhase endPhase Last phase to run specify HDFS directories while running on hadoop; else specify local file system directories.

参考
Introduction to Item-Based Recommendations with Hadoop
mahout分布式:Item-based推荐

相关推荐
掘金-我是哪吒36 分钟前
分布式微服务系统架构第133集:运维服务器6年经验,高并发,大数据量系统
运维·服务器·分布式·微服务·系统架构
python算法(魔法师版)1 小时前
.NET NativeAOT 指南
java·大数据·linux·jvm·.net
星川皆无恙2 小时前
大模型学习:Deepseek+dify零成本部署本地运行实用教程(超级详细!建议收藏)
大数据·人工智能·学习·语言模型·架构
L耀早睡2 小时前
mapreduce打包运行
大数据·前端·spark·mapreduce
姬激薄2 小时前
MapReduce打包运行
大数据·mapreduce
计算机人哪有不疯的2 小时前
Mapreduce初使用
大数据·mapreduce
菜鸟冲锋号3 小时前
Flink SQL、Hudi 、Doris在数据上的组合应用
大数据·flink
尘世壹俗人3 小时前
hadoop.proxyuser.代理用户.授信域 用来干什么的
大数据·hadoop·分布式
白露与泡影4 小时前
基于Mongodb的分布式文件存储实现
分布式·mongodb·wpf
鸿乃江边鸟5 小时前
Starrocks的主键表涉及到的MOR Delete+Insert更新策略
大数据·starrocks·sql