Item-Based Recommendations with Hadoop

Mahout在MapReduce上实现了Item-Based Collaborative Filtering,这里我尝试运行一下。

  1. 安装Hadoop

  2. 从下载Mahout并解压

  3. 准备数据

    下载1 Million MovieLens Dataset,解压得到ratings.dat,用

    sed 's/:😦[0-9]{1,}):😦[0-9]{1})::[0-9]{1,}$/,\1,\2/' ratings.dat

    处理成需要的格式。

  4. 运行

    mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -i /path/to/input/file -o /path/to/desired/output -n 25

    参数:

    MAHOUT-JOB: /home/laxe/apple/mahout/mahout-examples-0.11.0-job.jar
    Job-Specific Options:
    --input (-i) input Path to job input directory.
    --output (-o) output The directory pathname for output.
    --numRecommendations (-n) numRecommendations Number of recommendations per user.
    --usersFile usersFile File of users to recommend for.
    --itemsFile itemsFile File of items to recommend for.
    --filterFile (-f) filterFile File containing comma-separated userID,itemID pairs. Used to exclude the item from the recommendations for that user(optional).
    --userItemFile (-uif) userItemFile File containing comma-separated userID,itemID pairs(optional). Used to include only these items into recommendations. Cannot be used together with usersFile or itemsFile.
    --booleanData (-b) booleanData Treat input as without prefvalues.
    --maxPrefsPerUser (-mxp) maxPrefsPerUser Maximum number of preferences considered per user in final recommendation phase.
    --minPrefsPerUser (-mp) minPrefsPerUser Ignore users with less preferences than this in the similarity computation (default: 1).
    --maxSimilaritiesPerItem (-m) maxSimilaritiesPerItem Maximum number of similarities considered per item.
    --maxPrefsInItemSimilarity (-mpiis) maxPrefsInItemSimilarity Max number of preferences to consider per user or item in the item similarity computation phase, users or items with more preferences will be sampled down(default: 500).
    --similarityClassname (-s) similarityClassname Name of distributed similarity measures class to instantiate,
    alternatively use one of the predefined similarities([SIMILARITY_COOCCURRENCE, SIMILARITY_LOGLIKELIHOOD, SIMILARITY_TANIMOTO_COEFFICIENT, SIMILARITY_CITY_BLOCK, SIMILARITY_COSINE, SIMILARITY_PEARSON_CORRELATION, SIMILARITY_EUCLIDEAN_DISTANCE])
    --threshold (-tr) threshold Discard item pairs with a similarity value below this.
    --outputPathForSimilarityMatrix (-opfsm) outputPathForSimilarityMatrix Write the items imilarity matrix to this path(optional).
    --randomSeed randomSeed Use this seed for sampling.
    --sequencefileOutput Write the output into a Sequence File instead of a text file.
    --help (-h) Print out help.
    --tempDir tempDir Intermediate output directory.
    --startPhase startPhase First phase to run.
    --endPhase endPhase Last phase to run specify HDFS directories while running on hadoop; else specify local file system directories.

参考
Introduction to Item-Based Recommendations with Hadoop
mahout分布式:Item-based推荐

相关推荐
AllData公司负责人12 分钟前
亲测丝滑,体验跃迁|AllData通过集成开源项目Cube-Studio,降低机器学习落地门槛
java·大数据·数据库·人工智能·机器学习·开源·cube-studio
码农杂谈000716 分钟前
医药行业GEA:企业级智能体系统如何开启医药学术运营新范式
大数据·人工智能
phltxy25 分钟前
RabbitMQ TTL与死信队列详解
分布式·rabbitmq·ruby
QYR-分析28 分钟前
深耕智慧物流赛道:交叉带分拣机器人行业全景解析
大数据·人工智能·机器人
Days205030 分钟前
AI提示词管理器:解锁大模型高效应用的核心工具
大数据·人工智能
深蓝电商API33 分钟前
反向海淘系统微服务拆分:从单体到分布式演进实战经验
分布式·微服务·架构·反向海淘
Promise微笑44 分钟前
GEO优化:官网建设的重要性,如何铸就数字信任与增长引擎
大数据·人工智能·深度学习
武子康1 小时前
调查研究-146 宇树科技科创板IPO上会:42亿募资背后的机器人商业化真相
大数据·人工智能·科技·程序人生·ai·机器人·具身智能
GIS6688001 小时前
赛事解读|第十八届全国高校GIS技能大赛【操作赛道】参赛题目及规范要求
大数据·人工智能·gis开发·gis大赛
薛定猫AI1 小时前
【深度解析】GPT-6 关键技术趋势:持久化记忆、Agent 能力与企业级落地架构
大数据·gpt·架构