Credit Card Fraud Detection(信用卡欺诈检测数据集)

Credit Card Fraud Detection

Anonymized credit card transactions labeled as fraudulent or genuine

About Dataset

Context

It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

Content

The dataset contains transactions made by credit cards in September 2013 by European cardholders.

This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

Update (03/05/2021)

A simulator for transaction data has been released as part of the practical handbook on Machine Learning for Credit Card Fraud Detection - https://fraud-detection-handbook.github.io/fraud-detection-handbook/Chapter_3_GettingStarted/SimulatedDataset.html. We invite all practitioners interested in fraud detection datasets to also check out this data simulator, and the methodologies for credit card fraud detection presented in the book.

Acknowledgements

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection.

More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project

Please cite the following works:

Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon

Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE

Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)

Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier

Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing

Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019

Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection Information Sciences, 2019

Yann-Aël Le Borgne, Gianluca Bontempi Reproducible machine Learning for Credit Card Fraud Detection - Practical Handbook

Bertrand Lebichot, Gianmarco Paldino, Wissam Siblini, Liyun He, Frederic Oblé, Gianluca Bontempi Incremental learning strategies for credit cards fraud detection, IInternational Journal of Data Science and Analytics

信用卡欺诈检测

匿名信用卡交易被标记为欺诈或真实

关于数据集

上下文

信用卡公司能够识别欺诈性信用卡交易非常重要,这样客户就不会为他们未购买的商品付费。

内容

该数据集包含欧洲持卡人在 2013 年 9 月通过信用卡进行的交易。

该数据集显示两天内发生的交易,其中 284,807 笔交易中有 492 笔是欺诈交易。该数据集非常不平衡,正类(欺诈)占所有交易的 0.172%。它仅包含数值输入变量,这些变量是 PCA 转换的结果。遗憾的是,由于保密问题,我们无法提供原始特征和有关数据的更多背景信息。特征 V1、V2、...V28 是通过 PCA 获得的主要成分,唯一未通过 PCA 转换的特征是"时间"和"金额"。特征"时间"包含数据集中每笔交易与第一笔交易之间经过的秒数。特征"金额"是交易金额,该特征可用于依赖于示例的成本敏感学习。特征"类别"是响应变量,在发生欺诈时取值为 1,否则取值为 0。

考虑到类别不平衡率,我们建议使用精确度-召回率曲线下面积 (AUPRC) 来测量准确度。混淆矩阵准确度对于不平衡分类没有意义。

更新(2021 年 3 月 5 日)

作为信用卡欺诈检测机器学习实用手册的一部分,已发布了交易数据模拟器 - https://fraud-detection-handbook.github.io/fraud-detection-handbook/Chapter_3_GettingStarted/SimulatedDataset.html。我们邀请所有对欺诈检测数据集感兴趣的从业者也查看这个数据模拟器,以及书中介绍的信用卡欺诈检测方法。

致谢

数据集是在 Worldline 和布鲁塞尔自由大学 (ULB) 机器学习小组 (http://mlg.ulb.ac.be) 就大数据挖掘和欺诈检测展开的研究合作期间收集和分析的。

有关相关主题的当前和过去项目的更多详细信息,请访问 https://www.researchgate.net/project/Fraud-detection-5 和 DefeatFraud 项目页面

请引用以下作品:

Andrea Dal Pozzolo、Olivier Caelen、Reid A. Johnson 和 Gianluca Bontempi。使用欠采样校准不平衡分类的概率。在计算智能和数据挖掘研讨会 (CIDM) 中,IEEE,2015

Dal Pozzolo,Andrea;Caelen,Olivier;Le Borgne,Yann-Ael;Waterschoot,Serge;Bontempi,Gianluca。从实践者的角度学习信用卡欺诈检测的经验教训,应用专家系统,41,10,4915-4928,2014,Pergamon

Dal Pozzolo,Andrea;Boracchi,Giacomo;Caelen,Olivier;Alippi,Cesare;Bontempi,Gianluca。信用卡欺诈检测:一种现实的建模和一种新颖的学习策略,IEEE 神经网络和学习系统交易,29,8,3784-3797,2018,IEEE

Dal Pozzolo,Andrea 自适应机器学习用于信用卡欺诈检测 ULB MLG 博士论文(由 G. Bontempi 指导)

Carcillo,Fabrizio;Dal Pozzolo,Andrea;Le Borgne,Yann-Aël;Caelen,Olivier;Mazzer,Yannis;Bontempi,Gianluca。 Scarff:使用 Spark 进行流式信用卡欺诈检测的可扩展框架,信息融合,41,182-194,2018,Elsevier

Carcillo,Fabrizio;Le Borgne,Yann-Aël;Caelen,Olivier;Bontempi,Gianluca。流式主动学习策略用于现实生活中的信用卡欺诈检测:评估和可视化,国际数据科学与分析杂志,5,4,285-300,2018,Springer International Publishing

Bertrand Lebichot、Yann-Aël Le Borgne、Liyun He、Frederic Oblé、Gianluca Bontempi 信用卡欺诈检测的深度学习领域自适应技术,INNSBDDL 2019:大数据和深度学习的最新进展,第 78-88 页,2019 年

Fabrizio Carcillo、Yann-Aël Le Borgne、Olivier Caelen、Frederic Oblé、Gianluca Bontempi 将无监督和监督学习相结合用于信用卡欺诈检测信息科学,2019 年

Yann-Aël Le Borgne、Gianluca Bontempi 信用卡欺诈检测的可重复机器学习 - 实用手册

Bertrand Lebichot、Gianmarco Paldino、Wissam Siblini、Liyun He、Frederic Oblé、Gianluca Bontempi 信用卡欺诈检测的增量学习策略,《国际数据科学与分析杂志》

相关推荐
命里有定数2 天前
Ubuntu问题 - 显示ubuntu服务器上可用磁盘空间 一条命令df -h
服务器·ubuntu·数据集
数据猎手小k2 天前
PCBS:由麻省理工学院和Google联合创建,揭示1.2M短文本间的相似性的大规模图聚类数据集。
机器学习·支持向量机·数据集·聚类·机器学习数据集·ai大模型应用
数据猎手小k6 天前
DAHL:利用由跨越 29 个类别的 8,573 个问题组成的基准数据集,评估大型语言模型在生物医学领域长篇回答的事实准确性。
人工智能·深度学习·语言模型·数据集·机器学习数据集·ai大模型应用
此星光明10 天前
GEE 数据集——美国gNATSGO(网格化国家土壤调查地理数据库)完整覆盖了美国所有地区和岛屿领土的最佳可用土壤信息
javascript·数据库·数据集·美国·数据·gee·土壤
OpenBayes10 天前
OpenBayes 一周速览丨VASP 教程上线!HPC 助力材料计算;AllClear 公共云层去除数据集发布,含超 23k 个全球分布的兴趣区域
人工智能·深度学习·机器学习·自然语言处理·开源·数据集·大语言模型
数据猎手小k13 天前
CulturalBench :一个旨在评估大型语言模型在全球不同文化背景下知识掌握情况的基准测试数据集
数据集·机器学习数据集·ai大模型应用
此星光明13 天前
2016年7月29日至2017年2月21日NASA大气层层析(ATom)任务甲醛(HCHO)、羟基(OH)和OH生产率的剖面积分柱密度
数据集·甲醛·nasa·羟基·密度·剖面·hcho
数据猎手小k14 天前
GS-Blur数据集:首个基于3D场景合成的156,209对多样化真实感模糊图像数据集。
数据集·机器学习数据集·ai大模型应用
HyperAI超神经14 天前
贝式计算的 AI4S 观察:使用机器学习对世界进行感知与推演,最大魅力在于横向扩展的有效性
人工智能·深度学习·机器学习·数据集·ai4s·科研领域·工科
HyperAI超神经16 天前
突破1200°C高温性能极限!北京科技大学用机器学习合成24种耐火高熵合金,室温延展性极佳
人工智能·深度学习·机器学习·数据集·ai4s·材料学·合金