python-NLP常用数据集0.1.022

python-NLP常用数据集0.1.022

MRPC数据集

Microsoft Research Paraphrase Corpus 3600个数据

  1. 下载地址:https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e
  2. 格式
python 复制代码
Quality	#1 ID	#2 ID	#1 String	#2 String
1	1089874	1089925	PCCW 's chief operating officer , Mike Butcher , and Alex Arena , the chief financial officer , will report directly to Mr So .	Current Chief Operating Officer Mike Butcher and Group Chief Financial Officer Alex Arena will report to So .
1	3019446	3019327	The world 's two largest automakers said their U.S. sales declined more than predicted last month as a late summer sales frenzy caused more of an industry backlash than expected .	Domestic sales at both GM and No. 2 Ford Motor Co. declined more than predicted as a late summer sales frenzy prompted a larger-than-expected industry backlash .
1	1945605	1945824	According to the federal Centers for Disease Control and Prevention ( news - web sites ) , there were 19 reported cases of measles in the United States in 2002 .	The Centers for Disease Control and Prevention said there were 19 reported cases of measles in the United States in 2002 .
0	1430402	1430329	A tropical storm rapidly developed in the Gulf of Mexico Sunday and was expected to hit somewhere along the Texas or Louisiana coasts by Monday night .	A tropical storm rapidly developed in the Gulf of Mexico on Sunday and could have hurricane-force winds when it hits land somewhere along the Louisiana coast Monday night .
0	3354381	3354396	The company didn 't detail the costs of the replacement and repairs .	But company officials expect the costs of the replacement work to run into the millions of dollars .
1	1390995	1391183	The settling companies would also assign their possible claims against the underwriters to the investor plaintiffs , he added .	Under the agreement , the settling companies will also assign their potential claims against the underwriters to the investors , he added .
0

XNLI数据集

用户语言翻译和跨语言分类的语料库

  1. 官网地址:https://github.com/facebookresearch/XNLI
  2. 下载地址:https://dl.fbaipublicfiles.com/XNLI/XNLI-1.0.zip
  3. 注意事项:数据集有json格式的,和txt格式的
  4. 数据格式

txt格式

python 复制代码
language	gold_label	sentence1_binary_parse	sentence2_binary_parse	sentence1_parse	sentence2_parse	sentence1	sentence2	promptID	pairID	genre	label1	label2	label3	label4	label5	sentence1_tokenized	sentence2_tokenized	match
ar	neutral					وقال، ماما، لقد عدت للمنزل.	اتصل بأمه حالما أوصلته حافلة المدرسية.	1	1	facetoface	neutral	contradiction	neutral	neutral	neutral	وقال ، ماما ، لقد عدت للمنزل .	اتصل بأمه حالما أوصلته حافلة المدرسية .	True
ar	contradiction					وقال، ماما، لقد عدت للمنزل.	لم ينطق ببنت شفة.	1	2	facetoface	contradiction	contradiction	contradiction	contradiction	contradiction	وقال ، ماما ، لقد عدت للمنزل .	لم ينطق ببنت شفة .	True
ar	entailment					وقال، ماما، لقد عدت للمنزل.	أخبر أمه أنه قد عاد للمنزل.	1	3	facetoface	entailment	entailment	neutral	entailment	entailment	وقال ، ماما ، لقد عدت للمنزل .	أخبر أمه أنه قد عاد للمنزل .	True
ar	neutral	

json格式

python 复制代码
{"annotator_labels": ["neutral", "contradiction", "neutral", "neutral", "neutral"], "genre": "facetoface", "gold_label": "neutral", "language": "ar", "match": "True", "pairID": "1", "promptID": "1", "sentence1": "\u0648\u0642\u0627\u0644\u060c \u0645\u0627\u0645\u0627\u060c \u0644\u0642\u062f \u0639\u062f\u062a \u0644\u0644\u0645\u0646\u0632\u0644.", "sentence1_tokenized": "\u0648\u0642\u0627\u0644 \u060c \u0645\u0627\u0645\u0627 \u060c \u0644\u0642\u062f \u0639\u062f\u062a \u0644\u0644\u0645\u0646\u0632\u0644 .", "sentence2": "\u0627\u062a\u0635\u0644 \u0628\u0623\u0645\u0647 \u062d\u0627\u0644\u0645\u0627 \u0623\u0648\u0635\u0644\u062a\u0647 \u062d\u0627\u0641\u0644\u0629 \u0627\u0644\u0645\u062f\u0631\u0633\u064a\u0629.", "sentence2_tokenized": "\u0627\u062a\u0635\u0644 \u0628\u0623\u0645\u0647 \u062d\u0627\u0644\u0645\u0627 \u0623\u0648\u0635\u0644\u062a\u0647 \u062d\u0627\u0641\u0644\u0629 \u0627\u0644\u0645\u062f\u0631\u0633\u064a\u0629 ."}
{"annotator_labels": ["contradiction", "contradiction", "contradiction", "contradiction", "contradiction"], "genre": "facetoface", "gold_label": "contradiction", "language": "ar", "match": "True", "pairID": "2", "promptID": "1", "sentence1": "\u0648\u0642\u0627\u0644\u060c \u0645\u0627\u0645\u0627\u060c \u0644\u0642\u062f \u0639\u062f\u062a \u0644\u0644\u0645\u0646\u0632\u0644.", "sentence1_tokenized": "\u0648\u0642\u0627\u0644 \u060c \u0645\u0627\u0645\u0627 \u060c \u0644\u0642\u062f \u0639\u062f\u062a \u0644\u0644\u0645\u0646\u0632\u0644 .", "sentence2": "\u0644\u0645 \u064a\u0646\u0637\u0642 \u0628\u0628\u0646\u062a \u0634\u0641\u0629.", "sentence2_tokenized": "\u0644\u0645 \u064a\u0646\u0637\u0642 \u0628\u0628\u0646\u062a \u0634\u0641\u0629 ."}

SQuAD数据集

  1. 官网地址:https://rajpurkar.github.io/SQuAD-explorer/
  2. 下载地址:https://rajpurkar.github.io/SQuAD-explorer/
  3. 注意事项:测试集没有给出,需要在官网提交模型由平台对模型进行测试集的跑分
  4. 数据格式:点击https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json

数据由多篇文章组成

一个title就表示一篇文章

文章里由paragraphs组成

paragraphs由多个context组成

每一个context有answers和question

部分数据:

python 复制代码
{
	"data": [{
		"title": "Super_Bowl_50",
		"paragraphs": [{
			"context": "Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24\u201310 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the \"golden anniversary\" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as \"Super Bowl L\"), so that the logo could prominently feature the Arabic numerals 50.",
			"qas": [{
				"answers": [{
					"answer_start": 177,
					"text": "Denver Broncos"
				}, {
					"answer_start": 177,
					"text": "Denver Broncos"
				}, {
					"answer_start": 177,
					"text": "Denver Broncos"
				}],
				"question": "Which NFL team represented the AFC at Super Bowl 50?",
				"id": "56be4db0acb8001400a502ec"
			}, {
				"answers": [{
					"answer_start": 249,
					"text": "Carolina Panthers"
				}, {
					"answer_start": 249,
					"text": "Carolina Panthers"
				}, {
					"answer_start": 249,
					"text": "Carolina Panthers"
				}],
				"question": "Which NFL team represented the NFC at Super Bowl 50?",
				"id": "56be4db0acb8001400a502ed"
			}, {
				"answers": [{
					"answer_start": 403,
					"text": "Santa Clara, California"
				}, {
					"answer_start": 355,
					"text": "Levi's Stadium"
				}, {
					"answer_start": 355,
					"text": "Levi's Stadium in the San Francisco Bay Area at Santa Clara, California."
				}],
				"question": "Where did Super Bowl 50 take place?",
				"id": "56be4db0acb8001400a502ee"
			}, {
				"answers": [{
					"answer_start": 177,
					"text": "Denver Broncos"
				}, {
					"answer_start": 177,
					"text": "Denver Broncos"
				}, {
					"answer_start": 177,
					"text": "Denver Broncos"
				}],
				"question": "Which NFL team won Super Bowl 50?",
				"id": "56be4db0acb8001400a502ef"
			}, {
				"answers": [{
					"answer_start": 488,
					"text": "gold"
				}, {
					"answer_start": 488,
					"text": "gold"
				}, {
					"answer_start": 521,
					"text": "gold"
				}],
				"question": "What color was used to emphasize the 50th anniversary of the Super Bowl?",
				"id": "56be4db0acb8001400a502f0"
			}
相关推荐
森焱森6 小时前
详解 Spring Boot、Flask、Nginx、Redis、MySQL 的关系与协作
spring boot·redis·python·nginx·flask
he___H6 小时前
双色球红球
python
deephub6 小时前
机器学习特征工程:分类变量的数值化处理方法
python·机器学习·特征工程·分类变量
Dimpels6 小时前
CANN ops-nn 算子解读:AIGC 批量生成中的 Batch 处理与并行算子
开发语言·aigc·batch
blueSatchel6 小时前
U-Boot载入到DDR过程的代码分析
linux·开发语言·u-boot
Pyeako6 小时前
深度学习--卷积神经网络(下)
人工智能·python·深度学习·卷积神经网络·数据增强·保存最优模型·数据预处理dataset
无小道6 小时前
QT——QFIie和QFileInfo文件类
开发语言·qt·命令模式
OPEN-Source6 小时前
大模型实战:搭建一张“看得懂”的大模型应用可观测看板
人工智能·python·langchain·rag·deepseek
廖圣平6 小时前
从零开始,福袋直播间脚本研究【七】《添加分组和比特浏览器》
python
B站_计算机毕业设计之家6 小时前
豆瓣电影数据可视化分析系统 | Python Flask框架 requests Echarts 大数据 人工智能 毕业设计源码(建议收藏)✅
大数据·python·机器学习·数据挖掘·flask·毕业设计·echarts