UCI Heart Disease Data Set—— UCI 心脏病数据集介绍

The UCI Heart Disease Data Set is a well-known dataset used for machine learning and statistical analysis, particularly in the context of predicting heart disease. Here is a brief introduction:

UCI heart数据集下载

Overview

The UCI Heart Disease Data Set contains data collected from four different locations: Cleveland, Hungary, Switzerland, and the VA Long Beach. The goal is to predict the presence of heart disease in a patient based on various medical attributes.

Attributes

The dataset includes the following attributes:

  1. Age: Age of the patient
  2. Sex: Gender of the patient (1 = male; 0 = female)
  3. CP: Chest pain type (4 values)
  4. Trestbps: Resting blood pressure (in mm Hg on admission to the hospital)
  5. Chol: Serum cholesterol in mg/dl
  6. Fbs: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
  7. Restecg: Resting electrocardiographic results (values 0, 1, 2)
  8. Thalach: Maximum heart rate achieved
  9. Exang: Exercise-induced angina (1 = yes; 0 = no)
  10. Oldpeak: ST depression induced by exercise relative to rest
  11. Slope: The slope of the peak exercise ST segment
  12. Ca: Number of major vessels (0-3) colored by fluoroscopy
  13. Thal: Thalassemia (3 = normal; 6 = fixed defect; 7 = reversible defect)
  14. Target: Diagnosis of heart disease (0 = no disease; 1 = disease)

Usage

The dataset is commonly used for classification tasks to predict the presence of heart disease. It is available for download from the UCI Machine Learning Repository.

Example Code

Here is an example of how to load and explore the dataset using Python and pandas:

python 复制代码
import pandas as pd

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
column_names = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
data = pd.read_csv(url, names=column_names)

# Display the first few rows of the dataset
print(data.head())

This code snippet loads the Cleveland subset of the Heart Disease Data Set and displays the first few rows.

The dataset in question appears to be a simulated dataset used for demonstrating cross-validation techniques. Here is an explanation of the dataset and its variables:

Dataset Explanation

The dataset is generated using random values and is used to demonstrate various cross-validation techniques in machine learning. It consists of multiple predictors and a response variable.

Variables

  1. Q: A 2D array where each row represents a different predictor, and each column represents a different observation. The values are generated using a Gaussian distribution.
  2. N: The response variable, which is a binary variable with values 'Yes' and 'No'. It is shuffled to create randomness in the dataset.
  3. b : An array that stores the correlation coefficients between the response variable N and each predictor in Q.
  4. Index : An array that stores the indices of the top nc predictors based on their correlation with the response variable.
  5. mydata : A combined dataset that includes the response variable N and the selected predictors from Q.
  6. tt: An array used to create test and train splits for cross-validation.
  7. cv_error: An array to store cross-validation errors.
  8. cv_true: An array to store the true values of cross-validation predictions.
  9. final_cv: An array to store the final cross-validation results.
  10. final_corr: A 3D array to store the correlation coefficients of the selected predictors with the outcome for each cross-validation fold.

These variables are used to perform cross-validation and evaluate the performance of different machine learning models, such as K-Nearest Neighbors (KNN) and Decision Tree classifiers.

UCI 心脏病数据集介绍

UCI 心脏病数据集是一个著名的数据集,常用于机器学习和统计分析,特别是在预测心脏病方面。以下是该数据集的简要介绍:

概述

UCI 心脏病数据集包含从四个不同地点收集的数据:克利夫兰、匈牙利、瑞士和 VA 长滩。目标是根据各种医学属性预测患者是否患有心脏病。

属性

该数据集包括以下属性:

  1. Age:患者年龄
  2. Sex:患者性别(1 = 男性;0 = 女性)
  3. CP:胸痛类型(4 种值)
  4. Trestbps:静息血压(入院时的 mm Hg)
  5. Chol:血清胆固醇(mg/dl)
  6. Fbs:空腹血糖 > 120 mg/dl(1 = 是;0 = 否)
  7. Restecg:静息心电图结果(值 0, 1, 2)
  8. Thalach:最大心率
  9. Exang:运动诱发心绞痛(1 = 是;0 = 否)
  10. Oldpeak:运动相对于静息的 ST 段压低
  11. Slope:运动峰值 ST 段的斜率
  12. Ca:通过荧光检查的主要血管数量(0-3)
  13. Thal:地中海贫血(3 = 正常;6 = 固定缺陷;7 = 可逆缺陷)
  14. Target:心脏病诊断(0 = 无病;1 = 有病)

用途

该数据集通常用于分类判别任务,以预测心脏病的存在。可以从 UCI 机器学习库下载。

示例代码

以下是使用 Python 和 pandas 加载和探索数据集的示例:

python 复制代码
import pandas as pd
# 加载数据集 url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
column_names = ["age", "sex", "cp", "trestbps", "chol", "fbs","restecg", "thalach", "exang","oldpeak", "slope", "ca", "thal", "target"] 
data = pd.read_csv(url, names=column_names)
# 显示数据集的前几行 print(data.head()) 

这段代码加载了克利夫兰子集的心脏病数据集,并显示了前几行数据。

数据集解释

该数据集似乎是一个用于演示交叉验证技术的模拟数据集。以下是对数据集及其变量的解释:

变量

  1. Q:一个二维数组,每行代表一个不同的预测变量,每列代表一个不同的观测值。值是使用高斯分布生成的。
  2. N:响应变量,是一个二元变量,值为"是"和"否"。它被打乱以创建数据集的随机性。
  3. b :一个数组,存储响应变量N与每个预测变量Q之间的相关系数。
  4. Index :一个数组,存储基于与响应变量的相关性选择的前nc个预测变量的索引。
  5. mydata :一个包含响应变量N和从Q中选择的预测变量的组合数据集。
  6. tt:一个数组,用于创建交叉验证的测试和训练集。
  7. cv_error:一个数组,用于存储交叉验证错误。
  8. cv_true:一个数组,用于存储交叉验证预测的真实值。
  9. final_cv:一个数组,用于存储最终的交叉验证结果。
  10. final_corr:一个三维数组,用于存储每个交叉验证折叠中选择的预测变量与结果的相关系数。

这些变量用于执行交叉验证并评估不同机器学习模型(如K近邻(KNN)和决策树分类器)的性能。

python 复制代码
import pandas as pd

# 加载数据集
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
column_names = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
data = pd.read_csv(url, names=column_names)

# 显示数据集的前几行
print(data.head())
复制代码
    age  sex   cp  trestbps   chol  fbs  restecg  thalach  exang  oldpeak  \
0  63.0  1.0  1.0     145.0  233.0  1.0      2.0    150.0    0.0      2.3   
1  67.0  1.0  4.0     160.0  286.0  0.0      2.0    108.0    1.0      1.5   
2  67.0  1.0  4.0     120.0  229.0  0.0      2.0    129.0    1.0      2.6   
3  37.0  1.0  3.0     130.0  250.0  0.0      0.0    187.0    0.0      3.5   
4  41.0  0.0  2.0     130.0  204.0  0.0      2.0    172.0    0.0      1.4   

   slope   ca thal  target  
0    3.0  0.0  6.0       0  
1    2.0  3.0  3.0       2  
2    2.0  2.0  7.0       1  
3    3.0  0.0  3.0       0  
4    1.0  0.0  3.0       0  
python 复制代码
#显示数据集介绍
print(data.describe())
复制代码
              age         sex          cp    trestbps        chol         fbs  \
count  303.000000  303.000000  303.000000  303.000000  303.000000  303.000000   
mean    54.438944    0.679868    3.158416  131.689769  246.693069    0.148515   
std      9.038662    0.467299    0.960126   17.599748   51.776918    0.356198   
min     29.000000    0.000000    1.000000   94.000000  126.000000    0.000000   
25%     48.000000    0.000000    3.000000  120.000000  211.000000    0.000000   
50%     56.000000    1.000000    3.000000  130.000000  241.000000    0.000000   
75%     61.000000    1.000000    4.000000  140.000000  275.000000    0.000000   
max     77.000000    1.000000    4.000000  200.000000  564.000000    1.000000   

          restecg     thalach       exang     oldpeak       slope      target  
count  303.000000  303.000000  303.000000  303.000000  303.000000  303.000000  
mean     0.990099  149.607261    0.326733    1.039604    1.600660    0.937294  
std      0.994971   22.875003    0.469794    1.161075    0.616226    1.228536  
min      0.000000   71.000000    0.000000    0.000000    1.000000    0.000000  
25%      0.000000  133.500000    0.000000    0.000000    1.000000    0.000000  
50%      1.000000  153.000000    0.000000    0.800000    2.000000    0.000000  
75%      2.000000  166.000000    1.000000    1.600000    2.000000    2.000000  
max      2.000000  202.000000    1.000000    6.200000    3.000000    4.000000  
python 复制代码
# 输出数据集的介绍
print(data.info())
复制代码
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    float64
 1   sex       303 non-null    float64
 2   cp        303 non-null    float64
 3   trestbps  303 non-null    float64
 4   chol      303 non-null    float64
 5   fbs       303 non-null    float64
 6   restecg   303 non-null    float64
 7   thalach   303 non-null    float64
 8   exang     303 non-null    float64
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    float64
 11  ca        303 non-null    object 
 12  thal      303 non-null    object 
 13  target    303 non-null    int64  
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB
None
python 复制代码
import pandas as pd

# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
column_names = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
data = pd.read_csv(url, names=column_names)

# Display the first few rows of the dataset
print(data.head())

# Display dataset summary
print(data.describe())

# Output dataset information
print(data.info())

# Manually create a description of the dataset
description = """
This dataset contains 303 instances of heart disease data with 14 attributes:
1. age: Age in years
2. sex: Sex (1 = male; 0 = female)
3. cp: Chest pain type (1 to 4)
4. trestbps: Resting blood pressure (in mm Hg)
5. chol: Serum cholesterol in mg/dl
6. fbs: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
7. restecg: Resting electrocardiographic results (0 to 2)
8. thalach: Maximum heart rate achieved
9. exang: Exercise induced angina (1 = yes; 0 = no)
10. oldpeak: ST depression induced by exercise relative to rest
11. slope: The slope of the peak exercise ST segment (1 to 3)
12. ca: Number of major vessels (0-3) colored by fluoroscopy
13. thal: Thalassemia (3 = normal; 6 = fixed defect; 7 = reversible defect)
14. target: Diagnosis of heart disease (0 = no disease; 1 to 4 = disease severity)
"""
print(description)
复制代码
    age  sex   cp  trestbps   chol  fbs  restecg  thalach  exang  oldpeak  \
0  63.0  1.0  1.0     145.0  233.0  1.0      2.0    150.0    0.0      2.3   
1  67.0  1.0  4.0     160.0  286.0  0.0      2.0    108.0    1.0      1.5   
2  67.0  1.0  4.0     120.0  229.0  0.0      2.0    129.0    1.0      2.6   
3  37.0  1.0  3.0     130.0  250.0  0.0      0.0    187.0    0.0      3.5   
4  41.0  0.0  2.0     130.0  204.0  0.0      2.0    172.0    0.0      1.4   

   slope   ca thal  target  
0    3.0  0.0  6.0       0  
1    2.0  3.0  3.0       2  
2    2.0  2.0  7.0       1  
3    3.0  0.0  3.0       0  
4    1.0  0.0  3.0       0  
              age         sex          cp    trestbps        chol         fbs  \
count  303.000000  303.000000  303.000000  303.000000  303.000000  303.000000   
mean    54.438944    0.679868    3.158416  131.689769  246.693069    0.148515   
std      9.038662    0.467299    0.960126   17.599748   51.776918    0.356198   
min     29.000000    0.000000    1.000000   94.000000  126.000000    0.000000   
25%     48.000000    0.000000    3.000000  120.000000  211.000000    0.000000   
50%     56.000000    1.000000    3.000000  130.000000  241.000000    0.000000   
75%     61.000000    1.000000    4.000000  140.000000  275.000000    0.000000   
max     77.000000    1.000000    4.000000  200.000000  564.000000    1.000000   

          restecg     thalach       exang     oldpeak       slope      target  
count  303.000000  303.000000  303.000000  303.000000  303.000000  303.000000  
mean     0.990099  149.607261    0.326733    1.039604    1.600660    0.937294  
std      0.994971   22.875003    0.469794    1.161075    0.616226    1.228536  
min      0.000000   71.000000    0.000000    0.000000    1.000000    0.000000  
25%      0.000000  133.500000    0.000000    0.000000    1.000000    0.000000  
50%      1.000000  153.000000    0.000000    0.800000    2.000000    0.000000  
75%      2.000000  166.000000    1.000000    1.600000    2.000000    2.000000  
max      2.000000  202.000000    1.000000    6.200000    3.000000    4.000000  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    float64
 1   sex       303 non-null    float64
 2   cp        303 non-null    float64
 3   trestbps  303 non-null    float64
 4   chol      303 non-null    float64
 5   fbs       303 non-null    float64
 6   restecg   303 non-null    float64
 7   thalach   303 non-null    float64
 8   exang     303 non-null    float64
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    float64
 11  ca        303 non-null    object 
 12  thal      303 non-null    object 
 13  target    303 non-null    int64  
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB
None

This dataset contains 303 instances of heart disease data with 14 attributes:
1. age: Age in years
2. sex: Sex (1 = male; 0 = female)
3. cp: Chest pain type (1 to 4)
4. trestbps: Resting blood pressure (in mm Hg)
5. chol: Serum cholesterol in mg/dl
6. fbs: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
7. restecg: Resting electrocardiographic results (0 to 2)
8. thalach: Maximum heart rate achieved
9. exang: Exercise induced angina (1 = yes; 0 = no)
10. oldpeak: ST depression induced by exercise relative to rest
11. slope: The slope of the peak exercise ST segment (1 to 3)
12. ca: Number of major vessels (0-3) colored by fluoroscopy
13. thal: Thalassemia (3 = normal; 6 = fixed defect; 7 = reversible defect)
14. target: Diagnosis of heart disease (0 = no disease; 1 to 4 = disease severity)
相关推荐
ada7_1 天前
LeetCode(python)230.二叉搜索树中第k小的元素
python·算法·leetcode·链表
江上鹤.1481 天前
Day36官方文档的阅读
python
嗝o゚1 天前
Flutter 无障碍功能开发最佳实践
python·flutter·华为
芝麻开门-新起点1 天前
第13-1章 Python地理空间开发
开发语言·python
秋刀鱼 ..1 天前
2026年电力电子与电能变换国际学术会议 (ICPEPC 2026)
大数据·python·计算机网络·数学建模·制造
znhy_231 天前
day35打卡
python
盼哥PyAI实验室1 天前
12306反反爬虫策略:Python网络请求优化实战
网络·爬虫·python
deephub1 天前
DeepSeek-R1 与 OpenAI o3 的启示:Test-Time Compute 技术不再迷信参数堆叠
人工智能·python·深度学习·大语言模型
力江1 天前
FastAPI 最佳架构实践,从混乱到优雅的进化之路
python·缓存·架构·单元测试·fastapi·分页·企业
Raink老师1 天前
第 11 章 错误处理与异常
python