UCI Heart Disease Data Set—— UCI 心脏病数据集介绍

The UCI Heart Disease Data Set is a well-known dataset used for machine learning and statistical analysis, particularly in the context of predicting heart disease. Here is a brief introduction:

UCI heart数据集下载

Overview

The UCI Heart Disease Data Set contains data collected from four different locations: Cleveland, Hungary, Switzerland, and the VA Long Beach. The goal is to predict the presence of heart disease in a patient based on various medical attributes.

Attributes

The dataset includes the following attributes:

Age: Age of the patient
Sex: Gender of the patient (1 = male; 0 = female)
CP: Chest pain type (4 values)
Trestbps: Resting blood pressure (in mm Hg on admission to the hospital)
Chol: Serum cholesterol in mg/dl
Fbs: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
Restecg: Resting electrocardiographic results (values 0, 1, 2)
Thalach: Maximum heart rate achieved
Exang: Exercise-induced angina (1 = yes; 0 = no)
Oldpeak: ST depression induced by exercise relative to rest
Slope: The slope of the peak exercise ST segment
Ca: Number of major vessels (0-3) colored by fluoroscopy
Thal: Thalassemia (3 = normal; 6 = fixed defect; 7 = reversible defect)
Target: Diagnosis of heart disease (0 = no disease; 1 = disease)

Usage

The dataset is commonly used for classification tasks to predict the presence of heart disease. It is available for download from the UCI Machine Learning Repository.

Example Code

Here is an example of how to load and explore the dataset using Python and pandas:

python 复制代码

import pandas as pd

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
column_names = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
data = pd.read_csv(url, names=column_names)

# Display the first few rows of the dataset
print(data.head())

This code snippet loads the Cleveland subset of the Heart Disease Data Set and displays the first few rows.

The dataset in question appears to be a simulated dataset used for demonstrating cross-validation techniques. Here is an explanation of the dataset and its variables:

Dataset Explanation

The dataset is generated using random values and is used to demonstrate various cross-validation techniques in machine learning. It consists of multiple predictors and a response variable.

Variables

Q: A 2D array where each row represents a different predictor, and each column represents a different observation. The values are generated using a Gaussian distribution.
N: The response variable, which is a binary variable with values 'Yes' and 'No'. It is shuffled to create randomness in the dataset.
b : An array that stores the correlation coefficients between the response variable N and each predictor in Q.
Index : An array that stores the indices of the top nc predictors based on their correlation with the response variable.
mydata : A combined dataset that includes the response variable N and the selected predictors from Q.
tt: An array used to create test and train splits for cross-validation.
cv_error: An array to store cross-validation errors.
cv_true: An array to store the true values of cross-validation predictions.
final_cv: An array to store the final cross-validation results.
final_corr: A 3D array to store the correlation coefficients of the selected predictors with the outcome for each cross-validation fold.

These variables are used to perform cross-validation and evaluate the performance of different machine learning models, such as K-Nearest Neighbors (KNN) and Decision Tree classifiers.

UCI 心脏病数据集介绍

UCI 心脏病数据集是一个著名的数据集，常用于机器学习和统计分析，特别是在预测心脏病方面。以下是该数据集的简要介绍：

概述

UCI 心脏病数据集包含从四个不同地点收集的数据：克利夫兰、匈牙利、瑞士和 VA 长滩。目标是根据各种医学属性预测患者是否患有心脏病。

属性

该数据集包括以下属性：

Age：患者年龄

Sex：患者性别（1 = 男性；0 = 女性）

CP：胸痛类型（4 种值）

Trestbps：静息血压（入院时的 mm Hg）

Chol：血清胆固醇（mg/dl）

Fbs：空腹血糖 > 120 mg/dl（1 = 是；0 = 否）

Restecg：静息心电图结果（值 0, 1, 2）

Thalach：最大心率

Exang：运动诱发心绞痛（1 = 是；0 = 否）

Oldpeak：运动相对于静息的 ST 段压低

Slope：运动峰值 ST 段的斜率

Ca：通过荧光检查的主要血管数量（0-3）

Thal：地中海贫血（3 = 正常；6 = 固定缺陷；7 = 可逆缺陷）

Target：心脏病诊断（0 = 无病；1 = 有病）

用途

该数据集通常用于分类判别任务，以预测心脏病的存在。可以从 UCI 机器学习库下载。

示例代码

以下是使用 Python 和 pandas 加载和探索数据集的示例：
python 复制代码
import pandas as pd
# 加载数据集 url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
column_names = ["age", "sex", "cp", "trestbps", "chol", "fbs","restecg", "thalach", "exang","oldpeak", "slope", "ca", "thal", "target"] 
data = pd.read_csv(url, names=column_names)
# 显示数据集的前几行 print(data.head()) 
这段代码加载了克利夫兰子集的心脏病数据集，并显示了前几行数据。

数据集解释

该数据集似乎是一个用于演示交叉验证技术的模拟数据集。以下是对数据集及其变量的解释：

变量

Q：一个二维数组，每行代表一个不同的预测变量，每列代表一个不同的观测值。值是使用高斯分布生成的。

N：响应变量，是一个二元变量，值为"是"和"否"。它被打乱以创建数据集的随机性。

b ：一个数组，存储响应变量N与每个预测变量Q之间的相关系数。

Index ：一个数组，存储基于与响应变量的相关性选择的前nc个预测变量的索引。

mydata ：一个包含响应变量N和从Q中选择的预测变量的组合数据集。

tt：一个数组，用于创建交叉验证的测试和训练集。

cv_error：一个数组，用于存储交叉验证错误。

cv_true：一个数组，用于存储交叉验证预测的真实值。

final_cv：一个数组，用于存储最终的交叉验证结果。

final_corr：一个三维数组，用于存储每个交叉验证折叠中选择的预测变量与结果的相关系数。

这些变量用于执行交叉验证并评估不同机器学习模型（如K近邻（KNN）和决策树分类器）的性能。

python 复制代码

import pandas as pd

# 加载数据集
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
column_names = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
data = pd.read_csv(url, names=column_names)

# 显示数据集的前几行
print(data.head())

复制代码

    age  sex   cp  trestbps   chol  fbs  restecg  thalach  exang  oldpeak  \
0  63.0  1.0  1.0     145.0  233.0  1.0      2.0    150.0    0.0      2.3   
1  67.0  1.0  4.0     160.0  286.0  0.0      2.0    108.0    1.0      1.5   
2  67.0  1.0  4.0     120.0  229.0  0.0      2.0    129.0    1.0      2.6   
3  37.0  1.0  3.0     130.0  250.0  0.0      0.0    187.0    0.0      3.5   
4  41.0  0.0  2.0     130.0  204.0  0.0      2.0    172.0    0.0      1.4   

   slope   ca thal  target  
0    3.0  0.0  6.0       0  
1    2.0  3.0  3.0       2  
2    2.0  2.0  7.0       1  
3    3.0  0.0  3.0       0  
4    1.0  0.0  3.0       0

python 复制代码

#显示数据集介绍
print(data.describe())

复制代码

              age         sex          cp    trestbps        chol         fbs  \
count  303.000000  303.000000  303.000000  303.000000  303.000000  303.000000   
mean    54.438944    0.679868    3.158416  131.689769  246.693069    0.148515   
std      9.038662    0.467299    0.960126   17.599748   51.776918    0.356198   
min     29.000000    0.000000    1.000000   94.000000  126.000000    0.000000   
25%     48.000000    0.000000    3.000000  120.000000  211.000000    0.000000   
50%     56.000000    1.000000    3.000000  130.000000  241.000000    0.000000   
75%     61.000000    1.000000    4.000000  140.000000  275.000000    0.000000   
max     77.000000    1.000000    4.000000  200.000000  564.000000    1.000000   

          restecg     thalach       exang     oldpeak       slope      target  
count  303.000000  303.000000  303.000000  303.000000  303.000000  303.000000  
mean     0.990099  149.607261    0.326733    1.039604    1.600660    0.937294  
std      0.994971   22.875003    0.469794    1.161075    0.616226    1.228536  
min      0.000000   71.000000    0.000000    0.000000    1.000000    0.000000  
25%      0.000000  133.500000    0.000000    0.000000    1.000000    0.000000  
50%      1.000000  153.000000    0.000000    0.800000    2.000000    0.000000  
75%      2.000000  166.000000    1.000000    1.600000    2.000000    2.000000  
max      2.000000  202.000000    1.000000    6.200000    3.000000    4.000000

python 复制代码

# 输出数据集的介绍
print(data.info())

复制代码

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    float64
 1   sex       303 non-null    float64
 2   cp        303 non-null    float64
 3   trestbps  303 non-null    float64
 4   chol      303 non-null    float64
 5   fbs       303 non-null    float64
 6   restecg   303 non-null    float64
 7   thalach   303 non-null    float64
 8   exang     303 non-null    float64
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    float64
 11  ca        303 non-null    object 
 12  thal      303 non-null    object 
 13  target    303 non-null    int64  
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB
None

python 复制代码

import pandas as pd

# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
column_names = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
data = pd.read_csv(url, names=column_names)

# Display the first few rows of the dataset
print(data.head())

# Display dataset summary
print(data.describe())

# Output dataset information
print(data.info())

# Manually create a description of the dataset
description = """
This dataset contains 303 instances of heart disease data with 14 attributes:
1. age: Age in years
2. sex: Sex (1 = male; 0 = female)
3. cp: Chest pain type (1 to 4)
4. trestbps: Resting blood pressure (in mm Hg)
5. chol: Serum cholesterol in mg/dl
6. fbs: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
7. restecg: Resting electrocardiographic results (0 to 2)
8. thalach: Maximum heart rate achieved
9. exang: Exercise induced angina (1 = yes; 0 = no)
10. oldpeak: ST depression induced by exercise relative to rest
11. slope: The slope of the peak exercise ST segment (1 to 3)
12. ca: Number of major vessels (0-3) colored by fluoroscopy
13. thal: Thalassemia (3 = normal; 6 = fixed defect; 7 = reversible defect)
14. target: Diagnosis of heart disease (0 = no disease; 1 to 4 = disease severity)
"""
print(description)

复制代码

    age  sex   cp  trestbps   chol  fbs  restecg  thalach  exang  oldpeak  \
0  63.0  1.0  1.0     145.0  233.0  1.0      2.0    150.0    0.0      2.3   
1  67.0  1.0  4.0     160.0  286.0  0.0      2.0    108.0    1.0      1.5   
2  67.0  1.0  4.0     120.0  229.0  0.0      2.0    129.0    1.0      2.6   
3  37.0  1.0  3.0     130.0  250.0  0.0      0.0    187.0    0.0      3.5   
4  41.0  0.0  2.0     130.0  204.0  0.0      2.0    172.0    0.0      1.4   

   slope   ca thal  target  
0    3.0  0.0  6.0       0  
1    2.0  3.0  3.0       2  
2    2.0  2.0  7.0       1  
3    3.0  0.0  3.0       0  
4    1.0  0.0  3.0       0  
              age         sex          cp    trestbps        chol         fbs  \
count  303.000000  303.000000  303.000000  303.000000  303.000000  303.000000   
mean    54.438944    0.679868    3.158416  131.689769  246.693069    0.148515   
std      9.038662    0.467299    0.960126   17.599748   51.776918    0.356198   
min     29.000000    0.000000    1.000000   94.000000  126.000000    0.000000   
25%     48.000000    0.000000    3.000000  120.000000  211.000000    0.000000   
50%     56.000000    1.000000    3.000000  130.000000  241.000000    0.000000   
75%     61.000000    1.000000    4.000000  140.000000  275.000000    0.000000   
max     77.000000    1.000000    4.000000  200.000000  564.000000    1.000000   

          restecg     thalach       exang     oldpeak       slope      target  
count  303.000000  303.000000  303.000000  303.000000  303.000000  303.000000  
mean     0.990099  149.607261    0.326733    1.039604    1.600660    0.937294  
std      0.994971   22.875003    0.469794    1.161075    0.616226    1.228536  
min      0.000000   71.000000    0.000000    0.000000    1.000000    0.000000  
25%      0.000000  133.500000    0.000000    0.000000    1.000000    0.000000  
50%      1.000000  153.000000    0.000000    0.800000    2.000000    0.000000  
75%      2.000000  166.000000    1.000000    1.600000    2.000000    2.000000  
max      2.000000  202.000000    1.000000    6.200000    3.000000    4.000000  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    float64
 1   sex       303 non-null    float64
 2   cp        303 non-null    float64
 3   trestbps  303 non-null    float64
 4   chol      303 non-null    float64
 5   fbs       303 non-null    float64
 6   restecg   303 non-null    float64
 7   thalach   303 non-null    float64
 8   exang     303 non-null    float64
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    float64
 11  ca        303 non-null    object 
 12  thal      303 non-null    object 
 13  target    303 non-null    int64  
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB
None

This dataset contains 303 instances of heart disease data with 14 attributes:
1. age: Age in years
2. sex: Sex (1 = male; 0 = female)
3. cp: Chest pain type (1 to 4)
4. trestbps: Resting blood pressure (in mm Hg)
5. chol: Serum cholesterol in mg/dl
6. fbs: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
7. restecg: Resting electrocardiographic results (0 to 2)
8. thalach: Maximum heart rate achieved
9. exang: Exercise induced angina (1 = yes; 0 = no)
10. oldpeak: ST depression induced by exercise relative to rest
11. slope: The slope of the peak exercise ST segment (1 to 3)
12. ca: Number of major vessels (0-3) colored by fluoroscopy
13. thal: Thalassemia (3 = normal; 6 = fixed defect; 7 = reversible defect)
14. target: Diagnosis of heart disease (0 = no disease; 1 to 4 = disease severity)