The UCI Heart Disease Data Set is a well-known dataset used for machine learning and statistical analysis, particularly in the context of predicting heart disease. Here is a brief introduction:
Overview
The UCI Heart Disease Data Set contains data collected from four different locations: Cleveland, Hungary, Switzerland, and the VA Long Beach. The goal is to predict the presence of heart disease in a patient based on various medical attributes.
Attributes
The dataset includes the following attributes:
- Age: Age of the patient
- Sex: Gender of the patient (1 = male; 0 = female)
- CP: Chest pain type (4 values)
- Trestbps: Resting blood pressure (in mm Hg on admission to the hospital)
- Chol: Serum cholesterol in mg/dl
- Fbs: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
- Restecg: Resting electrocardiographic results (values 0, 1, 2)
- Thalach: Maximum heart rate achieved
- Exang: Exercise-induced angina (1 = yes; 0 = no)
- Oldpeak: ST depression induced by exercise relative to rest
- Slope: The slope of the peak exercise ST segment
- Ca: Number of major vessels (0-3) colored by fluoroscopy
- Thal: Thalassemia (3 = normal; 6 = fixed defect; 7 = reversible defect)
- Target: Diagnosis of heart disease (0 = no disease; 1 = disease)
Usage
The dataset is commonly used for classification tasks to predict the presence of heart disease. It is available for download from the UCI Machine Learning Repository.
Example Code
Here is an example of how to load and explore the dataset using Python and pandas
:
python
import pandas as pd
# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
column_names = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
data = pd.read_csv(url, names=column_names)
# Display the first few rows of the dataset
print(data.head())
This code snippet loads the Cleveland subset of the Heart Disease Data Set and displays the first few rows.
The dataset in question appears to be a simulated dataset used for demonstrating cross-validation techniques. Here is an explanation of the dataset and its variables:
Dataset Explanation
The dataset is generated using random values and is used to demonstrate various cross-validation techniques in machine learning. It consists of multiple predictors and a response variable.
Variables
- Q: A 2D array where each row represents a different predictor, and each column represents a different observation. The values are generated using a Gaussian distribution.
- N: The response variable, which is a binary variable with values 'Yes' and 'No'. It is shuffled to create randomness in the dataset.
- b : An array that stores the correlation coefficients between the response variable
N
and each predictor inQ
. - Index : An array that stores the indices of the top
nc
predictors based on their correlation with the response variable. - mydata : A combined dataset that includes the response variable
N
and the selected predictors fromQ
. - tt: An array used to create test and train splits for cross-validation.
- cv_error: An array to store cross-validation errors.
- cv_true: An array to store the true values of cross-validation predictions.
- final_cv: An array to store the final cross-validation results.
- final_corr: A 3D array to store the correlation coefficients of the selected predictors with the outcome for each cross-validation fold.
These variables are used to perform cross-validation and evaluate the performance of different machine learning models, such as K-Nearest Neighbors (KNN) and Decision Tree classifiers.
UCI 心脏病数据集介绍
UCI 心脏病数据集是一个著名的数据集,常用于机器学习和统计分析,特别是在预测心脏病方面。以下是该数据集的简要介绍:
概述
UCI 心脏病数据集包含从四个不同地点收集的数据:克利夫兰、匈牙利、瑞士和 VA 长滩。目标是根据各种医学属性预测患者是否患有心脏病。
属性
该数据集包括以下属性:
- Age:患者年龄
- Sex:患者性别(1 = 男性;0 = 女性)
- CP:胸痛类型(4 种值)
- Trestbps:静息血压(入院时的 mm Hg)
- Chol:血清胆固醇(mg/dl)
- Fbs:空腹血糖 > 120 mg/dl(1 = 是;0 = 否)
- Restecg:静息心电图结果(值 0, 1, 2)
- Thalach:最大心率
- Exang:运动诱发心绞痛(1 = 是;0 = 否)
- Oldpeak:运动相对于静息的 ST 段压低
- Slope:运动峰值 ST 段的斜率
- Ca:通过荧光检查的主要血管数量(0-3)
- Thal:地中海贫血(3 = 正常;6 = 固定缺陷;7 = 可逆缺陷)
- Target:心脏病诊断(0 = 无病;1 = 有病)
用途
该数据集通常用于分类判别任务,以预测心脏病的存在。可以从 UCI 机器学习库下载。
示例代码
以下是使用 Python 和
pandas
加载和探索数据集的示例:
pythonimport pandas as pd # 加载数据集 url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data" column_names = ["age", "sex", "cp", "trestbps", "chol", "fbs","restecg", "thalach", "exang","oldpeak", "slope", "ca", "thal", "target"] data = pd.read_csv(url, names=column_names) # 显示数据集的前几行 print(data.head())
这段代码加载了克利夫兰子集的心脏病数据集,并显示了前几行数据。
数据集解释
该数据集似乎是一个用于演示交叉验证技术的模拟数据集。以下是对数据集及其变量的解释:
变量
- Q:一个二维数组,每行代表一个不同的预测变量,每列代表一个不同的观测值。值是使用高斯分布生成的。
- N:响应变量,是一个二元变量,值为"是"和"否"。它被打乱以创建数据集的随机性。
- b :一个数组,存储响应变量
N
与每个预测变量Q
之间的相关系数。- Index :一个数组,存储基于与响应变量的相关性选择的前
nc
个预测变量的索引。- mydata :一个包含响应变量
N
和从Q
中选择的预测变量的组合数据集。- tt:一个数组,用于创建交叉验证的测试和训练集。
- cv_error:一个数组,用于存储交叉验证错误。
- cv_true:一个数组,用于存储交叉验证预测的真实值。
- final_cv:一个数组,用于存储最终的交叉验证结果。
- final_corr:一个三维数组,用于存储每个交叉验证折叠中选择的预测变量与结果的相关系数。
这些变量用于执行交叉验证并评估不同机器学习模型(如K近邻(KNN)和决策树分类器)的性能。
python
import pandas as pd
# 加载数据集
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
column_names = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
data = pd.read_csv(url, names=column_names)
# 显示数据集的前几行
print(data.head())
age sex cp trestbps chol fbs restecg thalach exang oldpeak \
0 63.0 1.0 1.0 145.0 233.0 1.0 2.0 150.0 0.0 2.3
1 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5
2 67.0 1.0 4.0 120.0 229.0 0.0 2.0 129.0 1.0 2.6
3 37.0 1.0 3.0 130.0 250.0 0.0 0.0 187.0 0.0 3.5
4 41.0 0.0 2.0 130.0 204.0 0.0 2.0 172.0 0.0 1.4
slope ca thal target
0 3.0 0.0 6.0 0
1 2.0 3.0 3.0 2
2 2.0 2.0 7.0 1
3 3.0 0.0 3.0 0
4 1.0 0.0 3.0 0
python
#显示数据集介绍
print(data.describe())
age sex cp trestbps chol fbs \
count 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000
mean 54.438944 0.679868 3.158416 131.689769 246.693069 0.148515
std 9.038662 0.467299 0.960126 17.599748 51.776918 0.356198
min 29.000000 0.000000 1.000000 94.000000 126.000000 0.000000
25% 48.000000 0.000000 3.000000 120.000000 211.000000 0.000000
50% 56.000000 1.000000 3.000000 130.000000 241.000000 0.000000
75% 61.000000 1.000000 4.000000 140.000000 275.000000 0.000000
max 77.000000 1.000000 4.000000 200.000000 564.000000 1.000000
restecg thalach exang oldpeak slope target
count 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000
mean 0.990099 149.607261 0.326733 1.039604 1.600660 0.937294
std 0.994971 22.875003 0.469794 1.161075 0.616226 1.228536
min 0.000000 71.000000 0.000000 0.000000 1.000000 0.000000
25% 0.000000 133.500000 0.000000 0.000000 1.000000 0.000000
50% 1.000000 153.000000 0.000000 0.800000 2.000000 0.000000
75% 2.000000 166.000000 1.000000 1.600000 2.000000 2.000000
max 2.000000 202.000000 1.000000 6.200000 3.000000 4.000000
python
# 输出数据集的介绍
print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 303 non-null float64
1 sex 303 non-null float64
2 cp 303 non-null float64
3 trestbps 303 non-null float64
4 chol 303 non-null float64
5 fbs 303 non-null float64
6 restecg 303 non-null float64
7 thalach 303 non-null float64
8 exang 303 non-null float64
9 oldpeak 303 non-null float64
10 slope 303 non-null float64
11 ca 303 non-null object
12 thal 303 non-null object
13 target 303 non-null int64
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB
None
python
import pandas as pd
# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
column_names = ["age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"]
data = pd.read_csv(url, names=column_names)
# Display the first few rows of the dataset
print(data.head())
# Display dataset summary
print(data.describe())
# Output dataset information
print(data.info())
# Manually create a description of the dataset
description = """
This dataset contains 303 instances of heart disease data with 14 attributes:
1. age: Age in years
2. sex: Sex (1 = male; 0 = female)
3. cp: Chest pain type (1 to 4)
4. trestbps: Resting blood pressure (in mm Hg)
5. chol: Serum cholesterol in mg/dl
6. fbs: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
7. restecg: Resting electrocardiographic results (0 to 2)
8. thalach: Maximum heart rate achieved
9. exang: Exercise induced angina (1 = yes; 0 = no)
10. oldpeak: ST depression induced by exercise relative to rest
11. slope: The slope of the peak exercise ST segment (1 to 3)
12. ca: Number of major vessels (0-3) colored by fluoroscopy
13. thal: Thalassemia (3 = normal; 6 = fixed defect; 7 = reversible defect)
14. target: Diagnosis of heart disease (0 = no disease; 1 to 4 = disease severity)
"""
print(description)
age sex cp trestbps chol fbs restecg thalach exang oldpeak \
0 63.0 1.0 1.0 145.0 233.0 1.0 2.0 150.0 0.0 2.3
1 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5
2 67.0 1.0 4.0 120.0 229.0 0.0 2.0 129.0 1.0 2.6
3 37.0 1.0 3.0 130.0 250.0 0.0 0.0 187.0 0.0 3.5
4 41.0 0.0 2.0 130.0 204.0 0.0 2.0 172.0 0.0 1.4
slope ca thal target
0 3.0 0.0 6.0 0
1 2.0 3.0 3.0 2
2 2.0 2.0 7.0 1
3 3.0 0.0 3.0 0
4 1.0 0.0 3.0 0
age sex cp trestbps chol fbs \
count 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000
mean 54.438944 0.679868 3.158416 131.689769 246.693069 0.148515
std 9.038662 0.467299 0.960126 17.599748 51.776918 0.356198
min 29.000000 0.000000 1.000000 94.000000 126.000000 0.000000
25% 48.000000 0.000000 3.000000 120.000000 211.000000 0.000000
50% 56.000000 1.000000 3.000000 130.000000 241.000000 0.000000
75% 61.000000 1.000000 4.000000 140.000000 275.000000 0.000000
max 77.000000 1.000000 4.000000 200.000000 564.000000 1.000000
restecg thalach exang oldpeak slope target
count 303.000000 303.000000 303.000000 303.000000 303.000000 303.000000
mean 0.990099 149.607261 0.326733 1.039604 1.600660 0.937294
std 0.994971 22.875003 0.469794 1.161075 0.616226 1.228536
min 0.000000 71.000000 0.000000 0.000000 1.000000 0.000000
25% 0.000000 133.500000 0.000000 0.000000 1.000000 0.000000
50% 1.000000 153.000000 0.000000 0.800000 2.000000 0.000000
75% 2.000000 166.000000 1.000000 1.600000 2.000000 2.000000
max 2.000000 202.000000 1.000000 6.200000 3.000000 4.000000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 303 non-null float64
1 sex 303 non-null float64
2 cp 303 non-null float64
3 trestbps 303 non-null float64
4 chol 303 non-null float64
5 fbs 303 non-null float64
6 restecg 303 non-null float64
7 thalach 303 non-null float64
8 exang 303 non-null float64
9 oldpeak 303 non-null float64
10 slope 303 non-null float64
11 ca 303 non-null object
12 thal 303 non-null object
13 target 303 non-null int64
dtypes: float64(11), int64(1), object(2)
memory usage: 33.3+ KB
None
This dataset contains 303 instances of heart disease data with 14 attributes:
1. age: Age in years
2. sex: Sex (1 = male; 0 = female)
3. cp: Chest pain type (1 to 4)
4. trestbps: Resting blood pressure (in mm Hg)
5. chol: Serum cholesterol in mg/dl
6. fbs: Fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
7. restecg: Resting electrocardiographic results (0 to 2)
8. thalach: Maximum heart rate achieved
9. exang: Exercise induced angina (1 = yes; 0 = no)
10. oldpeak: ST depression induced by exercise relative to rest
11. slope: The slope of the peak exercise ST segment (1 to 3)
12. ca: Number of major vessels (0-3) colored by fluoroscopy
13. thal: Thalassemia (3 = normal; 6 = fixed defect; 7 = reversible defect)
14. target: Diagnosis of heart disease (0 = no disease; 1 to 4 = disease severity)