R语言EDA学习笔记

下载地址:https://posit.co/download/rstudio-desktop/

一、包

1.安装包
r 复制代码
install.packages("tidyverse")
install.packages("patchwork")
2.加载包

linrary(包名):加载某个包

r 复制代码
library(tidyverse)
r 复制代码
library(patchwork)

二、数据规整

1. 赋值
r 复制代码
employees <- read.csv("D:\\NetdiskDownload\\Employee_dataset.csv")

使用函数read.csv读取指定路径的 CSV 文件,并将读取的数据存储为数据框,赋值给变量 employees

2.查询列名
r 复制代码
colnames(employees)
3.删除列
r 复制代码
employees <- select(employees, -c("X", "Name", "Address", "Dependents", "HRA", "DA", "PF", "Insurance", "Gross.Salary"))

select([代操作变量], [操作]):去掉列或保留列
-:代表去掉
c():将一个或多个值合并成一个集合

4.删除行
r 复制代码
other_sex_row_nos <- employees$Sex == "Other"

列出当前指定列数据为 "Other" 的

如果该列等于"Other" 返回 TRUE ,否则为 FALSE,并存入变量other_sex_row_nos

r 复制代码
employees <- employees[!other_sex_row_nos, ]

employees[行索引, 列索引]
employees[other_sex_row_nos, ]表示将为TRUE的行保留,FALSE丢弃,但是因为有 ! 取反,则保留下FALSE行的值,因为FALSE变为了TRUE,也就是除"Other"以外的值

去掉了性别为"Other"的行

5.重命名
r 复制代码
employees |> rename(Joining_Date = DOJ, Birth_Date = DOB, Gender = Sex, Marital_Status = Marital.Status, In_Company_Years = In.Company.Years, Year_of_Experience = Year.of.Experience) -> employees

|>:将前者传入后者
rename(新列名 = 旧列名):重命名

6.类型
6.1查看类型

class():查看数据类型

查看指定列的类型

r 复制代码
> class(employees$Gender)
结果:[1] "character"
6.2 更改类型
r 复制代码
employees$Gender <- as.factor(employees$Gender)
employees$Marital_Status <- as.factor(employees$Marital_Status)
employees$Department <- as.factor(employees$Department)
employees$Position <- as.factor(employees$Position)

as.factor():将类型更改为factor

也就是将原本的文本类型更改为选项

6.3 显示结构
r 复制代码
str(employees)
6.4 factor选项

查看已有选项

r 复制代码
levels(employees$Marital_Status)
结果:[1] ""         "Divorced" "Married"  "Single"   "Widowed"

三、统计分析

1.统计摘要

summary(xxx):获得对xxx的基本统计摘要

r 复制代码
> summary(employees)
     Salary       Joining_Date        Birth_Date             Age           Gender    
 Min.   : 15215   Length:3332        Length:3332        Min.   :21.00   Female:1674  
 1st Qu.: 70556   Class :character   Class :character   1st Qu.:30.75   Male  :1658  
 Median : 96309   Mode  :character   Mode  :character   Median :40.00                
 Mean   : 94013                                         Mean   :40.38                
 3rd Qu.:121410                                         3rd Qu.:50.00                
 Max.   :149991                                         Max.   :60.00                
                                                                                     
  Marital_Status In_Company_Years Year_of_Experience           Department 
         :  87   Min.   :-1.000   Min.   : 0.00      Finance        :657  
 Divorced: 524   1st Qu.: 3.000   1st Qu.: 9.75      Human Resources:696  
 Married :1969   Median : 7.000   Median :19.00      IT             :659  
 Single  : 515   Mean   : 9.739   Mean   :19.38      Marketing      :647  
 Widowed : 237   3rd Qu.:15.000   3rd Qu.:29.00      Sales          :673  
                 Max.   :39.000   Max.   :39.00                           
                                                                          
                       Position   
 National Sales Manager    : 136  
 QA Lead                   : 136  
 Senior HR                 : 136  
 Regional Account Head     : 131  
 Senior Marketing Executive: 128  
 Software Engineer III     : 128  
 (Other)                   :2537

Min:最小值
Max:最大值
Mean:平均值
Median:中位数
1st Qu.:第一四分位数 - 排在第 25% 位上的数
3rd Qu.:第三四分位数 - 排在第 75% 位上的数

2.可视化 - 数据分布
2.1 直方图(Histogram)
r 复制代码
p1 <- ggplot(employees, aes(x = Salary)) + 
  geom_histogram(fill = "orange3", color = "white", bins = 10) + 
  labs(title = "Salary Distribution of employees",
       subtitle = "From some fake company",
       x = "Salary",
       y = "Frequency") + 
  theme(plot.title = element_text(face = "bold"))

print(p1)

通过 + 可以将上述代码解释为4步
第一步 -->
ggplot(employees, aes(x = Salary)):创建画布,将employees放在第一个参数位置

第二个参数aes(x = xxx)为映射,而aes(x = Salary)就表示将Salary这个字段放在x轴上

整段就代表将 employees 中的 Salary 字段放在x轴上
第二步 -->
geom_histogram():画直方图
fill = "orange3":填充色
color = "white":边框颜色
bins = 10:数据分级设置为10,也就是10个区间
第三步 -->
labs():修改坐标轴、图例和图形标签
title:正标题
subtitle:副标题
x:x轴标题
y:y轴标题
第四步 -->
theme():修改图形主题中的元素样式
plot.title:图的标题
element_text():用于控制文字样式的函数
face = "bold":字体调节为粗体

2.2 箱形图(Boxplot)
r 复制代码
p2 = ggplot(employees, aes(x = Department, y = Salary)) + 
  geom_boxplot(color = "blue2", fill = "yellow2") + 
  labs(title = "Salary distribution of employees by Department",
       subtitle = "From a fake company", 
       x = "", 
       y = "Salary")
print(p2)

gemo_boxplot(xxx):画箱形图

箱子从上到下依次为:3rd Qu. -> Median -> 1st Qu.

箱子的高度(数据中间50%) --> 四分位距(Interquartile range) = Q3 - Q1


在箱型图中显示平均值

r 复制代码
p2 = ggplot(employees, aes(x = Department, y = Salary)) + 
  geom_boxplot(color = "blue2", fill = "yellow2") + 
  labs(title = "Salary distribution of employees by Department",
       subtitle = "From a fake company", 
       x = "", 
       y = "Salary") + 
  stat_summary(fun = mean, geom = "point", shape = 20, size = 8, 
               color = "white", fill = "white")
print(p2)

fun = xxxx:使用什么函数
mean(x):求平均值
geom = "point":将平均值显示为一个点
shape = 20:表示实心圆
size = 8:圆的大小
color、fill:颜色,边框


如果发现因数据过多导致堆叠

可以更改显示样式

r 复制代码
p2 = ggplot(employees, aes(x = Position, y = Salary)) + 
  geom_boxplot(color = "blue2", fill = "yellow2") + 
  labs(title = "Salary distribution of employees by Department",
       subtitle = "From a fake company", 
       x = "", 
       y = "Salary") + 
  stat_summary(fun = mean, geom = "point", shape = 20, size = 8, 
               color = "white", fill = "white") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
print(p2)

axis.text.x:X轴文本标签(刻度标签)
adgle:倾斜角度
vjust:垂直
hjust:水平

3.dplyr
r 复制代码
employees |> summarise(MIN = min(Salary), 
                       LQ = quantile(Salary, 0.25), 
                       HQ = quantile(Salary, 0.75), 
                       AVG = mean(Salary), 
                       MEDIAN = median(Salary), 
                       MAX = max(Salary)) -> tibble1

summarise():摘要
min():最小值
quantile(Salary, 0.25):1st Qu.
quantile(Salary, 0.75):3rd Qu.
mean():平均值
median():中位数
max():最大值


分组

r 复制代码
employees |> group_by(Gender)|> 
                       summarise(MIN = min(Salary), 
                       LQ = quantile(Salary, 0.25), 
                       HQ = quantile(Salary, 0.75), 
                       AVG = mean(Salary), 
                       MEDIAN = median(Salary), 
                       MAX = max(Salary)) -> tibble2

group_by(Gender):按年龄分组


筛选

r 复制代码
# 年薪大于10w的
employees |> group_by(Gender)|> tally(Salary >= 100000) -> tibble3

tally():计数

r 复制代码
> str(tibble3)
tibble [2 × 2] (S3: tbl_df/tbl/data.frame)
 $ Gender: Factor w/ 2 levels "Female","Male": 1 2
 $ n     : int [1:2] 774 730

tibble:现代化的 data frame

4.相关性

衡量一对连续数值型变量之间的线性关系
正相关和负相关

线性相关系数

  • ρX,Y:X 与 Y 的相关系数
  • cov(X,Y):X 与 Y 的协方差
  • σX,σY:X 和 Y 的标准差
  • μX,μY:X 和 Y 的期望(均值)
  • E[⋅]:期望算子
    取值范围在[-1, 1],越趋近于1,正相关性越强,越趋近于-1,负相关性越强

相关性不代表具有因果关系

分类 关系
A 导致 B 直接原因
B 导致 A 反向因果关系
A 和 B 都是由 C 导致的 共因关系
A 导致 B 并且 B 导致 A 双向或循环因果关系
A 和 B 之间没有关联 巧合

4.1 数字显示
r 复制代码
employees |> select(c(Salary, Age, In_Company_Years, Year_of_Experience)) ->
  cor_employees

选出上述四列内容赋值给cor_employees

r 复制代码
cor_matrix <- cor(cor_employees)

cor():计算两两之间的相关系数

r 复制代码
cor_table <- melt(cor_matrix)

melt():把上一步结果转换为便于绘图的样式


4.2热图 Heat map
r 复制代码
pc <- ggplot(cor_table, aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + 
  scale_fill_gradient2(low = "steelblue4", mid = "white", high = "red4") + 
  labs(title = "Correlation Matrix", 
       subtitle = "Correlation Coefficients between 4 variables", 
       x = "", 
       y = "", 
       fill = "Correlation\nCoefficient", 
       caption = "Source: Employee data set") + 
  theme(plot.title = element_text(size = 18, face = "bold"), 
        legend.title = element_text(face = "bold", color = "brown", size = 10), 
        axis.text.x = element_text(size = 14, face = "bold", angle = 45, vjust = 0.7), 
        axis.text.y = element_text(size = 14, face = "bold")) + 
  geom_text(aes(x = Var1, y = Var2, label = round(value, 2), color = "white"), 
            fontface = "bold", size = 5)

print(pc)

gemo_tile():用来绘制每一个格子
scale_fill_gradient2():绘制渐变色
legend.title:图例
gemo_text():添加文字
label:要显示的文字
round(value, 2):保留两位小数

r 复制代码
scale_fill_gradient2(low = "steelblue4", mid = "white", high = "red4")

这里可以通过添加参数来提高对比度,将中间值设置为0.5

r 复制代码
scale_fill_gradient2(midpoint = 0.5, low = "steelblue4", mid = "white", high = "red4")

5. 显示已有所有图表
r 复制代码
temp <- p1 + p2 + plot_layout(ncol = 2)
print(temp)
r 复制代码
temp <- p1 + p2 + p3 + pc + plot_layout(ncol = 2)
print(temp)
6. 保存
复制代码
write_csv(employees, "D:\\SystemFiles\\dawd\\cleaned_employyees.csv")
r 复制代码
# install.packages("tidyverse")
# install.packages("patchwork")
# install.packages("reshape2")

library(tidyverse)
library(patchwork)
library(reshape2)

employees <- read.csv("D:\\NetdiskDownload\\Employee_dataset.csv")

employees <- select(employees, -c("X", "Name", "Address", "DOJ", 
                                  "Dependents", "HRA", "DA", 
                                  "PF", "Insurance", "Gross.Salary"))

other_sex_row_nos <- employees$Sex == "Other"
employees <- employees[!other_sex_row_nos, ]

empty_marital_status_rows <- employees$Marital.Status == ""
employees <- employees[!empty_marital_status_rows, ]

employees |> rename(Birth_Date = DOB, 
                    Gender = Sex, 
                    Marital_Status = Marital.Status, 
                    In_Company_Years = In.Company.Years, 
                    Year_of_Experience = Year.of.Experience) -> employees

employees$Gender <- as.factor(employees$Gender)
employees$Marital_Status <- as.factor(employees$Marital_Status)
employees$Department <- as.factor(employees$Department)
employees$Position <- as.factor(employees$Position)

p1 <- ggplot(employees, aes(x = Salary)) + 
  geom_histogram(fill = "orange3", color = "white", bins = 10) + 
  labs(title = "Salary Distribution of employees",
       subtitle = "From some fake company",
       x = "Salary",
       y = "Frequency") + 
  theme(plot.title = element_text(face = "bold"))

print(p1)

p2 = ggplot(employees, aes(x = Position, y = Salary)) + 
  geom_boxplot(color = "blue2", fill = "yellow2") + 
  labs(title = "Salary distribution of employees by Department",
       subtitle = "From a fake company", 
       x = "", 
       y = "Salary") + 
  stat_summary(fun = mean, geom = "point", shape = 20, size = 8, 
               color = "white", fill = "white") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
print(p2)

p3 = ggplot(employees, aes(x = Year_of_Experience, y = Salary)) + 
  geom_boxplot(color = "blue2", fill = "yellow2") + 
  labs(title = "Salary distribution of employees by Department",
       subtitle = "From a fake company", 
       x = "", 
       y = "Salary") + 
  stat_summary(fun = mean, geom = "point", shape = 20, size = 8, 
               color = "white", fill = "white") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
print(p2)

# ------------------------------------------------------

employees |> summarise(MIN = min(Salary), 
                       LQ = quantile(Salary, 0.25), 
                       HQ = quantile(Salary, 0.75), 
                       AVG = mean(Salary), 
                       MEDIAN = median(Salary), 
                       MAX = max(Salary)) -> tibble1

employees |> group_by(Gender)|> 
                       summarise(MIN = min(Salary), 
                       LQ = quantile(Salary, 0.25), 
                       HQ = quantile(Salary, 0.75), 
                       AVG = mean(Salary), 
                       MEDIAN = median(Salary), 
                       MAX = max(Salary)) -> tibble2

employees |> group_by(Gender)|> tally(Salary >= 100000) -> tibble3

# Correlation 相关性

employees |> select(c(Salary, Age, In_Company_Years, Year_of_Experience)) ->
  cor_employees

cor_matrix <- cor(cor_employees)

cor_table <- melt(cor_matrix)

head(cor_table)

pc <- ggplot(cor_table, aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + 
  scale_fill_gradient2(midpoint = 0.5, low = "steelblue4", mid = "white", high = "red4") + 
  labs(title = "Correlation Matrix", 
       subtitle = "Correlation Coefficients between 4 variables", 
       x = "", 
       y = "", 
       fill = "Correlation\nCoefficient", 
       caption = "Source: Employee data set") + 
  theme(plot.title = element_text(size = 18, face = "bold"), 
        legend.title = element_text(face = "bold", color = "brown", size = 10), 
        axis.text.x = element_text(size = 14, face = "bold", angle = 45, vjust = 0.7), 
        axis.text.y = element_text(size = 14, face = "bold")) + 
  geom_text(aes(x = Var1, y = Var2, label = round(value, 2), color = "white"), 
            fontface = "bold", size = 5)

print(pc)

# Others

temp <- p1 + p2 + plot_layout(ncol = 2)
print(temp)

temp <- p1 + p2 + p3 + pc + plot_layout(ncol = 2, nrow = 2)
print(temp)

# save
write_csv(employees, "D:\\SystemFiles\\dawd\\cleaned_employyees.csv")
相关推荐
lzq6031 小时前
文本驱动数据可视化新范式:图表狐5个跨行业实战案例深度解析
信息可视化·自然语言处理·数据分析·aigc·数据可视化
Shining05962 小时前
CPU 并行编程系列《CPU 性能优化导论》
人工智能·学习·其他·性能优化·infinitensor
IT界的老黄牛2 小时前
【IT老齐098 笔记】京东实例讲解如何进行系统架构容量评估
笔记·系统架构
sealaugh322 小时前
react native(学习笔记第一课)环境构筑(hello,world)
笔记·学习·react native
菩提小狗2 小时前
第20天:信息打点-红蓝队自动化项目&资产侦察&企查产权&武器库部署&网络空间__笔记|小迪安全2023-2024|web安全|渗透测试|
笔记·安全·自动化
babe小鑫2 小时前
2026高职统计与大数据分析专业学数据分析的价值分析
数据挖掘·数据分析
电商API_180079052472 小时前
企业级应用:京东商品详情 API 的高可用架构与多级缓存设计
开发语言·人工智能·python·数据分析·网络爬虫·php
安逸sgr2 小时前
16-OpenClaw数据分析与可视化
人工智能·数据挖掘·数据分析·大模型·aigc·agent·openclaw