R语言EDA学习笔记

下载地址：https://posit.co/download/rstudio-desktop/

一、包

1.安装包

r 复制代码

install.packages("tidyverse")
install.packages("patchwork")

2.加载包

linrary(包名)：加载某个包

r 复制代码

library(tidyverse)

r 复制代码

library(patchwork)

二、数据规整

1. 赋值

r 复制代码

employees <- read.csv("D:\\NetdiskDownload\\Employee_dataset.csv")

使用函数read.csv读取指定路径的 CSV 文件，并将读取的数据存储为数据框，赋值给变量 employees

2.查询列名

r 复制代码

colnames(employees)

3.删除列

r 复制代码

employees <- select(employees, -c("X", "Name", "Address", "Dependents", "HRA", "DA", "PF", "Insurance", "Gross.Salary"))

select([代操作变量], [操作])：去掉列或保留列
-：代表去掉
c()：将一个或多个值合并成一个集合

4.删除行

r 复制代码

other_sex_row_nos <- employees$Sex == "Other"

列出当前指定列数据为 "Other" 的

如果该列等于"Other" 返回 TRUE ，否则为 FALSE，并存入变量other_sex_row_nos

r 复制代码

employees <- employees[!other_sex_row_nos, ]

employees[行索引, 列索引]
employees[other_sex_row_nos, ]表示将为TRUE的行保留，FALSE丢弃，但是因为有 ! 取反，则保留下FALSE行的值，因为FALSE变为了TRUE，也就是除"Other"以外的值

去掉了性别为"Other"的行

5.重命名

r 复制代码

employees |> rename(Joining_Date = DOJ, Birth_Date = DOB, Gender = Sex, Marital_Status = Marital.Status, In_Company_Years = In.Company.Years, Year_of_Experience = Year.of.Experience) -> employees

|>：将前者传入后者
rename(新列名 = 旧列名)：重命名

6.类型

6.1查看类型

class()：查看数据类型

查看指定列的类型

r 复制代码

> class(employees$Gender)
结果：[1] "character"

6.2 更改类型

r 复制代码

employees$Gender <- as.factor(employees$Gender)
employees$Marital_Status <- as.factor(employees$Marital_Status)
employees$Department <- as.factor(employees$Department)
employees$Position <- as.factor(employees$Position)

as.factor()：将类型更改为factor

也就是将原本的文本类型更改为选项

6.3 显示结构

r 复制代码

str(employees)

6.4 factor选项

查看已有选项

r 复制代码

levels(employees$Marital_Status)
结果：[1] ""         "Divorced" "Married"  "Single"   "Widowed"

三、统计分析

1.统计摘要

summary(xxx)：获得对xxx的基本统计摘要

r 复制代码

> summary(employees)
     Salary       Joining_Date        Birth_Date             Age           Gender    
 Min.   : 15215   Length:3332        Length:3332        Min.   :21.00   Female:1674  
 1st Qu.: 70556   Class :character   Class :character   1st Qu.:30.75   Male  :1658  
 Median : 96309   Mode  :character   Mode  :character   Median :40.00                
 Mean   : 94013                                         Mean   :40.38                
 3rd Qu.:121410                                         3rd Qu.:50.00                
 Max.   :149991                                         Max.   :60.00                
                                                                                     
  Marital_Status In_Company_Years Year_of_Experience           Department 
         :  87   Min.   :-1.000   Min.   : 0.00      Finance        :657  
 Divorced: 524   1st Qu.: 3.000   1st Qu.: 9.75      Human Resources:696  
 Married :1969   Median : 7.000   Median :19.00      IT             :659  
 Single  : 515   Mean   : 9.739   Mean   :19.38      Marketing      :647  
 Widowed : 237   3rd Qu.:15.000   3rd Qu.:29.00      Sales          :673  
                 Max.   :39.000   Max.   :39.00                           
                                                                          
                       Position   
 National Sales Manager    : 136  
 QA Lead                   : 136  
 Senior HR                 : 136  
 Regional Account Head     : 131  
 Senior Marketing Executive: 128  
 Software Engineer III     : 128  
 (Other)                   :2537

Min：最小值
Max：最大值
Mean：平均值
Median：中位数
1st Qu.：第一四分位数 - 排在第 25% 位上的数
3rd Qu.：第三四分位数 - 排在第 75% 位上的数

2.可视化 - 数据分布

2.1 直方图(Histogram)

r 复制代码

p1 <- ggplot(employees, aes(x = Salary)) + 
  geom_histogram(fill = "orange3", color = "white", bins = 10) + 
  labs(title = "Salary Distribution of employees",
       subtitle = "From some fake company",
       x = "Salary",
       y = "Frequency") + 
  theme(plot.title = element_text(face = "bold"))

print(p1)

通过 + 可以将上述代码解释为4步
第一步 -->
ggplot(employees, aes(x = Salary))：创建画布，将employees放在第一个参数位置

第二个参数aes(x = xxx)为映射，而aes(x = Salary)就表示将Salary这个字段放在x轴上

整段就代表将 employees 中的 Salary 字段放在x轴上
第二步 -->
geom_histogram()：画直方图
fill = "orange3"：填充色
color = "white"：边框颜色
bins = 10：数据分级设置为10，也就是10个区间
第三步 -->
labs()：修改坐标轴、图例和图形标签
title：正标题
subtitle：副标题
x：x轴标题
y：y轴标题
第四步 -->
theme()：修改图形主题中的元素样式
plot.title：图的标题
element_text()：用于控制文字样式的函数
face = "bold"：字体调节为粗体

2.2 箱形图(Boxplot)

r 复制代码

p2 = ggplot(employees, aes(x = Department, y = Salary)) + 
  geom_boxplot(color = "blue2", fill = "yellow2") + 
  labs(title = "Salary distribution of employees by Department",
       subtitle = "From a fake company", 
       x = "", 
       y = "Salary")
print(p2)

gemo_boxplot(xxx)：画箱形图

箱子从上到下依次为：3rd Qu. -> Median -> 1st Qu.

箱子的高度(数据中间50%) --> 四分位距(Interquartile range) = Q3 - Q1

在箱型图中显示平均值：

r 复制代码

p2 = ggplot(employees, aes(x = Department, y = Salary)) + 
  geom_boxplot(color = "blue2", fill = "yellow2") + 
  labs(title = "Salary distribution of employees by Department",
       subtitle = "From a fake company", 
       x = "", 
       y = "Salary") + 
  stat_summary(fun = mean, geom = "point", shape = 20, size = 8, 
               color = "white", fill = "white")
print(p2)

fun = xxxx：使用什么函数
mean(x)：求平均值
geom = "point"：将平均值显示为一个点
shape = 20：表示实心圆
size = 8：圆的大小
color、fill：颜色，边框

如果发现因数据过多导致堆叠

可以更改显示样式

r 复制代码

p2 = ggplot(employees, aes(x = Position, y = Salary)) + 
  geom_boxplot(color = "blue2", fill = "yellow2") + 
  labs(title = "Salary distribution of employees by Department",
       subtitle = "From a fake company", 
       x = "", 
       y = "Salary") + 
  stat_summary(fun = mean, geom = "point", shape = 20, size = 8, 
               color = "white", fill = "white") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
print(p2)

axis.text.x：X轴文本标签(刻度标签)
adgle：倾斜角度
vjust：垂直
hjust：水平

3.dplyr

r 复制代码

employees |> summarise(MIN = min(Salary), 
                       LQ = quantile(Salary, 0.25), 
                       HQ = quantile(Salary, 0.75), 
                       AVG = mean(Salary), 
                       MEDIAN = median(Salary), 
                       MAX = max(Salary)) -> tibble1

summarise()：摘要
min()：最小值
quantile(Salary, 0.25)：1st Qu.
quantile(Salary, 0.75)：3rd Qu.
mean()：平均值
median()：中位数
max()：最大值

分组

r 复制代码

employees |> group_by(Gender)|> 
                       summarise(MIN = min(Salary), 
                       LQ = quantile(Salary, 0.25), 
                       HQ = quantile(Salary, 0.75), 
                       AVG = mean(Salary), 
                       MEDIAN = median(Salary), 
                       MAX = max(Salary)) -> tibble2

group_by(Gender)：按年龄分组

筛选

r 复制代码

# 年薪大于10w的
employees |> group_by(Gender)|> tally(Salary >= 100000) -> tibble3

tally()：计数

r 复制代码

> str(tibble3)
tibble [2 × 2] (S3: tbl_df/tbl/data.frame)
 $ Gender: Factor w/ 2 levels "Female","Male": 1 2
 $ n     : int [1:2] 774 730

tibble：现代化的 data frame

4.相关性

衡量一对连续数值型变量之间的线性关系
正相关和负相关

线性相关系数

ρX,Y：X 与 Y 的相关系数
cov(X,Y)：X 与 Y 的协方差
σX,σY：X 和 Y 的标准差
μX,μY：X 和 Y 的期望（均值）
E $\cdot$ ：期望算子
取值范围在 $-1, 1$ ，越趋近于1，正相关性越强，越趋近于-1，负相关性越强

分类	关系
A 导致 B	直接原因
B 导致 A	反向因果关系
A 和 B 都是由 C 导致的	共因关系
A 导致 B 并且 B 导致 A	双向或循环因果关系
A 和 B 之间没有关联	巧合

4.1 数字显示

r 复制代码

employees |> select(c(Salary, Age, In_Company_Years, Year_of_Experience)) ->
  cor_employees

选出上述四列内容赋值给cor_employees

r 复制代码

cor_matrix <- cor(cor_employees)

cor()：计算两两之间的相关系数

r 复制代码

cor_table <- melt(cor_matrix)

melt()：把上一步结果转换为便于绘图的样式

4.2热图 Heat map

r 复制代码

pc <- ggplot(cor_table, aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + 
  scale_fill_gradient2(low = "steelblue4", mid = "white", high = "red4") + 
  labs(title = "Correlation Matrix", 
       subtitle = "Correlation Coefficients between 4 variables", 
       x = "", 
       y = "", 
       fill = "Correlation\nCoefficient", 
       caption = "Source: Employee data set") + 
  theme(plot.title = element_text(size = 18, face = "bold"), 
        legend.title = element_text(face = "bold", color = "brown", size = 10), 
        axis.text.x = element_text(size = 14, face = "bold", angle = 45, vjust = 0.7), 
        axis.text.y = element_text(size = 14, face = "bold")) + 
  geom_text(aes(x = Var1, y = Var2, label = round(value, 2), color = "white"), 
            fontface = "bold", size = 5)

print(pc)

gemo_tile()：用来绘制每一个格子
scale_fill_gradient2()：绘制渐变色
legend.title：图例
gemo_text()：添加文字
label：要显示的文字
round(value, 2)：保留两位小数

r 复制代码

scale_fill_gradient2(low = "steelblue4", mid = "white", high = "red4")

这里可以通过添加参数来提高对比度，将中间值设置为0.5

r 复制代码

scale_fill_gradient2(midpoint = 0.5, low = "steelblue4", mid = "white", high = "red4")

5. 显示已有所有图表

r 复制代码

temp <- p1 + p2 + plot_layout(ncol = 2)
print(temp)

r 复制代码

temp <- p1 + p2 + p3 + pc + plot_layout(ncol = 2)
print(temp)

6. 保存

复制代码

write_csv(employees, "D:\\SystemFiles\\dawd\\cleaned_employyees.csv")

r 复制代码

# install.packages("tidyverse")
# install.packages("patchwork")
# install.packages("reshape2")

library(tidyverse)
library(patchwork)
library(reshape2)

employees <- read.csv("D:\\NetdiskDownload\\Employee_dataset.csv")

employees <- select(employees, -c("X", "Name", "Address", "DOJ", 
                                  "Dependents", "HRA", "DA", 
                                  "PF", "Insurance", "Gross.Salary"))

other_sex_row_nos <- employees$Sex == "Other"
employees <- employees[!other_sex_row_nos, ]

empty_marital_status_rows <- employees$Marital.Status == ""
employees <- employees[!empty_marital_status_rows, ]

employees |> rename(Birth_Date = DOB, 
                    Gender = Sex, 
                    Marital_Status = Marital.Status, 
                    In_Company_Years = In.Company.Years, 
                    Year_of_Experience = Year.of.Experience) -> employees

employees$Gender <- as.factor(employees$Gender)
employees$Marital_Status <- as.factor(employees$Marital_Status)
employees$Department <- as.factor(employees$Department)
employees$Position <- as.factor(employees$Position)

p1 <- ggplot(employees, aes(x = Salary)) + 
  geom_histogram(fill = "orange3", color = "white", bins = 10) + 
  labs(title = "Salary Distribution of employees",
       subtitle = "From some fake company",
       x = "Salary",
       y = "Frequency") + 
  theme(plot.title = element_text(face = "bold"))

print(p1)

p2 = ggplot(employees, aes(x = Position, y = Salary)) + 
  geom_boxplot(color = "blue2", fill = "yellow2") + 
  labs(title = "Salary distribution of employees by Department",
       subtitle = "From a fake company", 
       x = "", 
       y = "Salary") + 
  stat_summary(fun = mean, geom = "point", shape = 20, size = 8, 
               color = "white", fill = "white") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
print(p2)

p3 = ggplot(employees, aes(x = Year_of_Experience, y = Salary)) + 
  geom_boxplot(color = "blue2", fill = "yellow2") + 
  labs(title = "Salary distribution of employees by Department",
       subtitle = "From a fake company", 
       x = "", 
       y = "Salary") + 
  stat_summary(fun = mean, geom = "point", shape = 20, size = 8, 
               color = "white", fill = "white") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
print(p2)

# ------------------------------------------------------

employees |> summarise(MIN = min(Salary), 
                       LQ = quantile(Salary, 0.25), 
                       HQ = quantile(Salary, 0.75), 
                       AVG = mean(Salary), 
                       MEDIAN = median(Salary), 
                       MAX = max(Salary)) -> tibble1

employees |> group_by(Gender)|> 
                       summarise(MIN = min(Salary), 
                       LQ = quantile(Salary, 0.25), 
                       HQ = quantile(Salary, 0.75), 
                       AVG = mean(Salary), 
                       MEDIAN = median(Salary), 
                       MAX = max(Salary)) -> tibble2

employees |> group_by(Gender)|> tally(Salary >= 100000) -> tibble3

# Correlation 相关性

employees |> select(c(Salary, Age, In_Company_Years, Year_of_Experience)) ->
  cor_employees

cor_matrix <- cor(cor_employees)

cor_table <- melt(cor_matrix)

head(cor_table)

pc <- ggplot(cor_table, aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + 
  scale_fill_gradient2(midpoint = 0.5, low = "steelblue4", mid = "white", high = "red4") + 
  labs(title = "Correlation Matrix", 
       subtitle = "Correlation Coefficients between 4 variables", 
       x = "", 
       y = "", 
       fill = "Correlation\nCoefficient", 
       caption = "Source: Employee data set") + 
  theme(plot.title = element_text(size = 18, face = "bold"), 
        legend.title = element_text(face = "bold", color = "brown", size = 10), 
        axis.text.x = element_text(size = 14, face = "bold", angle = 45, vjust = 0.7), 
        axis.text.y = element_text(size = 14, face = "bold")) + 
  geom_text(aes(x = Var1, y = Var2, label = round(value, 2), color = "white"), 
            fontface = "bold", size = 5)

print(pc)

# Others

temp <- p1 + p2 + plot_layout(ncol = 2)
print(temp)

temp <- p1 + p2 + p3 + pc + plot_layout(ncol = 2, nrow = 2)
print(temp)

# save
write_csv(employees, "D:\\SystemFiles\\dawd\\cleaned_employyees.csv")