[R] Merging and adding info from paper based survey

Before merging the dataset using R

1. Load Required Libraries: Ensure that the necessary libraries, such as dplyr or data.table for data manipulation, are loaded into your R environment.

2. Check the Structure of Datasets: Examine the structure of your datasets (df1 and df2) using functions like str(), summary(), or head() to understand the variables, data types, and potential issues.

3. Ensure Common Key(s) for Merging: Confirm that there is a common key (e.g., ID) in both datasets that you can use for merging.

4. Handle Duplicate Columns: Check for any columns with the same name in both datasets. If necessary, rename or remove duplicate columns to avoid conflicts during merging.

5. Handle Missing Values: Address any missing or incomplete data in both datasets before merging. Decide on an appropriate strategy, such as imputation or removal of missing values.

R 复制代码

# Example removing rows with missing values in df1
df1 <- na.omit(df1)

6. Convert IDs to the Same Data Type: Ensure that the common key(s) used for merging are of the same data type in both datasets.

7. Handle Different Levels of Categorical Variables: If your datasets contain categorical variables with different levels, harmonize the levels to avoid issues during merging.

Merge answers

Assuming you have two datasets (df1 and df2) that you want to merge based on a common identifier (e.g., ID), you can use the merge function:

R 复制代码

merged_data <- merge(df1, df2, by = "ID", all.x = TRUE, all.y = TRUE)

This command merges the two datasets (df1 and df2) based on the common column "ID" and includes all observations from both datasets (all.x = TRUE, all.y = TRUE). Adjust these parameters based on your specific merging needs.

Rbind

The rbind function in R is used to combine data frames or matrices by rows. It stands for **"row bind" and is short for "row-wise binding" or "row concatenation."**This function is particularly useful when you want to append new rows to an existing data frame.

R 复制代码

combined_data <- rbind(dataframe1, dataframe2, ...)

R 复制代码

# Create two sample data frames
df1 <- data.frame(ID = c(1, 2, 3), Name = c("Alice", "Bob", "Charlie"))
df2 <- data.frame(ID = c(4, 5, 6), Name = c("David", "Eva", "Frank"))

# Combine the data frames using rbind
combined_df <- rbind(df1, df2)

# Print the result
print(combined_df)

The two df should have the same variables but don't have to be in the same order.

R 复制代码

cuhksz_students_h <- rbind(SME_students_data,SSE_students_data)
mean(cuhksz_students_h$BMI)
mean(SSE_students_health$BMI)
mean(SME_students_health$BMI)

++You may delete the extra one or create new with info NA.++

Sometimes, you need to read carefully, like rank and ranking, they can be modified first and then do the Rbind.

R 复制代码

#Solution 0 #Using the rename function
rename(Survey_GE_class_choice_2,Rank=Ranking)
#Finally bind the data
cuhksz_GE_choice<-rbind(Survey_GE_class_choice_1,Survey_GE_class_choice_2)

R 复制代码

#How to merge the two when the name of variables are not matching
#Solution 1# Create a new variable having the same name in the two dataset and delete the old variable 
Survey_GE_class_choice_2$Rank<-Survey_GE_class_choice_2$Ranking
Survey_GE_class_choice_2<-Survey_GE_class_choice_2[,-7]
#or using dplyr
Survey_GE_class_choice_2<-select(Survey_GE_class_choice_2,-7)
#Finally bind the data
merged_data <- rbind(Survey_GE_class_choice_1, Survey_GE_class_choice_2)

The notation [, -7] in R is used for subsetting or indexing a data frame. It is a way of selecting or excluding specific columns from a data frame. In this context, it means to exclude the column at position 7.

Here's a breakdown of the notation:

[,]: This indicates that you are working with the entire data frame.
-7: This specifies that you want to exclude the column at position 7.

R 复制代码

# Creating a sample data frame
df <- data.frame(A = 1:3, B = 4:6, C = 7:9, D = 10:12)

# Excluding column at position 3 (C)
df_subset <- df[, -3]

# Printing the result
print(df_subset)

In this example, df[, -3] creates a new data frame (df_subset) that includes all columns from the original data frame df except the one at position 3 (C). The result will look like:

R 复制代码

Add new variables by matching the ID

It is common to add some new info to existing datasets.

If you want to add new variables from one dataset to another based on matching IDs, you can use the merge function or the %in% operator along with indexing:

R 复制代码

df1$new_variable <- df2$new_variable[df2$ID %in% df1$ID]

This command adds a new variable (new_variable) from df2 to df1 based on matching IDs.

Before merging we make sure that we can find a way to match the information correctly:

In the present case the ID is unique to each respondent and correspond in the two dataframe

Name of the new dataframe <-merge (dataframeA,dataframeB, by="variable to identify respondent")

Notice:

错误: 找不到对象'CUHKSZ_employment_survey_2' we have a CUHKSZ_employment_survey_2_xls, how can we change the name

you can use the assign function to rename the object.

R 复制代码

# Load the data from the Excel file with the original name
CUHKSZ_employment_survey_2_xls <- readxl::read_excel("path_to_file/CUHKSZ_employment_survey_2_xls.xlsx")

# Rename the object
new_name <- "CUHKSZ_employment_survey_2"
assign(new_name, CUHKSZ_employment_survey_2)

R 复制代码

#Merging two dataset to add a new variable (example on a longitudinal survey)
CUHK_employement_1<-merge(CUHKSZ_employment_survey_1,CUHKSZ_employment_survey_1b, by="ID")
#Let's practice
CUHK_employement_2<-merge(CUHKSZ_employment_survey_2,CUHKSZ_employment_survey_2b, by="ID")

Notice：（删除第几列）

R 复制代码

CUHK_employement_2 <- CUHK_employement_2[, -ncol(CUHK_employement_2)]
#The line CUHK_employement_2 <- CUHK_employement_2[, -ncol(CUHK_employement_2)] is 
# used to remove the last column from the data frame CUHK_employement_2
#ncol(CUHK_employement_2): This part calculates the number of columns in the dataframe CUHK_employement_2.

Notice: （改名）

R 复制代码

# 假设你想将最后一列的名字从 "old_name" 改为 "new_name"
new_column_name <- "new_name"

# 获取数据框的列名
column_names <- names(your_dataframe)

# 找到要更改的列的位置（假设是最后一列）
column_index_to_change <- which(column_names == "old_name")

# 替换列名
column_names[column_index_to_change] <- new_column_name

# 设置新的列名
names(your_dataframe) <- column_names

Becareful about the names:

R 复制代码

> print(names(CUHK_employement_1))
[1] "ID"                "Months_find_job"  
[3] "Occupation"        "Monthlysalary"    
[5] "FatherPC"          "Gender"           
[7] "Monthly_salary_19" "Monthly_salary_22"
> print(names(CUHK_employement_2))
[1] "ID"                "Months_find_job"  
[3] "Occupation"        "Monthlysalary"    
[5] "FatherPC"          "Gender"           
[7] "Monthly_salary_19" "Month_salary_22"

"Month and Monthly"

Manually add observation in a existing dataframe

Demi and Amy have to add the new respondents in the existing dataframe without writing an excel file (and risking some errors)L

They have to open the data editor using the command fix(dataframe).

R 复制代码

# Assuming existing_dataframe is your current dataframe
existing_dataframe <- data.frame(
  ID = c(1, 2, 3),
  Name = c("John", "Alice", "Bob"),
  Age = c(25, 28, 22)
)

# Open the data editor
fix(existing_dataframe)

Remember to input NA if there is not enough info.

Appendix: Levels()

levels() 是 R 中用于处理因子（factor）变量的函数。在 R 中，factor 是一种特殊的数据类型，用于表示分类变量，它将离散的标签映射到整数值。levels() 用于获取或设置 factor 变量的水平（levels），也就是它的不同类别或水平的标签。

R 复制代码

# 创建一个 factor 变量
gender <- factor(c("Male", "Female", "Male", "Female"))

# 获取 factor 变量的水平
gender_levels <- levels(gender)
print(gender_levels)

R 复制代码

# 创建一个 factor 变量
gender <- factor(c("Male", "Female", "Male", "Female"))

# 设置 factor 变量的水平
levels(gender) <- c("M", "F")
print(gender)