MachineLearning 23. 机器学习之岭回归预测基因型和表型 (Ridge)

简介

岭回归(英文名：ridge regression, Tikhonov regularization)是一种专用于共线性数据分析的有偏估计回归方法，实质上是一种改良的最小二乘估计法，通过放弃最小二乘法的无偏性，以损失部分信息、降低精度为代价获得回归系数更为符合实际、更可靠的回归方法，对病态数据的拟合要强于最小二乘法。

迄今为止，许多遗传变异已被确定为与不同的表型性状相关。然而，已确定的关联通常只能解释性状遗传性的一小部分，并且仅包含已知相关变异的模型的预测能力很小。多元回归是一种流行的框架，它同时考虑许多遗传变异的联合效应。由于遗传数据的高维数和预测因子之间的相关结构，普通多元回归很少适用于遗传数据。人们重新对使用惩罚性回归技术来规避这些困难产生了兴趣。在本文中着重于岭回归，一种惩罚回归方法，已被证明在多变量预测中提供良好的性能问题。岭回归应用中的一个挑战是控制量的岭参数的选择收缩的回归系数。我们提出了一种基于数据确定脊形参数的方法。

目的:在高维预测问题中获得良好的性能。为我们的方法建立了理论依据，并在模拟遗传数据和实际数据实例上验证了其性能。岭回归模型拟合数十万到数百万的基因变异同时存在，这给计算带来了挑战。

软件包安装

ridge 软件包安装非常简单，直接install.packages()即可。

go 复制代码

install.packages("ridge")
devtools::install_github("selva86/InformationValue")

数据读取

输入：自变量X至少一项或以上的定量变量或二分类定类变量，因变量Y要求为定量变量（若为定类变量，请使用逻辑回归）。输出：模型检验优度的结果，自变量对因变量的线性关系等等。

表型为数值型变量

go 复制代码

library(ridge)
data(GenCont)
head(GenCont[, 1:10])
##      Phenotypes SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9
## [1,]  1.4356316    1    1    1    0    0    1    0    0    0
## [2,]  2.9226960    1    0    0    0    1    0    0    1    0
## [3,]  0.5669319    0    0    0    0    0    0    0    0    0
## [4,]  4.8515051    1    0    0    0    1    0    0    0    0
## [5,]  0.1525582    1    1    1    0    0    1    0    0    0
## [6,]  2.9522701    1    0    0    0    1    0    0    0    0

表型为分类型变量

go 复制代码

data(GenBin)
head(GenBin)
##      Phenotypes SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 SNP11 SNP12
## [1,]          0    1    0    0    0    0    0    0    1    0     1     0     0
## [2,]          0    0    0    0    0    0    0    1    0    0     0     0     0
## [3,]          1    0    0    0    0    0    0    2    0    1     0     0     1
## [4,]          0    0    0    0    0    0    1    0    0    0     0     0     0
## [5,]          0    0    0    0    0    0    0    1    0    0     0     0     0
## [6,]          0    0    0    0    0    0    0    0    0    0     0     0     0
##      SNP13 SNP14
## [1,]     0     1
## [2,]     1     0
## [3,]     1     0
## [4,]     0     0
## [5,]     0     0
## [6,]     0     0

实例操作

基于乳腺癌分类的逻辑岭回归模型

这个我们采用glmnet里面的函数和参数解决实际的问题，也可以使用R包ridge，常规数据我们可以模仿Lasso回归的模式来做。

数据读取

这个数据集经常使用，在分享的机器学习里面的例子中经常出现，并且有对数据进行处理的常规步骤：

go 复制代码

library(glmnet)
library(caret)
BreastCancer <- read.csv("wisc_bc_data.csv", stringsAsFactors = FALSE)
BreastCancer = BreastCancer[, -1]
dim(BreastCancer)
## [1] 568  31
str(BreastCancer)
## 'data.frame':	568 obs. of  31 variables:
##  $ diagnosis              : chr  "M" "M" "M" "M" ...
##  $ radius_mean            : num  20.6 19.7 11.4 20.3 12.4 ...
##  $ texture_mean           : num  17.8 21.2 20.4 14.3 15.7 ...
##  $ perimeter_mean         : num  132.9 130 77.6 135.1 82.6 ...
##  $ area_mean              : num  1326 1203 386 1297 477 ...
##  $ smoothness_mean        : num  0.0847 0.1096 0.1425 0.1003 0.1278 ...
##  $ compactne_mean         : num  0.0786 0.1599 0.2839 0.1328 0.17 ...
##  $ concavity_mean         : num  0.0869 0.1974 0.2414 0.198 0.1578 ...
##  $ concave_points_mean    : num  0.0702 0.1279 0.1052 0.1043 0.0809 ...
##  $ symmetry_mean          : num  0.181 0.207 0.26 0.181 0.209 ...
##  $ fractal_dimension_mean : num  0.0567 0.06 0.0974 0.0588 0.0761 ...
##  $ radius_se              : num  0.543 0.746 0.496 0.757 0.335 ...
##  $ texture_se             : num  0.734 0.787 1.156 0.781 0.89 ...
##  $ perimeter_se           : num  3.4 4.58 3.44 5.44 2.22 ...
##  $ area_se                : num  74.1 94 27.2 94.4 27.2 ...
##  $ smoothness_se          : num  0.00522 0.00615 0.00911 0.01149 0.00751 ...
##  $ compactne_se           : num  0.0131 0.0401 0.0746 0.0246 0.0335 ...
##  $ concavity_se           : num  0.0186 0.0383 0.0566 0.0569 0.0367 ...
##  $ concave_points_se      : num  0.0134 0.0206 0.0187 0.0188 0.0114 ...
##  $ symmetry_se            : num  0.0139 0.0225 0.0596 0.0176 0.0216 ...
##  $ fractal_dimension_se   : num  0.00353 0.00457 0.00921 0.00511 0.00508 ...
##  $ radius_worst           : num  25 23.6 14.9 22.5 15.5 ...
##  $ texture_worst          : num  23.4 25.5 26.5 16.7 23.8 ...
##  $ perimeter_worst        : num  158.8 152.5 98.9 152.2 103.4 ...
##  $ area_worst             : num  1956 1709 568 1575 742 ...
##  $ smoothness_worst       : num  0.124 0.144 0.21 0.137 0.179 ...
##  $ compactne_worst        : num  0.187 0.424 0.866 0.205 0.525 ...
##  $ concavity_worst        : num  0.242 0.45 0.687 0.4 0.535 ...
##  $ concave_points_worst   : num  0.186 0.243 0.258 0.163 0.174 ...
##  $ symmetry_worst         : num  0.275 0.361 0.664 0.236 0.399 ...
##  $ fractal_dimension_worst: num  0.089 0.0876 0.173 0.0768 0.1244 ...
table(BreastCancer$diagnosis)
## 
##   B   M 
## 357 211
# 删除方差为0的变量
zerovar = nearZeroVar(BreastCancer[, -1])
zerovar
## integer(0)
# 首先删除强相关的变量
descrCorr = cor(BreastCancer[, -1])
descrCorr[1:5, 1:5]
##                 radius_mean texture_mean perimeter_mean area_mean
## radius_mean       1.0000000   0.32938305      0.9978764 0.9873442
## texture_mean      0.3293830   1.00000000      0.3359176 0.3261929
## perimeter_mean    0.9978764   0.33591759      1.0000000 0.9865482
## area_mean         0.9873442   0.32619289      0.9865482 1.0000000
## smoothness_mean   0.1680940  -0.01776898      0.2045046 0.1748380
##                 smoothness_mean
## radius_mean          0.16809398
## texture_mean        -0.01776898
## perimeter_mean       0.20450464
## area_mean            0.17483805
## smoothness_mean      1.00000000
highCorr = findCorrelation(descrCorr, 0.9)
BreastCancer = BreastCancer[, -(highCorr + 1)]
# 随后解决多重共线性，本例中不存在多重共线性问题
comboInfo = findLinearCombos(BreastCancer[, -1])
Process = preProcess(BreastCancer)
BreastCancer = predict(Process, BreastCancer)

数据分割

go 复制代码

library(tidyverse)
library(sampling)
set.seed(123)
# 每层抽取70%的数据
train_id <- strata(BreastCancer, "diagnosis", size = rev(round(table(BreastCancer$diagnosis) *
    0.7)))$ID_unit
# 训练数据
trainData <- BreastCancer[train_id, ]
# 测试数据
testData <- BreastCancer[-train_id, ]

# 查看训练、测试数据中正负样本比例
prop.table(table(trainData$diagnosis))
## 
##         B         M 
## 0.6281407 0.3718593
prop.table(table(testData$diagnosis))
## 
##         B         M 
## 0.6294118 0.3705882
prop.table(table(BreastCancer$diagnosis))
## 
##         B         M 
## 0.6285211 0.3714789

可视化重要变量

go 复制代码

#4. How to visualize the importance of variables using featurePlot()
featurePlot(x = trainData[, 2:21], 
            y = as.factor(trainData$diagnosis), 
            plot = "box", #For classification:box, strip, density, pairs or ellipse. For regression, pairs or scatter
            strip=strip.custom(par.strip.text=list(cex=.7)),
            scales = list(x = list(relation="free"), 
                          y = list(relation="free"))
)

构建输入数据

go 复制代码

x = as.matrix(trainData[, -1])
y = ifelse(trainData$diagnosis == "M", 1, 0)

构建逻辑岭回归模型

go 复制代码

ridge <- glmnet(x, y, family = "binomial", alpha = 0)
plot(ridge, xvar = "lambda", label = TRUE)

交叉验证

go 复制代码

ridge_model <- cv.glmnet(x, y, alpha = 0, nfolds = 3)
plot(ridge_model)

go 复制代码

plot(ridge_model$glmnet.fit, "lambda", label = T)

确定模型均方误差最小的情况下参数lambda的取值

在cv.glmnet()函数的输出结果中，包含一个lambda.min取值，且ridge_model$lambda.min = 0.15，该值为在模型均方误差最小的情况下参数lambda的取值，可以使用该值建立Ridge回归模型。下面使用该参数对全部的训练数据集建立Ridge模型。

go 复制代码

ridge_min <- ridge_model$lambda.min
## 使用ridge_min 拟合ridge模型
ridge_best <- glmnet(x, y, alpha = 0, lambda = ridge_min)
coef(ridge_best)
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                                    s0
## (Intercept)              0.3729332067
## area_mean                0.0885358564
## smoothness_mean          0.0052045287
## compactne_mean          -0.0003435624
## symmetry_mean            0.0075012310
## fractal_dimension_mean  -0.0839607648
## radius_se                0.0636875678
## texture_se               0.0027998059
## smoothness_se            0.0388690882
## compactne_se            -0.0551225591
## concavity_se            -0.0524564964
## concave_points_se        0.0598000796
## symmetry_se             -0.0177510728
## fractal_dimension_se     0.0119540616
## texture_worst            0.0557146338
## smoothness_worst         0.0310525197
## compactne_worst         -0.0004504186
## concavity_worst          0.0892093379
## concave_points_worst     0.1296407274
## symmetry_worst           0.0614660535
## fractal_dimension_worst  0.0430986166

验证集验证

go 复制代码

ridge.coef <- predict(ridge_model, s = 0.05, type = "coefficients")
ridge.coef
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                                    s1
## (Intercept)              0.3729040858
## area_mean                0.0894265533
## smoothness_mean          0.0055832621
## compactne_mean           0.0009547588
## symmetry_mean            0.0076406106
## fractal_dimension_mean  -0.0794539760
## radius_se                0.0632058362
## texture_se               0.0022559684
## smoothness_se            0.0356625085
## compactne_se            -0.0512845944
## concavity_se            -0.0465064056
## concave_points_se        0.0570207458
## symmetry_se             -0.0175198162
## fractal_dimension_se     0.0076460181
## texture_worst            0.0559848075
## smoothness_worst         0.0332178942
## compactne_worst          0.0054483603
## concavity_worst          0.0835264079
## concave_points_worst     0.1249131201
## symmetry_worst           0.0592973698
## fractal_dimension_worst  0.0381440680
ridge.y <- predict(ridge_model, newx = as.matrix(testData[, -1]), type = "response",
    s = 0.05)

绘制ROC曲线

go 复制代码

library(InformationValue)
actuals <- ifelse(testData$diagnosis == "M", 1, 0)
misClassError(actuals, ridge.y)
## [1] 0.0588
plotROC(actuals, ridge.y)

基于基因型和表型数据构建岭回归模型

这样的数据我们使用特殊的软件包 ridge 来做，非常方便，并且里面内置了基因型预测函数，非常方便对基因型和表型的数据进行处理。

构建逻辑岭回归模型

go 复制代码

data(GenBin)
head(GenBin)
##      Phenotypes SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 SNP11 SNP12
## [1,]          0    1    0    0    0    0    0    0    1    0     1     0     0
## [2,]          0    0    0    0    0    0    0    1    0    0     0     0     0
## [3,]          1    0    0    0    0    0    0    2    0    1     0     0     1
## [4,]          0    0    0    0    0    0    1    0    0    0     0     0     0
## [5,]          0    0    0    0    0    0    0    1    0    0     0     0     0
## [6,]          0    0    0    0    0    0    0    0    0    0     0     0     0
##      SNP13 SNP14
## [1,]     0     1
## [2,]     1     0
## [3,]     1     0
## [4,]     0     0
## [5,]     0     0
## [6,]     0     0
table(GenBin[, 1])
## 
##   0   1 
## 250 250
beta_logisticRidge <- logisticRidge(Phenotypes ~ ., data = as.data.frame(GenBin))
summary(beta_logisticRidge)
## 
## Call:
## logisticRidge(formula = Phenotypes ~ ., data = as.data.frame(GenBin))
## 
## 
## Coefficients:
##              Estimate Scaled estimate Std. Error (scaled) t value (scaled)
## (Intercept)  0.002782              NA                  NA               NA
## SNP1         0.166067        1.914369            0.666082            2.874
## SNP2         0.110745        0.269635            0.963936            0.280
## SNP3        -0.284802       -0.491809            0.999027           -0.492
## SNP4        -1.007615       -1.422131            0.990998           -1.435
## SNP5         0.911082        0.910170            0.986458            0.923
## SNP6        -0.026877       -0.355500            0.959424           -0.371
## SNP7        -0.068540       -1.022005            0.922686           -1.108
## SNP8         0.149821        1.729257            0.667675            2.590
## SNP9        -0.159520       -0.795170            0.948859           -0.838
## SNP10       -0.207470       -1.214400            0.893062           -1.360
## SNP11        0.027842        0.067788            0.999575            0.068
## SNP12        0.090563        0.651787            0.958376            0.680
## SNP13       -0.051310       -0.705257            0.947386           -0.744
## SNP14       -0.196030       -0.990720            0.904846           -1.095
##             Pr(>|t|)   
## (Intercept)       NA   
## SNP1         0.00405 **
## SNP2         0.77969   
## SNP3         0.62252   
## SNP4         0.15127   
## SNP5         0.35618   
## SNP6         0.71098   
## SNP7         0.26802   
## SNP8         0.00960 **
## SNP9         0.40202   
## SNP10        0.17389   
## SNP11        0.94593   
## SNP12        0.49644   
## SNP13        0.45662   
## SNP14        0.27356   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Ridge paramter: 0.1244006, chosen automatically, computed using 3 PCs
## 
## Degrees of freedom: model 7.02 , variance 4.066
pvalsMod <- pvals(beta_logisticRidge)
print(pvalsMod)
## 
## 	lambda 0.124401 chosen automatically using 3 PCs
## 
##       Estimate (scaled) Std. Error (scaled) t value (scaled) Pr(>|t|)   
## SNP1            1.91437             0.66608            2.874  0.00405 **
## SNP2            0.26964             0.96394            0.280  0.77969   
## SNP3           -0.49181             0.99903           -0.492  0.62252   
## SNP4           -1.42213             0.99100           -1.435  0.15127   
## SNP5            0.91017             0.98646            0.923  0.35618   
## SNP6           -0.35550             0.95942           -0.371  0.71098   
## SNP7           -1.02201             0.92269           -1.108  0.26802   
## SNP8            1.72926             0.66767            2.590  0.00960 **
## SNP9           -0.79517             0.94886           -0.838  0.40202   
## SNP10          -1.21440             0.89306           -1.360  0.17389   
## SNP11           0.06779             0.99958            0.068  0.94593   
## SNP12           0.65179             0.95838            0.680  0.49644   
## SNP13          -0.70526             0.94739           -0.744  0.45662   
## SNP14          -0.99072             0.90485           -1.095  0.27356   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(pvalsMod)

go 复制代码

pred_phen <- predict(beta_logisticRidge, type = "response")
res <- cbind(GenBin, pred_phen)
head(res)
##   Phenotypes SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 SNP11 SNP12
## 1          0    1    0    0    0    0    0    0    1    0     1     0     0
## 2          0    0    0    0    0    0    0    1    0    0     0     0     0
## 3          1    0    0    0    0    0    0    2    0    1     0     0     1
## 4          0    0    0    0    0    0    1    0    0    0     0     0     0
## 5          0    0    0    0    0    0    0    1    0    0     0     0     0
## 6          0    0    0    0    0    0    0    0    0    0     0     0     0
##   SNP13 SNP14          
## 1     0     1 0.4788050
## 2     1     0 0.4707666
## 3     1     0 0.4367005
## 4     0     0 0.4939766
## 5     0     0 0.4835665
## 6     0     0 0.5006956

预测基因型

go 复制代码

genotypesfile <- system.file("extdata", "GenBin_genotypes.txt", package = "ridge")
phenotypesfile <- system.file("extdata", "GenBin_phenotypes.txt", package = "ridge")
beta_logisticRidgeGenotypes <- logisticRidgeGenotypes(genotypesfilename = genotypesfile,
    phenotypesfilename = phenotypesfile)
beta_logisticRidgeGenotypes
##                   B
## Intercept  0.002782
## SNP1       0.166067
## SNP2       0.110745
## SNP3      -0.284802
## SNP4      -1.007615
## SNP5       0.911082
## SNP6      -0.026877
## SNP7      -0.068540
## SNP8       0.149821
## SNP9      -0.159520
## SNP10     -0.207470
## SNP11      0.027842
## SNP12      0.090563
## SNP13     -0.051310
## SNP14     -0.196030
## compare to output of logisticRidge
results <- as.data.frame(cbind(round(coef(beta_logisticRidge), 6), beta_logisticRidgeGenotypes))
results
##             round(coef(beta_logisticRidge), 6)         B
## (Intercept)                           0.002782  0.002782
## SNP1                                  0.166067  0.166067
## SNP2                                  0.110745  0.110745
## SNP3                                 -0.284802 -0.284802
## SNP4                                 -1.007615 -1.007615
## SNP5                                  0.911082  0.911082
## SNP6                                 -0.026877 -0.026877
## SNP7                                 -0.068540 -0.068540
## SNP8                                  0.149821  0.149821
## SNP9                                 -0.159520 -0.159520
## SNP10                                -0.207470 -0.207470
## SNP11                                 0.027842  0.027842
## SNP12                                 0.090563  0.090563
## SNP13                                -0.051310 -0.051310
## SNP14                                -0.196030 -0.196030

构建线性岭回归模型

go 复制代码

data(GenCont)
head(GenCont)
##      Phenotypes SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 SNP11 SNP12
## [1,]  1.4356316    1    1    1    0    0    1    0    0    0     0     1     0
## [2,]  2.9226960    1    0    0    0    1    0    0    1    0     0     1     0
## [3,]  0.5669319    0    0    0    0    0    0    0    0    0     0     2     0
## [4,]  4.8515051    1    0    0    0    1    0    0    0    0     0     1     0
## [5,]  0.1525582    1    1    1    0    0    1    0    0    0     0     1     0
## [6,]  2.9522701    1    0    0    0    1    0    0    0    0     0     0     0
beta_linearRidge <- linearRidge(Phenotypes ~ ., data = as.data.frame(GenCont))
summary(beta_linearRidge)
## 
## Call:
## linearRidge(formula = Phenotypes ~ ., data = as.data.frame(GenCont))
## 
## 
## Coefficients:
##              Estimate Scaled estimate Std. Error (scaled) t value (scaled)
## (Intercept)  1.533386              NA                  NA               NA
## SNP1         0.277296        4.045409            0.266120           15.201
## SNP2        -0.110458       -1.256154            0.216332            5.807
## SNP3        -0.110458       -1.256154            0.216332            5.807
## SNP4         0.005230        0.011635            0.371693            0.031
## SNP5         0.531173        6.323006            0.315368           20.050
## SNP6        -0.119164       -1.373227            0.223047            6.157
## SNP7         0.113844        0.113730            0.372181            0.306
## SNP8        -0.099149       -1.028581            0.355807            2.891
## SNP9        -0.008321       -0.008312            0.372386            0.022
## SNP10        0.058562        0.101128            0.371567            0.272
## SNP11       -0.096526       -1.495699            0.329250            4.543
## SNP12       -0.334279       -0.333945            0.372248            0.897
##             Pr(>|t|)    
## (Intercept)       NA    
## SNP1         < 2e-16 ***
## SNP2        6.38e-09 ***
## SNP3        6.38e-09 ***
## SNP4         0.97503    
## SNP5         < 2e-16 ***
## SNP6        7.43e-10 ***
## SNP7         0.75993    
## SNP8         0.00384 ** 
## SNP9         0.98219    
## SNP10        0.78549    
## SNP11       5.55e-06 ***
## SNP12        0.36966    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Ridge parameter: 2.206848, chosen automatically, computed using 1 PCs
## 
## Degrees of freedom: model 3.121 , variance 1.205 , residual 5.036
pvalsMod <- pvals(beta_linearRidge)
print(pvalsMod)
## 
## 	lambda 2.206848 chosen automatically using 1 PCs
## 
##       Estimate (scaled) Std. Error (scaled) t value (scaled) Pr(>|t|)    
## SNP1           4.045409            0.266120           15.201  < 2e-16 ***
## SNP2          -1.256154            0.216332            5.807 6.38e-09 ***
## SNP3          -1.256154            0.216332            5.807 6.38e-09 ***
## SNP4           0.011635            0.371693            0.031  0.97503    
## SNP5           6.323006            0.315368           20.050  < 2e-16 ***
## SNP6          -1.373227            0.223047            6.157 7.43e-10 ***
## SNP7           0.113730            0.372181            0.306  0.75993    
## SNP8          -1.028581            0.355807            2.891  0.00384 ** 
## SNP9          -0.008312            0.372386            0.022  0.98219    
## SNP10          0.101128            0.371567            0.272  0.78549    
## SNP11         -1.495699            0.329250            4.543 5.55e-06 ***
## SNP12         -0.333945            0.372248            0.897  0.36966    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(pvalsMod)

go 复制代码

pred_phen <- predict(beta_linearRidge)
res <- cbind(GenCont, pred_phen)
head(res)
##   Phenotypes SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 SNP11 SNP12
## 1  1.4356316    1    1    1    0    0    1    0    0    0     0     1     0
## 2  2.9226960    1    0    0    0    1    0    0    1    0     0     1     0
## 3  0.5669319    0    0    0    0    0    0    0    0    0     0     2     0
## 4  4.8515051    1    0    0    0    1    0    0    0    0     0     1     0
## 5  0.1525582    1    1    1    0    0    1    0    0    0     0     1     0
## 6  2.9522701    1    0    0    0    1    0    0    0    0     0     0     0
##   pred_phen
## 1  1.374077
## 2  2.146179
## 3  1.340333
## 4  2.245328
## 5  1.374077
## 6  2.341855

预测基因型

go 复制代码

genotypesfile <- system.file("extdata", "GenCont_genotypes.txt", package = "ridge")
phenotypesfile <- system.file("extdata", "GenCont_phenotypes.txt", package = "ridge")
betafile <- tempfile(pattern = "beta", fileext = ".dat")
beta_linearRidgeGenotypes <- linearRidgeGenotypes(genotypesfilename = genotypesfile,
    phenotypesfilename = phenotypesfile, betafilename = betafile)
pred_phen_geno <- linearRidgeGenotypesPredict(genotypesfilename = genotypesfile,
    betafilename = betafile)

## compare to output of linearRidge
results <- cbind(pred_phen_geno, pred_phen)
head(results)
##   PredictedPhenotypes pred_phen
## 1            1.374076  1.374077
## 2            2.146180  2.146179
## 3            1.340334  1.340333
## 4            2.245329  2.245328
## 5            1.374076  1.374077
## 6            2.341855  2.341855

Reference

Significance testing in ridge regression for genetic data. Cule, E. et al (2011) BMC Bioinformatics, 12:372 A semi-automatic method to guide the choice of ridge parameter in ridge regression. Cule, E. and De Iorio, M. (2012) arXiv:1205.0686v1
Cule, Erika, and Maria De Iorio. "Ridge regression in prediction problems: automatic choice of the ridge parameter." Genetic epidemiology 37.7 (2013): 704-714.

MachineLearning 23. 机器学习之岭回归预测基因型和表型 (Ridge)

简 介

软件包安装

数据读取

表型为数值型变量

表型为分类型变量

实例操作

基于乳腺癌分类的逻辑岭回归模型

数据读取

数据分割

可视化重要变量

构建输入数据

构建逻辑岭回归模型

交叉验证

确定模型均方误差最小的情况下参数lambda的取值

验证集验证

绘制ROC曲线

基于基因型和表型数据构建岭回归模型

构建逻辑岭回归模型

预测基因型

构建线性岭回归模型

预测基因型

Reference

简介