一、数据描述
数据来源2013年各地区水泥制造业规模以上企业的各主要经济指标,原始数据来源于2014年(《中国水泥统计年鉴》),试对用主成分和因子进行经济效益评价。
地区,企业个数(亿元),流动资产合计(亿元),资产总额(亿元),负债总额(亿元),主营业务收入(亿元),利润总额(亿元),销售利润率(%)
北京,8,17.6,43.8,17.8,26.6,-1.4,-5.2
天津,24,43.8,91.7,33.7,35.9,1.5,4.1
河北,231,281.4,993.8,647,565.1,22.7,4
山西,113,103.4,317.4,238.5,124,-2.1,-1.7
内蒙古,116,135.9,384.4,256.8,245.8,11.9,4.8
辽宁,151,151.4,417.6,247.9,350.3,23,6.6
吉林,69,333.7,627.7,415.2,539.8,25.4,4.7
黑龙江,96,142.1,331.6,234.7,183.2,13.5,7.4
上海,14,21.5,28.3,12.6,31.6,1.2,4
江苏,254,300.3,680,435.7,713.3,62.6,8.8
浙江,192,259.8,561.9,300.1,473.9,42.1,8.9
安徽,169,217.2,591.9,305.2,518.8,64.9,12.5
福建,111,93.2,276.4,163.9,284.8,11.2,3.9
江西,138,143.8,398.1,208.4,400.3,47.5,11.9
山东,295,351.8,792.7,412.5,878.3,80.3,9.1
河南,238,388.5,804.2,475.2,673.7,58.3,8.7
湖北,151,193,619.7,360.7,570.5,49.1,8.6
湖南,220,86.4,398.8,212.3,434.1,33.6,7.7
广东,204,217,592.1,345.3,474.3,40.5,8.5
广西,148,116,387.2,178.7,344,49.6,14.4
海南,15,53.1,102.1,52.9,80.7,5.6,6.9
重庆,78,158.3,419.8,294.1,185.1,8.4,4.5
四川,196,218.2,739.1,433.3,465.2,37.1,8
贵州,133,91.5,367.5,244.2,224.7,28.2,12.6
云南,149,134.2,434.7,290.2,251,11.3,4.5
西藏,10,11.3,26.5,5.4,17.4,4.1,23.7
陕西,116,82.2,312.6,203.8,253.2,14.4,5.7
甘肃,68,61.8,213.2,126.8,124.3,13.3,10.7
青海,20,39.5,152.7,123.1,44.4,3,6.7
宁夏,27,36.1,90.1,49.2,45.1,3.4,7.4
新疆,86,220.6,602.7,353.4,136.1,1.5,1.1
二、读入数据
df<-read.csv('f:/桌面/水泥制造企业.csv')attach(df)
attach(df)
names(df)<-c('areas','x1','x2','x3','x4','x5','x6','x7')
df
df
areas x1 x2 x3 x4 x5 x6 x7
1 北京 8 17.6 43.8 17.8 26.6 -1.4 -5.2
2 天津 24 43.8 91.7 33.7 35.9 1.5 4.1
3 河北 231 281.4 993.8 647.0 565.1 22.7 4.0
4 山西 113 103.4 317.4 238.5 124.0 -2.1 -1.7
5 内蒙古 116 135.9 384.4 256.8 245.8 11.9 4.8
6 辽宁 151 151.4 417.6 247.9 350.3 23.0 6.6
7 吉林 69 333.7 627.7 415.2 539.8 25.4 4.7
8 黑龙江 96 142.1 331.6 234.7 183.2 13.5 7.4
9 上海 14 21.5 28.3 12.6 31.6 1.2 4.0
10 江苏 254 300.3 680.0 435.7 713.3 62.6 8.8
11 浙江 192 259.8 561.9 300.1 473.9 42.1 8.9
12 安徽 169 217.2 591.9 305.2 518.8 64.9 12.5
13 福建 111 93.2 276.4 163.9 284.8 11.2 3.9
14 江西 138 143.8 398.1 208.4 400.3 47.5 11.9
15 山东 295 351.8 792.7 412.5 878.3 80.3 9.1
16 河南 238 388.5 804.2 475.2 673.7 58.3 8.7
...
三、对样本数据进行主成分分析
1.提取主成分
df.pr<-princomp(df[,c('x1','x2','x3','x4','x5','x6','x7')],cor=TRUE,scores=TRUE)
summary(df.pr,loadings = TRUE)
运行得到:
summary(df.pr,loadings = TRUE)
Importance of components:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
Standard deviation 2.272309 1.099611 0.584717 0.4413558 0.22149843 0.18479654 0.08698225
Proportion of Variance 0.737627 0.172735 0.048842 0.0278278 0.00700879 0.00487854 0.00108084
Cumulative Proportion 0.737627 0.910362 0.959204 0.9870318 0.99404062 0.99891916 1.00000000
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
x1 0.407 0.340 0.681 0.478 0.151
x2 0.410 -0.155 -0.227 -0.591 0.596 0.224
x3 0.421 -0.178 -0.301 0.135 -0.338 0.190 -0.729
x4 0.400 -0.269 -0.450 0.216 -0.311 0.643
x5 0.426 0.272 -0.183 -0.831 -0.118
x6 0.377 0.360 0.479 -0.283 -0.439 0.433 0.196
x7 0.858 -0.486 0.102
说明:a、前两个主成分累计贡献率为91.036%,所有可以用前两个主成分来降维。
b、主成分的表达式:
Y1=0.407x1+0.410x2+0.421x3+0.400x4+0.426x5+0.377x6
Y2=-0.155x2-0.178x3-0.269x4+0.360x6+0.858x7
主成分的经济意义各线性组合中系数较大的几个指标的综合意义确定,成分Y1在x1,x2,x3,x4,x5这这个指标系数较大,在x6,x7系数较小,所有Y1主要反映的是水泥企业的整体规模和收入水平。主成分Y2在x6,x7上面系数较大,在其他指标上面系数较小,因此主成分2主要反映的是水泥企业的盈利能力。
2、计算主成分得分
主成分得分就是把各样本数据代入两个主成分的线下表达式,从而计算主成分得分。
df.pr$scores
df1<-data.frame(dfareas,df.prscores)
df1
df1<-data.frame(df$areas,df.pr$scores)
> df1
df.areas Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
1 北京 -3.515640 -1.82793290 1.2312186 -0.42043074 -0.20977172 0.058936541 -0.02181273
2 天津 -3.003073 -0.27388041 0.3057919 -0.26320901 0.07346600 0.064959941 -0.08878631
3 河北 3.492990 -1.76504930 -1.1354352 0.83091321 -0.27063999 -0.224951646 -0.06432118
4 山西 -1.366870 -1.86572957 0.2729763 0.44984962 0.17070907 0.104080572 0.11184461
5 内蒙古 -0.502928 -0.61534857 -0.1095937 0.19265129 0.09068896 -0.045207363 0.04420153
6 辽宁 0.179882 -0.10846130 0.1796696 0.21988421 0.17106575 -0.106211797 -0.03572307
7 吉林 1.642415 -1.10452228 -0.8629399 -1.39822126 -0.01338830 -0.426576464 0.02692693
8 黑龙江 -0.781065 -0.10578258 -0.3740632 0.00035975 0.15311967 0.125397343 0.13894098
9 上海 -3.320677 -0.18636325 0.4490147 -0.28078672 0.01768420 -0.026870111 -0.00445479
10 江苏 3.615331 0.32645483 0.4863494 -0.08531633 0.13592028 -0.091041741 0.23298086
11 浙江 1.788433 0.29379000 0.1195645 -0.19348952 0.40015949 0.170807670 -0.08634148
12 安徽 2.083389 1.31107200 0.2515930 -0.38424689 -0.41459944 0.278439752 -0.01419167
13 福建 -1.077652 -0.46565494 0.4875302 0.16515444 0.14027990 -0.316436643 -0.07381221
14 江西 0.527362 1.29710347 0.3464652 -0.17054810 -0.22114898 0.076110671 0.01699948
15 山东 4.770580 0.61728673 1.0270857 -0.34787078 0.21396830 -0.065610349 -0.09328715
16 河南 4.049670 -0.07207734 -0.1732655 -0.52129861 0.38658167 0.203415790 0.04829138
17 湖北 1.862018 0.31272534 0.1341054 -0.21897171 -0.59398630 -0.258499835 -0.02868519
18 湖南 0.503000 0.48817164 0.9677778 0.94672161 0.11604429 -0.224571843 -0.08705109
19 广东 1.818569 0.16884189 0.1008128 0.24875567 0.11485019 0.069973843 -0.00926963
20 广西 0.338820 1.85055716 0.2862296 0.09014668 -0.22736954 0.255088948 -0.04460838
21 海南 -2.750894 0.22608263 0.0454661 -0.39954238 -0.00993694 -0.059410368 -0.03427266
22 重庆 -0.628661 -0.88772794 -0.5940808 -0.10450225 -0.06405085 0.092947440 0.09545077
23 四川 2.181136 -0.24175065 -0.4090686 0.41866229 -0.24938816 0.090796994 -0.09391571
24 贵州 -0.304797 1.08921175 -0.3175461 0.51661253 -0.16962058 0.166790217 0.15655822
25 云南 -0.170936 -0.74929228 -0.1002991 0.56045970 0.14910267 0.006228440 0.04195803
26 西藏 -3.091577 3.26751381 -1.3939541 0.05508742 0.30015544 -0.222173606 -0.06625040
27 陕西 -0.908212 -0.19240249 0.2268532 0.36462016 -0.04504990 -0.175866164 0.02455321
28 甘肃 -1.793673 0.82109182 -0.2513493 0.10918441 -0.01922901 0.037035723 0.00660759
29 青海 -2.623050 0.00150422 -0.2510706 -0.09292648 -0.21900016 0.001762138 0.10736844
30 宁夏 -2.882470 0.31321061 0.0222172 -0.14500038 0.04669932 0.000181659 -0.01947260
31 新疆 -0.131422 -1.92264209 -0.9680548 -0.14270184 0.04668469 0.440474245 -0.18642580
从而得到各个主成分在每个样本点的得分。
3、绘制主成分得分的样本散点图
attach(df1)
plot(df1Comp.2\~df1Comp.1)
text(df1Comp.1,df1Comp.2+0.02,df1$df.areas,pos=3)
运行得到:
因为主成分只有两个,将样本主成分得分绘制出来,就可以得到样本的分布情况,从而可以对样本进行分类。
四、对样本数据进行因子分析
1、提取因子
在这里使用函数factanal()
fa<-factanal(df[,c('x1','x2','x3','x4','x5','x6','x7')],factors=2,rotation = 'varimax',scores = 'regression')
fa
在这里使用最大方差旋转方法
运行得到:
fa<-factanal(df[,c('x1','x2','x3','x4','x5','x6','x7')],factors=2,rotation = 'varimax',scores = 'regression')
> fa
Call:
factanal(x = df[, c("x1", "x2", "x3", "x4", "x5", "x6", "x7")], factors = 2, scores = "regression", rotation = "varimax")
Uniquenesses:
x1 x2 x3 x4 x5 x6 x7
0.170 0.133 0.009 0.007 0.059 0.005 0.677
Loadings:
Factor1 Factor2
x1 0.824 0.388
x2 0.915 0.176
x3 0.992
x4 0.994
x5 0.845 0.477
x6 0.635 0.770
x7 0.568
Factor1 Factor2
SS loadings 4.605 1.335
Proportion Var 0.658 0.191
Cumulative Var 0.658 0.849
Test of the hypothesis that 2 factors are sufficient.
The chi square statistic is 15.17 on 8 degrees of freedom.
The p-value is 0.0559
前两个因子累计方差贡献率为84.9%,提取两个因子是合适的,跟进运行得到的因子载荷矩阵可以得到因子f1和f2与原始变量的关系:
x1=0.824f1+0.388f2
x2=0.915f1+0.176f2
x3=0.922f1
x4=0.944f1
x5=0.845f1+0.477f2
x6=0.635f1+0.770f2
x7=0.568f2
如果使用斜交旋转Promax的方法,将得到:
fa<-factanal(df[,c('x1','x2','x3','x4','x5','x6','x7')],factors=2,rotation = 'promax',scores = 'regression')
> fa
Call:
factanal(x = df[, c("x1", "x2", "x3", "x4", "x5", "x6", "x7")], factors = 2, scores = "regression", rotation = "promax")
Uniquenesses:
x1 x2 x3 x4 x5 x6 x7
0.170 0.133 0.009 0.007 0.059 0.005 0.677
Loadings:
Factor1 Factor2
x1 0.685 0.353
x2 0.878 0.101
x3 1.001
x4 1.072 -0.181
x5 0.665 0.452
x6 0.309 0.809
x7 -0.286 0.650
Factor1 Factor2
SS loadings 4.011 1.450
Proportion Var 0.573 0.207
Cumulative Var 0.573 0.780
Factor Correlations:
Factor1 Factor2
Factor1 1.000 -0.489
Factor2 -0.489 1.000
Test of the hypothesis that 2 factors are sufficient.
The chi square statistic is 15.17 on 8 degrees of freedom.
The p-value is 0.0559
斜交旋转的情况下,因子表达式为:
x1=0.568f1+0.353f2
x2=0.878f1+0.101f2
x3=1.001f1
x4=1.072f1-0.181f2
x5=0.665f1+0.452 f2
x6=0.309f1+0.809 f2
x7=-0.286f2+ 0.650f2
斜交旋转的情况下,更能说明因子的因子具体经济含有,从斜交旋转的载荷矩阵可以看到,变量x2,x3,x4在在第一个公共因子的载荷较大,因此,第一个公共因子主要反映企业的规模,变量x6,x7在第二个公共因子的载荷较大,说明第二个公共因子反映企业的盈利能力。
2、计算因子得分
fa$scores
df2<-data.frame(dfareas,fascores)
df2
df2
df.areas Factor1 Factor2
1 北京 -1.4143765 -0.1508311
2 天津 -1.3043410 -0.0889328
3 河北 3.5399545 -2.6706415
4 山西 0.4281325 -1.5294351
5 内蒙古 0.3150193 -0.8072694
6 辽宁 0.0608492 -0.1041206
7 吉林 1.4151234 -0.9917456
8 黑龙江 0.0320639 -0.5454919
9 上海 -1.5548788 0.0749349
10 江苏 0.6880203 1.2042261
11 浙江 0.1966863 0.6611251
12 安徽 -0.3187238 2.0231134
13 福建 -0.4136153 -0.2779676
14 江西 -0.8398720 1.6310805
15 山东 0.3714166 2.2721740
16 河南 1.3016494 0.5713237
17 湖北 0.4532688 0.7642714
18 湖南 -0.4557515 0.7670075
19 广东 0.5397438 0.3251183
20 广西 -1.0926484 1.9075718
21 海南 -1.2837594 0.0763441
22 重庆 0.6843438 -1.2564094
23 四川 1.4231037 -0.4835207
24 贵州 -0.2352031 0.2985555
25 云南 0.6242982 -1.0586728
26 西藏 -1.7056631 0.3287434
27 陕西 -0.2081481 -0.3079663
28 甘肃 -0.8373975 0.0894333
29 青海 -0.7299168 -0.4712487
30 宁夏 -1.2786471 -0.0284181
31 新疆 1.5992689 -2.2223515
3、跟据计算得出的因子得分,把样本点绘制的散点图上