ale,28.785,0,no,northeast,3385.39915
19,female,28.3,0,yes,southwest,17081.08
52,female,37.4,0,no,southwest,9634.538
32,female,17.765,2,yes,northwest,32734.1863
38,male,34.7,2,no,southwest,6082.405
59,female,26.505,0,no,northeast,12815.44495
61,female,22.04,0,no,northeast,13616.3586
53,female,35.9,2,no,southwest,11163.568
19,male,25.555,0,no,northwest,1632.56445
20,female,28.785,0,no,northeast,2457.21115
22,female,28.05,0,no,southeast,2155.6815
19,male,34.1,0,no,southwest,1261.442
22,male,25.175,0,no,northwest,2045.68525
54,female,31.9,3,no,southeast,27322.73386
22,female,36,0,no,southwest,2166.732
34,male,22.42,2,no,northeast,27375.90478
26,male,32.49,1,no,northeast,3490.5491
34,male,25.3,2,yes,southeast,18972.495
29,male,29.735,2,no,northwest,18157.876 ......
执行过程分析:
> insurance <- read.csv("insurance.csv", stringsAsFactors = TRUE) #读取数据
> str(insurance) #查看data.frame结构
'data.frame': 1338 obs. of 7 variables:
$ age : int 19 18 28 33 32 31 46 37 37 60 ...
$ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
$ bmi : num 27.9 33.8 33 22.7 28.9 ...
$ children: int 0 1 3 0 0 0 1 3 2 0 ...
$ smoker : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
$ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
$ charges : num 16885 1726 4449 21984 3867 ...> library("psych") #加载包
> ins_model <- lm(charges ~ age + children + bmi + sex + smoker + region, data=insurance) #使用包的线性回归方法训练数据集
> summary(ins_model) #查看训练集汇总信息
Call:
lm(formula = charges ~ age + children + bmi + sex + smoker +
region, data = insurance)
Residuals:
Min 1Q Median 3Q Max
-11304.9 -2848.1 -982.1 1393.9 29992.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -11938.5 987.8 -12.086 < 2e-16 ***
age 256.9 11.9 21.587 < 2e-16 *** #*多代表显著特征
children 475.5 137.8 3.451 0.000577 ***
bmi 339.2 28.6 11.860 < 2e-16 ***
sexmale -131.3 332.9 -0.394 0.693348
smokeryes 23848.5 413.1 57.723 < 2e-16 ***
regionnorthwest -353.0 476.3 -0.741 0.458769
regionsoutheast -1035.0 478.7 -2.162 0.030782 *
regionsouthwest -960.0 477.9 -2.009 0.044765 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6062 on 1329 degrees of freedom
Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
> lmstep<- step(ins_model) #用于去除不显著的特征
Start: AIC=23316.43
charges ~ age + children + bmi + sex + smoker + region
Df Sum of Sq RSS AIC
- sex 1 5.7164e+06 4.8845e+10 23315 #sex特征被删除
<none> 4.8840e+10 23316
- region 3 2.3343e+08 4.9073e+10 23317
- children 1 4.3755e+08 4.9277e+10 23326
- bmi 1 5.1692e+09 5.4009e+10 23449
- age 1 1.7124e+10 6.5964e+10 23717
- smoker 1 1.2245e+11 1.7129e+11 24993
Step: AIC=23314.58 #用AIC最小值来评估
charges ~ age + children + bmi + smoker + region
Df Sum of Sq RSS AIC
<none> 4.8845e+10 23315
- region 3 2.3320e+08 4.9078e+10 23315
- children 1 4.3596e+08 4.9281e+10 23325
- bmi 1 5.1645e+09 5.4010e+10 23447
- age 1 1.7151e+10 6.5996e+10 23715
- smoker 1 1.2301e+11 1.7186e+11 24996
> predict.lm(lmstep,data.frame(age=70,children=4,bmi=31.5,smoker='yes',region='northeast' |