Data: 31 cherry trees
V – volume, cubic feet
H – height, feet
D – diameter, inches (1 feet = 12 inches)
##
## Call:
## lm(formula = Volume ~ Diameter + Height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4065 -2.6493 -0.2876 2.2003 8.4847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***
## Diameter 4.7082 0.2643 17.816 < 2e-16 ***
## Height 0.3393 0.1302 2.607 0.0145 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared: 0.948, Adjusted R-squared: 0.9442
## F-statistic: 255 on 2 and 28 DF, p-value: < 2.2e-16
## CV= 18.1578 GCV= 16.6831
## Cross-validation R-squared= 0.9306
Interaction is very significant but still doesn’t improve much the residuals
Cube-root transformation makes clear sense – \(Volume^{1/3}\) is in feets
##
## Call:
## lm(formula = Volume^(1/3) ~ Diameter + Height, y = T)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.159602 -0.050200 -0.006827 0.069649 0.133981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.085388 0.184315 -0.463 0.647
## Diameter 0.151516 0.005639 26.871 < 2e-16 ***
## Height 0.014472 0.002777 5.211 1.56e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08283 on 28 degrees of freedom
## Multiple R-squared: 0.9777, Adjusted R-squared: 0.9761
## F-statistic: 612.5 on 2 and 28 DF, p-value: < 2.2e-16
## CV= 0.0075 GCV= 0.0076
## Cross-validation R-squared= 0.9731
Interaction is non-significant (\(p-value=0.95\))
Residuals are not “perfect”, more general model: \({\bf \sigma^2=\sigma^2(x)}\) (GLM)
trees.new<-predict(trees.fit,data.frame(Height=80,Diameter=15),
interval=c("confidence"))
print(trees.new)
## fit lwr upr
## 1 3.345092 3.307977 3.382207
trees.new<-predict(trees.fit,data.frame(Height=80,Diameter=15),
interval=c("prediction"))
print(trees.new)
## fit lwr upr
## 1 3.345092 3.171417 3.518768
\(\widehat{Volume}=3.3451^3=37.43\)
Confidence interval for \({\bf Med}(Volume)\) is \((3.308^3;3.382^3)=(36.20;38.69)\)
Prediction interval for \(Volume\) is \((3.171^3;3.519^3)=(31.90;43.57)\)
Another way to solve dimensionality problem is log-transforms of all variables.
##
## Call:
## lm(formula = log(Volume) ~ log(Diameter) + log(Height))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.168561 -0.048488 0.002431 0.063637 0.129223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.63162 0.79979 -8.292 5.06e-09 ***
## log(Diameter) 1.98265 0.07501 26.432 < 2e-16 ***
## log(Height) 1.11712 0.20444 5.464 7.81e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08139 on 28 degrees of freedom
## Multiple R-squared: 0.9777, Adjusted R-squared: 0.9761
## F-statistic: 613.2 on 2 and 28 DF, p-value: < 2.2e-16
## CV= 0.0071 GCV= 0.0073
## Cross-validation R-squared= 0.9737
Interaction is non-significant (\(p-value=0.7035\))
Residuals are not “perfect”, more general model: \({\bf \sigma^2=\sigma^2(x)}\) (GLM)
## fit lwr upr
## 1 3.632763 3.595401 3.670124
## fit lwr upr
## 1 3.632763 3.461916 3.80361
Prediction on the original scale: \(\widehat{Volume}=e^{3.633}=37.82\)
Confidence interval for \({\bf Med}(Volume)\) is \((e^{3.619};e^{3.646})=(37.30;38.32)\)
Prediction interval for \(Volume\) is \((e^{3.462};e^{3.801})=(31.88;44.88)\)
Note that \(\hat{\beta}_1 = 1.98265\;(0.07501) \approx 2;\;\;\;\hat{\beta}_2 =1.11712\;(0.20444) \approx 1\).
It makes perfect physical sense: \[ Volume=C \cdot Diameter^2 \cdot Height = C_1 \cdot \left(\frac{Diameter}{12}\right)^2\cdot Height, \] where \(C_1=\pi/12\) (cone) and \(C_1=\pi/4\) (cylinder).
Hence, \[ \log(Volume)= \log(C_1)+2\cdot \log(Diameter/12)+\log(Height) \]
treesf.fit <- lm(log(Volume) ~ 1+offset(2. * log(Diameter/12.) + log(Height)))
print(summary(treesf.fit, correlation = F))
##
## Call:
## lm(formula = log(Volume) ~ 1 + offset(2 * log(Diameter/12) +
## log(Height)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.168446 -0.047355 -0.003518 0.066308 0.136467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.19935 0.01421 -84.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0791 on 30 degrees of freedom
r2 <- cor(log(trees.dat[, "Volume"]), predict(treesf.fit))^2.
cat("Multiple R-squared:", round(r2, 4.), "\n")
## Multiple R-squared: 0.9774
\[ -1.340=\log(\pi/12) < \hat{\beta_0} < \log(\pi/4)=-0.242 \]
## CV= 0.0065 GCV= 0.0065
## Cross-validation R-squared= 0.9759
## Analysis of Variance Table
##
## Model 1: log(Volume) ~ 1 + offset(2 * log(Diameter/12) + log(Height))
## Model 2: log(Volume) ~ log(Diameter) + log(Height)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 0.18769
## 2 28 0.18546 2 0.0022224 0.1678 0.8464
\(\widehat{\log(Volume)}=3.629\); \(\widehat{Volume}=37.67\) but prediction intervals – from a “full” model
Model | \(R^2\) | CV | GCV | \(R^2_{CV}\) |
---|---|---|---|---|
Linear | 0.9480 | 18.578 | 16.6831 | 0.9306 |
Cube Root | 0.9777 | 0.0075 | 0.0076 | 0.9731 |
Log-Log | 0.9777 | 0.0071 | 0.0073 | 0.9737 |
Final | 0.9774 | 0.0065 | 0.0065 | 0.9759 |