Trees example

Data: 31 cherry trees

V – volume, cubic feet

H – height, feet

D – diameter, inches (1 feet = 12 inches)

Linear model

## 
## Call:
## lm(formula = Volume ~ Diameter + Height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4065 -2.6493 -0.2876  2.2003  8.4847 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
## Diameter      4.7082     0.2643  17.816  < 2e-16 ***
## Height        0.3393     0.1302   2.607   0.0145 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared:  0.948,  Adjusted R-squared:  0.9442 
## F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16
## CV= 18.1578 GCV= 16.6831 
## Cross-validation R-squared= 0.9306

Interaction is very significant but still doesn’t improve much the residuals

Cube-root transformation makes clear sense – \(Volume^{1/3}\) is in feets

Cube-root model

## 
## Call:
## lm(formula = Volume^(1/3) ~ Diameter + Height, y = T)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.159602 -0.050200 -0.006827  0.069649  0.133981 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.085388   0.184315  -0.463    0.647    
## Diameter     0.151516   0.005639  26.871  < 2e-16 ***
## Height       0.014472   0.002777   5.211 1.56e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08283 on 28 degrees of freedom
## Multiple R-squared:  0.9777, Adjusted R-squared:  0.9761 
## F-statistic: 612.5 on 2 and 28 DF,  p-value: < 2.2e-16
## CV= 0.0075 GCV= 0.0076 
## Cross-validation R-squared= 0.9731

Interaction is non-significant (\(p-value=0.95\))

Residuals are not “perfect”, more general model: \({\bf \sigma^2=\sigma^2(x)}\) (GLM)

Prediction and forecasting on the cube-root scale

trees.new<-predict(trees.fit,data.frame(Height=80,Diameter=15),
interval=c("confidence"))
print(trees.new)
##        fit      lwr      upr
## 1 3.345092 3.307977 3.382207
trees.new<-predict(trees.fit,data.frame(Height=80,Diameter=15),
interval=c("prediction"))
print(trees.new)
##        fit      lwr      upr
## 1 3.345092 3.171417 3.518768

Prediction and forecasting on the original scale:

\(\widehat{Volume}=3.3451^3=37.43\)

Confidence interval for \({\bf Med}(Volume)\) is \((3.308^3;3.382^3)=(36.20;38.69)\)

Prediction interval for \(Volume\) is \((3.171^3;3.519^3)=(31.90;43.57)\)


Another way to solve dimensionality problem is log-transforms of all variables.

Log-log Model

## 
## Call:
## lm(formula = log(Volume) ~ log(Diameter) + log(Height))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.168561 -0.048488  0.002431  0.063637  0.129223 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6.63162    0.79979  -8.292 5.06e-09 ***
## log(Diameter)  1.98265    0.07501  26.432  < 2e-16 ***
## log(Height)    1.11712    0.20444   5.464 7.81e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08139 on 28 degrees of freedom
## Multiple R-squared:  0.9777, Adjusted R-squared:  0.9761 
## F-statistic: 613.2 on 2 and 28 DF,  p-value: < 2.2e-16
## CV= 0.0071 GCV= 0.0073 
## Cross-validation R-squared= 0.9737

Interaction is non-significant (\(p-value=0.7035\))

Residuals are not “perfect”, more general model: \({\bf \sigma^2=\sigma^2(x)}\) (GLM)

Prediction and forecasting on the log-scale

##        fit      lwr      upr
## 1 3.632763 3.595401 3.670124
##        fit      lwr     upr
## 1 3.632763 3.461916 3.80361

Prediction on the original scale: \(\widehat{Volume}=e^{3.633}=37.82\)

Confidence interval for \({\bf Med}(Volume)\) is \((e^{3.619};e^{3.646})=(37.30;38.32)\)

Prediction interval for \(Volume\) is \((e^{3.462};e^{3.801})=(31.88;44.88)\)


Note that \(\hat{\beta}_1 = 1.98265\;(0.07501) \approx 2;\;\;\;\hat{\beta}_2 =1.11712\;(0.20444) \approx 1\).

It makes perfect physical sense: \[ Volume=C \cdot Diameter^2 \cdot Height = C_1 \cdot \left(\frac{Diameter}{12}\right)^2\cdot Height, \] where \(C_1=\pi/12\) (cone) and \(C_1=\pi/4\) (cylinder).

Hence, \[ \log(Volume)= \log(C_1)+2\cdot \log(Diameter/12)+\log(Height) \]

Final model

treesf.fit <- lm(log(Volume) ~ 1+offset(2. * log(Diameter/12.) + log(Height)))
print(summary(treesf.fit, correlation = F))
## 
## Call:
## lm(formula = log(Volume) ~ 1 + offset(2 * log(Diameter/12) + 
##     log(Height)))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.168446 -0.047355 -0.003518  0.066308  0.136467 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.19935    0.01421  -84.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0791 on 30 degrees of freedom
r2 <- cor(log(trees.dat[, "Volume"]), predict(treesf.fit))^2.
cat("Multiple R-squared:", round(r2, 4.), "\n")
## Multiple R-squared: 0.9774

\[ -1.340=\log(\pi/12) < \hat{\beta_0} < \log(\pi/4)=-0.242 \]

## CV= 0.0065 GCV= 0.0065 
## Cross-validation R-squared= 0.9759
## Analysis of Variance Table
## 
## Model 1: log(Volume) ~ 1 + offset(2 * log(Diameter/12) + log(Height))
## Model 2: log(Volume) ~ log(Diameter) + log(Height)
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     30 0.18769                           
## 2     28 0.18546  2 0.0022224 0.1678 0.8464

\(\widehat{\log(Volume)}=3.629\); \(\widehat{Volume}=37.67\) but prediction intervals – from a “full” model


Summary

Model \(R^2\) CV GCV \(R^2_{CV}\)
Linear 0.9480 18.578 16.6831 0.9306
Cube Root 0.9777 0.0075 0.0076 0.9731
Log-Log 0.9777 0.0071 0.0073 0.9737
Final 0.9774 0.0065 0.0065 0.9759