- Show that the Average Mean Squared Error (AMSE) is
$AMSE=\frac{1}{n}\sum_{i=1}^nE(u_i-g_i)^2=\frac{1}{n}\left({\bf g}'(I-A)'(I-A){\bf g}+\sigma^2 tr(AA')\right)$ - Show that
$E(RSS)=\sum_{i=1}^n E(y_i-u_i)^2={\bf g}'(I-A)'(I-A){\bf g}+\sigma^2 tr\left((I-A)'(I-A)\right)$ - Based on the previous results find the unbiased estimate for AMSE.
- What is the matrix A for the OLS estimator in linear regreession with
*p*explanatory variables? What is the unbiased estimate for AMSE in this case?

- Show that the generalized likelihood ratio test (GLRT) for testing H
_{0}: β_{j}=0 is the $\chi^2_1$ test. What is the corresponding test statistic T_{j}? - Suppose now that we want to check the significance of x
_{j}in the model by Mallows' C_{p}=RSS/σ^{2}-(n-2p) criterion (or, equivalently, by AIC). Show that C_{p-1}=T_{j}+C_{p}-2 - Using the result above, show that x
_{j}is not significant and may be dropped out of the model (according to the Mallows' C_{p}criterion) iff T_{j}< 2 and find the corresponding significance level.

- Compare sample variances within cells and comment the results.
- Fit the full model
*Type*Treat*choosing first an appropriate scale for the depenendent variable. Calculate sample variances within cells at the chosen scale and compare these results with those from the previous paragraph. - Carry out the ANOVA table and fit the resulting model. Does the order's change in dropping terms in the ANOVA table may influence on the final model in this case? What's going on in the general case?
- Estimate the survival time for a rat poisoned by the second type of poison and treated by the first antidote. Give a 95%-predicted interval for survival time for such a rat and a 95%-confidence interval for the median survival time for all rates with such "fate".

Price | - | Selling price of house in thousands dollars |

Bdr | - | Number of bedrooms |

Flr | - | Floor space in sq. ft. |

Fp | - | Number of fireplaces |

Rms | - | Number of rooms |

St | - | Storm windows (0 if absent, 1 if present) |

Lot | - | Front footage of lot in feets |

Tax | - | Annual taxes |

Bth | - | Number of bathrooms |

Con | - | Construction (0 if frame, 1 if brick) |

Gar | - | Garage size (0 = no garage, 1 = one-car garage, etc.) |

Cdn | - | Condition (1 = "needs work", 0 otherwise) |

L1 | - | Location (L1 = 1 if property is in zone A, 0 otherwise) |

L2 | - | Location (L2 = 1 if property is in zone B, 0 otherwise) |

- Fit the main effects model of Price on all explanatory variables (don't forget first to define suitable factor variables where necessary!). If the model doesn't seem appropriate to you, take care of possible transformations.
- Try to add all paired iteractions to the model. Are the results surprising?
- Choose the best models among all possible regressions w.r.t. AIC, BIC, RIC and the adjusted
*R*. Compare the results.^{2} - Simplify the main effect model (after possible transformations -- see 4.1) w.r.t. AIC using various strategies:
- backward elimination starting from the main effects model
- forward selection starting from the null model without any predictors
- try to drop non-significant main effects and to add signifiant iteractions by stepwise procedure

- Apply LASSO to the data and comment the results.
- Try to reduce deminsionality by principle component regression and partial least squares. Comment the results. Are there any "conceptual" problems in using these methods for this data?
- Summarize all the results and choose the model that seems to be the most adequate to you. Analyse the goodness-of-fit of the chosen model.
- Estimate the selling price of a house with 1000 square feet of floor area, 8 rooms, 4 bedrooms, 2 bathrooms, without fireplaces and storm windows, 40 foot frontage, brick construction, 2 car garage, doesn't "need work", 1000$ annual taxes in the L1 area. Give the corresponding 95%-prediction interval and the 95%-confidence interval for the median price of such houses in this area. Point out possible "conceptual" problems (if any) in deriving these intervals in this case.

- To get an ANOVA table use
**anova**command -
**predict.lm**command can be used for prediction (see*help*for details). - The command
**tapply**applies a function (like*mean, var*, etc.) to each cell of a table (see*help*for details). - The function
**regsubsets(..., nbest=1,...)**from the package*leaps*allows one to perfom all possible regressions and stepwise searches w.r.t. to several criteria (read carefully*help*for details and for its output). - For various stepwise model selection strategies you can also use the function
**step**. Read carefully its*help*for details in each specific case. By default,**step**performs*backward elimination*starting from your original model w.r.t. AIC criterion. To run*forward selection*or*stepwise procedure*, you should also define the maximal model you want in the*scope*parameter:- >step.lm<-step(model.lm,direction="forward/both",scope=list(upper=max.lm), scale=...)
- See
*help*for more details and options.

- To perform LASSO you'll probably need the function
**glmnet(..., alpha=1,...)**from the package*glmnet*. The LASSO tunning parameter λ can be chosen by CV using the function**cv.glmnet(...,alpha=1,...)**. - To fit principle components regression and partial least squares you can use the functions
**pcr**and**plsr**from the package*pls*. It's recommended to use the scaled data (*scale=T*). Read carefully*help*for details. - I wrote a short R function
CrossVal(lm.object) that calculates CV, GCV and R
^{2}_{CV}for a given linear model (*lm.object*). It is free although kind donations will be appreciated (-:).