Exercise 3

Question 1.

Consider the general regression model: $y_i=g_i+\epsilon_i, \;\;\;\; i=1,...n,$ where ε_i are i.i.d. variables with zero mean and the (known) variance σ². Let u be an arbitrary linear estimator of g, i.e. u=A y for some squared matrix A.

Show that the Average Mean Squared Error (AMSE) is $AMSE=\frac{1}{n}\sum_{i=1}^nE(u_i-g_i)^2=\frac{1}{n}\left({\bf g}'(I-A)'(I-A){\bf g}+\sigma^2 tr(AA')\right)$
Show that $E(RSS)=\sum_{i=1}^n E(y_i-u_i)^2={\bf g}'(I-A)'(I-A){\bf g}+\sigma^2 tr\left((I-A)'(I-A)\right)$
Based on the previous results find the unbiased estimate for AMSE.
What is the matrix A for the OLS estimator in linear regreession with p explanatory variables? What is the unbiased estimate for AMSE in this case?

Question 2.

Consider the linear regression model with p explanatory variables. Let σ²=Var(y_i) be known.

Show that the generalized likelihood ratio test (GLRT) for testing H₀: β_j=0 is the $\chi^2_1$ test. What is the corresponding test statistic T_j?
Suppose now that we want to check the significance of x_j in the model by Mallows' C_p=RSS/σ²-(n-2p) criterion (or, equivalently, by AIC). Show that C_p-1=T_j+C_p-2
Using the result above, show that x_j is not significant and may be dropped out of the model (according to the Mallows' C_p criterion) iff T_j < 2 and find the corresponding significance level.

Question 3.

The file Pois.dat contains the survival times of rats after poisoning with one of three types of poison, and treatment with one of four antidotes. The design is an orthogonal 3x4 factorial design with four observation per cell.

Compare sample variances within cells and comment the results.
Fit the full model Type*Treat choosing first an appropriate scale for the depenendent variable. Calculate sample variances within cells at the chosen scale and compare these results with those from the previous paragraph.
Carry out the ANOVA table and fit the resulting model. Does the order's change in dropping terms in the ANOVA table may influence on the final model in this case? What's going on in the general case?
Estimate the survival time for a rat poisoned by the second type of poison and treated by the first antidote. Give a 95%-predicted interval for survival time for such a rat and a 95%-confidence interval for the median survival time for all rates with such "fate".

Question 4.

The prostate cancer data in the file Prostate.dat come from a study that examined the correlation between the level of prostate specific antigen (PSA) and the following clinical measurements in 97 men who were about to receive a radical prostatectomy:

lcavol	-	log(cancer volume)
lweight	-	log(prostate weight)
age	-	age
lbph	-	log(benign prostatic hyperplasia amount)
svi	-	seminal vesicle invasion
lcp	-	log(capsular penetration)
gleason	-	Gleason score
pgg45	-	percentage Gleason scores 4 or 5

The goal is to predict the log of PCA (lpsa) from these measurements.

Analyse the data to get some first impression. Make some preliminary comments.
Check the presence of multicollinearity among the explanatory variables. What methodological and computational problems it might cause?
Split randomly (why?) the data into a training and test sets of 75 and 22 patiens respectively. Put a test set meanwhile aside and consider a training set:
1. Start from the main effects model, verify its adequacy.
2. Select the `best' model by adding/removing variables and their interactions w.r.t. several model selection criteria. Compare the resulting models (also with the main effects), comment the results.
Apply LASSO to the data and comment the results.
Try to reduce deminsionality by principle component regression and partial least squares. Comment the results. Are there any "conceptual" problems in using these methods for this data?
How would you test and compare the goodness-of-fit of different models from the previous paragraph on the test set? Apply your ideas and comment the results.
Split randomly again the initial data into training and test sets of the same sizes and repeat the steps 3-5. Compare and comment the results for two different splits. Are they surprising? (explain, answers like "Nothing can surprise me in this world anymore" won't be accepted!)
Make final conclusions and point out on the measurements relevant for predicting the prostate specific antigen.

Computational Notes for R users:

To get an ANOVA table use anova command
predict.lm command can be used for prediction (see help for details).
The command tapply applies a function (like mean, var, etc.) to each cell of a table (see help for details).
The function regsubsets(..., nbest=1,...) from the package leaps allows one to perfom all possible regressions and stepwise searches w.r.t. to several criteria (read carefully help for details and for its output).
For various stepwise model selection strategies you can also use the function step. Read carefully its help for details in each specific case. By default, step performs backward elimination starting from your original model w.r.t. AIC criterion. To run forward selection or stepwise procedure, you should also define the maximal model you want in the scope parameter:

>step.lm<-step(model.lm,direction="forward/both",scope=list(upper=max.lm), scale=...)

See help for more details and options.
To perform LASSO you'll probably need the function glmnet(..., alpha=1,...) from the package glmnet. The LASSO tunning parameter λ can be chosen by CV using the function cv.glmnet(...,alpha=1,...).
To fit principle components regression and partial least squares you can use the functions pcr and plsr from the package pls. It's recommended to use the scaled data (scale=T). Read carefully help for details.
I wrote a short R function CrossVal(lm.object) that calculates CV, GCV and R²_CV for a given linear model (lm.object). It is free although kind donations will be appreciated (-:).