Exercise 3

Question 1.

Consider the general regression model:
$y_i=g_i+\epsilon_i, \;\;\;\; i=1,...n,$
where εi are i.i.d. variables with zero mean and the (known) variance σ2. Let u be an arbitrary linear estimator of g, i.e. u=A y for some squared matrix A.
  1. Show that the Average Mean Squared Error (AMSE) is
    $AMSE=\frac{1}{n}\sum_{i=1}^nE(u_i-g_i)^2=\frac{1}{n}\left({\bf g}'(I-A)'(I-A){\bf g}+\sigma^2 tr(AA')\right)$
  2. Show that
    $E(RSS)=\sum_{i=1}^n E(y_i-u_i)^2={\bf g}'(I-A)'(I-A){\bf g}+\sigma^2 tr\left((I-A)'(I-A)\right)$
  3. Based on the previous results find the unbiased estimate for AMSE.
  4. What is the matrix A for the OLS estimator in linear regreession with p explanatory variables? What is the unbiased estimate for AMSE in this case?

Question 2.

Consider the linear regression model with p explanatory variables. Let σ2=Var(yi) be known.
  1. Show that the generalized likelihood ratio test (GLRT) for testing H0: βj=0 is the $\chi^2_1$ test. What is the corresponding test statistic Tj?
  2. Suppose now that we want to check the significance of xj in the model by Mallows' Cp=RSS/σ2-(n-2p) criterion (or, equivalently, by AIC). Show that Cp-1=Tj+Cp-2
  3. Using the result above, show that xj is not significant and may be dropped out of the model (according to the Mallows' Cp criterion) iff Tj < 2 and find the corresponding significance level.

Question 3.

The file Pois.dat contains the survival times of rats after poisoning with one of three types of poison, and treatment with one of four antidotes. The design is an orthogonal 3x4 factorial design with four observation per cell.
  1. Compare sample variances within cells and comment the results.
  2. Fit the full model Type*Treat choosing first an appropriate scale for the depenendent variable. Calculate sample variances within cells at the chosen scale and compare these results with those from the previous paragraph.
  3. Carry out the ANOVA table and fit the resulting model. Does the order's change in dropping terms in the ANOVA table may influence on the final model in this case? What's going on in the general case?
  4. Estimate the survival time for a rat poisoned by the second type of poison and treated by the first antidote. Give a 95%-predicted interval for survival time for such a rat and a 95%-confidence interval for the median survival time for all rates with such "fate".

Question 4.

The file Prices.dat contains the following data on selling prices of houses in one of Chicago's areas:
Price - Selling price of house in thousands dollars
Bdr - Number of bedrooms
Flr - Floor space in sq. ft.
Fp - Number of fireplaces
Rms - Number of rooms
St - Storm windows (0 if absent, 1 if present)
Lot - Front footage of lot in feets
Tax - Annual taxes
Bth - Number of bathrooms
Con - Construction (0 if frame, 1 if brick)
Gar - Garage size (0 = no garage, 1 = one-car garage, etc.)
Cdn - Condition (1 = "needs work", 0 otherwise)
L1 - Location (L1 = 1 if property is in zone A, 0 otherwise)
L2 - Location (L2 = 1 if property is in zone B, 0 otherwise)
  1. Fit the main effects model of Price on all explanatory variables (don't forget first to define suitable factor variables where necessary!). If the model doesn't seem appropriate to you, take care of possible transformations.
  2. Try to add all paired iteractions to the model. Are the results surprising?
  3. Choose the best models among all possible regressions w.r.t. AIC, BIC, RIC and the adjusted R2. Compare the results.
  4. Simplify the main effect model (after possible transformations -- see 4.1) w.r.t. AIC using various strategies:
    1. backward elimination starting from the main effects model
    2. forward selection starting from the null model without any predictors
    3. try to drop non-significant main effects and to add signifiant iteractions by stepwise procedure
    Compare the resulting AIC-based models from 4.3 and 4.4.1-4.4.3 by CV, GCV, AIC, multiple correlation and cross-validation correlation coefficients. Can you provide any statistical inference (test) to choose among these models?
  5. Apply LASSO to the data and comment the results.
  6. Try to reduce deminsionality by principle component regression and partial least squares. Comment the results. Are there any "conceptual" problems in using these methods for this data?
  7. Summarize all the results and choose the model that seems to be the most adequate to you. Analyse the goodness-of-fit of the chosen model.
  8. Estimate the selling price of a house with 1000 square feet of floor area, 8 rooms, 4 bedrooms, 2 bathrooms, without fireplaces and storm windows, 40 foot frontage, brick construction, 2 car garage, doesn't "need work", 1000$ annual taxes in the L1 area. Give the corresponding 95%-prediction interval and the 95%-confidence interval for the median price of such houses in this area. Point out possible "conceptual" problems (if any) in deriving these intervals in this case.
Computational Notes for R users:
  • To get an ANOVA table use anova command
  • predict.lm command can be used for prediction (see help for details).
  • The command tapply applies a function (like mean, var, etc.) to each cell of a table (see help for details).
  • The function regsubsets(..., nbest=1,...) from the package leaps allows one to perfom all possible regressions and stepwise searches w.r.t. to several criteria (read carefully help for details and for its output).
  • For various stepwise model selection strategies you can also use the function step. Read carefully its help for details in each specific case. By default, step performs backward elimination starting from your original model w.r.t. AIC criterion. To run forward selection or stepwise procedure, you should also define the maximal model you want in the scope parameter:
    >step.lm<-step(model.lm,direction="forward/both",scope=list(upper=max.lm), scale=...)
    See help for more details and options.
  • To perform LASSO you'll probably need the function glmnet(..., alpha=1,...) from the package glmnet. The LASSO tunning parameter λ can be chosen by CV using the function cv.glmnet(...,alpha=1,...).
  • To fit principle components regression and partial least squares you can use the functions pcr and plsr from the package pls. It's recommended to use the scaled data (scale=T). Read carefully help for details.
  • I wrote a short R function CrossVal(lm.object) that calculates CV, GCV and R2CV for a given linear model (lm.object). It is free although kind donations will be appreciated (-:).