Exercise 3

Question 1.

Consider the general regression model:
yi=gii, i=1,...n
where εi are i.i.d. variables with zero mean and the (known) variance σ2. Let u be an arbitrary linear estimator of g, i.e. u=Ay for some squared matrix A.
  1. Show that the Average Mean Squared Error (AMSE) is

    AMSE=(1/n)Σi E(ui-gi)2= (1/n)(g'(I-A)'(I-A)g+ σ2tr(AA'))
  2. Show that

    E(RSS)= ΣiE(ui-yi)2= g'(I-A)'(I-A)g2 tr((I-A)(I-A)')
  3. Based on the previous results find the unbiased estimate for AMSE.
  4. What is the matrix A for the OLS estimator in linear regreession with p explanatory variables? What is the unbiased estimate for AMSE in this case?

Question 2.

Consider the linear regression model with p explanatory variables. Let σ2=Var(yi) be known.
  1. Show that the likelihood ratio test (LRT) for testing H0: βj=0 is the χ21 test. What is the corresponding test statistic Tj?
  2. Suppose now that we want to check the significance of xj in the model by Mallows' Cp=RSS/σ2-(n-2p) criterion (or, equivalently, by AIC). Show that Cp-1=Tj+Cp-2
  3. Using the result above, show that xj is not significant and may be dropped out of the model (according to the Mallows' Cp criterion) iff Tj < 2 and find the corresponding significance level.

Question 3.

The file Pois.dat contains the survival times of rats after poisoning with one of three types of poison, and treatment with one of four antidotes. The design is an orthogonal 3x4 factorial design with four observation per cell.
  1. Compare sample variances within cells and comment the results.
  2. Fit the full model Type*Treat choosing first an appropriate scale for the depenendent variable. Calculate sample variances within cells at the chosen scale and compare these results with those from the previous paragraph.
  3. Carry out the ANOVA table and fit the resulting model. Does the order's change in dropping terms in the ANOVA table may influence on the final model in this case? What's going on in the general case?
  4. Estimate the survival time for a rat poisoned by the second type of poison and treated by the first antidote. Give a 95%-predicted interval for survival time for such a rat and a 95%-confidence interval for the median survival time for all rates with such "fate".

Question 4.

The file Prices.dat contains the following data on selling prices of houses in one of Chicago's areas:

Price -Selling price of house in thousands dollars
Bdr -Number of bedrooms
Flr -Floor space in sq. ft.
Fp -Number of fireplaces
Rms -Number of rooms
St -Storm windows (0 if absent, 1 if present)
Lot -Front footage of lot in feets
Tax -Annual taxes
Bth -Number of bathrooms
Con -Construction (0 if frame, 1 if brick)
Gar -Garage size (0 = no garage, 1 = one-car garage, etc.)
Cdn -Condition (1 = "needs work", 0 otherwise)
L1 -Location (L1 = 1 if property is in zone A, 0 otherwise)
L2 -Location (L2 = 1 if property is in zone B, 0 otherwise)

  1. Fit the main effects model of Price on all explanatory variables (don't forget first to define suitable factor variables where necessary!). If the model doesn't seem appropriate to you, take care of possible transformations.
  2. Try to add all paired iteractions to the model. Are the results surprising?
  3. Simplify the main effect model from 4.1 accoring to AIC using various strategies:
    1. backward elimination starting from the main effects model
    2. forward selection starting from the null model without any predictors
    3. try to drop non-significant main effects and to add signifiant iteractions by stepwise procedure
    Compare the resulting models in 4.3.1-4.3.3 by CV, GCV, AIC, multiple correlation and cross-validation correlation coefficients. Can you provide any statistical inference to choose among these models? Fit the model that seems to you the most adequate one and analyse the fit.
  4. Estimate the selling price of a house with 1000 square feet of floor area, 8 rooms, 4 bedrooms, 2 bathrooms, without fireplaces and storm windows, 40 foot frontage, brick construction, 2 car garage, doesn't "need work", 1000$ annual taxes in the L1 area. Give the corresponding 95%-prediction interval and the 95%-prediction interval for the median price of such houses in this area. Point out possible ``conceptual" problems (if any) in deriving these intervals in this case.
Computational Notes for R users: