Exercise 2

Prelude and Fugue in F-test Major

Theoretical Prelude

Question 1.

Suppose we have a linear model with an intercept, p explanatory variables and i.i.d. normal errors:

y=β01 x1+...+βp xp

Transform the original response y to y'=y-c, where c is a fixed constant.

  1. What will happen to the OLS estimates of β's after transformation?
  2. Show that the residual sums of squares (RSS) for y and y' will be the same.
  3. Show that the F-statistics for testing H0 : β1=...=βs =0 will be also the same in both cases.
  4. Where did you use the assumption of normality for errors? How will you test the hypotheses H0 (see above) when the distribution of errors ε is different from normal (though known)?

Question 2.

Consider a simple linear regression with a single explanatory variable x. Show that
  1. If all n observations xi are equidistant from their average, than hii=2/n.
  2. If all but one observation xi's are identical, these will have hii=1/(n-1), while for the remaining observation hii=1

Fugue in Medical Data

Question 3.

The file Feigl.dat gives the survival times (Time) in weeks from initial diagnosis of 33 patients with acute myelogeneous leukaemia, with two covariates: WBC (white blood cell count in thousands) and AG-factor at the time of diagnosis (1=Pos, 2=Neg).
  1. Plot Time against WBC for each level of AG. Does the plot indicate that the linear model will be appropriate? Try the effect of the log-transformations on Time and WBC on this plot.
  2. Fit a full linear regression model (with iteraction) of Time on WBC and AG. Comment the results. Test for parallel regression. Does this model fit the data?
  3. Re-fit the model on the log-log scale. Does the effect of log(WBC) on log(Time) depend on presence of AG-factor? Check the adequacy of the resulting model and try to think of possible reasons for problems you found (if any). Compare this model with that of the previous paragraph.

Question 4.

The file Charges.dat contains data on the sex, the attending physician (A,B or C), severity of illness (1-4), total hospital charges (Chrg) and age for 49 patients, all of whom had an identical diagnosis, from Northwestern Memorial Hospital, Chicago.
  1. Fit the main effect model expressing the charges against age and the other variables (don't forget first to express them as suitable indicator variables where necessary). Is the linear model adequate for this data?
  2. Find the appropriate transformation of the dependent variable from the Box-Cox transformation family, re-fit the model and comment its adequacy.
  3. Test the hypotheses that the attending physician has no effect on hospital charges (on the chosen scale).
  4. Some feminist organizations claim that there is sexual discrimination in the hospital and women suffer from higher hospital charges. Does their claim have any statistical ground?
  5. Point out influential observation(s) that strongly affected your model (if any). Remove them from the data and re-fit the model. Comment the results. Repeat Step 3 and Step 4.
  6. Repeat Step 2 without influential observations you've found. Did you get the same scale for the response variable as before? Try to explain this phenomenon.
  7. Are you completely satisfied with the resulting model(s)? If "yes", mazal tov!; if "no", give an idea(s) of improving it.
Computational Notes for R users: