Exercise 2

Prelude and Fugue in F-test Major

Theoretical Prelude

Question 1.

Suppose we have a linear model with an intercept, p explanatory variables and i.i.d. normal errors: y=β₀+β₁ x₁+...+β_p x_p+ε

Transform the original response y to y'=a(y-c), where a and c are fixed constants.

What will happen to the OLS estimates of β's and to the residual sum of squares (RSS) after this linear transformation?
Show that the F-statistics for testing H₀ : β₁=...=β_s =0 (1 ≤ s ≤ p ) will be the same in both cases.
Where did you use the assumption of normality for errors? How will you test the hypotheses H₀ (see above) when the distribution of errors ε is different from normal (at least asymptotically)?

Question 2.

Consider a simple linear regression with a single explanatory variable x. Show that

If all n observations x_i are equidistant from their average, than h_ii=2/n.
If all but one observation x_i's are identical, these will have h_ii=1/(n-1), while for the remaining observation h_ii=1

Fugue in Medical Data

Question 3.

The file Feigl.dat gives the survival times (Time) in weeks from initial diagnosis of 33 patients with acute myelogeneous leukaemia, with two covariates: WBC (white blood cell count in thousands) and AG-factor at the time of diagnosis (1=Pos, 2=Neg).

Plot Time against WBC for each level of AG. Does the plot indicate that the linear model will be appropriate? Try the effect of the log-transformations onTime and WBC on this plot.
Fit a full linear regression model (with iteraction) of Time on WBC and AG. Comment the results. Test for parallel regression. Does this model fit the data?
Re-fit the model on the log-log scale. Does the effect of log(WBC) on log(Time) depend on presence of AG-factor? Check the adequacy of the resulting model and try to think of possible reasons for problems you found (if any). Compare this model with that of the previous paragraph.

Question 4.

The file Charges.dat contains data on the sex, the attending physician (A,B or C), severity of illness (1-4), total hospital charges (Chrg) and age for 49 patients, all of whom had an identical diagnosis, from Northwestern Memorial Hospital, Chicago.

Fit the main effect model expressing the charges against age and the other variables (don't forget first to express them as suitable indicator variables where necessary). Is the linear model adequate for this data?
Find the appropriate transformation of the dependent variable from the Box-Cox transformation family, re-fit the model and comment its adequacy.
Test the hypotheses that the attending physician has no effect on hospital charges (on the chosen scale).
Some feminist organizations claim that there is sexual discrimination in the hospital and women suffer from higher hospital charges. Does their claim have any statistical ground?
Point out influential observation(s) that strongly affected your model (if any). Remove them from the data and re-fit the model. Comment the results. Repeat Step 3 and Step 4.
Repeat Step 2 without influential observations you've found. Did you get the same scale for the response variable as before? Try to explain this phenomenon.
Are you completely satisfied with the resulting model(s)? If "yes", mazal tov!; if "no", give an idea(s) of improving it.

Computational Notes for R users:

the function lm used for fitting linear models creates an object lm.object as its output that contains a lot of useful information you may need for analysis of your model. See help(lm.object) for more details
fitting a linear model by lm and creating the object lm.object as its output, the function plot(lm.object) gives useful plots, like residuals vs. predicted values, Q-Q plot, Cook's distance, etc.
to find the optimal Box-Cox transformation, you can use the function boxcox from the package MASS you should attach/download first
to define factor variables use the factor or ordered functions (see help for details)
to use only part of the data in fitting models, use the parameters subset (preferable) or weights in lm function (see help(lm) for details)
the functions update, add1, drop1 may be useful for modifying models (see help for details)
if you want to plot several plots at the same page you can use par(mfrow=c(...,...)) to control the number of plots per page and per row