Prelude and Fugue in F-test Major
Suppose we have a linear model with an intercept, p
explanatory variables and i.i.d. normal errors:
y=β0+β1 x1+...+βp xp+ε
Transform the original response y to y'=a(y-c), where a and c are fixed constants.
- What will happen to the OLS estimates of β's and to the residual sum of squares (RSS) after this linear transformation?
- Show that the F-statistics for testing H0 : β1=...=βs =0 (1 ≤ s ≤ p ) will be the same in both cases.
- Where did you use the assumption of normality for errors? How will you test the hypotheses H0 (see above) when the distribution of errors ε is different from normal (at least
Consider a simple linear regression with a single explanatory variable x
. Show that
- If all n observations xi are equidistant from their average, than hii=2/n.
- If all but one observation xi's are identical, these will have hii=1/(n-1), while for the remaining observation hii=1
Fugue in Medical Data
gives the survival times (Time
) in weeks from initial diagnosis of 33 patients with acute myelogeneous leukaemia, with two covariates: WBC
(white blood cell count in thousands) and AG
-factor at the time of diagnosis (1=Pos, 2=Neg)
- Plot Time against WBC for each level of AG. Does the plot indicate that the linear model will be appropriate? Try the effect of the log-transformations onTime and WBC on this plot.
- Fit a full linear regression model (with iteraction) of Time on WBC and AG. Comment the results. Test for parallel regression. Does this model fit the data?
- Re-fit the model on the log-log scale. Does the effect of log(WBC) on log(Time) depend on presence of AG-factor? Check the adequacy of the resulting model and try to think of possible reasons for problems you found
(if any). Compare this model with that of the previous paragraph.
contains data on the sex, the attending physician (A
), severity of illness (1-4
), total hospital charges (Chrg
) and age for 49 patients, all of whom had an identical diagnosis, from Northwestern Memorial Hospital, Chicago.
Computational Notes for R users:
- Fit the main effect model expressing the charges against age and the other variables (don't forget first to express
them as suitable indicator variables where necessary). Is the linear model adequate for this data?
- Find the appropriate transformation of the dependent variable from the Box-Cox transformation family, re-fit the
model and comment its adequacy.
- Test the hypotheses that the attending physician has no effect on hospital charges (on the chosen scale).
- Some feminist organizations claim that there is sexual discrimination in the hospital and women suffer from higher
hospital charges. Does their claim have any statistical ground?
- Point out influential observation(s) that strongly affected your model (if any). Remove them from the data and re-fit
the model. Comment the results. Repeat Step 3 and Step 4.
- Repeat Step 2 without influential observations you've found. Did you get the same scale for the response variable as
before? Try to explain this phenomenon.
- Are you completely satisfied with the resulting model(s)? If "yes", mazal tov!; if "no", give an idea(s) of improving it.
- the function
lm used for fitting linear models creates an object lm.object as its output that contains a lot of useful information you may need for analysis of your model. See
help(lm.object) for more details
- fitting a linear model by
lm and creating the object lm.object as its output, the function
plot(lm.object) gives useful plots, like residuals vs. predicted values, Q-Q plot, Cook's distance,
- to find the optimal Box-Cox transformation, you can use the function
boxcox from the package MASS you should attach/download first
- to define factor variables use the
ordered functions (see help for details)
- to use only part of the data in fitting models, use the parameters subset (preferable) or weights in
lm function (see
help(lm) for details)
- the functions
update, add1, drop1 may be useful for modifying models (see help for details)
- if you want to plot several plots at the same page you can use
par(mfrow=c(...,...)) to control the number of plots per page and per row