• The deadline for submission is 4 March, 2018. Give comprehensive and convincing but at the same time brief and clearly written answers (it is not necessarily a contradiction!).
  • Before starting, please sign (by email) the following declaration valid for both he practical and theoretical exams.

Question 1.

The data in the file Books.dat is compiled from the catalogue of American Goverment books at Spring, 1988. It lists prices Price, number of pages P and the binding B (p - paperback, c - cloth) of books published by a certain publisher.
  1. Find a reasonable linear model to this data using a price as the dependent variable and performing appropriate transformations of variables if necessary. Examine the goodness-of-fit of your final model and comment the results.
  2. Although most of the data are for books published in 1988, in fact, two of the cloth-bound books were published in 1970's, one of the paperbacks in 1989 and another in 1984. Can you identify them? Delete them from the data and find an adequate linear model for the reduced data set. Did the omitted observations strongly affect the model?
  3. Another possible way to reduce the influence of outliers is robust regression. Fit robust regression(s) and comment the results.
  4. What model(s) would you introduce to a client? How would you interpretate your results to him/her? (he is a complete "amateur" in statistics)
  5. Estimate the price of a 200-page book for the two types of binding and give the corresponding 95% prediction intervals.

Question 2.

The file Girls.dat contains the data on the exercise histories of 138 teenaged girls hospitalized for eating disorders, and a group of 93 "control" subjects. The variables are
subject - an identification code; there are several observations for each subject, but because the girls were hospitalized at different ages, the number of observations, and the age at the last observation, vary
age - the subject's age in years at the time of observation; all but the last observation for each subject were collected retrospectively at intervals of two years, starting at 8.
exercise - the amount of exercise in which the subject engaged, expessed as estimated hours per week
group - a factor indicating whether the subject is "patient" or "control"
  1. Perform initial examination of the data and make preliminary conclusions about the relationship of exercise to age for the two groups.
  2. Fit an appropriate model performing transformations of original variables if necessary. Comment the results.
  3. Is the relationship of exercise to age different in both groups?
  4. Whether the amount of weekly hours of exercises does not change with the age for the control group?
  5. Estimate the expected difference in the amount of weekly hours of exercises between the two groups of girls at age 15.

Question 3.

The prostate cancer data in the file Prostate.dat come from a study that examined the correlation between the level of prostate specific antigen (PSA) and the following clinical measurements in 97 men who were about to receive a radical prostatectomy:
lcavol - log(cancer volume)
lweight - log(prostate weight)
age - age
lbph - log(benign prostatic hyperplasia amount)
svi - seminal vesicle invasion
lcp - log(capsular penetration)
gleason - Gleason score
pgg45 - percentage Gleason scores 4 or 5

The goal is to predict the log of PCA (lpsa) from these measurements.

  1. Analyse the data to get some first impression. Make some preliminary comments.
  2. Check the presence of multicollinearity among the explanatory variables. What methodological and computational problems it might cause?
  3. Split randomly (why?) the data into a training and test sets of 75 and 22 patiens respectively. Put a test set meanwhile aside and consider a training set:
    1. Start from the main effects model, verify its adequacy.
    2. Select the `best' model by adding/removing variables and their interactions w.r.t. several model selection criteria. Compare the resulting models (also with the main effects), comment the results.
  4. How would you test and compare the goodness-of-fit of different models from the previous paragraph on the test set? Apply your ideas and comment the results.
  5. Split randomly again the initial data into training and test sets of the same sizes and repeat the steps 3-4. Compare and comment the results for two different splits. Are they surprising? (explain, answers like "Nothing can surprise me in this world anymore" won't be accepted!)
  6. Make final conclusions and point out on the measurements relevant for predicting the prostate specific antigen.

Good Luck!