PRACTICAL EXAM


  • The deadline for submission is 13 March, 2016.
  • Give comprehensive and convincing but at the same time brief and clearly written answers (it is not necessarily a contradiction!).

Question 1.

The data in the file Books.dat is compiled from the catalogue of American Goverment books at Spring, 1988. It lists prices Price, number of pages P and the binding B (p - paperback, c - cloth) of books published by a certain publisher.
  1. Find a reasonable linear model to this data using a price as the dependent variable and performing appropriate transformations of variables if necessary. Examine the goodness-of-fit of your final model and comment the results.
  2. Although most of the data are for books published in 1988, in fact, two of the cloth-bound books were published in 1970's, one of the paperbacks in 1989 and another in 1984. Can you identify them? Delete them from the data and find an adequate linear model for the reduced data set. Did the omitted observations strongly affect the model?
  3. What model(s) would you introduce to a client? How would you interpretate your results to him/her? (he is a complete "amateur" in statistics)
  4. Estimate the price of a 200-page book for the two types of binding and give the corresponding 95% prediction intervals.

Question 2.

The file Girls.dat contains the data on the exercise histories of 138 teenaged girls hospitalized for eating disorders, and a group of 93 "control" subjects. The variables are
subject - an identification code; there are several observations for each subject, but because the girls were hospitalized at different ages, the number of observations, and the age at the last observation, vary
age - the subject's age in years at the time of observation; all but the last observation for each subject were collected retrospectively at intervals of two years, starting at 8.
exercise - the amount of exercise in which the subject engaged, expessed as estimated hours per week
group - a factor indicating whether the subject is "patient" or "control"
  1. Perform initial examination of the data and make preliminary conclusions about the relationship of exercise to age for the two groups.
  2. Fit an appropriate model performing transformations of original variables if necessary. Comment the results.
  3. Is the relationship of exercise to age different in both groups?
  4. Whether the amount of weekly hours of exercises does not change with the age for the control group?
  5. Estimate the expected difference in the amount of weekly hours of exercises between the two groups of girls at age 15.

Question 3.

The prostate cancer data in the file Prostate.dat come from a study that examined the correlation between the level of prostate specific antigen (PSA) and the following clinical measurements in 97 men who were about to receive a radical prostatectomy:
lcavol - log(cancer volume)
lweight - log(prostate weight)
age - age
lbph - log(benign prostatic hyperplasia amount)
svi - seminal vesicle invasion
lcp - log(capsular penetration)
gleason - Gleason score
pgg45 - percentage Gleason scores 4 or 5

The goal is to predict the log of PCA (lpsa) from these measurements.

  1. Analyse the data to get some first impression. Make some preliminary comments.
  2. Check the presence of multicollinearity among the explanatory variables. What methodological and computational problems it might cause?
  3. Split randomly (why?) the data into a training and test sets of 75 and 22 patiens respectively. Put a test set meanwhile aside and consider a training set:
    1. Start from the main effects model, verify its adequacy.
    2. By adding/removing variables and interactions find the ``best'' parsimonious model with respect to AIC, BIC, RIC criteria and Lasso (you can check other criteria for model selection as well!). Compare the resulting models (also with the main effects), comment the results.
  4. How would you test and compare the goodness-of-fit of different models from the previous paragraph on the test set? Apply your ideas and comment the results.
  5. Split randomly again the initial data into training and test sets of the same sizes and repeat the steps 3-4. Compare and comment the results for two different splits. Are they surprising? (explain, answers like "Nothing can surprise me in this world anymore" won't be accepted!)
  6. Make final conclusions and point out on the measurements relevant for predicting the prostate specific antigen.

Question 4.

The data below are the number of cases of lung cancer and the number of "man-years at risk" in a very large British study of smoking men and its effect on lung cancer. The table is classified by number of years of smoking in five-year intervals, beginning at 15-19 and up to 55-59, and equivalent number of cigarettes smoked per day, in intervals as shown in the Table below. The data are in the form r/n, where r is the number of lung cancer cases and n the number of men at risk.
Years of smoking Cigs/day
1-9 10-14 15-19 20-24 25-34 35+
15-19 0/3121 0/3577 0/4317 0/5683 0/3042 0/670
20-24 0/2937 1/3286 0/4214 1/6385 1/4050 0/1166
25-29 0/2288 1/2546 0/3185 1/5483 4/4290 0/1482
30-34 0/2015 2/2219 4/2560 6/4687 9/4268 4/1580
35-39 1/1648 0/1826 0/1893 5/3646 9/3529 6/1336
40-44 2/1310 1/1886 2/1334 12/2411 11/2424 10/924
45-49 0/927 2/988 2/849 9/1567 10/1409 7/556
50-54 3/710 4/684 2/470 7/857 5/663 4/255
55-59 0/606 3/449 5/280 7/416 3/284 1/104
  1. For these data, find a well-fitting parsimonious model relating the proportion suffering from lung cancer to smoking rate and years of smoking. Give the interpretation of your model in terms of the risk of developing lung cancer.
  2. What are the chances of developing lung cancer for a man smoking 20 cigarettes per day for the last 40 years? (give a pointwise estimate and the corresponding confidence interval).

Good Luck!