PRACTICAL EXAM
 The deadline for submission is 4 March, 2018. Give comprehensive and convincing but at the same time brief and clearly written answers (it is not necessarily a contradiction!).
 Before starting, please sign (by email) the following
declaration valid for both he practical and theoretical exams.
Question 1.
The data in the file
Books.dat is compiled from the catalogue of American Goverment books at Spring, 1988. It lists
prices
Price, number of pages
P and the binding
B (
p  paperback,
c  cloth) of books published by a certain publisher.
 Find a reasonable linear model to this data using a price as the dependent variable and performing appropriate
transformations of variables if necessary. Examine the goodnessoffit of your final model and comment the results.
 Although most of the data are for books published in 1988, in fact, two of the clothbound books were published in
1970's, one of the paperbacks in 1989 and another in 1984. Can you identify them? Delete them from the data and find an
adequate linear model for the reduced data set. Did the omitted observations strongly affect the model?
 Another possible way to reduce the influence of outliers is robust regression. Fit robust regression(s) and comment
the results.
 What model(s) would you introduce to a client? How would you interpretate your results to him/her? (he is a complete
"amateur" in statistics)
 Estimate the price of a 200page book for the two types of binding and give the corresponding 95% prediction
intervals.
Question 2.
The file
Girls.dat contains the data on the exercise histories of 138 teenaged girls hospitalized for eating
disorders, and a group of 93 "control" subjects. The variables are
subject

 
an identification code; there are several observations for each subject, but because the girls were hospitalized at
different ages, the number of observations, and the age at the last observation, vary 
age

 
the subject's age in years at the time of observation; all but the last observation for each subject were collected
retrospectively at intervals of two years, starting at 8. 
exercise

 
the amount of exercise in which the subject engaged, expessed as estimated hours per week 
group

 
a factor indicating whether the subject is "patient" or "control" 
 Perform initial examination of the data and make preliminary conclusions about the relationship of exercise to age for the two groups.
 Fit an appropriate model performing transformations of original variables if necessary. Comment the results.
 Is the relationship of exercise to age different in both groups?
 Whether the amount of weekly hours of exercises does not change with the age for the control group?
 Estimate the expected difference in the amount of weekly hours of exercises between the two groups of girls at age
15.
Question 3.
The prostate cancer data in the file
Prostate.dat come from a study that examined the correlation between the level of prostate
specific antigen (PSA) and the following clinical measurements in 97 men who were about to receive a radical prostatectomy:
lcavol

 
log(cancer volume) 
lweight

 
log(prostate weight) 
age

 
age 
lbph

 
log(benign prostatic hyperplasia amount) 
svi

 
seminal vesicle invasion 
lcp

 
log(capsular penetration) 
gleason

 
Gleason score 
pgg45

 
percentage Gleason scores 4 or 5 
The goal is to predict the log of PCA (lpsa) from these measurements.
 Analyse the data to get some first impression. Make some preliminary comments.
 Check the presence of multicollinearity among the explanatory variables. What methodological and computational
problems it might cause?
 Split randomly (why?) the data into a training and test sets of 75 and 22 patiens respectively. Put a test set meanwhile
aside and consider a training set:
 Start from the main effects model, verify its adequacy.
 Select the `best' model by adding/removing variables and their interactions w.r.t. several model selection criteria.
Compare the resulting models (also with the main
effects), comment the results.
 How would you test and compare the goodnessoffit of different models from the previous paragraph on the test set?
Apply your ideas and comment the results.
 Split randomly again the initial data into training and test sets of the same sizes and repeat the steps 34. Compare
and comment the results for two different splits. Are they surprising? (explain, answers like "Nothing can surprise me in
this world anymore" won't be accepted!)
 Make final conclusions and point out on the measurements relevant for predicting the prostate specific antigen.
Good Luck!