- The deadline for submission is 4 March, 2018. Give comprehensive and convincing but at the same time brief and clearly written answers (it is not necessarily a contradiction!).
- Before starting, please sign (by email) the following
declaration valid for both he practical and theoretical exams.
The data in the file
is compiled from the catalogue of American Goverment books at Spring, 1988. It lists
, number of pages P
and the binding B
- paperback, c
- cloth) of books published by a certain publisher.
- Find a reasonable linear model to this data using a price as the dependent variable and performing appropriate
transformations of variables if necessary. Examine the goodness-of-fit of your final model and comment the results.
- Although most of the data are for books published in 1988, in fact, two of the cloth-bound books were published in
1970's, one of the paperbacks in 1989 and another in 1984. Can you identify them? Delete them from the data and find an
adequate linear model for the reduced data set. Did the omitted observations strongly affect the model?
- Another possible way to reduce the influence of outliers is robust regression. Fit robust regression(s) and comment
- What model(s) would you introduce to a client? How would you interpretate your results to him/her? (he is a complete
"amateur" in statistics)
- Estimate the price of a 200-page book for the two types of binding and give the corresponding 95% prediction
contains the data on the exercise histories of 138 teenaged girls hospitalized for eating
disorders, and a group of 93 "control" subjects. The variables are
||an identification code; there are several observations for each subject, but because the girls were hospitalized at
different ages, the number of observations, and the age at the last observation, vary
||the subject's age in years at the time of observation; all but the last observation for each subject were collected
retrospectively at intervals of two years, starting at 8.
||the amount of exercise in which the subject engaged, expessed as estimated hours per week
||a factor indicating whether the subject is "patient" or "control"
- Perform initial examination of the data and make preliminary conclusions about the relationship of exercise to age for the two groups.
- Fit an appropriate model performing transformations of original variables if necessary. Comment the results.
- Is the relationship of exercise to age different in both groups?
- Whether the amount of weekly hours of exercises does not change with the age for the control group?
- Estimate the expected difference in the amount of weekly hours of exercises between the two groups of girls at age
The prostate cancer data in the file
come from a study that examined the correlation between the level of prostate
specific antigen (PSA) and the following clinical measurements in 97 men who were about to receive a radical prostatectomy:
||log(benign prostatic hyperplasia amount)
||seminal vesicle invasion
||percentage Gleason scores 4 or 5
The goal is to predict the log of PCA (lpsa) from these measurements.
- Analyse the data to get some first impression. Make some preliminary comments.
- Check the presence of multicollinearity among the explanatory variables. What methodological and computational
problems it might cause?
- Split randomly (why?) the data into a training and test sets of 75 and 22 patiens respectively. Put a test set meanwhile
aside and consider a training set:
- Start from the main effects model, verify its adequacy.
- Select the `best' model by adding/removing variables and their interactions w.r.t. several model selection criteria.
Compare the resulting models (also with the main
effects), comment the results.
- How would you test and compare the goodness-of-fit of different models from the previous paragraph on the test set?
Apply your ideas and comment the results.
- Split randomly again the initial data into training and test sets of the same sizes and repeat the steps 3-4. Compare
and comment the results for two different splits. Are they surprising? (explain, answers like "Nothing can surprise me in
this world anymore" won't be accepted!)
- Make final conclusions and point out on the measurements relevant for predicting the prostate specific antigen.