PRACTICAL EXAM


  • The deadline for submission is 22 March, 2020. Give comprehensive and convincing but at the same time brief and clearly written answers (it is not necessarily a contradiction!).

Question 1.

The data in the file Books.dat is compiled from the catalogue of American Goverment books at Spring, 1988. It lists prices Price, number of pages P and the binding B (p - paperback, c - cloth) of books published by a certain publisher.
  1. Find a reasonable linear model to this data using a price as the dependent variable and performing appropriate transformations of variables if necessary. Examine the goodness-of-fit of your final model and comment the results.
  2. Although most of the data are for books published in 1988, in fact, two of the cloth-bound books were published in 1970's, one of the paperbacks in 1989 and another in 1984. Can you identify them? Delete them from the data and find an adequate linear model for the reduced data set. Did the omitted observations strongly affect the model?
  3. Another possible way to reduce the influence of outliers is robust regression. Fit robust regression(s) and comment the results.
  4. What model(s) would you introduce to a client? How would you interpret your results to him/her? (he is a complete "amateur" in statistics)
  5. Estimate the price of a 200-page book for the two types of binding and give the corresponding 95% prediction intervals.

Question 2.

The file Girls.dat contains the data on the exercise histories of 138 teenaged girls hospitalized for eating disorders, and a group of 93 "control" subjects. The variables are
subject - an identification code; there are several observations for each subject, but because the girls were hospitalized at different ages, the number of observations, and the age at the last observation, vary
age - the subject's age in years at the time of observation; all but the last observation for each subject were collected retrospectively at intervals of two years, starting at 8.
exercise - the amount of exercise in which the subject engaged, expessed as estimated hours per week
group - a factor indicating whether the subject is "patient" or "control"
  1. Perform initial examination of the data and make preliminary conclusions about the relationship of exercise to age for the two groups.
  2. Fit an appropriate model performing transformations of original variables if necessary. Comment the results.
  3. Is the relationship of exercise to age different in both groups?
  4. Whether the amount of weekly hours of exercises does not change with the age for the control group?
  5. Estimate the expected difference in the amount of weekly hours of exercises between the two groups of girls at age 15.

Question 3.

The dataset Boston from the library MASS consists of 506 median prices of owner-occupoed homes in $1000s (medv) in various places in Boston. Alongside with price, the dataset also provide various geographic and socio-economic information such as
  • crim -- per capita crime rate by town
  • zn -- proportion of residential land zoned for lots over 25,000 sq.ft
  • indus -- proportion of non-retail business acres per town
  • chas -- Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • nox -- nitrogen oxides concentration (parts per 10 million)
  • rm -- average number of rooms per dwelling
  • age -- proportion of owner-occupied units built prior to 1940
  • dis -- weighted mean of distances to five Boston employment centres
  • rad -- index of accessibility to radial highways
  • tax -- full-value property-tax rate per $10,000
  • ptratio -- pupil-teacher ratio by town
  • black -- $1000(Bk - 0.63)^2$ where Bk is the proportion of blacks by town
  • lstat -- lower status of the population (percent)
The goal is to find the relations between these factors and the house prices.
  1. Analyze the data to get some first impressions and make some preliminary comments.
  2. Split randomly (why?) the data into a training and test sets of 80% and 20% of the data respectively. Put a test set meanwhile aside and consider a training set:
    1. Start from the main effects model, verify its adecuacy.
    2. If you're not satisfied, try to add paired interactions, perform transformations if necessary.
    3. Perform model selection w.r.t. various model selection criteria. Compare the resulting models and comment the results.
  3. Test and compare the goodness-of-fit of those models on the test set. Comment the results, choose the `final' model and explain it.

Good Luck!