Exercise 1

Question 1 (theoretical 'warming-up').

Which of the following distributions belong to the exponential family: negative binomial NB(r,p) with given r, unform U[0,θ], Student t_n, Gamma(α,β) with fixed α, Gamma(α,β) with fixed β, Beta(p,q) with fixed q? For those of them who do belong to the exponential family, find their natural parameters θ's and functions b(θ).
Given a sample y₁,...,y_n from a distribution that belongs to the exponential family with a natural parameter θ, find the sufficient statistic for θ.
Consider the binomial GLM. Give the expression for the predicted probability of success at point x₀ and the corresponding confidence interval for the logit, probit and CLL links.

Some questions doctors might ask you...

Question 2.

The erythrocyte sedimentation rate (ESR) is the rate at which red blood cells (erythrocytes) settle out of suspension in blood plasma. The ESR increases if the level of certain proteins in the blood plasma rise due to various diseases, this makes the ESR one of the most commonly used screening tests performed on samples of blood. The data in the file ESR.dat was collected on 32 individuals to examine the extent to which ESR is related to two plasma proteins, fibrinogen and γ-globulin. The ESR for a healthy individual should be less than 20 mm/hr and since its absolute value is not really important, the response variable is 0 if an individual is "healthy" (ERS < 20) or 1 if "unhealthy" (ESR >= 20). The aim of a research is to determine the strength of a relationship between the probability of an ESR reading greater than 20 mm/hr and the levels of two plasma proteins.

Fit the logistic regression of Response on the explanatory variables. Comment on its goodness-of-fit.
Examine the need for an iteraction.
Is Response affected by the levels of both proteins? Comment the final model.
Point out on influential observations (if any) that might affect your modelling. Remove them from the data and repeat the previous paragraphs. Comment the results.
Repeat all the previous paragraphs for probit and CLL link functions. Compare the results.

Question 3.

The data in the file GHQ.dat come from a psychiatric study. Each of 120 patients was administered the 12-item General Health Questionaire (GHQ), resulting score between 0 and 12, and was subsequently given a full psychiatric examination by a psychiatrist who did not know the patient's GHQ score. The patient was classified by a psychiatrist as either a "Case", requiring psychiatric treatment or a "Non-Case". The goal of research was to establish whether the GHQ score could indicate the need for psychiatric treatment. More specifically, given patient's GHQ and Sex, what can be said about the probability P that the patient is a psychiatric case?

The file gives the GHQ, the number of Cases C, and Non-Cases NC at each GHQ score classified by the factor Sex (1=Male, 2=Female).

Plot the proportion of patients who need psychiatric treatment as a function of GHQ for both sexes. Can you say something from a visual analysis of the data?
Fit an appropriate model for studying the relations between P and explanatory variables GHQ and Sex. Analyse the results. Does the effect of GHQ on P depend on patient's sex? Does Sex influence on P at all?
Fit the final model. Plot the fitted proportions and compare them with the observed ones. Interpretate the results. Do they support/contradict your preliminary conclusions based on the visual analysis of the data? Could you recommend to use the GHQ as a reasonable indicator for the need of a psychiatric treatment without a full expensive and sophisticated psychiatric examination?
What can you say about the probability of a new female patient with GHQ=2 to be a psychiatric case? (give both the pointwise estimate and the corresponding confidence interval). R functions predict or predict.glm may be helpful (see their help for more detail).
A question which may be of interest is the value GHQ_(0.5) of patient's GHQ that corresponds to a probability of 0.5 that the patient is a psychiatric case. Estimate GHQ_(0.5) from your model and give the corresponding 95% confidence interval. The R function dose.p from the library MASS may be helpful.