Exercise 1
Question 1 (theoretical 'warming-up').
- Which of the following distributions belong to the exponential family:
negative binomial NB(r,p) with given r, unform U[0,θ],
Student tn, Gamma(α,β) with fixed α,
Gamma(α,β) with fixed β, Beta(p,q)
with fixed q? For those of them who do belong to the exponential family,
find their natural parameters θ's and functions b(θ).
- Given a sample y1,...,yn from a distribution
that belongs to the exponential family with a natural parameter θ,
find the sufficient statistic for θ.
- Consider the binomial GLM. Give the expression for the predicted probability
of success at point x0 and the corresponding confidence
interval for the logit, probit and CLL links.
Some questions doctors might ask you...
Question 2.
The erythrocyte sedimentation rate (ESR) is the rate at which red blood cells
(erythrocytes) settle out of suspension in blood plasma. The ESR increases if
the level of certain proteins in the blood plasma rise due to various diseases,
this makes the ESR one of the most commonly used screening tests performed on
samples of blood.
The data in the file ESR.dat was collected on 32
individuals to examine
the extent to which ESR is related to two plasma proteins, fibrinogen
and γ-globulin. The ESR for a `healty' individual should be less
than 20 mm/hr and since its absolute value is not really important, the
response variable is 0 if an individual is `healthy' (ERS < 20) or 1 if
`unhealthy' (ESR >= 20). The aim of a research is to determine the strength
of a relationship between the probability of an ESR reading greater than 20
mm/hr and the levels of two plasma proteins.
- Fit the logistic regression of Response on the explanatory variables.
Comment on its goodness-of-fit.
- Examine the need for an iteraction.
- Is Response affected by the levels of both proteins?
Comment the final model.
- Point out on influential observations (if any) that might affect your
modelling. Remove them from the data and repeat the previous paragraphs.
Comment the results.
- Repeat all the previous paragraphs for probit and CLL link functions.
Compare the results.
Question 3.
The data in the file GHQ.dat come from a psychiatric
study.
Each of 120 patients was administered the 12-item General Health Questionaire (GHQ),
resulting score
between 0 and 12, and was subsequently given a full psychiatric examination by
a psychiatrist who did not know the patient's GHQ score. The patient was
classified by a psychiatrist as either a "Case", requiring psychiatric treatment
or a "Non-Case". The goal of research was to establish whether the GHQ score could indicate
the need for psychiatric treatment. More specifically, given patient's GHQ and
Sex, what can be said about the probability P that the patient is a psychiatric
case?
The file gives the GHQ, the number of Cases C, and Non-Cases NC
at each GHQ score classified by the factor Sex (1=Male, 2=Female).
- Plot the proportion of patients who need psychiatric treatment as a function
of GHQ for both sexes. Can you say something from a visual analysis of the data?
- Fit an appropriate model for studying the relations between P and
explanatory variables GHQ and Sex. Analyse the results. Does the
effect of GHQ on P depend on patient's sex?
Does Sex influence on P at all?
- Fit the final model. Plot the fitted
proportions and compare them with the observed ones.
Interpretate the results. Do they support/contradict your preliminary
conclusions based on the visual analysis of the data? Could you recommend to use
the GHQ as a reasonable indicator for the need of a psychiatric treatment
without a full expensive and sophisticated psychiatric examination?
- What can you say about the probability of a new female patient with
GHQ=2 to be a psychiatric case? (give both the pointwise estimate and
the corresponding confidence interval). R functions predict or
predict.glm may be helpful (see their help for more detail).
- A question which may be of interest is the value GHQ(0.5) of patient's GHQ
that corresponds to a probability of 0.5 that the patient is a psychiatric case.
Estimate GHQ(0.5) from your model and give the corresponding
95% confidence interval. The R function dose.p from the library
MASS may be helpful.