Exercise 4

Question 1.

The data set bacteria (part of MASS library - see help(bacteria)) contains records on the presence of the bacteria H.influenzae in a series of observations on 50 children with otitis media in the Northern Territory of Australia. Children were randomly allocated to three groups treated by a placebo and an active drug with and without extra effort to ensure that it was taken. The presence of the bacteria was tested on weeks 0, 2, 4, 6 and 11 (but some observations are missing). The (binary) response y is the presence (y) or absence (n) of the bacteria. The relevant explonatory variables are

week -- week of test
trt -- a factor with levels placebo, drug and drug+
ID -- subject ID

Fit a logit GLM for y as a function of week and trt.
1. Does the week effect depend on the treatment?
2. Is there a significant diffence between treatments?
3. Was extra effort worth?
Fit a GLMM model adding a (random) effect of a subject.
1. Fit a random intercept model. Compare it with a random intercepts model, where the intercepts' variances may be different for each treatment group. Is the assumption of their equality reasonable?
2. Consider a model with random intercept and slope.
3. For a chosen GLMM model reply to the same questions about the efficiency of treatments (see GLM).
What is your final model? Make final comments on it.

Welcome to the Mysterious World of Missing Data!

Warning: once you miss something during the journey, don't panic - apply the EM-algorithm

Question 2.

Observations y_i are drawn from a truncated Poisson distribution with an unknown mean λ in which zero values are not observable:

	1	2	3	4	5
Frequency	132	42	21	3	2

Find a probability function of y.
Give the form of iterations and write a short code to find the MLE of λ by Newton-Raphson and Fisher scoring algorithms (you can, for example, plot the (log-)likelihood function as a function of λ to get a reasonable starting value).
Show how the EM algorithm can be used to estimate λ from "complete data". Perform the EM algorithm (to make a fair comparison use the same starting value).
Compare estimates obtained by different methods and the rates of convergence.

Question 3.

The data shown below give the number of death notices for women aged 80 years and over, from the London Times for each day in the three-year period 1910-1912 (it seems men simply didn't reach such age at all...at least those times...):

Number of deaths	0	1	2	3	4	5	6	7	8	9
Frequency	162	267	271	185	111	61	27	8	3	1

Fit a Poisson distribution to the number of deaths. Is the fit satisfactory?
Write down a two-component Poisson mixture model for the data.
Describe the EM algorithm for fitting the model.
Starting from initial estimates p=0.5, λ₁=1, λ₂=2, run the EM algorithm for the first five iterations and compute the log-likelihood on each iteration.
The convergence of the EM algorithm can be slow especially for mixture models and sometimes needs about 1000 iterations, so I'll help you a little bit. The MLEs you'll get after a lot of iterations in this case are: p=0.360, λ₁=1.256, λ₂=2.663. Calculate the log-likelihood for MLEs and compare it with your previous results. Is the fit satisfactory now?
Convert the frequencies of each value of y to proportions and plot these observed proportions together with the fitted probablities from the Poisson distribution and the Poisson mixture distribution. Comment the results.

Computational Notes for R users: there are several ways to run GLMM models in R, e.g.

glmmPQL from the MASS library
lmer from the lme4 package

The methods mostly differ in numerical approximations for integrals and maximization.