FINAL PROJECT
- the deadline is 25, July 2012, Wednesday 15:00
- try to provide comprehensive but short arguments; use all the techniques you have learned (various tests, plots,
etc.) to justify your conclusions.
- before starting the exam, please sign (by email) the following
declaration
Question 1 (theoretical warming-up).
A hospital planed to carry out a medical study on a large sample of patients
to investigate possible association between a certain disease D and patients' characteristics x (e.g. age, sex, smoking status, etc.). However, due to budget cuts it was decided to select a smaller sample from
the original one. Let
- D i=1 if the i-th patient has the disease, D i=0 otherwise
- x i - the vector of values for the i-th patient (fixed and known)
- S i=1 if the i-th patient is selected to a smaller sample for the study, S i=0 otherwise.
- For the selected patients, logistic regression model has been fitted:
log(P(D i=1|x i,S i=1)/P(D i=0|x i,S i=1)) = a+b'x i
Unless budget limitations, one would be naturally interested in fitting the logistic regression to the whole data
set:
log(P(D i=1|x i)/P(D i=0|x i)) = a *+(b *)'x i
- Suppose the proportion of chosen patients was the same among both groups (say, r). What is the connection between the
coefficients in both models, i.e. between a,b and a *, b * ?
- Repeat the previous paragraph for the case where proportions of selected patients are different for patients with and
without the disease (say, P(S i=1|D i=1)=r 1, while P(S i=1|D i=0)=r 0).
- Comment the results and make conclusions. How will the decreased samle size affect the model fit?
Question 2.
The data below are the number of cases of lung cancer and the number of "man-years at risk" in a very
large British study of smoking men (sorry, girls...) and its effect on lung cancer. The table is classified by number of
years of smoking in five-year intervals, beginning at 15-19 and up to 55-59, and equivalent number of cigarettes smoked per
day, in intervals as shown in the Table below. The data are in the form
r/n, where
r is the number of lung cancer cases and
n the number of men at risk.
Years of smoking |
|
|
|
Cigs/day |
|
|
|
|
|
1-9 |
10-14 |
15-19 |
20-24 |
25-34 |
35+
|
15-19 |
|
0/3121 |
0/3577 |
0/4317 |
0/5683 |
0/3042 |
0/670 |
20-24 |
|
0/2937 |
1/3286 |
0/4214 |
1/6385 |
1/4050 |
0/1166 |
25-29 |
|
0/2288 |
1/2546 |
0/3185 |
1/5483 |
4/4290 |
0/1482 |
30-34 |
|
0/2015 |
2/2219 |
4/2560 |
6/4687 |
9/4268 |
4/1580 |
35-39 |
|
1/1648 |
0/1826 |
0/1893 |
5/3646 |
9/3529 |
6/1336 |
40-44 |
|
2/1310 |
1/1886 |
2/1334 |
12/2411 |
11/2424 |
10/924 |
45-49 |
|
0/927 |
2/988 |
2/849 |
9/1567 |
10/1409 |
7/556 |
50-54 |
|
3/710 |
4/684 |
2/470 |
7/857 |
5/663 |
4/255 |
55-59 |
|
0/606 |
3/449 |
5/280 |
7/416 |
3/284 |
1/104 |
- For these data, find a well-fitting
parsimonious model relating the proportion suffering from lung cancer to smoking rate and years of smoking. Give
the interpretation of your model in terms of the risk of developing lung cancer.
- What are the chances of developing lung cancer for a man smoking 20 cigarettes per day for the last 40 years? (give a
pointwise estimate and the corresponding confidence interval).
Question 3.
In a study male and female drivers were interviewed about the importance of various features of vehicle
safety to them when they were buying a car. Table below shows the ratings for air conditioning according to the sex and age
of the driver.
Sex |
Age |
No or little importance |
Important |
Very important |
Total |
Women |
18-23 |
26 |
12 |
7 |
45 |
|
24-40 |
9 |
21 |
15 |
45 |
|
>40 |
5 |
14 |
41 |
60 |
Men |
18-23 |
40 |
17 |
8 |
65 |
|
24-40 |
17 |
15 |
12 |
44 |
|
>40 |
8 |
15 |
18 |
41 |
Total |
|
105 |
94 |
101 |
300 |
- Look at the data and try to make some preliminary conclusions (conjectures?).
- Fit an appropriate model for these data. Do the ratings change with the age similarly in both sex groups? Does sex
influence at all? Can you say that the ratings do not change with the age?
- Can you exploit the fact that the response variable for these data is an ordinal categorical variable? If "
yes", fit the corresponding model. Is it adequate for the data? Return to all the questions from the second
paragraph.
- Compare the results from both models. In particular, compare the estimated probabilities with the observed
proportions. Make final conclusions and comment the results of the study.
Question 4.
In an experiment to investigate the social behavior of hornets, different numbers of hornets were
placed in boxes, and the number of cells built by the hornets was counted. Below are given the data from 38 boxes: the
number of cells built for each number of hornets.
No. of hornets |
No. of cells |
1 |
0, 1, 2, 2, 4, 4, 5, 10, 11, 18 |
2 |
0, 4, 5, 7, 8, 13, 18, 29 |
5 |
7, 8, 17, 18, 19 |
6 |
17 |
10 |
12, 17, 18, 23, 25, 32 |
12 |
21 |
16 |
12 |
19 |
23 |
20 |
21, 23, 30, 31 |
41 |
30 |
- Assuming a normal model with constant variance, find the appropriate transformation of the number of cells using
log(no. of hornets) as an explanatory variable. Analyse the results of fitting and point out problems you have found (if
any).
- If you are not satisfied with the resulting fit in the previous paragraph, you can probably modify the model by allow
heterogeneity for the variance assuming that it is also a function of no. of hornets or log(no. of hornets) respectively.
Compare your final model with that from the previous paragraph. Is the assumption of equal variances reasonable?
- Assume the Poisson model for the number of cells and fit the corresponding regression model for no. of hornets or
log(no. of hornets). Comment on the fit of the Poisson model, and compare the results with those from previous
paragraphs. Is there overdispersion? If "yes", modify your original Poisson model. Make final conclusions.
Question 5.
Observations of a Poisson process are recorded on a counter, which records the number of events
occuring in independent consecutive 1-second periods. The counter has a random fault, and in some random 1-second periods
does not record the correct count, but records a zero count instead.
In one-minute period, the following distribution of counts is observed:
Count |
0 |
1 |
2 |
3 |
4 |
5 |
Number |
24 |
21 |
11 |
3 |
0 |
1 |
Let λ be the mean number of events occuring per second, and
p the probability that the counter is not operating.
- Write down the likelihood function.
- Find the MLEs for λ and
p. If they are not available in the closed form, provide relevant numerical procedure(s) and apply it to the
data at hand.
- Assess the evidence that the counter was faulty over the data recording period. State any theoretical difficulties
(if any) in answering this question.
Good Luck!