Final Project

FINAL PROJECT

the deadline is 25, July 2012, Wednesday 15:00
try to provide comprehensive but short arguments; use all the techniques you have learned (various tests, plots, etc.) to justify your conclusions.
before starting the exam, please sign (by email) the following declaration

Question 1 (theoretical warming-up).

A hospital planed to carry out a medical study on a large sample of patients to investigate possible association between a certain disease D and patients' characteristics x (e.g. age, sex, smoking status, etc.). However, due to budget cuts it was decided to select a smaller sample from the original one. Let

D _i=1 if the i-th patient has the disease, D _i=0 otherwise
x _i - the vector of values for the i-th patient (fixed and known)
S _i=1 if the i-th patient is selected to a smaller sample for the study, S _i=0 otherwise.
For the selected patients, logistic regression model has been fitted:

log(P(D _i=1|x _i,S _i=1)/P(D _i=0|x _i,S _i=1)) = a+b'x _i

Unless budget limitations, one would be naturally interested in fitting the logistic regression to the whole data set:

log(P(D _i=1|x _i)/P(D _i=0|x _i)) = a ^*+(b ^*)'x _i

Suppose the proportion of chosen patients was the same among both groups (say, r). What is the connection between the coefficients in both models, i.e. between a,b and a ^*, b ^* ?
Repeat the previous paragraph for the case where proportions of selected patients are different for patients with and without the disease (say, P(S _i=1|D _i=1)=r ₁, while P(S _i=1|D _i=0)=r ₀).
Comment the results and make conclusions. How will the decreased samle size affect the model fit?

Question 2.

The data below are the number of cases of lung cancer and the number of "man-years at risk" in a very large British study of smoking men (sorry, girls...) and its effect on lung cancer. The table is classified by number of years of smoking in five-year intervals, beginning at 15-19 and up to 55-59, and equivalent number of cigarettes smoked per day, in intervals as shown in the Table below. The data are in the form r/n, where r is the number of lung cancer cases and n the number of men at risk.

Years of smoking			Cigs/day
	1-9	10-14	15-19	20-24	25-34	35+
15-19	0/3121	0/3577	0/4317	0/5683	0/3042	0/670
20-24	0/2937	1/3286	0/4214	1/6385	1/4050	0/1166
25-29	0/2288	1/2546	0/3185	1/5483	4/4290	0/1482
30-34	0/2015	2/2219	4/2560	6/4687	9/4268	4/1580
35-39	1/1648	0/1826	0/1893	5/3646	9/3529	6/1336
40-44	2/1310	1/1886	2/1334	12/2411	11/2424	10/924
45-49	0/927	2/988	2/849	9/1567	10/1409	7/556
50-54	3/710	4/684	2/470	7/857	5/663	4/255
55-59	0/606	3/449	5/280	7/416	3/284	1/104

For these data, find a well-fitting parsimonious model relating the proportion suffering from lung cancer to smoking rate and years of smoking. Give the interpretation of your model in terms of the risk of developing lung cancer.
What are the chances of developing lung cancer for a man smoking 20 cigarettes per day for the last 40 years? (give a pointwise estimate and the corresponding confidence interval).

Question 3.

In a study male and female drivers were interviewed about the importance of various features of vehicle safety to them when they were buying a car. Table below shows the ratings for air conditioning according to the sex and age of the driver.

Sex	Age	No or little importance	Important	Very important	Total
Women	18-23	26	12	7	45
	24-40	9	21	15	45
	>40	5	14	41	60
Men	18-23	40	17	8	65
	24-40	17	15	12	44
	>40	8	15	18	41
Total		105	94	101	300

Look at the data and try to make some preliminary conclusions (conjectures?).
Fit an appropriate model for these data. Do the ratings change with the age similarly in both sex groups? Does sex influence at all? Can you say that the ratings do not change with the age?
Can you exploit the fact that the response variable for these data is an ordinal categorical variable? If " yes", fit the corresponding model. Is it adequate for the data? Return to all the questions from the second paragraph.
Compare the results from both models. In particular, compare the estimated probabilities with the observed proportions. Make final conclusions and comment the results of the study.

Question 4.

In an experiment to investigate the social behavior of hornets, different numbers of hornets were placed in boxes, and the number of cells built by the hornets was counted. Below are given the data from 38 boxes: the number of cells built for each number of hornets.

No. of hornets	No. of cells
1	0, 1, 2, 2, 4, 4, 5, 10, 11, 18
2	0, 4, 5, 7, 8, 13, 18, 29
5	7, 8, 17, 18, 19
6	17
10	12, 17, 18, 23, 25, 32
12	21
16	12
19	23
20	21, 23, 30, 31
41	30

Assuming a normal model with constant variance, find the appropriate transformation of the number of cells using log(no. of hornets) as an explanatory variable. Analyse the results of fitting and point out problems you have found (if any).
If you are not satisfied with the resulting fit in the previous paragraph, you can probably modify the model by allow heterogeneity for the variance assuming that it is also a function of no. of hornets or log(no. of hornets) respectively. Compare your final model with that from the previous paragraph. Is the assumption of equal variances reasonable?
Assume the Poisson model for the number of cells and fit the corresponding regression model for no. of hornets or log(no. of hornets). Comment on the fit of the Poisson model, and compare the results with those from previous paragraphs. Is there overdispersion? If "yes", modify your original Poisson model. Make final conclusions.

Question 5.

Observations of a Poisson process are recorded on a counter, which records the number of events occuring in independent consecutive 1-second periods. The counter has a random fault, and in some random 1-second periods does not record the correct count, but records a zero count instead.

In one-minute period, the following distribution of counts is observed:

Count	0	1	2	3	4	5
Number	24	21	11	3	0	1

Let λ be the mean number of events occuring per second, and p the probability that the counter is not operating.