THEORETICAL EXAM


  • The deadline for submission is 2 February 2016, 14:00.
  • Before starting the exam, please sign (by email) the following declaration valid for the theoretical exam and the final project as well.

Question 1

Consider a linear regression model with p explanatory variables:
y=X β+ ε,

where X is a n by p matrix, E( ε)= 0 and Var( ε)=σ 2I n (σ is unknown). We want to test a general linear hypothesis H 0:A β= c, where A is an r by p matrix ( r <= p) and c is a r-dimensional vector.

  1. Show that the following null-hypotheses are particular cases of this general case (derive the corresponding matrix A, r and vector c):
    1. β 1=...=β k=0
    2. β 1=...=β k
    3. β 1=8
    4. β 1=8, β 3=2β 2-5, β 24=4β 5
  2. Find the OLS estimators of β under H 0.
  3. Using the results of the previous paragraph and, in addition, assuming that ε i are normaly distributed, derive the corresponding test for testing H 0 (try to simplify the final formula as much as you can).

Question 2

Suppose
y=X β+ ε,

where ε i are i.i.d. with zero means and the common variance σ 2. We wish to estimate x 0' β at some point x 0 based on the ordinary least squares (OLS) estimator β * of β as x 0' β *.

  1. Find the mean squared error (MSE) of the resulting estimator.
  2. Show that there does not exist any other linear unbiased estimator of x 0' β with a smaller variance, i.e. that this estimator is BLUE.
  3. Can you claim that this is the best unbiased estimator among all possible unbiased estimators? Will the additional assumption of normality of ε's be helpful?

Question 3

Consider the one-way analysis of variance model with the same numbers of observations n in each of m groups:
y ij=μ+α jij, i=1,...,n; j=1,...,m

where ε ij are i.i.d. normal variables with zero means and the variance σ 2.

  1. Treat the group factor as a fixed effect.
    1. Find the MLE for μ, α j and σ 2.
    2. Formulate the hypothesis for testing the homogeneity among groups. What is the corresponding test?
  2. Treat the group factor as a random effect.
    1. Define an appropriate model and find the MLE for its parameters.
    2. Formulate the hypothesis for testing the homogeneity among groups in terms of the random effects model and derive an appropriate test statistic.
  3. Explain the `conceptual' differences between fixed and random effects models. Give examples where it is reasonable to consider a group factor as a fixed and random effect correspondingly.

Question 4

Consider the following linear growth curve model with random intercept and slope, where n measurements are taken repeatedly on each of m individuals over time, that is
y ij0j1j = β 0j + β 1j t i + ε ij, i=1,...,n; j=1,...,m,

where ε ij ~ N(0,σ 2), β 0j ~ N(β * 0, σ 2 0), β 1j ~ N(β * 1, σ 2 1), and all ε ij, β 0j and β 1j are independent.

Find the joint marginal distribution of the data y ij, i=1,...,n; j=1,...,m.

Question 5

A hospital planed to carry out a medical study on a large sample of people to investigate possible association between a certain disease D and personal characteristics x (e.g. age, sex, smoking status, etc.). However, due to budget cuts it was decided to select a smaller sample from the original one. Let
D i=1 if the i-th person has the disease, D i=0 otherwise
x i - the vector of values for the i-th person (fixed and known)
S i=1 if the i-th person is selected to a smaller sample for the study, S i=0 otherwise.
For the selected sample, logistic regression model has been fitted:
log(P(D i=1|x i,S i=1)/P(D i=0|x i,S i=1)) = a+b'x i

Unless budget limitations, one would be naturally interested in fitting the logistic regression to the whole original large sample:

log(P(D i=1|x i)/P(D i=0|x i)) = a *+(b *)'x i
  1. Suppose the proportion of chosen people in a small sample was the same among both groups (say, r). What is the connection between the coefficients in both models, i.e. between a,b and a *, b * ?
  2. Repeat the previous paragraph for the case where proportions of selected people are different for people with and without the disease (say, P(S i=1|D i=1)=r 1, while P(S i=1|D i=0)=r 0).
  3. Comment the results and make conclusions. How will the decreased samle size affect the model fit?

And for the dessert here is the story about Tom Statman, a statistician from London, that essentially may happen to any of us... Enjoy the reading, but each time appears, please, stop there for a moment - your consulting is badly needed. Your short but comprehensive assistance will be highly appreciated.

One day from Mr. Statman's statistical practice

It was one of those days when everything runs wrong right from the morning... It started with a telephone call at 6.00 am. Tom Statman's client, Dr. Fleming, had been in the hospital all night waiting impatiently for the test results of his patients treated according to his new method . The moment he got them from the labaratory he called Tom at home though it was 6.00 am begging him to peform a statistical analysis of these results and to compare them with those from the control group as soon as possible. "Please, wake up! It's urgent! It should be done before 10.00 am, I'm sending the results to your office by email right now!" - Dr. Fleming was so excited he could hardly speak. Sleepy Tom, who could hardly understand what was going on, mumbled: "Yeh... t-test... Office..." swearing in his mind at this idiot Dr. Fleming, his stupid results, all damn statistical tests and statistics in general, and closed the phone. But he also understood that he won't go back to sleep again anyway now... Tom Statman recently started this job and each client was quite important, especially client such as Dr. Fleming who supplied much work and paid well... "No way... I'd better go to the office and start analysing his data" - thought Tom. The day had started in a wrong way and it was clear to him that this was not the end... He looked out of the window - it was a dark-grey drizzling cold London winter morning. Tom put coffee on the gas and looked through the window. He was still half a sleep and was nodding off a little bit when he suddenly heard hissing on the gas. It was too late - the coffee spilt on to the cooker and covered it with large dirty spots. "Damn it! My coffee! The cleaning lady was here only yesterday" - groaned Tom. He became angry at this coffee, at this damn morning, at this stupid life and started dressing...

In about ten minutes he went out of his house in a suburb of London, sat in his car and tried to start it. The engine was silent and didn't react to Tom's desperate efforts. "Come on! Come on!"... No response... It was too much... Tom put his head on the wheel and started crying... "Cab! I'll take a cab and this stupid Dr. Fleming will pay me for it!" - decided Tom.

He was in his office in the City in about half an hour. No one was there at such an early hour. Tom switched on his computer: "New mail has arrived" - it was Dr. Fleming's data. "OK, let's go!" - said Tom to himself. He had got his M.Sc. in statistics at Oxford and was a real professional. Moreover, unlike some of his colleagues he still liked his job, he liked to analyse data, to discover "hidden" connections between variables and to look at the astonished faces of his clients: "It's unbelieveble. You're a genius, Mr. Statman!". At those moments he was really happy and was proud of himself. Once he started importing Dr. Fleming's data into the file he was already enthusiastic about this project. "Please, no mis-recorded data this time, please" - prayed Tom. He remembered the incident that happened to his class-mates Alan Weightman and Judy Grouppy in one of the projects during those good old student days in Oxford...

In a simple linear regression one observation has been miss-recorded. To remove it from the fit in a simple way, Alan Weightman suggested to use the weighted regression for the whole data set giving the null weight to the miss-recorded observation and setting unit weights to all others. Judy Grouppy never agreed with Alan Weightman and proposed instead to define a new factor variable: Group=1 for the miss-recorded observation, Group=0 for all others and then fit the model y~x+Group for the whole data. They argued for two hours and finally agreed to ask Prof. Wiseman to solve their dispute.

Prof. Wiseman listened carefully to both sides, thought for a moment and in his typical Jewish manner replied by suggesting several questions of his own: "Young colleagues, a)What do you think is the meaning of the extra variable coefficient in the model of Mrs. Grouppy? b)Could you compare the regression coefficients for both models and the residual sum of squares (RSS) for Judy Grouppy's model with the weighted RSS for Alan Weightman's model?


Answer both of Prof. Wiseman's questions and comment on the results.

The data transfer was over. As usual Tom started with the visual analysis: "Well-well... Both samples (the results of control and treatment groups) seem to be normal with similar variances... Hope a simple two-sample t-test will solve the problem". But what has happened?! Every time he tried to run the two-sample t-test which till this morning had always worked without a hitch, today returned a strange error message "t-test function is not found". Tom tried to call it again and again but the result was the same. "Something has happened to this stupid computer!" - Tom was almost crying. Their System Administrator was supposed to come only about 9.30 am. Too late... It was too much for one day... Tom was in despair. Why does this happen to him? Who could help him? It was already 7:30 am... Suddenly he remembered rumors about a famous oracle, Lin No Lin. Tom always laughed at those stories but nevertheless could not forget the one told him by his French friend Charles d'Linear:

Charles d'Linear was looking for an adequate linear model for his data that contained a lot of explanatory variables some of them perhaps being not relevant. He wasn't a complete `amateur' in statistics and after a while, he found a reasonable model, checked its R 2, R 2 adj,R 2 cv, performed the analysis of residuals, etc. - everything seemed OK but he still felt uncomfortable with these `half-heuristic' goodness-of-fit indicators and was keen to check the adequacy of his model by some goodness-of-fit statistical test.


Chalres d'Linear heard good recommendations about you and asked you to help him to find such a test. Can you help Charles?

Desperate Charles went to Lin No Lin. The oracle listened carefully to d'Linear's problems and promised to try to help him. After Charles had gone, Lin No Lin entered into deep meditation and in his vision the true value of the noise variance σ 2 has been revealed to him.


Can this miraculous revelation help Charles d'Linear? If `yes', then how?

Tom was already with his hand on the telephone book to start searching for Lin No Lin, when a sudden idea crossed his mind: "Linear... linear regression... Yes! Linear regression! I can use the linear regression function to perform this damn t-test!" - Tom was dancing. He was so ashamed now about his one-minute weakness and readiness to call this charlatan Lin No Lin. It was not the first time that linear regression helped Tom. In one of his projects...

In one of his projects Tom faced a nonlinear model

y=b 0+a x 1x 3 + b x 2x 3+ a k x 1 + b k x 2+ε,
with the unknown parameters b 0, a, b, k. He did not have any nonlinear regression function then but, anyway, managed to find OLS estimates for all the parameters by running a series of linear regressions.
Is it a non-linear model indeed? What was Tom's idea? Did he get analytical or only numerical solutions?
Tom jumped up to his computer and ran the linear regression. He saw the results, turned back on his chair and started singing his favourite Queen song "We are the Champions, my friend...". Life was not so awful, after all...
  1. What was Tom's enthusiasm based on (if at all)? What kind of linear model was Tom running and how could he (how did he think he could?) use the results of this linear regression for his original t-test?
  2. Did he need the standard two-sample t-test assumption of equivalence of the variances?

Good Luck!