Exercise 3

Question 1.

The file Bees.dat contains the data about ``working activities" of bees in the bee-hive (Hebrew: "kaveret") as a function of time of the day. One of the important characteristics of ``working activities" is the number of bees leaving the bee-hive for outside activities. The data collected during several successive non-rainy days contain the case number (irrelevant), the number of bees that left the bee-hive and the time of the day (in hours).
  1. Plot the original data, Time vs. log(Number), any other plots you find relevant and suggest a reasonable parametric model.
  2. Fit the model you think is appropriate from the initial data analysis. Do the results of the fit support your original model?
  3. If you are not satisfied, try a nonparametric estimate. Comment the results.
  4. Do you think there is an overdispersion? Justify your conclusions. Are there any reasonable explanations for this phenomenon? What should be changed in your model in order to include overdispersion? Fit the modified model(s) and compare the results with those from 1.2 and 1.3.

Question 2.

The data set trees (part of Venables & Ripley's mass library - see help(trees)) provides measurements of girth, heights and volume of 31 black cherry trees in Allegheny National Park Forest, Pennsylvania. The goal of the research was to find a suitable model for Volume as a function of Girth and Height that can be easily measured (unlike volume). We have used this data set in the class to demonstrate models with heteroscedastic variances. Now we continue the analysis of these data.
  1. Another possible way of thinking for a reasonable model is to consider the original explanatory variables (without log-transformation) Girth and Height but perform a cube-root transform of the predicted variable Volume. Justify such a model. Fit it and check its adequacy.

  2. What assumptions for standard linear models seem to be violated (if any)? Suggest a way(s) to take into account these violations and fit a corresponding model(s). Comment the results.
Computational Notes for R users:
  • To fit Negative Binomial model use either family=negative.binomial(θ) in glm fucntion (for given θ) or glm.nb function that, in addition, finds a MLE for θ
  • To run quasi-likelihood models use glm(..., family=quasi(link=..., variance=,...) (see help(glm) for more details)
  • To fit generalized nonparametric model use th function gam from the library mgcv.
  • To fit a χ 2 model use family=Gamma(link=log) calling the glm function. As you know (at least should know), the estimates of regression coefficients will not depend on the degree of freedom, but their variances and other summary statistics will do. To get a correct summary for your specific degree of freedom, define it via dispersion parameter in the summary command:
    >model.gl<-glm(model,family=Gamma(link=log),...)
    >summary(model.gl, dispersion=2m)
    where m is the degrees of freedom of your specific χ 2 distribution (and, hence, 2m is its variance).
  • The function logLik returns a maximum of the log-likelihood of a fitted model