Exercise 3
Question 1.
The file
Bees.dat contains the data about ``working activities" of bees in the bee-hive (Hebrew: "kaveret")
as a function of time of the day. One of the important characteristics of ``working activities" is the number of bees
leaving the bee-hive for outside activities. The data collected during several successive non-rainy days contain the case
number (irrelevant), the number of bees that left the bee-hive and the time of the day (in hours).
- Plot the original data, Time vs. log(Number), any other plots you find relevant and suggest a reasonable parametric model.
- Fit the model you think is appropriate from the initial data analysis. Do the results of the fit support your
original model?
- If you are not satisfied, try a nonparametric estimate. Comment the results.
- Do you think there is an overdispersion? Justify your conclusions. Are there any reasonable explanations for this
phenomenon? What should be changed in your model in order to include overdispersion? Fit the modified model(s) and
compare the results with those from 1.2 and 1.3.
Question 2.
The data set
trees (part of Venables & Ripley's
mass library - see
help(trees)) provides measurements of girth, heights and volume of 31 black cherry trees in Allegheny National Park
Forest, Pennsylvania. The goal of the research was to find a suitable model for
Volume as a function of
Girth and
Height that can be easily measured (unlike volume). We have used this data set in the class to demonstrate models
with heteroscedastic variances. Now we continue the analysis of these data.
-
Another possible way of thinking for a reasonable model is to consider the original explanatory variables (without
log-transformation) Girth and Height but perform a cube-root transform of the predicted variable Volume. Justify such a model. Fit it and check its adequacy.
- What assumptions for standard linear models seem to be violated (if any)? Suggest a way(s) to take into account these
violations and fit a corresponding model(s). Comment the results.
Computational Notes for R users:
- To fit Negative Binomial model use either family=negative.binomial(θ) in glm fucntion (for
given θ) or glm.nb function that, in addition, finds a MLE for θ
- To run quasi-likelihood models use glm(..., family=quasi(link=..., variance=,...) (see help(glm) for more details)
- To fit generalized nonparametric model use th function gam from the library mgcv.
- To fit a χ 2 model use family=Gamma(link=log) calling the glm function. As you know (at least should know), the estimates of regression coefficients will not depend on the degree of freedom, but their variances and other
summary statistics will do. To get a correct summary for your specific degree of freedom, define it via dispersion parameter in the summary command:
- >model.gl<-glm(model,family=Gamma(link=log),...)
- >summary(model.gl, dispersion=2m)
- where m is the degrees of freedom of your specific χ 2 distribution (and, hence, 2m is its variance).
- The function logLik returns a maximum of the log-likelihood of a fitted model