Exercise 3
Question 1.
The file Bees.dat contains the data about ``working
activities" of bees in
the bee-hive (Hebrew: "kaveret") as a function of time of the day.
One of the important characteristics of ``working activities" is the number
of bees leaving the bee-hive for outside activities. The data collected during
several successive non-rainy days contain the
case number (irrelevant), the number of bees that left the bee-hive and the time of the day (in hours).
- Plot the original data, Time vs. log(Number), any other plots you find
relevant and suggest a reasonable parametric model.
- Fit the model you think is appropriate from the initial data analysis.
Do the results of the fit support your original model?
- If you are not satisfied, try a nonparametric estimate. Comment the results.
- Do you think there is an overdispersion? Justify your conclusions.
Are there any reasonable explanations for this phenomenon? What should be changed
in your model in order to include overdispersion? Fit the modified model(s) and
compare the results with those from 1.2 and 1.3.
Question 2.
The data set trees (part of Venables & Ripley's mass library - see help(trees))
provides measurements of girth, heights and volume of 31 black cherry trees
in Allegheny National Park Forest, Pennsylvania. The goal of the research was
to find a suitable model for Volume as a function of Girth and
Height that
can be easily measured (unlike volume). We have used this data set in the
class to demonstrate models with heteroscedastic variances. Now we
continue the analysis of these data.
-
Another possible way of thinking for a reasonable model is
to consider the original explanatory variables (without log-transformation)
Girth and Height but perform a cube-root transform of
the predicted variable Volume.
Justify such a model. Fit it and check its adequacy.
- What assumptions for standard linear models seem to be violated (if any)?
Suggest a way(s) to take into account these violations and fit a corresponding
model(s). Comment the results.
Computational Notes for R users:
- To fit Negative Binomial model use either
family=negative.binomial(θ)
in glm fucntion (for given θ) or
glm.nb function that, in addition, finds a MLE for θ
- To run quasi-likelihood models use glm(..., family=quasi(link=...,
variance=,...) (see help(glm) for more details)
- To fit generalized nonparametric model use th function gam
from the library mgcv.
- To fit a χ2 model use family=Gamma(link=log)
calling the glm function. As you know (at least should
know), the estimates of regression coefficients will not depend on the
degree of freedom, but their variances and other summary statistics will
do. To get a correct summary for your specific degree of freedom, define it via
dispersion parameter in the summary command:
- >model.gl<-glm(model,family=Gamma(link=log),...)
- >summary(model.gl, dispersion=2m)
- where m is the degrees of freedom of your specific χ2
distribution (and, hence, 2m is its variance).
- The function logLik returns a maximum of the log-likelihood of a fitted
model