Exercise 3
Multivariate Nonparametric Regression
(warning: beware of curse of dimensionality!)
Question 1 (theoretical warming-up).
Let y be a response variable and assume there is a
single explanatory variable x. Consider transformations of variables
F(y) and Φ(x) and without loss of generality, assume that
F(y) and Φ(x); are centered and scaled to have zero means and unit variances.
Following the ACE methodology, we are
seeking F(y) and Φ(x) to minimize
E[F(y)-Φ(x)]2.
Show that this is equivalent to finding the so-called maximal correlation
- the largest possible correlation between a function of y and
a function of x.
Question 2.
The file Air.dat contains 111
observations taken from an environmental study that measured the four variables:
ozone (surface concentration of ozone in New York, in parts per million),
radiation (solar radiation),
temperature (observed temperature, in degrees Fahrenheit) and
wind (wind speed, in miles per hour) for 111 consecutive days.
The study investigated the influence of solar radiation, temperature and wind
speed on concentration of ozone.
- Fit a linear model to the data. Does it seem adequate?
- Fit an additive nonparametric model and compare it with the linear one.
- Perform projection pursuit regression trying various smoothing methods
and different number of terms (one, two, three). Choose the most reasonable
projection pursuit model.
What are the resulting explanatory variables?
Compare the results with those obtained on previous steps.
- Perform ACE algorithm to find transformations of variables
that maximizes the correlation. Do the resulting transformations hint on some
parametric model? (but don't try to find a black cat in a dark room especially
if it is not there!)
- Return to the previous paragraph applying the AVAS algorithm.
- Apply neural network and MARS (for volunteers)
estimation procedures. Comment the results.
- Summarize the results.
Question 3.
The data in the file Diabetes.dat come from a study
of the factors affecting patterns of insulin-dependent diabetes mellitus in
children. The objective was to investigate the dependence of the level of
serum C-peptide on various other factors in order to understand the patterns
of residual insulin secretion. The response measurement is the logarithm of
C-peptide concentration (pmol/ml) at diagnosis, and the predictor measurements
are age and base deficit, a measure of acidity.
- Plot the data. Do you think that a linear model is appropriate? Verify your
initial conclusions.
- Fit an additive and projection pursuit estimators, and comment the results.
- Apply the ACE and AVAS algorithms. What response transformation does each
method suggest?
- Summarize the results.
Computational Notes for R users.
Here is a (partial) list of R functions for multivariate
nonparametric regression. See the corresponding help files
for details of their use.
- gam performs backfitting algorithm for
additive models by spline smoothing with ``automatically'' chosen amount of
smoothing.
- ppr fits projection pursuit estimator using smoothing splines or
supersmoother.
Read carefully its help comments.
- ace and avas from the from the
CRAN 's package acepack
perform ACE and AVAS algorithms respectively with automatically chosen
smoothing parameters.
Note that to plot
the resulting transformation of any variable you should first sort its original
values. You can also get the correlation coefficients between transformed
variables (see help for details).
- nnet from the CRAN 's
package nnet fits neural networks models.
- mars from the
CRAN 's
package mda fits MARS models.