Exercise 3

Multivariate Nonparametric Regression

(warning: beware of curse of dimensionality!)

Question 1 (theoretical warming-up).

Let y be a response variable and assume there is a single explanatory variable x. Consider transformations of variables F(y) and Φ(x) and without loss of generality, assume that F(y) and Φ(x); are centered and scaled to have zero means and unit variances. Following the ACE methodology, we are seeking F(y) and Φ(x) to minimize E[F(y)-Φ(x)]2. Show that this is equivalent to finding the so-called maximal correlation - the largest possible correlation between a function of y and a function of x.

Question 2.

The file Air.dat contains 111 observations taken from an environmental study that measured the four variables: ozone (surface concentration of ozone in New York, in parts per million), radiation (solar radiation), temperature (observed temperature, in degrees Fahrenheit) and wind (wind speed, in miles per hour) for 111 consecutive days. The study investigated the influence of solar radiation, temperature and wind speed on concentration of ozone.
  1. Fit a linear model to the data. Does it seem adequate?
  2. Fit an additive nonparametric model and compare it with the linear one.
  3. Perform projection pursuit regression trying various smoothing methods and different number of terms (one, two, three). Choose the most reasonable projection pursuit model. What are the resulting explanatory variables? Compare the results with those obtained on previous steps.
  4. Perform ACE algorithm to find transformations of variables that maximizes the correlation. Do the resulting transformations hint on some parametric model? (but don't try to find a black cat in a dark room especially if it is not there!)
  5. Return to the previous paragraph applying the AVAS algorithm.
  6. Apply neural network and MARS (for volunteers) estimation procedures. Comment the results.
  7. Summarize the results.

Question 3.

The data in the file Diabetes.dat come from a study of the factors affecting patterns of insulin-dependent diabetes mellitus in children. The objective was to investigate the dependence of the level of serum C-peptide on various other factors in order to understand the patterns of residual insulin secretion. The response measurement is the logarithm of C-peptide concentration (pmol/ml) at diagnosis, and the predictor measurements are age and base deficit, a measure of acidity.
  1. Plot the data. Do you think that a linear model is appropriate? Verify your initial conclusions.
  2. Fit an additive and projection pursuit estimators, and comment the results.
  3. Apply the ACE and AVAS algorithms. What response transformation does each method suggest?
  4. Summarize the results.

Computational Notes for R users.

Here is a (partial) list of R functions for multivariate nonparametric regression. See the corresponding help files for details of their use.