Exercise 1

Prelude and Fugue in Nonparametric Regression (univariate case)

Theoretical Prelude (for warming-up)

Question 1.

  1. Show that locally-constant estimator coincides with Nadaraya-Watson kernel estimator.
  2. Show, that if we add a linear trend to the unknown function, the resulting local linear and cubic spline estimators can be obtained from the original ones simply by adding the same linear trend. Show that for a sufficiently large sample the same is also approximately true for the Priestley-Chao kernel estimator (the kernel is assumed to be normalized and symmetric as usual). Suppose for simplicity that the design is uniform.
  3. Assume that the unknown regression function g ⊂ Cm. Consider a Priestley-Chao kernel estimator with a kernel of $m$-th order.
    1. Find the corresponding (asymptotic) bias(x), Var(x), MSE(x) and IMSE.
    2. What is the optimal choice for the bandwidth λ?
    3. What is the resulting optimal (asymptotic) IMSE?
  4. Show that a cubic spline s(x) with K knots ξ1,..., ξK can be written in the form

    s(x)=β01x+β2 x23x3+ Σj=1Kθj(x-ξj)+ 3,
    where f(x)+=f(x), if f(x) > 0 and 0 otherwise.

Fugue in Data

Question 2.

The file dat.reg contains x and y.
  1. Fit kernel estimates to the data using several kernels (depending on the package you use - see Computational Notes for R users below) and choosing bandwidthes according to GCV. Compare kernel estimates.
  2. Return to the previous paragraph using local linear regression estimates.
  3. Fit a smoothing cubic spline to the data selecting the smoothing parameter by GCV. Compare the resulting spline estimate with those obtained in the previous paragraph. Derive and plot 95% Bayesian error bounds for the unknown function.
  4. Add the linear trend 2x+1 to the original y and derive again the corresponding estimators from 2.1-2.3. Are you surprised by results? (you're already 'warmed-up'!).
  5. After you've finished 2.1-2.4 I can reveal you the secret (but don't tell others who still haven't done 2.1-2.4!): x and y were simply generated by yi=((sin(2 π xi3))3i, where εi's are random normal noise with variance 0.01 . What is the percentage of true points covered by 95% Bayesian error bounds obtained in 2.3.
  6. Using Monte Carlo generate 100 other random samples y at the same design points x from the function g(x)=((sin(2 π x3))3 adding a random Gaussian noise with the variance 0.01.
    1. Estimate pointwise squared bias, variance and MSE for each xi for kernel, local and smoothing spline estimators, and plot them as a function of x. Comment the results.
    2. Estimate the corresponding global average squared bias, variance and AMSE for the above estimators. Compare their goodness-of-fit.

Question 3.

What can be better than a glass of good red dry wine? But do you know that it is sort of harvest grapes and their quality (together certainly with wine-making process and further proper keeping) that distinguish wines from different areas and vintages? One cannot mix up old deep red wines from Bordeaux mostly made from Cabernet Sauvignon and Merlot grapes, their `competitors' from Burgundy made from Pinot Noir, young fresh Beaujolais, or wonderful Chianti made from Sangiovese grapes. Yarden's Cabernet Sauvignon is also quite different from that of Carmel (can you suggest any test to check this statement (-:) ?).

The file Vineyard.dat contains the data from a vineyard of some Chateau. The vineyard is divided into 52 rows, and the 52 observations in the data set correspond to the yields of the harvests in 1989, 1990 and 1991 measured by the total number of lugs (a lug is a basket that is used to carry the harvest grapes and contains about 30 pounds of grapes). The row numbers are ordered, with increasing row number reflecting movement from northwest to southeast. Rows 31-52 are shorter than rows 1-30 (100 yards long versus 120 yards). Strong winds and animals (birds and raccoons) damage more at the outer, exposed, parts of the vineyard.

The file contains: row number (first column), numer of lugs for 1989 harvest (second column), number of lugs for 1990 harvest (third column), number of lugs for 1991 harvest (fourth column).

  1. Plot the total lug count (the sum of the yields of three harvests) as a function of row. Comment the plot using the information about the data above.
  2. Fit kernel, local linear, smoothing spline and supersmoother estimators to the total lug count. Compare the resulting estimates (don't forget analysis of residuals!).
  3. Two rows in the data appear to be possible outliers (can you guess which ones and what is the possible reason for them to be outliers?). Omit these two rows and refit all four estimates. Do the fitted curves change a lot?
  4. To remove the harvest effect, for each row define three values as differences between the number of lugs for that row and corresponding harvest year and the average number of lugs per row for that harvest year. Explain the idea behind such a correction. Plot the `corrected' total lug count as a fucntion of row. Comment the plot. Choose one of the nonparametric estimators you want (kernel, loess or smoothing spline) and fit it to the `corrected' total lug count. Comment the resutls.
  5. L'Haim!


Computational Notes for R users.

Here is a (partial) list of R functions for kernel estimation, local polynomial regression and spline smoothing. See the corresponding help files for details of their use.