Exercise 1
Prelude and Fugue in Nonparametric Regression (univariate case)
Theoretical Prelude (for warming-up)
Question 1.
- Show that locally-constant estimator coincides with Nadaraya-Watson
kernel estimator.
- Show, that if we add a linear trend to the unknown function,
the resulting local linear and cubic spline estimators can be obtained from the original ones
simply by adding the same linear trend.
Show that
for a sufficiently large sample the same is also approximately true for the
Priestley-Chao kernel estimator
(the kernel is assumed to be normalized and symmetric as usual). Suppose for
simplicity that the design is uniform.
- Assume that the unknown regression function g ⊂ Cm.
Consider a Priestley-Chao kernel estimator with a kernel of $m$-th order.
- Find the corresponding (asymptotic) bias(x), Var(x), MSE(x) and IMSE.
- What is the optimal choice for the bandwidth λ?
- What is the resulting optimal (asymptotic) IMSE?
- Show that a cubic spline s(x) with K knots
ξ1,..., ξK can be written in the form
s(x)=β0+β1x+β2
x2+β3x3+
Σj=1Kθj(x-ξj)+
3,
where f(x)+=f(x), if f(x) > 0 and 0 otherwise.
Fugue in Data
Question 2.
The file dat.reg contains x and y.
- Fit kernel estimates to the data using several kernels (depending on the
package you use - see Computational Notes for R users below)
and choosing bandwidthes according to GCV. Compare kernel estimates.
- Return to the previous paragraph using local linear regression estimates.
- Fit a smoothing cubic spline to the data selecting the smoothing parameter by GCV.
Compare the resulting spline estimate with those obtained in the previous
paragraph. Derive and plot 95% Bayesian error bounds for the unknown function.
- Add the linear trend 2x+1 to the original
y and derive again the corresponding estimators from 2.1-2.3.
Are you surprised by results?
(you're already 'warmed-up'!).
- After you've finished 2.1-2.4 I can reveal you the secret (but don't tell
others who still haven't done 2.1-2.4!): x and y
were simply generated by
yi=((sin(2 π xi3))3+εi,
where εi's are random normal noise with variance 0.01 .
What is the percentage
of true points covered by 95% Bayesian error bounds obtained in 2.3.
- Using Monte Carlo generate 100 other random samples
y at the same design points x from the
function g(x)=((sin(2 π x3))3 adding a
random Gaussian noise with the variance 0.01.
- Estimate pointwise squared bias, variance and MSE for each
xi for kernel, local and smoothing spline estimators,
and plot them as a function of x. Comment the results.
- Estimate the corresponding global average squared bias, variance and
AMSE for the above estimators.
Compare their goodness-of-fit.
Question 3.
What can be better than a glass of good red dry wine?
But do you know that it is sort of harvest grapes and their quality (together certainly with
wine-making process and further proper keeping) that distinguish
wines from different areas and vintages? One cannot mix up old deep red wines
from Bordeaux mostly made from Cabernet Sauvignon and Merlot grapes,
their `competitors' from Burgundy made from Pinot Noir,
young fresh Beaujolais, or wonderful Chianti made from
Sangiovese grapes.
Yarden's Cabernet Sauvignon is also quite different from that of Carmel
(can you suggest any test to check this statement (-:) ?).
The file Vineyard.dat contains the data from
a vineyard of some Chateau. The vineyard is divided into 52 rows,
and the 52 observations in the data set correspond to the yields of
the harvests in 1989, 1990 and 1991 measured by the total number of lugs
(a lug is a basket that is used to carry the harvest grapes and
contains about 30 pounds of grapes). The row numbers are ordered, with
increasing row number reflecting movement from northwest to southeast.
Rows 31-52 are shorter than rows 1-30 (100 yards long versus 120 yards).
Strong winds and animals (birds and raccoons) damage more at the outer,
exposed, parts of the vineyard.
The file contains: row number (first column), numer of lugs for 1989 harvest
(second column), number of lugs for 1990 harvest (third column), number of
lugs for 1991 harvest (fourth column).
- Plot the total lug count (the sum of the yields of three harvests) as a function of row.
Comment the plot using the information about the data above.
- Fit kernel, local linear, smoothing spline and supersmoother
estimators to the
total lug count. Compare the resulting estimates (don't forget analysis of
residuals!).
- Two rows in the data appear to be possible outliers (can you guess which
ones and what is the possible reason for them to be outliers?). Omit these
two rows and refit all four estimates. Do the
fitted curves change a lot?
- To remove the harvest effect, for each row define three values as differences
between the number of lugs for that row and corresponding harvest year and the average number
of lugs per row for that harvest year. Explain the idea behind such a correction. Plot the `corrected' total lug count as a fucntion of row. Comment the
plot. Choose one of the nonparametric estimators you want (kernel, loess or
smoothing spline) and fit it to the `corrected' total lug count. Comment the
resutls.
- L'Haim!
Computational Notes for R users.
Here is a (partial) list of R functions for kernel estimation, local
polynomial regression and spline smoothing. See the corresponding help files
for details of their use.
- ksmooth performs Nadaraya-Watson kernel estimation.
However, it requires
a bandwidth to be specified by the user and does not choose it "automatically"
in an optimal way. In addition, its list of possible kernels and some other
options are quite limited.
- npreg from the package np
available on CRAN
computes a kernel regression and
allows one to choose the bandwidth by cross-validation (see npregbw).
- hcv and sm.regression from the package sm
available on CRAN
allows one the CV choice of the bandwidth (hcv) and its further use for the
kernel estimation (sm.regression).
- locpoly from the package KernSmooth
available on CRAN
fits local polynomial regression (though does not chose the bandwidth
"automatically").
- To run local polynomial regression with "automatically" chosen bandwidth
parameter you will need to install the package locfit
from the CRAN .
The functions locfit or locfit.raw perform a local polynomial
regression for a given bandwidth parameter that can be chosen by the GCV
criterion using the function gcvplot.
In fact, locfit and locfit.raw also allow to fit
Nadaraya-Watson kernel estimators (how?).
- loess performs a local polynomial regression where the
bandwidth is defined implicitly by the percentage of data points within the
window.
Unfortunately, this percentage either should be provided by a user or is
given by a default value.
You can choose a `reasonable' bandwidth by visual analysis of
several loess estimators with different bandwidthes.
- lowess is similar to loess but performs a robust
local linear regression
- supsmu performs smoothing by ``super-smoother'' algorithm of
Freidman -- local polynomial regression with a variable
bandwidth chosen by local CV.
- To perform spline smoothing use the function smooth.spline
with all.knots=T.
It chooses a smoothing parameter either by GCV (default) or CV.
The function
print(smooth.spline(x,y,...))
will provide you some useful output for smoothing spline fitting.