Constructing a nested sequence of models: regularization and model selection

Nathan Intrator and Inna Stainvas
Tel-Aviv University
www.math.tau.ac.il/~nin

Model selection is essential in experts with many parameters and requires an efficient measure of model complexity. While regularization is very useful, it has two fundamental drawbacks: (i) It requires the use of an independent test set for evaluating the regularization parameter, often this is replaced by a computationally expensive leave-out estimation and (ii) for every regularization value, a new model has to be estimated. The second problem implies that the models are very different, thus model interpretation is not unique. This last feature is highly desirable and is achieved when a nested sequence of models can be created, for example in regression or in a tree structured model like CART (Breiman et al., 1984).

A sequence of nested models is computationaly more efficient (there is no need to reestimate the model when the model complexity penalty value is changed) and leads to a natural hierarchy of models. In regression, a sequence of models is achieved by creating a nested sequence of data representations or simply eliminating different data coordinates. This data complexity regularization is not always directly correlated with the desired model complexity regularization. While measuring model complexity by some statistic on the distribution of the model parameters', (e.g. variance or description length, may be very useful; Hinton and van Camp, 1993), it still suffers from the two drawbacks mentioned above.

We introduce a general approach for regularizing models by constructing a nested sequence of models where model complexity is measured by the dimensionality of the model parameters; (The sequence may be only approximately nested when further optimization refinements are used). The likelihood ratio of the nested sequence has a Chi² distribution with degrees of freedom that equal the number of additional dimensions (Silvey, 1975). This leads to a simple choice of optimal models using the distribution properties. Wavelet theory and in particular, wavelet packets and best basis methods (Coifman and Wickerhauser, 1992) are used in the construction of the nested sequence of models and simplify the parameter estimation in the case of a feed-forward network model.

Comparison of this method with feed-forward architectures (linear and non-linear) on several data sets, suggests that this method can achieve superior results on both small and large data sets.

References

L. Breiman and J. H. Friedman and R. A. Olshen and C. J. Stone, (1984) Classification and Regression Trees. The Wadsworth Statistics/Probability Series.
G. E. Hinton and D. van Camp (1993) Keeping Neural Networks Simple by Minimizing the Description Length of the Weights. Sixth ACM conference on Computational Learning Theory. pp. 5--13. www.cs.toronto.edu/~drew/colt93.ps
S. D. Silvey (1975) Statistical Inference. Chapman and Hall.
R. R. Coifman and M. Wickerhauser (1992) Entropy-based algorithms for best basis selection. IEEE Trans. Info. Theory 38:2(713--719). wuarchive.wustl.edu/doc/techreports/wustl.edu/math/papers/entbb.ps.Z

A paper version can be found in my online publications