Searching for the Optimal Data Model: Two Strategies for Statistical Variable Selection

J.A. Stark and W.J. Fitzgerald

Technical Report CUED/F-INFENG/TR 259 (1996), Department of Engineering, University of Cambridge

Selecting the best model for data can be difficult when the proposal set of candidate models is large. This is particularly so when the models are composed of an arbitrary combination of variables. The selection task can be viewed as one of optimisation of a selection criterion or (within a Bayesian methodology) as one of probabilistic inference. Both these perspectives are useful in the development of search strategies.

Two strategies are presented in this paper. In the first, independent trial models are assessed. It is shown that the characterisations of a sample of models can be used to deduce the most useful model variables, especially when the trial models are large and the contribution of each component variable to the prediction is evaluated. The second strategy involves searching the candidate models in a stepwise fashion by making a sequence of small changes to a trial model. This method was found to be particularly effective in locating the optimal model in test data sets.

An important part of this research was an investigation of the difficulties that can be encountered when searching the model space. This led to the development of a test model for objective assessment of search strategies.