Post-model-selection inference problems (wonkish) It has long been recognized by some that when any parameter estimates are discarded, the sampling distribution of the remaining parameter estimates can be distorted … For example, suppose the model a researcher selects depends on the day of the week. On Mondays it’s model A, on Tuesdays it’s model B, and so onup to seven different models on seven different days. Each model, therefore,is the “final” model with a probability of 1/7th that has nothing to do with the values of the regression parameters. Then, if the data analysis happens to be done on a Thursday, say, it is the results from model D that are reported. All of the other model results that could have been reported are not. Those parameter estimates are summarily discarded … Model selection is a procedure by which some models are chosen over others. But model selection is subject to uncertainty. Because regression parameter estimates depend on the model in which they are embedded, there is in post-model-selection estimates additional uncertainty not present when a model is specified in advance. The uncertainty translates into sampling distributions that are a mixture of distributions, whose properties can differ dramatically from those required for convention statistical inference.
Topics:
Lars Pålsson Syll considers the following as important: Statistics & Econometrics
This could be interesting, too:
Lars Pålsson Syll writes What statistics teachers get wrong!
Lars Pålsson Syll writes Statistical uncertainty
Lars Pålsson Syll writes The dangers of using pernicious fictions in statistics
Lars Pålsson Syll writes Interpreting confidence intervals
Post-model-selection inference problems (wonkish)
It has long been recognized by some that when any parameter estimates are discarded, the sampling distribution of the remaining parameter estimates can be distorted …
For example, suppose the model a researcher selects depends on the day of the week. On Mondays it’s model A, on Tuesdays it’s model B, and so onup to seven different models on seven different days. Each model, therefore,is the “final” model with a probability of 1/7th that has nothing to do with the values of the regression parameters. Then, if the data analysis happens to be done on a Thursday, say, it is the results from model D that are reported. All of the other model results that could have been reported are not. Those parameter estimates are summarily discarded …
Model selection is a procedure by which some models are chosen over others. But model selection is subject to uncertainty. Because regression parameter estimates depend on the model in which they are embedded, there is in post-model-selection estimates additional uncertainty not present when a model is specified in advance. The uncertainty translates into sampling distributions that are a mixture of distributions, whose properties can differ dramatically from those required for convention statistical inference.