All ecologists will use some form of model in nearly every paper they write although we might not always realise it. Sometimes this might be a simple statistical model such as the families of linear models used to test the differences between different groups or to ascertain relationships between measurements made out in the field, or sometimes it might include more complex mathematical or simulation models. It seems obvious that to improve our understanding of a system or to make accurate predictions we have to be very careful about using good models that will facilitate these objectives. After teaching a lecture on ‘Bayesian Model Selection’ in the summer, I have been thinking a lot about what makes a model ‘good’ and it seems to me to be a little more involved than the maximisation of simple metrics. Here I talk about the criteria that I look out for when I am assessing my models for their quality. I don’t claim that either the criteria or the example metrics listed here are exhaustive but they collectively represent the sort of approaches I consider when thinking about model validation.
In order for us to believe that a model is an adequate representation of a system we must first convince ourselves that the proposed model can describe adequately the important characteristics of the data that we have collected. There are a many number of metrics that are commonly applied to assess the ‘fit’ of a model. The sum of squares is one such metric, as is R squared, and, if we are willing to assume a distribution for the errors in our model, so is the likelihood. The metric of fit can be calculated for any set of values for the parameters of the model and usually the first objective is to search the parameter space for the set of parameter values that optimise this metric (i.e find the smallest sum of squares or maximise the likelihood). This processes is often referred to as ‘fitting the model’. Usually this optimal metric value can be interpreted as the best possible fit that the given model can achieve and often forms the basis for comparisons of fit against other models. In frequentist statistics we assign the parameter values of the model that optimises our metric of fit as the ‘best estimators’ for those parameter values (see maximum-likelihood estimators or least-squares estimators). In Bayesian statistics it all becomes a bit more complicated because the parameter values are themselves random variables and the ‘fit’ of the model becomes an estimate of all the possible ways that the data could be generated over all possible parameterisations of the model but, even here, we have measures such as the deviance and the ‘marginal probability of the data given the model’ used in Bayes Factors.
At the last INTECOL/BES meeting in London there was a lot of talk about how to make ecology a more ‘predictive’ science with special symposia centred around topics such as ‘how can we transform predictive ecology to better meet scientific and societal demands?’, ‘not just for geeks: broadening scope and participation in predictive ecology’, and ‘global change and multispecies systems: from understanding to prediction’. In addition, we heard a lot from speakers in the ‘Math, Models and Methods in Ecology’ sessions talking about how to squeeze the most predictivity from our models. In particular, there was a very compelling talk by Sarah Calba on Thursday’s morning session asking what we mean when we talk about ‘predictivity’. Now I cannot hope to even slightly scrape the surface of the philosophical content of her talk and this concept still seems to elude being pinned down definitively. I offer one possible definition for the ‘predictivity’ of a model: the ability of the model to predict data that it has not seen in the ‘fitting’ process (see above). Whilst this may seem a rather shallow definition it may prove useful at least as a starting point.
One possible measure of ‘predictivity’ is to utilise the best-fitting parameterisation of the model and then calculate the same metrics used in the ‘fitting’ of the model on the new data. Certainly, there has been much work on the derivation of predictive-likelihoods although I have seen much less on predictive sum-of-squares. In the Bayesian world we have a relatively straightforward notion of the posterior predictive density from which we can calculate the probability of observing the newly found data. The problem with all measures of predictivity is that it is unclear about what is the best strategy to generate this ‘extra’ data. One simple strategy is to collect a new dataset, possibly in a new region or time period depending on what domain we want the metric of ‘predictivity’ to measure. The problem is that this requires waiting for the acquisition of this new data, at which point the prediction may not useful anymore. Another technique is to partition your data and then use one partition to fit the model and another to use as a test data set for prediction. These cross-validation techniques typically have to be repeated many times with different ‘training’ data (for use in fitting) and ‘testing’ data (for calculation of the predictivity metric) so that an estimate of the uncertainty in your predictivity estimate can be calculated.
In science we often give more weight to hypotheses that, all other things being equal, are simpler than their competitors. Occam’s razor, when applied to modelling, means that if two models both perform equally well in other areas (such as fit or predictivity) then we should prefer the simpler model. For many models, one measure of model complexity is to simply count the number of free parameters in the model. For example, in a simple linear model (assuming normal errors) we have parameters determining the slope and the intercept of the expectation plus one parameter determining the error around this expectation, giving a total of three parameters. This is one fewer than a similarly structured quadratic model, which has an extra parameter for the coefficient of the quadratic term. The ‘number of parameters’ is a bit more slippery concept when applied to hierarchical models as, in these models, there are a number of ‘parameters’ that are actually calculated from sub-models. Bayesian inference complicates this further as not all parameters are created equal in Bayesian inference: some are given a lot of flexibility (through the setting of wide prior distributions) and others are restricted to narrow ranges. This has led to the concept of the ‘effective number of parameters‘ used in Bayesian methods of model selection. Whilst the notion of model complexity can become quite abstract, all attempts to measure model complexity are actually an attempt to measure the flexibility of the model, sometimes called the ‘richness of the model space’.
It is a well known phenomenon that the more flexible a model is (i.e. the more complex), the more able it is to achieve close fits to the data in the fitting process. We therefore see a close trade-off between fit and parsimony. The question should not therefore be “will I get extra fit if I make my model more complex?” but “is the extra fit I will achieve be worth the extra complexity?”. To find the optimal point in this tradeoff we need to find some metric that can express parsimony and fit in comparable terms. Towards this end we have seen the development of a number of metrics based on information theory such as AIC, BIC, and, for Bayesian inference, DIC (to name but a few) that provide metrics that assess the fit of the model to the data but also penalise for complexity.
Probably the most important model selection criterion is one that doesn’t require any fancy estimation metrics but one that does require a self-critical eye and good knowledge of the system being studied. This criterion simply asks: ‘does the model proposed make sense for the system being studied?’. Does the model being proposed exhibit any strange behaviour such as predicting values that could not possibly occur? Is the prediction biologically reasonable?
For example, one might be tempted to fit a standard linear model to a data set with a dependent variable bounded by zero (such as height or weight). This model may provide a very good fit and may even predict new data well, however, the problem with this model is that for some values of the independent variable, our model will provide a negative expectation. We should therefore consider a better model specification (such as the same model except with a log link function) even if our other model metrics suggest that the model is performing well. Despite this seeming obvious, we have seen this type of mistake appearing in journals the likes of Nature (even if they did publish the appropriate response article).
However, even with a correctly specified link function, we still need to be vigilant that our predictions make sense biologically. One example of poor practice that I see regularly in the field of species distribution modelling is the specification of logistic regression models with the inclusion of only a single linear term for one or more predictor variables. Now such a specification may fit the data presented to the model well but it makes no sense biologically. For example, if we develop a regression model for the dependent variable ‘probability of occurrence’ with only a linear term for temperature, then a positive regression coefficient would mean that the species is more likely to occur in hot environments than cold environments. This makes sense when we only consider a small area but in the broader scheme of things then it would also mean that you would expect the surface of the sun to be littered with the species in question. Niche theory expects there to be environmental tolerance extremes for any species and so it seems sensible to restrict ourselves only to regression models (and parameterisations of those models) that enforce these expectations (i.e. regression models with an even numbered exponent as the highest term and that the coefficient for this term be positively valued).
Putting it all together
My career so far has seen me handle very different models in various ecological sub-disciplines and one thing that I’ve found most interesting is the fact that these different sub-disciplines seem to place completely different values upon each of these criteria when evaluating their models. In species distribution modelling we often test our models by evaluating a metric based on classification errors such as kappa or AUC. This can either be performed on the data used to fit the model, in which case these measures will be metrics of fit, or on novel data not encountered in the fitting process, in which case these will be measures of predictivity. Neither of these metrics include a penalty for complexity and it is rare that I’ve seen anyone try to include this in any assessment of the performance of different species distribution models (although I have seen one notable exception). Conversely, in the field of individual-based modelling, I have seen people working to create models that can capture very sophisticated elements of a species biology (so passing the ‘sanity’ criterion with flying colours) but this usually comes at the expense of staggering levels of complexity and it is rare that I’ve seen examples where people have justified this complexity in terms of a robust statistical criterion such as AIC.
So how do we put all of these criteria together to select the ‘best’ model? Unfortunately, this is where it get a bit more difficult. As of yet I have not encountered a metric that combines fit, predictivity, and parsimony into one optimisable statistic. Some of the metrics described above already attempt to combine two of the above criteria. The information criteria metrics such as AIC, BIC, and DIC already consider fit and parsimony jointly. Stone (1977) shows that model selection by AIC and leave-one-out cross validation are asymptotically equivalent so this suggests that there is a link between predictivity and the joint consideration of fit and parsimony. It isn’t yet clear whether similar results can be found for BIC or DIC information theoretic measures (although if you know of any I’d be grateful to receive any comments). Until a ‘super metric’ that incorporates all of these criteria is available, my approach is to firstly exclude all models that do not pass a sanity analysis and then calculate metrics of predicitiviy and combined fit and parsimony. More often than not I find that both the metrics rank the models equivalently, and, in the rare occasions that the rankings differ, I quote both rankings but use the ranking that is most appropriate for what I intend to use the model for. If the focus of the paper I am writing is prediction then I will place more emphasis on the predictivity rankings but, if the aim of the model is to describe the study system then I will place more emphasis on the fit and parsimony rankings.
Before I sign off, it would be amiss if I didn’t also point out an interesting discussion about model ‘generality’ happening on Bob O’Hara’s blog centered around a recent paper in Trends in Ecology and Evolution. In particular there seems to be intense debate about how we expect ‘generality’ to vary with model complexity. I haven’t covered this extra possible criterion of model performance here, partly because I am finding it difficult to pin down what the different protagonists mean by ‘generality’ in this debate (sometimes the authors hint at ‘generality’ being defined something like what I have called ‘predicitivity’ here), but mainly because I couldn’t hope to cover the subject as well as they have.
If anyone has any further criteria that they use to assess model performance then I would be happy to hear them. Despite the abundance of metrics for the assessment of model performance, it appears that there is still an element of artistry to model selection.