Uncertainty, noise and the art of model-data comparison

Global Warming
Klimaentwicklung

Global Warming: With permission from RealClimate. Gavin Schmidt and Stefan Rahmstorf

John Tierney and Roger Pielke Jr. have recently discussed attempts to validate (or falsify) IPCC projections of global temperature change over the period 2000-2007.

Others have attempted to show that last year’s numbers imply that ‘Global Warming has stopped’ or that it is ‘taking a break‘ (Uli Kulke, Die Welt)). However, as most of our readers will realise, these comparisons are flawed since they basically compare long term climate change to short term weather variability.

This becomes immediately clear when looking at the following graph:

The red line is the annual global-mean GISTEMP temperature record (though any other data set would do just as well), while the blue lines are 8-year trend lines – one for each 8-year period of data in the graph. What it shows is exactly what anyone should expect: the trends over such short periods are variable; sometimes small, sometimes large, sometimes negative – depending on which year you start with. The mean of all the 8 year trends is close to the long term trend (0.19ºC/decade), but the standard deviation is almost as large (0.17ºC/decade), implying that a trend would have to be either >0.5ºC/decade or much more negative (< -0.2ºC/decade) for it to obviously fall outside the distribution. Thus comparing short trends has very little power to distinguish between alternate expectations.So, it should be clear that short term comparisons are misguided, but the reasons why, and what should be done instead, are worth exploring.

The first point to make (and indeed the first point we always make) is that the climate system has enormous amounts of variability on day-to-day, month-to-month, year-to-year and decade-to-decade periods. Much of this variability (once you account for the diurnal cycle and the seasons) is apparently chaotic and unrelated to any external factor – it is the weather. Some aspects of weather are predictable – the location of mid-latitude storms a few days in advance, the progression of an El Niño event a few months in advance etc, but predictability quickly evaporates due to the extreme sensitivity of the weather to the unavoidable uncertainty in the initial conditions. So for most intents and purposes, the weather component can be thought of as random.

If you are interested in the forced component of the climate – and many people are – then you need to assess the size of an expected forced signal relative to the unforced weather ‘noise’. Without this, the significance of any observed change is impossible to determine. The signal to noise ratio is actually very sensitive to exactly what climate record (or ‘metric’) you are looking at, and so whether a signal can be clearly seen will vary enormously across different aspects of the climate.

An obvious example is looking at the temperature anomaly in a single temperature station. The standard deviation in New York City for a monthly mean anomaly is around 2.5ºC, for the annual mean it is around 0.6ºC, while for the global mean anomaly it is around 0.2ºC. So the longer the averaging time-period and the wider the spatial average, the smaller the weather noise and the greater chance to detect any particular signal.

In the real world, there are other sources of uncertainty which add to the ‘noise’ part of this discussion. First of all there is the uncertainty that any particular climate metric is actually representing what it claims to be. This can be due to sparse sampling or it can relate to the procedure by which the raw data is put together. It can either be random or systematic and there are a couple of good examples of this in the various surface or near-surface temperature records.

Sampling biases are easy to see in the difference between the GISTEMP surface temperature data product (which extrapolates over the Arctic region) and the HADCRUT3v product which assumes that Arctic temperature anomalies don’t extend past the land. These are both defendable choices, but when calculating global mean anomalies in a situation where the Arctic is warming up rapidly, there is an obvious offset between the two records (and indeed GISTEMP has been trending higher). However, the long term trends are very similar.

A more systematic bias is seen in the differences between the RSS and UAH versions of the MSU-LT (lower troposphere) satellite temperature record. Both groups are nominally trying to estimate the same thing from the same data, but because of assumptions and methods used in tying together the different satellites involved, there can be large differences in trends. Given that we only have two examples of this metric, the true systematic uncertainty is clearly larger than the simply the difference between them.

What we are really after is how to evaluate our understanding of what’s driving climate change as encapsulated in models of the climate system. Those models though can be as simple as an extrapolated trend, or as complex as a state-of-the-art GCM. Whatever the source of an estimate of what ’should’ be happening, there are three issues that need to be addressed:

* Firstly, are the drivers changing as we expected? It’s all very well to predict that a pedestrian will likely be knocked over if they step into the path of a truck, but the prediction can only be validated if they actually step off the curb! In the climate case, we need to know how well we estimated forcings (greenhouse gases, volcanic effects, aerosols, solar etc.) in the projections.

* Secondly, what is the uncertainty in that prediction given a particular forcing? For instance, how often is our poor pedestrian saved because the truck manages to swerve out of the way? For temperature changes this is equivalent to the uncertainty in the long-term projected trends. This uncertainty depends on climate sensitivity, the length of time and the size of the unforced variability.

* Thirdly, we need to compare like with like and be careful about what questions are really being asked. This has become easier with the archive of model simulations for the 20th Century (but more about this in a future post).

It’s worthwhile expanding on the third point since it is often the one that trips people up. In model projections, it is now standard practice to do a number of different simulations that have different initial conditions in order to span the range of possible weather states. Any individual simulation will have the same forced climate change, but will have a different realisation of the unforced noise. By averaging over the runs, the noise (which is uncorrelated from one run to another) averages out, and what is left is an estimate of the forced signal and its uncertainty. This is somewhat analogous to the averaging of all the short trends in the figure above, and as there, you can often get a very good estimate of the forced change (or long term mean).

Problems can occur though if the estimate of the forced change is compared directly to the real trend in order to see if they are consistent. You need to remember that the real world consists of both a (potentially) forced trend but also a random weather component. This was an issue with the recent Douglass et al paper, where they claimed the observations were outside the mean model tropospheric trend and its uncertainty. They confused the uncertainty in how well we can estimate the forced signal (the mean of the all the models) with the distribution of trends+noise.

This might seem confusing, but an dice-throwing analogy might be useful. If you have a bunch of normal dice (’models’) then the mean point value is 3.5 with a standard deviation of ~1.7. Thus, the mean over 100 throws will have a distribution of 3.5 +/- 0.17 which means you’ll get a pretty good estimate. To assess whether another dice is loaded it is not enough to just compare one throw of that dice. For instance, if you threw a 5, that is significantly outside the expected value derived from the 100 previous throws, but it is clearly within the expected distribution.

Bringing it back to climate models, there can be strong agreement that 0.2ºC/dec is the expected value for the current forced trend, but comparing the actual trend simply to that number plus or minus the uncertainty in its value is incorrect. This is what is implicitly being done in the figure on Tierney’s post.

If that isn’t the right way to do it, what is a better way? Well, if you start to take longer trends, then the uncertainty in the trend estimate approaches the uncertainty in the expected trend, at which point it becomes meaningful to compare them since the ‘weather’ component has been averaged out. In the global surface temperature record, that happens for trends longer than about 15 years, but for smaller areas with higher noise levels (like Antarctica), the time period can be many decades.

Are people going back to the earliest projections and assessing how good they are? Yes. We’ve done so here for Hansen’s 1988 projections, Stefan and colleagues did it for CO2, temperature and sea level projections from IPCC TAR (Rahmstorf et al, 2007), and IPCC themselves did so in Fig 1.1 of AR4 Chapter 1. Each of these analyses show that the longer term temperature trends are indeed what is expected. Sea level rise, on the other hand, appears to be under-estimated by the models for reasons that are as yet unclear.

Finally, this subject appears to have been raised from the expectation that some short term weather event over the next few years will definitively prove that either anthropogenic global warming is a problem or it isn’t. As the above discussion should have made clear this is not the right question to ask. Instead, the question should be, are there analyses that will be made over the next few years that will improve the evaluation of climate models? There the answer is likely to be yes. There will be better estimates of long term trends in precipitation, cloudiness, winds, storm intensity, ice thickness, glacial retreat, ocean warming etc. We have expectations of what those trends should be, but in many cases the ‘noise’ is still too large for those metrics to be a useful constraint. As time goes on, the noise in ever-longer trends diminishes, and what gets revealed then will determine how well we understand what’s happening.

Update: We are pleased to see such large interest in our post. Several readers asked for additional graphs.

– UK Met Office data (instead of GISS data) with 8-year trend lines
– GISS data with 7-year trend lines (instead of 8-year).
– GISS data with 15-year trend lines

These graphs illustrate that the 8-year trends in the UK Met Office data are of course just as noisy as in the GISS data; that 7-year trend lines are of course even noisier than 8-year trend lines; and that things start to stabilise (trends getting statistically robust) when 15-year averaging is used. This illustrates the key point we were trying to make: looking at only 8 years of data is looking primarily at the “noise” of interannual variability rather than at the forced long-term trend. This makes as much sense as analysing the temperature observations from 10-17 April to check whether it really gets warmer during spring.

And here is an update of the comparison of global temperature data with the IPCC TAR projections (Rahmstorf et al., Science 2007) with the 2007 values added in (for caption see that paper). With both data sets the observed long-term trends are still running in the upper half of the range that IPCC projected.

Add comment