Everyday Mathematics – Colin Cotter

Why are there probabilities in the forecast and why do they keep changing?

10 August 2014

Everyday Mathematics

The UK Met Office has been making use of the remains of Hurricane Bertha to publicise the probabilistic aspects of their forecast. In particular, they have been publishing probabilities for various tracks of the storm across the UK.

Why is the forecast being made in terms of probabilities?

The classical idea of a forecast is a prediction of the precise value of something (e.g., temperature) at a particular point in space and time. This is referred to as a deterministic forecast. In a probabilistic forecast, this is expressed as a probability instead. For example, a forecast could say that there is a 60% chance that the temperature will fall between 21 and 25 degrees Celsius, a 20% chance that it will be below this range, and a 20% chance that it will be above.

The main reason for working with probabilities is that even with the huge amount of weather data available (satellites, weather stations etc.), forecasters are never completely certain about the current (and even past) state of the global weather system. This is partially due to the limits of accuracy of measurement devices, but mostly due to the fact that measurements cannot be taken at every single point in the atmosphere. This uncertainty is best expressed using probability, which can be used to quantify how certain forecasters are about the value of any meteorological quantity at any particular point around the globe. Having accepted that forecasters are uncertain about the current state of the weather system, it becomes clear that they would also be uncertainties about future predictions. It is the job of forecasters to make the best use of models and the continuous stream of observational data to minimise this uncertainty so that they can make as precise a forecast as possible.

One of the features of forecasting a dynamical system such as the weather is that if we are initially uncertain about the system state, the uncertainty in the forecast tends to get larger as we try to make forecasts further into the future. If the system is chaotic (as the weather system is largely accepted to be) then eventually the level of uncertainty grows exponentially. This growth in uncertainty makes it harder and harder to predict the weather further and further into the future; this is why the Met Office only issues specific forecasts up to around 5 days ahead to the general public.

This problem is compounded by what is known as model error, i.e., errors due to the fact that the model is not a perfect model of reality. In weather models, these errors are mostly due to the multiscale nature of the atmospheric system which has features from thousands of kilometres in scale (planetary Rossby waves) down to microscopic details of turbulence in clouds, for example. These features on different scales all interact, and precise knowledge of them is required in order to produce a perfect forecast. In weather models, the atmosphere is divided up into cells, and features which are below the size of the cell cannot be represented directly. The total number of cells is limited by the requirement to run a forecast on a supercomputer quickly enough for it to be useful, and whilst the number of cells increases as computer technology progresses, there will always be a limit to how small they can be. Instead, the impact of features below the cell size on the weather must be included through “physics parameterisations”, which often work really well, but not in all situations, and are an important ongoing topic of research.

How are probabilistic forecasts made?

To make probabilistic forecasts, the forecasters’ uncertainty in the current state of the atmosphere is represented as an ensemble. This means that instead of storing a single representation of the atmospheric state detailing the wind speed, direction, pressure, temperature, and moisture values in each cell, the forecast is made using a collection, or ensemble, of several alternate representations of the atmospheric state. In regions where atmospheric quantities are known to a good degree of certainty, the ensembles have very similar values. The average over all the ensembles of this value represents the “best guess”, whilst the average distance from this value, i.e. the variance, over all the ensembles gives a quantification of the uncertainty in this value. In other regions where the quantities are less well known, over the oceans for example, the values that the ensemble members take are more spread out. To make the forecast, the weather model is run several times, once for each ensemble. This allows forecasters to compute similar means, variances, histograms and other statistics (such as the posible paths of a storm and their probabilities) for future times. Due to the increase in uncertainty over time that is characteristic of the weather system, these variances get larger and larger over time.

What is the use of a probabilistic forecast?

It seems frustrating to receive a probabilistic forecast rather than a deterministic one. However, probabilistic forecasts are extremely useful to policymakers, businesses and even the general public: they allow a quantification of risk. Rather than just saying that we don’t know which path a storm will take and how strong it will be, a probabilistic forecast provides an assessment of how likely various different scenarios are. For example, a deep low passing over the UK can cause storm surges, and it is important for the Environment Agency to be able to trade off the cost of evacuating an flood-prone area with the risk that a flood will occur. Probabilistic forecasts can also be used to help a business to minimise wastage (in managing stocks of ice cream, for example).

Why do the probabilities keep changing?

In the days leading up to the arrival of the low pressure zone formerly known as Hurricane Bertha, the Met Office issued several updates for probabilistic forecasts, each time showing a different set of probabilities for various tracks. It is easy to interpret this as the Met Office getting the forecast wrong, and then trying to correct it. However, what is actually happening is that the probabilities are being updated in the light of new observational data. The crucial point here is that the probabilities are relative probabilities which represent uncertainty in the forecast given all of the information that has been available so far. In the language of probability, these are called conditional probabilities. If there was different observational data (and past experience of the forecasters) available, then the probabilities would be different. And, when new observational data becomes available, the conditional probabilities must be updated to reflect this new information. There is a powerful mathematical formula for updating these probabilities known as Bayes’ Formula, named after Rev. Thomas Bayes, an 18th Century Presbyterian minister, who discovered it. To understand conditional probability, consider the following question: “What is the probability of rolling two fair 6-sided dice and getting a total score of 8?”. A bit of counting leads to the answer 5/36. However, if we receive a further piece of information, which is that the first die shows 2, then this value is updated to 1/6, which is the chance of getting a 6 on the other die. We see that the probability gets updated in the light of new information. In this example, the probabilities can be computed directly, but Bayes’ Formula becomes very powerful in more complex situations, such as forecasting the weather. So, when the Met Office update their probabilities, it does not mean that the previous probabilities were wrong, just that new information became available (in the form of new observational data from later times) and so new conditional probabilities must be computed. The mathematical process of blending new observational data with model simulations to update conditional probabilities is called data assimiliation. In the Met Office operational forecast cycle, new data is incorporated into the forecast every 6 hours.

[The book “Probabilistic Forecasting and Bayesian Data Assimilation: A Tutorial” by Sebastian Reich and Colin Cotter will shortly be published by Cambridge University Press.]

Correlation, causation. Again.

Colin Cotter

7 July 2014

Everyday Mathematics

I’m mainly posting this because I’m getting tired of explaining it repeatedly! There are plenty of other better written articles about this topic but they don’t make the combination of points that I would like to make. This post is about correlation, and what you can and can’t use it for. More generally it is about being careful about drawing conclusions from data.

It is very easy to jump to conclusions when we see changes in the world around us. For example, you might be looking at the success of a particular vaccine in protecting populations from a particular disease. Let’s say that in the countries where the population is given this vaccine, the disease levels are lower. Is the vaccine effective? The answer is maybe. it could just be a coincidence. With this amount of information, we don’t know. It could be that the countries that can afford to provide the vaccine are richer, and those people are less susceptible to the disease because of improved nutrition. It could be that people in those countries have a natural immunity for some reason. Determining whether the vaccine is effective would require a much more carefully designed trial in controlled conditions.

Noticing a connection between two datasets like this is an example of a “correlation”. Correlations are particularly easy to plot in data which is changing over time, i.e. time series. For example, a government may make some kind of change in crime policy, say increasing the minimum sentence for burglary. Suppose that burglaries then fall over the next 5 years. Is the policy working? Maybe, but we don’t know from this information alone. The number of burglaries could depend on a range of factors, including changes in policing, the state of the economy, recreational drug policy, the changing risk/reward balance of other types of theft, etc. Without more careful trials, the fall in burglaries doesn’t tell us anything. Not a thing.

There is quite a bit of confusion about this. Most people have been told that “correlation does not imply causation” i.e. spotting a pattern between two datasets or time series does not prove that one thing causes another. However, I still hear people making the argument “I know that correlation does not imply causation, but …”, i.e. they know that a causal link is not certain, but they are saying that the correlation is adding to the evidence that the causal link may exist. The point is, correlation does not add to the evidence at all, unless all other factors have been painstakingly removed in a controlled environment.

When I try to explain this, people often say “ah, but in the real world, we have to make a decision based on a lower standard of proof”. This is true, criminals can be convicted “beyond reasonable doubt” which is definitely short of absolute certainty such as you get from a mathematical proof. And, in most civil cases, the “balance of probabilities” is the standard. However, that does not mean that you are abandoning logic. It just means that you are making a decision to convict if there is sufficient probability (the exact amount depending on the situation) that the crime was committed. It is the job of judge and jury to try to estimate this probability (assisted by experts and usually not quantified as a number!). And my point is that not only does a correlation between two datasets not prove that A causes B, it does not even increase the probability that A causes B by any amount. This is because using correlation to prove causation, or even to increase the probability of causation, is a logical fallacy.

I’ll illustrate this with some graphical examples. I’ve simulated a time series of a quantity X, illustrated below. Let’s say it is the price of corn in Dakar (it isn’t).

I’ll now take each value of X, multiply by 2, and add 10, and call that Y. This is plotted below.

We see that when X goes up, Y goes up, and when X goes down, Y goes down. This is because the value of Y is simply the value of X, scaled by 2 and shifted by 10.

There is a mathematical definition of correlation between two variables, given here (not necessary to understand this post) with notation Corr(X,Y). When Y is simply a scale by a positive number and shift of X, then Corr(X,Y)=1, the maximum value, and we say X and Y are maximally correlated. If Y is a scale of X by a negative number with a shift, then Corr(X,Y)=-1, the minimum value, and we say X and Y are minimally correlated.

We see that Y now goes down when X goes up and vice versa. If I choose Y to be a more complicated function of X, for example each Y value is obtained by computing 10 log (X), then the correlation is lower.

Notice that Y is still correlated with X in the sense described above, but here I computed Corr(X,Y)=0.948. This is because the relationship is not linear i.e. it is not just a scaling and shifting. The more “nonlinear” the relationship is, the lower the correlation is. If I try an even more complicated function, Y = sin(sin(X)), then I get Corr(X,Y)=0.055, a very low correlation, and the plot is shown below.

Another typical situation is that Y could be equal to X, plus some other random component which is smaller in magnitude (perhaps Y is also affected by another process, say the price of wheat in Moscow). An example of this is plotted below.Here there is some relationship between the data, and Corr(X,Y)=0.9.

Finally, I’ll simulate Y to be a completely independent time series, let’s say the popularity rating of the Madagascan president (it isn’t).

As it happens (I promise this is the first simulation I did, no cherry picking), there is quite a bit of parallel drift between these two time series, and the correlation is Corr(X,Y)=0.48 which is not too small.

The point here is that I have been building a causal link into this experiment. Each time, except for the last, I simulated a variable X, and then computed Y as some transformation of X. This means that X causes the change in Y, and we know this because I defined it like this. However, if you just look at my second graph, and even compute the correlation as a number, you have no way of knowing that X caused Y. I could just have easily simulated Y, and then obtained X by subtracting 10 and dividing by 2. It could even be that X and Y have both been computed from a third variable, Z, in which case Z causes the change in both X and Y.

Going further, in the case Y=sin(sin(X)), the changes in Y are completely caused by X, but the correlation is low because the causal relationship is complicated. The correlation also decreased when Y was partially caused by X, but also had some other independent variability. Finally, even when Y was completely independent from X in the final example, there still were some similarities in the data, and the correlation was actually much bigger than the sin(sin(X)) example.

So, what can we use correlation for? Well, if we have two independent processes (i.e. there is no causal link between them), then on average, their correlation will be zero. For the two particular processes I chose above in the last example, they turned out to have some correlation. However, if I had obtained longer data series, this correlation would be closer to zero. So if there is a lack of correlation on a large enough dataset, then you can rule out causation. To know how large is large enough, you need to ask a statistician (who will calculate for you the probability of the situation occurring by chance). If you have two time series or data sets X and Y that are correlated, all you have done is suggested a plausible hypothesis that X causes Y. You must also consider that Y causes X, or that there is a third factor Z that causes both X and Y. If you want to prove the hypothesis, or estimate the probability that it is true, correlation tells you nothing. You either have to eliminate all other plausible hypotheses, which is usually how proof is carried out in courts. If only the balance of probability is required, then it may be that only probable hypotheses need to be eliminated. In scientific research, if we are comparing different hypotheses, we usually seek a mechanistic process that explains how X could cause Y, and then try to calculate/estimate the probability that this could happen from theory or experiment (or a combination of the two). This probability must then be compared with other hypotheses before deciding the most likely cause.

If you are interesting in how you might prove (beyond any doubt whatsoever) that two independent processes have zero correlation on average, or you are interested in how I simulated these datasets (or both), then you might be interested in doing a degree in Mathematics at Imperial College London.