But if both forecasters are imperfect, how do we use the data to tell us which forecaster was better? Or how good each one is on a scale with pure guess at one end and a crystal ball at the other end? How can we estimate the degree to which a forecaster was over-confident or under-confident in his own forecasting ability? What about bias?
Nick's question had to do with probabilistic forecasts. This problem is this isn't actually covered in any econometrics or probability classes. You have to go to a meteorology department to figure out how to do this without making stupid mistakes.
A good example of why good skill scores are needed is the following model. This model was a real model back in the day and actually used by a real weather forecaster. It predicts on every given day that there will not be a tornado in a given town. The forecaster claimed that he would be 98% right, but what we care about is that day when there in fact is a tornado. He got no false positives, but way too many (100%) false negatives. His score was "biased." (This is very different from what econometricians call bias.)
The Heidke skill score is a better predictor.
Here's a really simple example for yes-no answers:
a = Forecast = Yes, Reality = Yes.
b = Forecast = Yes, Reality = No.
c = Forecast = No, Reality = Yes
d = Forecast = No, Reality = Yes
We come up with the Heike Skill score by comparing how our model does compared to a coin flip. And how good a perfect model is compared to a coin flip. Then to make the numbers nice we take the ratio of those two results:
HSS = (number correct - expected number correct with a coin flip)/(perfect model's number - number correct with a coin flip)
This simplifies to:
$$HSS = \frac{2(ad - bc)} {(a+c)(c+d) + (a+d)(b+d)}$$ using the above definitions.
HSS of one means a perfect forecaster, a zero means the forecaster has no skill, and a negative value says that flipping a coin is actually better than the forecaster.
There are many other types of skill scores. They differ based on how they treat rare events, non-events, and systemic vs random errors. You can extend skill scores from a 2x2 table to a larger table for more complex forecasts. This won't do for probabilistic forecasts, however.
For probabilistic forecasts, instead of weighing false positives vs false negatives, you are weighing sharpness vs reliability. Here is a skill score for probabilistic forecasts:
The Ignorance Skill Score
Let f be the predicted probability of an event occurring lying on the open interval (0,1). (The ignorance skill score assumes that we are never 100% sure about anything.) Also the ignorance skill score has units of "bits." Yes, it's the same thing we talk about when we speak of "bits" in a computer. It traces its foundations to information theory.
And let:
$$Ignorance_t(f_t) = -log_2(f_t)$$ when the event happens at time period t,
$$Ignorance_t(f_t) = -log_2(1 - f_t)$$ when the event does not happen at time period t, and
T = the number of time periods t.
The expected ignorance is computed the normal way:
$$Ignorance(f)=\frac 1 {T}\sum_t I_t(f_t)$$
Standard errors for our estimate of ignorance are also computed the normal way.
So, back to your original question, "how do we compare probabalistic forecasters." We can compare the ignorance of the two forecasters by seeing which one is more ignorant.
Here is a more intuitive way to understand it:
Let's define a function that is "a measure of the information content associated with the outcome of a random variable."
Since it's a measure of information, then it should have the following properties:
1) This measure of event $$A_i$$ happening depends only on the probability, $$p_i$$ of $$A_i$$ happening.
2) It's a strictly decreasing function of $$p_i$$. This is so that the higher the probability, the less useful our prediction of event $$A_i$$.
3) It's a continuous function of $$p_i$$. We only want infantesimal changes in probability to cause infantesimal changes in information.
4) If an event A_i is the intersection of two independent events $$B_i$$ and $$C_i$$, then the amount of information we gain when we find out C_i has happened should be a function of the intersection of $$A_i$$ and $$B_i$$. Also, the function should be equal to the information we gain when we find out $$A_i$$ and $$B_i$$ have happened.
Said another way, if $$P_1 = P_2 * P_3$$ then $$I(P_1) = I(P_2) + I(P_3)$$.
Luckily, there is only one class of functions that fulfill these criteria:
$$I(event x) = k \log(p(x))$$
Now, k can be any negative number, so we pick k to give us units of bits.
This gives us:
$$I(event x) = \frac 1 {\ln(2)} * \ln(\frac1 {p(x)}) = -log_2(p(x))$$ where $$p(x)$$ is the probability of that event happening.
Now let's define a sort of measure of our surprise. This is the information we gain from seeing the results of our predictions. If the event happened, the knowledge we gained from our probability forecast was $$-log_2(f_t)$$. However, if we picked incorrectly, we gained evidence for the prediction of the alternate event. So, if we believed incorrectly, we gain $$-log_2(1 - f_t)$$ knowledge.
Let's work this out for a series of events:
We think there's a 10% chance of Bill winning an election in 2009. Bill loses so we gain $$-log_2(0.9)\approx0.15$$ bits of info. We gained very little information, because the 10% chance that Bill wins means we are close to certain that he will lose, and we were right.
We think there's a 90% chance of Bill winning his election in 2010. Bill wins, so we gain $$-log_2(0.9)\approx0.15$$. Again, we gain very little information, because 90% is close to certainty.
Bill gets caught cheating on his wife with a goat before the 2011 election. We now think there's a 1% chance of Bill winning. He manages to win. We are very surprised! We gain a lot of information this time. We gain $$-log_2(0.01)\approx6.54$$.
Bill later turns out to have done a great job in office, despite the scandal. We think there is a 90% chance that he gets reelected in 2012. But we are surprised; he loses. We gain $$-log_2(0.1)\approx3.32$$ bits of information.
Our total information gained for the 4 events is:
10.27 bits of information.
Our expected ignorance as a forecaster is about:
$$\frac {10.27} { 4 }\approx2.57$$ bits per forecast.