Friday, January 18, 2013
The calibrated estimator contest
In this post I’ll talk about a game related to estimating with weights, and verifying how “calibrated” are the estimates. I will just use two sizes (Small and Large) to make the game simpler.
If the experiment succeeds, that means finding calibrated estimators with certainty level closer to 100% than to 50%, then more complex experiments with more sizes can be done. The aim of the game is not to give an answer to the question if estimates must be done, but only about if estimates “may” be done. A paradoxical result could be when the winner, the most “calibrated” estimator, will be the one that gives the maximum uncertainty (50%), so confirming that the “right answer” should has been be flipping a coin.
A calibrated estimator is someone who is able to correctly weight, using confidence intervals (or probabilities in case of binary-true/false questions), his uncertainty related to an unknown values.
To measure your skill as calibrated estimator: you are suggested to guess some unknown value answering questions (like: "The internet 'Arpanet' was established as a military communications system in what year?"). The answer must be a confidence intervals of 90% (an interval that you would bet will contain the real value with 90% of probability). You are considered a calibrated estimator if you will be right 9 times on 10.
(of course it applies only if you don't know or don't remember the real value!)
People are mostly overconfident (i.e. when they claim 90%, they are right only 60%, 70% at most), at least before doing practices with some calibration exercises.
For binary (e.g. true/false) questions the procedure is as follows: you answer to binary questions, and give certainty levels like 50%, 60%, ... 100%, and convert such values in probabilities (0.5, 0.6, ... 1). You are considered a calibrated estimator if the number of this sum should be the same as your correct answers.
Here I want to suggest an application for a "game" game the I would call "the calibrated estimator contest".
This apply in any situation where you (or a team) have a todo list of working items represented by some cards or sticky note.
1) consider two sizes for any working item: Small and Large
2) select some high priority items (10 of them is a good number I think) from the todo list, and for each of them, invite players to write secretly her large/small guess with a probability (from 50% to 100%).
3) ask each player to put her guesses in closed envelopes with her name.
Start working and tracking the actual time.
It is suggested to track in real time using "prisoner metrics", i.e. tracking using a checkmark for each quantum time.
After the items are done:
order incrementally the items by actual timeboxes (it should be easier having used prisoner metrics), from the left to the right.
Spot the median, and consider the items on the left of the median as small, and the items on the right of the median as large (about the median: put it on the less numerous set, or flip a coin).
Each player then will open the envelope, get her guesses and do the following:
Count how many actually Large or Small values correspond to the guesses.
Then count the sum of the probability given for her guesses.
The one that is more calibrated than others (i.e. has a total of sum of probabilities closer to the actual number of times she was right) , will be celebrated as the winner.
If I mark my guesses as 100%, I’ll win if all the actuals will correspond to my guesses. I should do this kind of bet if I think that there is high predictability (at least when considering the items compared relatively).
If my guesses are all 50%, then I win if I am right 50% of the time, so more or less I play in this way if I think that there is no better guesses than flipping a coin.
This can be a funny exercise for deciding if we “can” estimate.
In fact we “can” estimate if:
1) there are calibrated estimators
2) the certainty given by those estimators are closer to 100% than to 50%.
I said we “can”, not we “should”. “should we” it’s another story.
The "wisdom of the crowd" variant is another possible game activity: make an average of all the guesses, and find how good they are.
Note: I have not tried the game until now. I’d like to suggest anybody interested to try it, but there are no guarantees. I’m looking for feedback about.
Here you can see some part of "How to measure anything" book that inspired me, particularly in the "calibration" part.
Variation for “estimating the value”: another possibility is guessing the "value" of a feature, for example providing confidence interval related to how many times a new feature will be used. The procedure essentially should not be so different, I may blog about later.
That’s all, thanks for reading, I’d like to get feedback.
p.s. warning: I did not quote definitions from the book, so I may made some mistakes. I apologize just in case. You are invited to read excerpts from the original books in case of doubts (and hopefully give me notices about any mistake).