Out of the Box: The calibrated estimator contest

In this post I’ll talk about a game related to estimating with weights, and verifying how “calibrated” are the estimates. I will just use two sizes (Small and Large) to make the game simpler.
If the experiment succeeds, that means finding calibrated estimators with certainty level closer to 100% than to 50%, then more complex experiments with more sizes can be done. The aim of the game is not to give an answer to the question if estimates must be done, but only about if estimates “may” be done. A paradoxical result could be when the winner, the most “calibrated” estimator, will be the one that gives the maximum uncertainty (50%), so confirming that the “right answer” should has been be flipping a coin.

A calibrated estimator is someone who is able to correctly weight, using confidence intervals (or probabilities in case of binary-true/false questions), his uncertainty related to an unknown values.
To measure your skill as calibrated estimator: you are suggested to guess some unknown value answering questions (like: "The internet 'Arpanet' was established as a military communications system in what year?"). The answer must be a confidence intervals of 90% (an interval that you would bet will contain the real value with 90% of probability). You are considered a calibrated estimator if you will be right 9 times on 10.
(of course it applies only if you don't know or don't remember the real value!)
People are mostly overconfident (i.e. when they claim 90%, they are right only 60%, 70% at most), at least before doing practices with some calibration exercises.
For binary (e.g. true/false) questions the procedure is as follows: you answer to binary questions, and give certainty levels like 50%, 60%, ... 100%, and convert such values in probabilities (0.5, 0.6, ... 1). You are considered a calibrated estimator if the number of this sum should be the same as your correct answers.
Here I want to suggest an application for a "game" game the I would call "the calibrated estimator contest".

This apply in any situation where you (or a team) have a todo list of working items represented by some cards or sticky note.

1) consider two sizes for any working item: Small and Large
2) select some high priority items (10 of them is a good number I think) from the todo list, and for each of them, invite players to write secretly her large/small guess with a probability (from 50% to 100%).
3) ask each player to put her guesses in closed envelopes with her name.

Start working and tracking the actual time.
It is suggested to track in real time using "prisoner metrics", i.e. tracking using a checkmark for each quantum time.

After the items are done:

order incrementally the items by actual timeboxes (it should be easier having used prisoner metrics), from the left to the right.
Spot the median, and consider the items on the left of the median as small, and the items on the right of the median as large (about the median: put it on the less numerous set, or flip a coin).
Each player then will open the envelope, get her guesses and do the following:
Count how many actually Large or Small values correspond to the guesses.
Then count the sum of the probability given for her guesses.

The one that is more calibrated than others (i.e. has a total of sum of probabilities closer to the actual number of times she was right) , will be celebrated as the winner.

Use cases:
If I mark my guesses as 100%, I’ll win if all the actuals will correspond to my guesses. I should do this kind of bet if I think that there is high predictability (at least when considering the items compared relatively).
If my guesses are all 50%, then I win if I am right 50% of the time, so more or less I play in this way if I think that there is no better guesses than flipping a coin.
This can be a funny exercise for deciding if we “can” estimate.
In fact we “can” estimate if:
1) there are calibrated estimators
2) the certainty given by those estimators are closer to 100% than to 50%.

I said we “can”, not we “should”. “should we” it’s another story.

The "wisdom of the crowd" variant is another possible game activity: make an average of all the guesses, and find how good they are.

Note: I have not tried the game until now. I’d like to suggest anybody interested to try it, but there are no guarantees. I’m looking for feedback about.

Here you can see some part of "How to measure anything" book that inspired me, particularly in the "calibration" part.

Variation for “estimating the value”: another possibility is guessing the "value" of a feature, for example providing confidence interval related to how many times a new feature will be used. The procedure essentially should not be so different, I may blog about later.

That’s all, thanks for reading, I’d like to get feedback.

p.s. warning: I did not quote definitions from the book, so I may made some mistakes. I apologize just in case. You are invited to read excerpts from the original books in case of doubts (and hopefully give me notices about any mistake).

1 comment:

Tonino Lucca said...: Let me clarify few things:

This game is:
1. a way to validate the hypothesis that estimates cannot be done
2. a way to find if there are calibrated estimators
3. a way to do some brainstorming and retrospective analysis in order to share our specific perception of the complexity of different tasks, with hindsight, given what were our "bets" in advance.

About 1: if we find out that all the guesses are really bad, failing in just distinguishing Small from Large in advance, then there is no point in trying to do better than this, or reporting some "story point" based velocity, and so on. It will give information whose value is very bad. Moreover, it could create some sense of "commitment" that may ends up in sacrifying the (internal) quality for the scope.

About 2: you may find a calibrated estimator, and you may investigate what made him so good. Moreover, a "six sigma" like approach could be done in order to find out how to increase the confidence level. (i.e. you find that there exist a calibrated estimator, whose certainty level in the average is 70%. You try to see if it is possible to ask to this guy what he would suggest to increase the certainty level to 80%).

About 3: even if the game gives very poor numbers, it is still possible to use all those numbers to do some retrospective analysis. Example: find the item whose actual is farer from _any_ guess, and try to do some root cause analysis. Find the item whose actual is closer to any guess and, in the same way, use it to understand the reason for it.

That could become complicate, and consider also that there is always a combination of luck and skill, according to a model like Outcome = a*Luck + (1-a)*skill (this is a deep topic, starts here for the skill luck model: http://blogs.hbr.org/cs/2011/02/untangling_skill_and_luck.html)

Clearly the skill is a random factor, but it is the a factor and the luck the tricky part. Skill is not the same for all the member of a team, and all depends also to the a factor, that we don't know in advance.

The skill matrix can be useful as a rough evaluation of the skill level for the different team members.

Note also the problem of "multitasking". The experiment should use "focused timeboxes" in order to avoid confusion about delays given by many activities at the same time. That may means, for example, that you would rather use "pomodoro" and track the pomodoros, so if you have been able to do only two pomodoros in a day, then you can track just them.
This does not change the fact that the delay create costs, and that only a short time of the day focused, is a symptom of some problem.

Another point:
The Large/Small in this example are “relatives”, so the guesses are only how to distinguish larger from smaller items _in this set_ (and so not Large or Small in an absolute sense). Moreover, the guessers should separate their guesses in two sets (Larges, and Smalls) almost equal in size.; January 22, 2013 at 11:52 AM

Friday, January 18, 2013

The calibrated estimator contest

1 comment: