During my last couple of years at Microsoft, the Office organization was making a broad transition to a new, more agile engineering process. It wasn’t exactly textbook agile, but it was a lot more flexible when it came to changing requirements and iteration. It was an admirable change, and I’m not sure an organization the size of Office could make a wholesale change to a “truly agile” culture in a short time like that. However, a result of these changes was that we all had to rethink how we approached things like scheduling and ship-readiness. Under the new process, we were operating on a schedule of regular, more frequent “releases” (in the early days, this just meant “semi-stable builds”), each consisting of a fixed number of sprints. There were a ton of tools in place from previous releases to manage the schedule, but since our team (and the organization as a whole) was new to this, we knew it would take some tweaking to get it right.
We tried many different methods (and I won’t go through them all here), but in the feature areas I was PMing, I found some success using randomized historical models to predict delivery dates. I didn’t invent the techniques in this post (though I modified them a bit to suit our needs), but I hope it serves as a good explanation of how you can use Monte Carlo Methods (and simulation in general) to better predict the future. (Okay, that sounds ridiculous.)
A note on numbers: I would love to share the results based on real data from our work at Microsoft, but for obvious reasons I can’t share that data with you. I’ve mocked up fake data throughout this post to demonstrate the concepts. While I can’t show you the real data, I’ve done my best to make this data representative of the kinds of observations we had, and the high-level trends we noticed. I’ll also just be assuming we’re working with one developer in this post, but in practice, you want to break out tasks by developer – everyone works at a different pace and estimates with different accuracy.
In the Beginning, There Was the Estimate
Every time a work item is recorded and considered “in-plan”, the developer enters an estimate (in “work hours”) for this item. Estimation of software engineering tasks is incredibly hard to do with accuracy. There are some techniques that help improve accuracy (working in groups to do estimation; ensuring everything is broken down into very small, well-understood chunks, etc.) but all the little bits of uncertainty can add up in a way that has a big impact on the schedule.
In the software world, the question is usually: “When will we be done?” (Sometimes, if you’re lucky, they’ll also ask “and how sure are you about that?”) How do you go about answering that question? Suppose you have data that looks like this:
Task | Estimate |
Task 1 | 8 |
Task 2 | 8 |
Task 3 | 15 |
Task 4 | 8 |
Task 5 | 8 |
Task 6 | 8 |
Task 7 | 12 |
Total | 67 |
You could say “Well, all the estimates add up to 67 hours, so we should be done in about 67 hours.” (Keep in mind that “hours” means “working hours,” and we do some fuzzy math to figure out how many working hours corresponds to a period of calendar time like a week. For example, nobody actually gets 40 working hours in a 40-hour work week. For the sake of simplicity, I’m going to show these examples based on working hours, but we did some wrangling to convert these into dates.)
Is that accurate? Maybe. But what if Task 4 actually takes twice as long as you thought? Then you’re up to 75 hours. What if Task 7 also takes twice as long as you expected? Then you’re up to 87! Even if you strive to make your estimates as accurate as possible, there’s always a possibility things will take more time – in fact, it’s pretty much a guarantee that most of the estimates will be off by at least a little. I wanted to have a clearer picture of when we could expect to deliver things. The goal was not certainty – I believe that’s impossible – but I wanted to be able to better envision the range of possibilities.
Evidence-Based Scheduling
When I was studying CS in college, I read pretty much everything by Joel Spolsky I could get my hands on. When I started trying to wrangle timelines and schedules, one of the things that popped out in my memory was his essay on Evidence-Based Scheduling. EBS is summarized as follows:
- Record how long you estimate each future task will take
- As you finish work, record how long each task actually takes
- To predict the likelihood of certain outcomes in the future, use the historical data to run Monte Carlo Simulations for the future tasks
What’s a Monte Carlo Simulation?
Back in the 40’s, many of the world’s biggest brains were hard at work in Los Alamos on the Manhattan Project. A few of them (Nicholas Metropolis, Stanislaw Ulam, and John von Neumann) were trying to model the behavior of certain particles to better understand radiation. Good ol’ Johnny von Neumann had built this brand-new fancy computer to put to work on the problem, and they had collected large amounts of data on these particles. Despite these advantages, they realized that they had come across a set of problems that couldn’t be solved by a single deterministic calculation — meaning there wasn’t a single set of equations they could compute to get The Right Answer. To understand what that means, let’s walk through an example.
Gathering Data (aka Gambling “Responsibly”)
Suppose you’re in a friend’s casino, and he invites you to play his favorite game. He has a 6-sided die, and he’ll allow you to place a bet on what side will come up when it’s rolled. For a fair die, there would obviously be a 1/6 chance for each side. In that case, you only have a 1/6 chance no matter what, so you just bet on any side and cross your fingers. However, let’s suppose we can’t assume this is a fair die. (What can I say? I run with some devious gamblers.) How would you figure out the likelihood of each option for a potentially unfair die? Before you bet, you might ask to roll the die once to see what happens. However, does one roll really tell you what you need to know? Can you think of any other ways?
How about rolling it 10 times before you bet, writing down the number of times each side comes up, and choosing the one that came up the most? Or what if you rolled it 100 times? 1000 times? One Million times? (He might get tired of waiting…) Why increase the number of rolls? As you increase the number, you gather a larger sample size.
Think of it this way: each side will come up some percent of the time. If you just roll it once, you could get any result at all! One roll doesn’t tell you much about the likelihood of any side coming up. But the more times you roll it, the closer you will get to observing the correct probability distribution for the sides. A probability distribution is just the set of probabilities for each outcome. (For example, a fair coin has an even probability distribution of 50% heads, 50% tails).
Back to our die. After 10 rolls, your notes might look something like this:
Side | Count |
1 | 1 |
2 | 0 |
3 | 4 |
4 | 4 |
5 | 1 |
6 | 0 |
It doesn’t look like a fair die, but it’s hard to draw conclusions from this. After 100 rolls, you might see something like this:
Side | Count |
1 | 10 |
2 | 15 |
3 | 30 |
4 | 35 |
5 | 5 |
6 | 5 |
Ah, we’re starting to see something interesting! That’s almost certainly not a fair die at all! If it was fair, you’d expect each side to come up between 16 and 17 times. This one appears to be weighted most heavily to 3 and 4. After 100 rolls, we have more confidence that we’re getting close to the correct distribution for our die. As your sample size continues to grow, you might notice that it is converging on specific distribution, say something like this:
Side | % Chance |
1 | 10% |
2 | 16% |
3 | 31% |
4 | 34% |
5 | 4% |
6 | 5% |
You can use this data to make decisions about the future — for example, if you’re gambling and you only have one bet, you should probably bet on 4. It doesn’t guarantee that 4 is going to come up, but it helps you understand how likely each possible outcome is.
Building a Model to Simulate a Process
Suppose that the game’s a bit more complicated, and each side has a different payout. (For example, maybe 6 pays out 3-to-1 of whatever you bet, a 1-to-1 payout for choosing 4, etc.) You’re planning to play the dice game for a while: you’re going to bet and roll, bet and roll, bet and roll, and so on. Before you put your pile of money on the line, you want to see if there is a better, more complex strategy than simply choosing the safe route (Side 4) every time. You decide to go home and try out a few different combinations of strategies to see how well they might perform before you bet real money. But there’s one big problem: your friend won’t let you take the dice home with you to try different strategies. He says all those free rolls you tried were already pushing the limits of his generosity, and all you can do is take your notes about the probabilities home. What could you do to try out different strategies?
One option is to build a model of the die. A model is just a representation of a process – it isn’t the actual process itself (like rolling a weighted die, building software, or the US economy), but it acts as a stand-in for that process when running it for real would be too costly or difficult. To build a model of the weighted die, we could use something very low-tech: a box with slips of paper inside. Each slip of paper has a side number (1,2,3,4,5, or 6) written on it. To run a “simulation” of the die, we would just draw a slip from the box: whatever number was on it would be the side “rolled” by our model. The simplest model would have what’s called a “uniform distribution”. In a model like this, there would be an equal chance of choosing each option:
That doesn’t model our unfair die very well – here our model is just randomly picking a side, and each side has an even 1/6 chance of being chosen. If we want to accurately model the die, we should randomly select a side based on the historical data. Based on the rolls we observed earlier, our model should have a 10% chance of selecting a 1, a 16% chance of a 2, a 31% chance of a 3 and so on. How might we model this?
Suppose that we put 100 slips of paper inside the box, and we use more slips of paper for sides that are more likely to be rolled. We would write “1” on 10 slips, “2” on 16 slips, “3” on 31 slips, “4” on 34 slips, “5” on 4 slips, and “6” on 5 slips. Then we would shake up the box and randomly draw one. Obviously, we’re most likely to draw a 4 from a box like this:
So, in this model of the die, the odds of each side correspond to the probabilities we recorded earlier. Now we can run simulations to our heart’s content! A single simulation looks like this:
- Place Bet on a side
- Draw a slip, and look at it.
- Record the results (made money or lost the bet)
- Put the slip back, shake the box up
We can then do this over and over trying different betting strategies and observing the outcomes. This technique can be used to model all kinds of things. (Although it generally uses a weighted random number generator instead of boxes and papers!)
So anyway, these physics geeks at Los Alamos devised this technique and ran these simulations on the first general-purpose electronic computer to model aspects of the atomic bomb detonation. Since it involved probabilities and random trials it reminded them of gambling, so they named it after the Monte Carlo Casino where they’d lost an atomic dime or two. Thus, the Monte Carlo Method was born.
What Does Any of this Have to do with Scheduling?
Oh, right. Enough about the physicists, and back to the task at hand. Predicting how long programming tasks will take is also not an exact science — it’s very common for estimates to be way too low or way too high. One simple technique to account for this variability is called Three-Point Estimation. In three-point estimation, you estimate the Best Case, Worst Case, and Most Likely Case. You can either leave these as three distinct possibilities, or you can combine them based on some assumptions about the probability of each. This allows you to account for uncertainty in estimation.
Using a Monte Carlo Method (like the box of papers) to simulate project schedules also conveys uncertainty, but it is a bit different. You run many simulations (I typically ran 1,000-10,000) and look at the big picture, making a note of the big range of possibilities. In each simulation, you do the following:
- Go through the tasks that have not been started, and for each task:
- randomly select an overrun percentage from the past. (Remember, “random” doesn’t necessarily mean “with equal probabilities for all options”)
- Apply that overrun percentage to the task’s estimate
- Add all these new estimates together, this is your predicted outcome
- Then you repeat this a couple thousand times, and look at the many predicted outcomes.
If you have enough historical data, that data begins to reflect the probability distribution of overrun percentages. Each simulation should randomly select from those historical overruns, based on their historical probability distribution. For example, if 20% of the time tasks take 1.2x as long as you thought, and you have 100 data points, then you would expect 20 of those data points to be 1.2. This is important — it’s not just randomly picking an overrun with equal chance for each option, it’s randomly picking based on the probability of each. Think back to the two boxes in the previous section.
Monte Carlo simulations allow you to express outcomes as a range of possibilities. They have two primary advantages over Three-Point in this scenario:
- They show the (often huge) middle-ground between Best/Worst/Likely
- They are explicitly based on past data, whereas Best/Worst/Likely often involves some guesswork or selecting a predefined probability distribution
I still prefer Three-Point Estimation for certain types of tasks, but I decided to give Monte Carlo Simulation a go for my feature areas. Fortunately, we had already been collecting some data, so I put it to work.
Collecting Data, and Giving it a Whirl
In addition to the data we started with, I also started collecting all the data I could get my hands on – I reasoned I could always decide later what data was relevant or irrelevant, but I could never go back in time and collect something I overlooked. I started with data like this:
Task | Estimate | Actual | Overrun |
Task 1 | 8 | 10 | 25% |
Task 2 | 8 | 6 | -25% |
Task 3 | 15 | 17 | 13% |
Task 4 | 8 | 9 | 13% |
Task 5 | 8 | 8 | 0% |
Task 6 | 8 | 10 | 25% |
Task 7 | 12 | 14 | 17% |
(As I mentioned before, all data in this post is made up.)
Negative Overrun means that we overestimated how long it would take. It’s nice when that happens, but sadly it doesn’t happen often. Based on this data, I ran a few thousand trials. Each trial looks something like this:
Task | Estimate | Overrun %(randomly sampled) | Predicted “Actual” |
Task 8 | 8 | 25% | 10 |
Task 9 | 16 | 13% | 18.08 |
Task 10 | 4 | 25% | 5 |
Task 11 | 8 | 17% | 9.36 |
Task 12 | 12 | 0% | 12 |
Totals | 48 | 54 |
So in this single trial, the predicted actual work is 54 hours. Of course, in practice this was an automated process, and the computer ran thousands of these trials. After running all those trials, we collect their outcomes (in total hours) and for each outcome compute percent of time the result was less than or equal to that outcome (I’ll explain “less than or equal to” below). We can then plot a graph like this one of the results:
The y-axis in the chart above indicates how confident we are (based on our models) that we’ll complete the remaining work in the corresponding number of hours. For example, if you look for 55 hours on the x-axis, you’ll see that it corresponds to about 68% confidence on the curve. Where did 68% confidence come from? The way to think about this figure is: “68% of our models predict that we will complete the project in 55 working hours or less.” Or, more generally: “<y value>% of our models predict that we will complete the project in <x value> working hours or less.” This is where “less than or equal to” comes in – we don’t care too much about the odds that something finishes exactly on a particular prediction, we care whether or not is finished by that time. The dotted line indicates how many “Work Hours” are left if our estimates were exactly right, or too high (11% confidence we’re done by our estimate? Whoops.)
It’s worth noting that “100% Confidence” is a bit deceiving. Can you ever really be 100% confident? Given that anything can happen, I rarely reported that 100% confidence figure. However, it was generally a safe bet that we would finish in less time than that. 100% confidence means “Let’s look at the worst overrun we’ve ever had. If every single task runs over by that much, this is the number we get.” That’s obviously unlikely – some things are bound to go better than that! That said, when you’re in the early stages, it’s important not to rely too heavily on the confidence percentage — you want to be cautious until you’ve collected enough data to account for a wide range of possibilities.
This was certainly an improvement over the old way of thinking about the schedule! One of the great things about these techniques is that they continue to improve as long as you continue collecting data – your box full of papers gets more and more accurate. Using these techniques, we were able to get a handle on the likelihood of various completion dates, and schedule work accordingly. My crew found that the estimate corresponding to 80-85% confidence was a reasonable place to start scheduling dependent tasks. There was still a chance that they’d slip, but things worked out often enough that we were able to reduce the amount of last-minute schedule juggling as a result of slippage, and do some more long-term planning.
Some Python Code!
While I was at Microsoft, I started all this analysis in Excel. It was a great tool for getting this thing up and running, but it had some limitations. Since leaving Microsoft, I’ve written some Python code to run these kinds of simulations. If you’re interested, you can check out the Monte Carlo Scheduling Project on GitHub. It requires MatPlotLib.PyPlot, and NumPy. Feel free to make any suggestions, or fork the code and modify it yourself!
Next Time Some day: Accounting for Emerging Requirements and Other Stumbling Blocks
Unfortunately, I have had a hard time writing part two of this post without disclosing stuff about my time at Microsoft that felt too proprietary for my comfort. I’ve decided to hold off on writing the follow-up post for the time being. If you have any questions on the subject or how it might apply to your team, feel free to reach out via the Contact form – I’d be happy to discuss the topic in the abstract or based on data from another party.
This process certainly improved our scheduling crystal ball, but after a few sprints it became clear that there were still some challenges. The first was that as a whole, we weren’t super diligent about tracking the “actual” time. Second, there were some delays in the check-in process that didn’t prevent us from getting started on a new task, but made it so that the task wasn’t truly “done” yet. Finally, The EBS techniques as outlined in Joel’s blog didn’t account for one of our biggest headaches: emerging requirements. Since we were new to this agile thing, and we were building out a new codebase, there were a heck of a lot of things we could never have anticipated in the beginning. Of course, we added these tasks to our schedule as they came up (and it shifted our projected dates accordingly), but it didn’t feel right to leave this important aspect of our work completely unaccounted for. We tried some buffers and things like that, but they didn’t really help us understand the future. Next time, I’ll outline some of the ways I modified this technique to accommodate these problems.
No comments on "Predicting the Future: Schedules, Simulations, and Software"