Normally, when debugging, I would recommend starting from first principles and trying to understand the problem’s root cause before doing anything else. Without a complete understanding of the bug, your chances of implementing a complete and correct fix are poor. The last thing you want to do is apply a random fix that “seems to work” without understanding why it works; at that point, you’re a member of the Voodoo Magic School of Software Engineering.
So, my first tool in debugging is always, if possible, an interactive debugger. Stepping through the code is the ideal way to understand when and how the program’s actual behavior starts to diverge from its expected behavior. If an interactive debugger is out of the question for some reason, there are always fallbacks like “printf debugging”, but I find this to be much less effective than a real debugger.
Every so often, though, you run into a bug that is truly mysterious, where you can’t seem to track down what is going wrong. It would be great to understand the full behavior of your system in every last detail, but large and complex software systems eventually start to display “emergent” behavior, where the system as a whole starts to display behavior that cannot easily be explained as the sum of the behaviors of the parts.
What then? One approach to shortcut the debugging process is to binary search for the change that introduced a bug. This requires no understanding of the system’s behavior at all. All you need to know is a revision number where the bug exists and an older revision number where the bug didn’t exist. If it worked at revision 100 and fails at revision 200, then you test at revision 150 to narrow that range by a factor of 2, and so on until you identify the exact change responsible — just like any other binary search.
The first obvious caveat is that this only works for regressions, not for bugs that have existed since the software was first written. Another is that you had better be able to accurately reconstruct old builds of your software; this can be a problem if you don’t have machine-independent builds. Even if you can reconstruct old builds, it can take a while to build them (syncing a tree back to an older revision has a way of flushing out makefile bugs that only syncing to newer revisions doesn’t reveal, so incremental builds are often not trustworthy).
Another tough problem is that sometimes there are parts of the range you want to search where the software is broken: either it doesn’t build, or it builds but there is another, more severe bug that prevents you from running your test case in the first place.
A final problem is that while you may be able to track down the problem to a single commit, some engineers have a bad habit of combining many independent logical changes (which could have each been a separate commit) into one large super-change. Continuous integration is your friend here: you should be committing code on a daily basis, not batching up weeks or months of changes into a single commit. Aside from individual engineers’ bad habits, this can also happen if you merge an entire branch worth of changes (which could easily be hundreds of changes) back into your main codeline in a single commit. As far as that goes, my recommendation is to avoid using “private branches” or “feature branches” and to commit to the mainline as frequently as possible, even if you have to leave your new code in a disabled state via #ifdef or whatnot. “Big bang integration” has a bad reputation for a good reason.
Once you’ve found the change responsible, then what? Unless the problem is extremely urgent and a fix cannot wait a second longer than it needs to, or unless it’s extremely obvious why the change is wrong from inspection alone, I do not recommend that you immediately back out the change. Instead, this is when you fire up your debugger to figure out what is really going on. The change may not have been the real culprit; it may have just perturbed the system a bit, causing a preexisting bug elsewhere to reveal itself in your test case. All that binary searching gives you is a clue. It does not give you a root cause. You still need to find the root cause to apply a correct and complete fix.
One good way to make use of binary searching for a bug is to have your QA department handle it. Since it doesn’t require understanding of the code itself, anyone who can build and run the software can do a binary search. QA can provide the result of the binary search in the bug report, thereby offloading this task from development.
Cascade can help make binary searching more effective:
- Rejecting commits that break builds and regression tests helps ensure that the search won’t break down in a region of changes where the software doesn’t build or run at all, increasing the probability of a successful binary search.
- One of the slowest and most frustrating parts of a binary search is doing all of the builds. Since Cascade builds your software on every single relevant commit and archives those builds off, you can skip this step entirely for any recent regression.
- There’s no guesswork about which changes need to be searched and which ones can be skipped. Cascade’s automatic dependency tracking knows exactly which changes affect your build.
- Cascade helps enable continuous integration, allowing your engineers to commit code directly to the main codeline with less fear that low-quality commits such as build breaks will hurt everyone’s productivity. Smaller, more frequent commits make the results of binary searches more useful.