Archive for November, 2008
I’d like to propose a fundamental law of configuration management: the cost of an integration increases over time. This is similar to the well-known software engineering observation that the cost of fixing a bug increases over time.
Let’s start with a simple example: a single project with just 2 engineers, where each engineer commits a single change once per day. Now suppose that both engineers, for some reason, decide to start committing their code in batches of 5 changes once per week instead. I’m not sure why they would do this; I see large benefits to keeping commits small.
Here are the consequences I would forsee:
- A reduction in per-commit overhead by batching up 5 changes into a single larger commit.
- Increased communication overhead: a revision control system is a formalized way for engineers to communicate without having to send emails, etc. In particular, the change descriptions, if well-written, help keep the other team members informed about what is going on. Frequent commits also make conversations like “watch out, I’m working on some big changes to file X” less necessary.
- Increased redundant work: both engineers might fix the same bug in their own local trees rather than picking up the change from the other engineer.
- A larger number of merge conflicts. At the risk of misapplying statistics and making a vast number of simplifying assumptions: if each change touches 5% of the total lines of code, and if changes are randomly distributed in the code, the probability of a merge conflict was about 1.2% weekly before and is about 5.1% weekly now.
- Incompatible changes: both engineers might choose to rewrite the same block of code in two different and inconsistent ways. This will show up as a merge conflict, but it’s worse than a plain old merge conflict. You’re not just doing a textual merge, you’re trying to reconcile to conflicting visions of how the code should work “after the fact” and throwing away a good chunk of the work. Had the first rewrite been committed more promptly, an additional rewrite might have been avoided.
- New bugs are discovered and fixed later: if the first engineer’s changes introduce a bug that impacts the second engineer’s work, the bug might be discovered a week later rather than a day later. Standard software engineering literature suggests that bugs cost more to fix over time.
- Increased probability of losing work. Once a change is committed, it’s saved in the repository and won’t be lost in an accidental “rm -rf” or a hardware failure (assuming that the repository itself is being backed up appropriately).
Unless you’re extremely worried about per-commit overhead (in which case I would suggest that you have bigger process problems you need to address), this is definitely not a good thing.
Merge conflicts in particular are more dangerous than a lot of people realize. In software, it is not necessarily true that the correct way to combine two changes is to perform a textual merge of the source code. It is dangerous to assume that simply because a textual merge did not detect any obvious conflicts, you are all set!
To perform a correct merge, you need to understand what has been changed and why. Many engineers have a bad habit of being careless on merges: they let down their guard. Merges are just as real as any other change, and we cannot assume that just because two changes worked independently that they will also work together.
Of course, if the textual merge does detect a conflict, the risks are far greater. An automated merge won’t get tired or make a typo. A human can and sometimes will. If the conflicts are nontrivial, as in the case of two engineers rewriting the same code, merges can be some of the most dangerous changes of all.
So far I’m not really saying anything new here. It’s pretty standard advice that engineers should commit code no less than once a day, even if only to reduce the risk of losing code by accident. Also, there is a lot of literature on the benefits of “continuous integration” as opposed to “Big Bang integration”, or on releasing your software “early and often.”
At the same time, a lot of supposed proponents of continuous integration seem to talk the talk better than they walk the walk. You will find a lot of these same people advocating such things as:
- development branches, where different groups of engineers working on different features commit to different branches/codelines, rather than sharing a single “main” or “trunk” development branch
- “distributed version control systems”, which are development branches taken to another level (all changes are effectively developed in a development branch, and no “main” branch even exists except by convention)
- branching and releasing each component of your project separately, rather than putting all components under a single “main” or “trunk” and branching them simultaneously
I contend that, by delaying integrations, these practices are steps back in the direction of “Big Bang integration” and that they increase the total cost of integrations.
Consider development branches, where several engineers go off and work on different features in different branches rather than working concurrently in a single main branch. Nearly all the same risks I listed above for committing once a week rather than once a day apply here also: communication overhead, redundant work, merge conflicts, incompatible changes, bugs discovered and fixed later. (On the bright side, losing work by accident should not be an issue here.)
The more development branches you have, the more integrations you will need to do. Someone will need to merge the changes from the main branch into the development branch on a regular basis, and when the development branch is done or at least has reached a “good” point, it needs to be merged back into the main branch. Either way, this typically leads to “mass integrates” in both directions.
As I’ve written before, mass integrates are a “worst practice.” Mass integrates can frequently run into dangerous merge conflicts. Because you are merging two large bodies of changes, the probability of a textual or logical conflict between the two sets of changes can be high. The longer the development branch lives on without being integrated back into the main branch, the greater this risk grows. (If you must, for whatever reason, have a development branch, I recommend integrating in both directions as frequently as possible.)
A development branch can be thought of as an intentional delay in integrating code. This can be tempting: “I get my own sandbox where I can do whatever I want!” But this kind of freedom is dangerous at best. For example, it encourages engineers to break things in the branch expecting that they will be “cleaned up later.” If the feature’s schedule starts to slip, this deferred “cleanup” work may be skipped. All of the sudden the development branch “needs” to be merged back into the main branch “right away” so that it can be in place for the next release branch. (I’ve seen this happen a number of times.)
When you add in the costs of delayed integrations, I recommend against development branches. You are better off doing the development in the main branch. This may require a bit more care on each change–you can’t break stuff like you can off in your own little “sandbox”– but the discipline this requires will pay off later. You won’t waste time integrating changes back and forth between branches, and you will spend a lot less time fiddling around with (textual or logical) merge conflicts.
If the new code isn’t ready to activate right away, you can simply hide it behind an #ifdef or somesuch. Even if the #ifdef approach starts to prove to be difficult, it’s likely to still be easier than dealing with merge conflicts: when someone makes an unrelated change that interacts with your changes, there’s a good chance that they will help you out by updating the inactive #ifdef code. And if someone makes a change that truly conflicts with your changes, you’ll know right away.
With the possible bankruptcy of the US Big Three automakers in the news, it’s interesting to think about the analogies between making cars and making software. There is no one single reason why the Big Three have been on the decline for many years now, but surely one of the most important reasons is that American consumers decided that Japanese cars are generally higher-quality than American cars.
These days, the Japanese carmakers’ attitude towards quality and process improvement is so mainstream that it’s almost cliched. It can easily be taken too far; for example, it is clearly not the case that increasing quality always saves money. Rather than get involved in the “software quality” debate, I’d rather focus right now on the idea of “process improvement.”
Whether you realize it or not, you have a process for building software. Oftentimes very little conscious thought has been put into this process, and it is frequently ineffective and wasteful.
Here are some basic elements of your software engineering process to think about:
- Who decides whether a feature is going to be added or a bug is going to be fixed, and if so, in which release?
- How many outstanding branches/codelines do you have? When do you create a new one? When do you shut down an existing one? Who does integrations between them, and how often?
- How do you build your software, all the way from the original source code to the final CD/DVD image or downloadable installer, for all of your target platforms/build configurations? How do you detect and/or prevent build breaks?
- How do you find bugs in your software? Customer bug reports? Internal QA? Code reviews/code reading? Static analysis tools and compiler warnings? Asserts? Other?
- What do engineers do before they commit their changes, e.g., what platforms do they build and test on, and what tests do they run?
- What happens after a change is committed? What automated or manual builds and tests are run on it?
- How do you verify that a bug is really fixed?
- If a previously committed change seems to be causing problems, what do you do? How much time do you let the change stay in place while you try to debug the problem? Or do you “back it out first and ask questions later”, putting the responsibility on the original change’s author to figure out what went wrong and reapply a correct version of the change later?
- Top engineers are often 10x more productive than average engineers. Is your process geared towards allowing those top engineers to flourish, at the risk of occasional mistakes slipping through, or is it geared towards preventing the average engineer from making mistakes, at the risk of reducing your top engineers’ productivity?
Whatever your current process, if you want to improve the effectiveness of your software development organization, you should be looking for ways to enhance it. A very simple way to do this, pioneered in manufacturing by the Japanese automakers, is to look for the root cause of each problem and fix the root cause so the problem cannot happen again. (I’ve written previously on this topic.) One simple method Toyota adopted to identify root causes is called the “5 Whys.” The important thing is not the specific method you use, but that you do dig down to understand why problems are happening.
This isn’t just the responsibility of management. Individual engineers should be looking for opportunities for process improvement, too. Any time you find or fix a bug, for example, this gives you an opportunity to ask a bunch of questions:
- When and how was the bug introduced?
- Could we have prevented this bug from being introduced in the first place?
- Could we have detected this bug sooner?
- From the time this bug was reported, could we have fixed it sooner?
- Was the bug prioritized appropriately?
- This could be one of a family of related bugs. Are there other, similar bugs elsewhere we should look for?
We can ask much the same questions any time someone’s build is broken:
- Who broke the build?
- Why wasn’t the build break discovered before commit? How could it have been prevented?
- How quickly was the build break detected? How quickly was it fixed?
- Are other build configurations or other components’ builds broken too?
To give a more concrete example, maybe a build break wasn’t discovered before commit because the developer only did builds on a subset of the target platforms/configurations. Perhaps the debug build passed and the release build failed. Or perhaps the Windows build passed and the Linux build failed. What possible process improvements does this suggest?
- Require people to test-build all configurations before committing. I would probably not recommend this; the cost can easily exceed the benefit. Also, engineers are likely to “forget” to follow such a requirement, either intentionally or unintentionally, or are likely to make “one last change” after doing all their tests and not go back and fully test everything again.
- Reduce the number of supported build configurations. Debug/release is pretty typical, but suppose you’re still supporting some ancient operating system that no one cares about any more; perhaps you can finally retire your old DOS or Win9x or MacOS9 build, for example? Or perhaps you can have a single binary for all versions of Linux rather than a separate binary for each supported Linux distro?
- Disable “warnings as errors.” This one is a double-edged sword. On one hand it prevents warnings from creeping in. On the other hand it makes your builds more brittle. It’s up to you to make the right choice.
- Set up a system like Cascade that will reject the commit of any change that breaks a build.
We can never achieve process perfection, but over time we can improve our process so that we don’t make preventable mistakes. We should be able to avoid making the same mistake twice, for example. We also need to watch that our process doesn’t get overly bureaucratic and burdensome. Every so often it may be useful to “deregulate” your process: toss out of some of the rules, especially the ones that you think might have a poor cost/benefit ratio, and see what happens.
Cascade 188.8.131.523 has been released! This release adds a number of new features. The full list of what’s changed is spelled out in the release notes. Try it now…
What’s more, Cascade is now completely free for personal use, even in a corporate environment. Multi-user installations still require you to purchase a license. The details are spelled out on the Pricing page and in the EULA at install time, but the bottom line is that you can use Cascade for as long as you want on as many computers as you want, so long as it’s just you using Cascade and accessing your Cascade servers.
I wrote previously on the topic of small commits. So when and why would I advise bunching small changes together into bigger ones, aside from the obvious case of changes that must be done atomically to avoid breaking something?
One example is a change that causes a compatibility break. Suppose you have an API, network protocol, file format, database schema, etc. you want to change. If you’re going to make one change to it already, this would be a great opportunity to make other, simultaneous, desirable changes. If people are going to have to upgrade their clients, servers, file parsers, file writers, databases, and/or database queries already, you might as well batch up these changes to reduce the total number of compatibility breaks and the total pain they will cause. The worst case would be if the intermediate API, protocol, database, etc. is released outside your organization. You might then have to support another version for the rest of time.
That’s not to say that you should make unnecessary or gratuitous changes at that time, but if you know you’re going to have to add 2 columns to a table in your database, you might as well add them both at once, rather than doing 2 separate changes to add one column at a time.
If a bug requires making an identical change to a bunch of different places in your source code, I’d likewise advise doing only a single change. If the same code has been copied and pasted to a bunch of different locations, for example, and each one has the same bug, I’d advise fixing them all at once. The last thing you want, certainly, is to be in the middle of fixing the bug and for someone to check in a new change adding another copy of the same buggy code — simply because you didn’t commit your changes all right away. This also makes it clear from the revision history that the changes are connected to one another.
However, if you find yourself in such a situation, where a “simple” bug fix requires changing a lot of similar logic all over the place, I might also suggest that your should look at your design more carefully and refactor your code to reduce the replication of logic. Any time you are copying and pasting code around, you are usually doing something wrong.
One of the often-claimed benefits of large commits is that there is fixed per-commit overhead. A typical example of this overhead would be a mandatory code review: if every change must be emailed out and you must wait for a reply from another engineer approving your change, this might take a while.
Fixed per-commit overhead, which is very real in many organizations, makes it very tempting to batch up your changes. I’d advise against this. If you are finding that fixed per-commit overhead is forcing you to batch up unrelated changes into a single atomic commit, I would contend that you have a process issue that you need to address.
Sometimes fixed per-commit overhead is simply unnecessary bureaucracy: paranoid management enforcing commit policies that have no logical connection to the actual risk of a change. My view is that a manager needs to be able to trust his employees’ judgment. If you don’t trust your employees to make good decisions and to ask around for help when they don’t know the right answer, I’d suggest that you have a much bigger problem in your organization and that your commit policies are just a band-aid.
These policies tend to drag down the productivity of your best engineers. If your best engineers are often 5-10x more productive than your average engineer, then you can ill afford to have them waste time on every commit, just to prevent your worst engineers from checking in bad code. The real solution is to get rid of the bad engineers or to mentor them so that they don’t need extensive babysitting.
I’ve worked in several organizations with these kinds of overkill commit policies, and my general approach as an engineer was simply to ignore the policies, which were rarely enforced, and use my own best judgment instead. (No… it really isn’t necessary to run an long, comprehensive test suite if all you’ve done is change a comment in the source code.)
In other cases, while the commit policy itself was basically reasonable, the time it took to run through the builds and tests was excessive. In this case the answer is to optimize your processes. If it takes several hours to build and test a change before committing it, forget the question of big vs. small commits — you’re killing your engineers’ productivity across the board.
For example, if your software needs to run on Windows, Linux, and Macintosh, it’s perfectly reasonable to expect that everyone’s changes should compile and pass a simple test on all three platforms before they are committed. But building and testing your changes on all three platforms can take a while, and done manually, it’s error-prone (are you sure you copied the exact same files back and forth between your 3 source trees? are you sure the final change you committed is the same one you tested?). This is where better tools like Cascade can help: instead of doing these builds and tests manually, you can simply “checkpoint” your changes and Cascade will take care of running them all.
If you’ve exhausted all the possible process improvements and commits are still taking a while, one final approach is to pipeline your work. Once you’ve kicked off builds and tests for a change, you shouldn’t just need to go off and browse the web waiting for them to complete. You ought to be able to start working on another, unrelated change in another tree. Again, Cascade can help. Traditionally, having more trees has been expensive: you have to check out and update the extra trees, and then you still have to build each tree independently (even though the build results should be the same). With Cascade, cloning a new tree takes just seconds, and each tree comes prepopulated with the results of all your builds and tests.
Unless there’s a specific reason why you can’t, I recommend that you commit code to your repository in the smallest possible atomic chunks.
Look, it’s great that modern source control systems allow you to commit an atomic change to more than one file at a time. This is an essential feature and I can’t imagine living without it. But just because we can, that doesn’t mean that we should.
Probably the biggest reason to keep changes small is to make it easier to track down which change caused a particular bug. If two changes are commingled into a single commit, you may have to manually disentangle them to figure out which one was responsible.
Large changes also make it complex to go back and dig through history. If you’re trying to understand why someone made a particular change to a particular file and are looking through the old revisions, you might be confused if you find that someone made several other changes to the same file at the same time. Are the changes interrelated? Hopefully the change’s description will explain, but old change descriptions are often less than fully illuminating in practice.
If a change is demonstrated to cause a bug, you might want to back it out. If other changes have been lumped together with it, you might unintentionally back out other, unrelated changes that did not cause the bug and might be desirable to keep in the tree.
Consider also the impact on other engineers who have changes in development. These engineers will need to merge their changes with yours. The larger and more invasive a change is, the harder it can be to merge with other changes.
One specific thing you should not do is combine cosmetic and functional changes in a single change. For example, while making a change, if you notice that a source file has tabs instead of spaces, and your coding policy calls for spaces, don’t reformat the entire file at the same time that you are making your other changes. The same goes for moving curly braces, making the text fit within a certain number of columns, using // vs. /* comments, etc. It’s fine to make these changes to clean up code to meet coding policies… just don’t mix them with substantive, functional changes to the code.
One common way people end up committing large changes is the dreaded “mass integrate”. That is, you have two branches, and you want to catch up the one branch with all the changes made to the other branch. In a mass integrate, rather than integrating each individual change over by itself, you integrate all of the changes together in one big commit. Mass integrates may touch hundreds or thousands of files.
Because they lump many changes together, they may introduce and fix large numbers of bugs all in a single commit, and it may be difficult to track down what caused what. They obscure file history, especially if the descriptions of the individual changes being integrated are not all copy-and-pasted into the mass integrate’s description. If the mass integrate proves to be unwise, you may not realistically be able to back it out without creating an even bigger mess.
Mass integrates into a long-lived branch, e.g., your trunk or a release branch, are a “worst practice” in software development. Mass integrates into a development branch are not such a problem; the problem arises when merging a development branch back into the main branch. Sometimes you may have no choice but to integrate a bunch of changes together (each change individually breaks things, and you need all of the changes or none for the tree to stay in a consistent, working state), but it can be massively disruptive for a large pile of changes to be thrown into a branch all at once.