Archive for December, 2008
I wrote previously on the cost of integration. I’d like to follow up by discussing a related topic, the cost of branching.
Doesn’t integration imply branching? Not exactly: branching requires integration, but integration necessarily takes place in any multi-developer project whether you use branching or not. Integration simply means that two developers go off and work in parallel in separate trees, and when each one finishes their work, that work has to be combined.
Even if the two developers are editing different groups of files and never edit the same file at the same time, the mere act of combining their changes is still an “integration” and has at least some nonzero risk. For example, on a platform with limited memory, e.g., a game console, each change independently might work, but combining the two might make the system run out of memory. This isn’t either developer’s “fault”, necessarily, just a reflection of the fact that modern software systems are complex and their behavior isn’t always easily predictable.
Back to branching. One cost of branching is that it generally increases the total number of integrations that have to be done. When you commit, you’re immediately integrating with the other people inside your branch, but then later on your change will have to be integrated over into other branches. The more branches, the more times it needs to be integrated.
An example is a team that has overlapping release branches. If release 1 can’t be end-of-lifed by the time release 2 is branched, engineers will need to integrate from release 1 to release 2 to main on each change. This can get many times worse if a large number of branches are outstanding. Each integration represents another source of risk. Consider that each integration could be botched, or even forgotten. (Systems with merge tracking reduce the chance someone will forget about an integration, but merge tracking systems are definitely not foolproof.)
Another cost of branching is that it delays integrations. This is especially the case with development branches, where a body of code is developed off in a separate branch before being integrated back into the main branch. The longer you put off an integration, the more costly it gets.
Another cost of branching is that, in some branching models, the developer of the change is not the same person who will be integrating the change. When I commit and need to merge with others’ changes, I, as the committer, am taking responsibility for my own merges. But later on, if a different engineer has to integrate that change into another branch, or has to merge that change with other changes as part of an integrate into the same branch, this person may not fully understand the intricacies of the change. This makes the integration more costly, and it increases the probability of an error. Put another way: if developer A has to integrate his own change A with developer B’s change B, he at least understands his own code. But with branching, developer C may be stuck merging changes A and B from developers A and B.
Another cost of branching is the impact on the revision control system. While marketing materials often love to talk about how cheap branching is in some revision control system, the reality sometimes differs from the marketing:
- Consider the cost of creating a branch. Is it an O(1) (constant time) operation, regardless of how many files are being branched? If the cost of creating a branch increases with the number of files being branched, this can become expensive as trees grow larger. How much does this slow down your server and grow your database?
- Consider all of the extra commits required to integrate between branches. Think of regular commits as “work” and integrations as “overhead.” As the number of branches increases, the percentage of commits that are overhead increase, reducing your server’s performance and causing your database to grow.
- Consider the cost of checking out the branched trees onto client systems, in terms of disk space and network bandwidth. If I need to work in 3 branches, I need to download about 3 times as many files. (Note that a system like Cascade can reduce this cost, by caching files as they are accessed instead of downloading them all up front.)
- If the system does merge tracking, tracking this information and performing queries on what files/changes need to be integrated may impose an additional cost on the server. I’ve observed cases where these queries can bog down a server for many minutes.
Another cost of branching is the impact on your build, release, and QA flows. Consider, for example, build and test automation. Does each active branch need to have automated builds and tests set up on it? If not, developers in that branch are effectively “flying blind.” Yet setting up and maintaining all of these builds and tests can be burdensome, especially if build and release engineers are a shared resource between developers on many teams. Every time someone creates a branch and requests that builds and tests be set up on the branch, more work is being created for this team. Every time these builds and tests need to be reconfigured, again, more work is being created. In addition, running all of these extra builds and tests may require you to buy more computers to run all of these builds and tests on.
Overall: branching is a powerful tool, but when it’s overused, it can impose a lot of costs on your team. The most important rule to remember is that integration is overhead. The greater percentage of their time your developers spend integrating, the less time they can spend developing. Branches create more integration work. My preference is always to err on the side of creating branches as late as possible, and ideally to not create them at all if there’s any way to get away with it.
Cascade 22.214.171.1241 has been released! This is primarily a bug fix release with a few minor new features. The exact changes are spelled out in the release notes. Try it now…