Archive for the ‘Software Engineering’ Category
I wrote previously on the cost of integration. I’d like to follow up by discussing a related topic, the cost of branching.
Doesn’t integration imply branching? Not exactly: branching requires integration, but integration necessarily takes place in any multi-developer project whether you use branching or not. Integration simply means that two developers go off and work in parallel in separate trees, and when each one finishes their work, that work has to be combined.
Even if the two developers are editing different groups of files and never edit the same file at the same time, the mere act of combining their changes is still an “integration” and has at least some nonzero risk. For example, on a platform with limited memory, e.g., a game console, each change independently might work, but combining the two might make the system run out of memory. This isn’t either developer’s “fault”, necessarily, just a reflection of the fact that modern software systems are complex and their behavior isn’t always easily predictable.
Back to branching. One cost of branching is that it generally increases the total number of integrations that have to be done. When you commit, you’re immediately integrating with the other people inside your branch, but then later on your change will have to be integrated over into other branches. The more branches, the more times it needs to be integrated.
An example is a team that has overlapping release branches. If release 1 can’t be end-of-lifed by the time release 2 is branched, engineers will need to integrate from release 1 to release 2 to main on each change. This can get many times worse if a large number of branches are outstanding. Each integration represents another source of risk. Consider that each integration could be botched, or even forgotten. (Systems with merge tracking reduce the chance someone will forget about an integration, but merge tracking systems are definitely not foolproof.)
Another cost of branching is that it delays integrations. This is especially the case with development branches, where a body of code is developed off in a separate branch before being integrated back into the main branch. The longer you put off an integration, the more costly it gets.
Another cost of branching is that, in some branching models, the developer of the change is not the same person who will be integrating the change. When I commit and need to merge with others’ changes, I, as the committer, am taking responsibility for my own merges. But later on, if a different engineer has to integrate that change into another branch, or has to merge that change with other changes as part of an integrate into the same branch, this person may not fully understand the intricacies of the change. This makes the integration more costly, and it increases the probability of an error. Put another way: if developer A has to integrate his own change A with developer B’s change B, he at least understands his own code. But with branching, developer C may be stuck merging changes A and B from developers A and B.
Another cost of branching is the impact on the revision control system. While marketing materials often love to talk about how cheap branching is in some revision control system, the reality sometimes differs from the marketing:
- Consider the cost of creating a branch. Is it an O(1) (constant time) operation, regardless of how many files are being branched? If the cost of creating a branch increases with the number of files being branched, this can become expensive as trees grow larger. How much does this slow down your server and grow your database?
- Consider all of the extra commits required to integrate between branches. Think of regular commits as “work” and integrations as “overhead.” As the number of branches increases, the percentage of commits that are overhead increase, reducing your server’s performance and causing your database to grow.
- Consider the cost of checking out the branched trees onto client systems, in terms of disk space and network bandwidth. If I need to work in 3 branches, I need to download about 3 times as many files. (Note that a system like Cascade can reduce this cost, by caching files as they are accessed instead of downloading them all up front.)
- If the system does merge tracking, tracking this information and performing queries on what files/changes need to be integrated may impose an additional cost on the server. I’ve observed cases where these queries can bog down a server for many minutes.
Another cost of branching is the impact on your build, release, and QA flows. Consider, for example, build and test automation. Does each active branch need to have automated builds and tests set up on it? If not, developers in that branch are effectively “flying blind.” Yet setting up and maintaining all of these builds and tests can be burdensome, especially if build and release engineers are a shared resource between developers on many teams. Every time someone creates a branch and requests that builds and tests be set up on the branch, more work is being created for this team. Every time these builds and tests need to be reconfigured, again, more work is being created. In addition, running all of these extra builds and tests may require you to buy more computers to run all of these builds and tests on.
Overall: branching is a powerful tool, but when it’s overused, it can impose a lot of costs on your team. The most important rule to remember is that integration is overhead. The greater percentage of their time your developers spend integrating, the less time they can spend developing. Branches create more integration work. My preference is always to err on the side of creating branches as late as possible, and ideally to not create them at all if there’s any way to get away with it.
I’d like to propose a fundamental law of configuration management: the cost of an integration increases over time. This is similar to the well-known software engineering observation that the cost of fixing a bug increases over time.
Let’s start with a simple example: a single project with just 2 engineers, where each engineer commits a single change once per day. Now suppose that both engineers, for some reason, decide to start committing their code in batches of 5 changes once per week instead. I’m not sure why they would do this; I see large benefits to keeping commits small.
Here are the consequences I would forsee:
- A reduction in per-commit overhead by batching up 5 changes into a single larger commit.
- Increased communication overhead: a revision control system is a formalized way for engineers to communicate without having to send emails, etc. In particular, the change descriptions, if well-written, help keep the other team members informed about what is going on. Frequent commits also make conversations like “watch out, I’m working on some big changes to file X” less necessary.
- Increased redundant work: both engineers might fix the same bug in their own local trees rather than picking up the change from the other engineer.
- A larger number of merge conflicts. At the risk of misapplying statistics and making a vast number of simplifying assumptions: if each change touches 5% of the total lines of code, and if changes are randomly distributed in the code, the probability of a merge conflict was about 1.2% weekly before and is about 5.1% weekly now.
- Incompatible changes: both engineers might choose to rewrite the same block of code in two different and inconsistent ways. This will show up as a merge conflict, but it’s worse than a plain old merge conflict. You’re not just doing a textual merge, you’re trying to reconcile to conflicting visions of how the code should work “after the fact” and throwing away a good chunk of the work. Had the first rewrite been committed more promptly, an additional rewrite might have been avoided.
- New bugs are discovered and fixed later: if the first engineer’s changes introduce a bug that impacts the second engineer’s work, the bug might be discovered a week later rather than a day later. Standard software engineering literature suggests that bugs cost more to fix over time.
- Increased probability of losing work. Once a change is committed, it’s saved in the repository and won’t be lost in an accidental “rm -rf” or a hardware failure (assuming that the repository itself is being backed up appropriately).
Unless you’re extremely worried about per-commit overhead (in which case I would suggest that you have bigger process problems you need to address), this is definitely not a good thing.
Merge conflicts in particular are more dangerous than a lot of people realize. In software, it is not necessarily true that the correct way to combine two changes is to perform a textual merge of the source code. It is dangerous to assume that simply because a textual merge did not detect any obvious conflicts, you are all set!
To perform a correct merge, you need to understand what has been changed and why. Many engineers have a bad habit of being careless on merges: they let down their guard. Merges are just as real as any other change, and we cannot assume that just because two changes worked independently that they will also work together.
Of course, if the textual merge does detect a conflict, the risks are far greater. An automated merge won’t get tired or make a typo. A human can and sometimes will. If the conflicts are nontrivial, as in the case of two engineers rewriting the same code, merges can be some of the most dangerous changes of all.
So far I’m not really saying anything new here. It’s pretty standard advice that engineers should commit code no less than once a day, even if only to reduce the risk of losing code by accident. Also, there is a lot of literature on the benefits of “continuous integration” as opposed to “Big Bang integration”, or on releasing your software “early and often.”
At the same time, a lot of supposed proponents of continuous integration seem to talk the talk better than they walk the walk. You will find a lot of these same people advocating such things as:
- development branches, where different groups of engineers working on different features commit to different branches/codelines, rather than sharing a single “main” or “trunk” development branch
- “distributed version control systems”, which are development branches taken to another level (all changes are effectively developed in a development branch, and no “main” branch even exists except by convention)
- branching and releasing each component of your project separately, rather than putting all components under a single “main” or “trunk” and branching them simultaneously
I contend that, by delaying integrations, these practices are steps back in the direction of “Big Bang integration” and that they increase the total cost of integrations.
Consider development branches, where several engineers go off and work on different features in different branches rather than working concurrently in a single main branch. Nearly all the same risks I listed above for committing once a week rather than once a day apply here also: communication overhead, redundant work, merge conflicts, incompatible changes, bugs discovered and fixed later. (On the bright side, losing work by accident should not be an issue here.)
The more development branches you have, the more integrations you will need to do. Someone will need to merge the changes from the main branch into the development branch on a regular basis, and when the development branch is done or at least has reached a “good” point, it needs to be merged back into the main branch. Either way, this typically leads to “mass integrates” in both directions.
As I’ve written before, mass integrates are a “worst practice.” Mass integrates can frequently run into dangerous merge conflicts. Because you are merging two large bodies of changes, the probability of a textual or logical conflict between the two sets of changes can be high. The longer the development branch lives on without being integrated back into the main branch, the greater this risk grows. (If you must, for whatever reason, have a development branch, I recommend integrating in both directions as frequently as possible.)
A development branch can be thought of as an intentional delay in integrating code. This can be tempting: “I get my own sandbox where I can do whatever I want!” But this kind of freedom is dangerous at best. For example, it encourages engineers to break things in the branch expecting that they will be “cleaned up later.” If the feature’s schedule starts to slip, this deferred “cleanup” work may be skipped. All of the sudden the development branch “needs” to be merged back into the main branch “right away” so that it can be in place for the next release branch. (I’ve seen this happen a number of times.)
When you add in the costs of delayed integrations, I recommend against development branches. You are better off doing the development in the main branch. This may require a bit more care on each change–you can’t break stuff like you can off in your own little “sandbox”– but the discipline this requires will pay off later. You won’t waste time integrating changes back and forth between branches, and you will spend a lot less time fiddling around with (textual or logical) merge conflicts.
If the new code isn’t ready to activate right away, you can simply hide it behind an #ifdef or somesuch. Even if the #ifdef approach starts to prove to be difficult, it’s likely to still be easier than dealing with merge conflicts: when someone makes an unrelated change that interacts with your changes, there’s a good chance that they will help you out by updating the inactive #ifdef code. And if someone makes a change that truly conflicts with your changes, you’ll know right away.
With the possible bankruptcy of the US Big Three automakers in the news, it’s interesting to think about the analogies between making cars and making software. There is no one single reason why the Big Three have been on the decline for many years now, but surely one of the most important reasons is that American consumers decided that Japanese cars are generally higher-quality than American cars.
These days, the Japanese carmakers’ attitude towards quality and process improvement is so mainstream that it’s almost cliched. It can easily be taken too far; for example, it is clearly not the case that increasing quality always saves money. Rather than get involved in the “software quality” debate, I’d rather focus right now on the idea of “process improvement.”
Whether you realize it or not, you have a process for building software. Oftentimes very little conscious thought has been put into this process, and it is frequently ineffective and wasteful.
Here are some basic elements of your software engineering process to think about:
- Who decides whether a feature is going to be added or a bug is going to be fixed, and if so, in which release?
- How many outstanding branches/codelines do you have? When do you create a new one? When do you shut down an existing one? Who does integrations between them, and how often?
- How do you build your software, all the way from the original source code to the final CD/DVD image or downloadable installer, for all of your target platforms/build configurations? How do you detect and/or prevent build breaks?
- How do you find bugs in your software? Customer bug reports? Internal QA? Code reviews/code reading? Static analysis tools and compiler warnings? Asserts? Other?
- What do engineers do before they commit their changes, e.g., what platforms do they build and test on, and what tests do they run?
- What happens after a change is committed? What automated or manual builds and tests are run on it?
- How do you verify that a bug is really fixed?
- If a previously committed change seems to be causing problems, what do you do? How much time do you let the change stay in place while you try to debug the problem? Or do you “back it out first and ask questions later”, putting the responsibility on the original change’s author to figure out what went wrong and reapply a correct version of the change later?
- Top engineers are often 10x more productive than average engineers. Is your process geared towards allowing those top engineers to flourish, at the risk of occasional mistakes slipping through, or is it geared towards preventing the average engineer from making mistakes, at the risk of reducing your top engineers’ productivity?
Whatever your current process, if you want to improve the effectiveness of your software development organization, you should be looking for ways to enhance it. A very simple way to do this, pioneered in manufacturing by the Japanese automakers, is to look for the root cause of each problem and fix the root cause so the problem cannot happen again. (I’ve written previously on this topic.) One simple method Toyota adopted to identify root causes is called the “5 Whys.” The important thing is not the specific method you use, but that you do dig down to understand why problems are happening.
This isn’t just the responsibility of management. Individual engineers should be looking for opportunities for process improvement, too. Any time you find or fix a bug, for example, this gives you an opportunity to ask a bunch of questions:
- When and how was the bug introduced?
- Could we have prevented this bug from being introduced in the first place?
- Could we have detected this bug sooner?
- From the time this bug was reported, could we have fixed it sooner?
- Was the bug prioritized appropriately?
- This could be one of a family of related bugs. Are there other, similar bugs elsewhere we should look for?
We can ask much the same questions any time someone’s build is broken:
- Who broke the build?
- Why wasn’t the build break discovered before commit? How could it have been prevented?
- How quickly was the build break detected? How quickly was it fixed?
- Are other build configurations or other components’ builds broken too?
To give a more concrete example, maybe a build break wasn’t discovered before commit because the developer only did builds on a subset of the target platforms/configurations. Perhaps the debug build passed and the release build failed. Or perhaps the Windows build passed and the Linux build failed. What possible process improvements does this suggest?
- Require people to test-build all configurations before committing. I would probably not recommend this; the cost can easily exceed the benefit. Also, engineers are likely to “forget” to follow such a requirement, either intentionally or unintentionally, or are likely to make “one last change” after doing all their tests and not go back and fully test everything again.
- Reduce the number of supported build configurations. Debug/release is pretty typical, but suppose you’re still supporting some ancient operating system that no one cares about any more; perhaps you can finally retire your old DOS or Win9x or MacOS9 build, for example? Or perhaps you can have a single binary for all versions of Linux rather than a separate binary for each supported Linux distro?
- Disable “warnings as errors.” This one is a double-edged sword. On one hand it prevents warnings from creeping in. On the other hand it makes your builds more brittle. It’s up to you to make the right choice.
- Set up a system like Cascade that will reject the commit of any change that breaks a build.
We can never achieve process perfection, but over time we can improve our process so that we don’t make preventable mistakes. We should be able to avoid making the same mistake twice, for example. We also need to watch that our process doesn’t get overly bureaucratic and burdensome. Every so often it may be useful to “deregulate” your process: toss out of some of the rules, especially the ones that you think might have a poor cost/benefit ratio, and see what happens.
I wrote previously on the topic of small commits. So when and why would I advise bunching small changes together into bigger ones, aside from the obvious case of changes that must be done atomically to avoid breaking something?
One example is a change that causes a compatibility break. Suppose you have an API, network protocol, file format, database schema, etc. you want to change. If you’re going to make one change to it already, this would be a great opportunity to make other, simultaneous, desirable changes. If people are going to have to upgrade their clients, servers, file parsers, file writers, databases, and/or database queries already, you might as well batch up these changes to reduce the total number of compatibility breaks and the total pain they will cause. The worst case would be if the intermediate API, protocol, database, etc. is released outside your organization. You might then have to support another version for the rest of time.
That’s not to say that you should make unnecessary or gratuitous changes at that time, but if you know you’re going to have to add 2 columns to a table in your database, you might as well add them both at once, rather than doing 2 separate changes to add one column at a time.
If a bug requires making an identical change to a bunch of different places in your source code, I’d likewise advise doing only a single change. If the same code has been copied and pasted to a bunch of different locations, for example, and each one has the same bug, I’d advise fixing them all at once. The last thing you want, certainly, is to be in the middle of fixing the bug and for someone to check in a new change adding another copy of the same buggy code — simply because you didn’t commit your changes all right away. This also makes it clear from the revision history that the changes are connected to one another.
However, if you find yourself in such a situation, where a “simple” bug fix requires changing a lot of similar logic all over the place, I might also suggest that your should look at your design more carefully and refactor your code to reduce the replication of logic. Any time you are copying and pasting code around, you are usually doing something wrong.
One of the often-claimed benefits of large commits is that there is fixed per-commit overhead. A typical example of this overhead would be a mandatory code review: if every change must be emailed out and you must wait for a reply from another engineer approving your change, this might take a while.
Fixed per-commit overhead, which is very real in many organizations, makes it very tempting to batch up your changes. I’d advise against this. If you are finding that fixed per-commit overhead is forcing you to batch up unrelated changes into a single atomic commit, I would contend that you have a process issue that you need to address.
Sometimes fixed per-commit overhead is simply unnecessary bureaucracy: paranoid management enforcing commit policies that have no logical connection to the actual risk of a change. My view is that a manager needs to be able to trust his employees’ judgment. If you don’t trust your employees to make good decisions and to ask around for help when they don’t know the right answer, I’d suggest that you have a much bigger problem in your organization and that your commit policies are just a band-aid.
These policies tend to drag down the productivity of your best engineers. If your best engineers are often 5-10x more productive than your average engineer, then you can ill afford to have them waste time on every commit, just to prevent your worst engineers from checking in bad code. The real solution is to get rid of the bad engineers or to mentor them so that they don’t need extensive babysitting.
I’ve worked in several organizations with these kinds of overkill commit policies, and my general approach as an engineer was simply to ignore the policies, which were rarely enforced, and use my own best judgment instead. (No… it really isn’t necessary to run an long, comprehensive test suite if all you’ve done is change a comment in the source code.)
In other cases, while the commit policy itself was basically reasonable, the time it took to run through the builds and tests was excessive. In this case the answer is to optimize your processes. If it takes several hours to build and test a change before committing it, forget the question of big vs. small commits — you’re killing your engineers’ productivity across the board.
For example, if your software needs to run on Windows, Linux, and Macintosh, it’s perfectly reasonable to expect that everyone’s changes should compile and pass a simple test on all three platforms before they are committed. But building and testing your changes on all three platforms can take a while, and done manually, it’s error-prone (are you sure you copied the exact same files back and forth between your 3 source trees? are you sure the final change you committed is the same one you tested?). This is where better tools like Cascade can help: instead of doing these builds and tests manually, you can simply “checkpoint” your changes and Cascade will take care of running them all.
If you’ve exhausted all the possible process improvements and commits are still taking a while, one final approach is to pipeline your work. Once you’ve kicked off builds and tests for a change, you shouldn’t just need to go off and browse the web waiting for them to complete. You ought to be able to start working on another, unrelated change in another tree. Again, Cascade can help. Traditionally, having more trees has been expensive: you have to check out and update the extra trees, and then you still have to build each tree independently (even though the build results should be the same). With Cascade, cloning a new tree takes just seconds, and each tree comes prepopulated with the results of all your builds and tests.
Unless there’s a specific reason why you can’t, I recommend that you commit code to your repository in the smallest possible atomic chunks.
Look, it’s great that modern source control systems allow you to commit an atomic change to more than one file at a time. This is an essential feature and I can’t imagine living without it. But just because we can, that doesn’t mean that we should.
Probably the biggest reason to keep changes small is to make it easier to track down which change caused a particular bug. If two changes are commingled into a single commit, you may have to manually disentangle them to figure out which one was responsible.
Large changes also make it complex to go back and dig through history. If you’re trying to understand why someone made a particular change to a particular file and are looking through the old revisions, you might be confused if you find that someone made several other changes to the same file at the same time. Are the changes interrelated? Hopefully the change’s description will explain, but old change descriptions are often less than fully illuminating in practice.
If a change is demonstrated to cause a bug, you might want to back it out. If other changes have been lumped together with it, you might unintentionally back out other, unrelated changes that did not cause the bug and might be desirable to keep in the tree.
Consider also the impact on other engineers who have changes in development. These engineers will need to merge their changes with yours. The larger and more invasive a change is, the harder it can be to merge with other changes.
One specific thing you should not do is combine cosmetic and functional changes in a single change. For example, while making a change, if you notice that a source file has tabs instead of spaces, and your coding policy calls for spaces, don’t reformat the entire file at the same time that you are making your other changes. The same goes for moving curly braces, making the text fit within a certain number of columns, using // vs. /* comments, etc. It’s fine to make these changes to clean up code to meet coding policies… just don’t mix them with substantive, functional changes to the code.
One common way people end up committing large changes is the dreaded “mass integrate”. That is, you have two branches, and you want to catch up the one branch with all the changes made to the other branch. In a mass integrate, rather than integrating each individual change over by itself, you integrate all of the changes together in one big commit. Mass integrates may touch hundreds or thousands of files.
Because they lump many changes together, they may introduce and fix large numbers of bugs all in a single commit, and it may be difficult to track down what caused what. They obscure file history, especially if the descriptions of the individual changes being integrated are not all copy-and-pasted into the mass integrate’s description. If the mass integrate proves to be unwise, you may not realistically be able to back it out without creating an even bigger mess.
Mass integrates into a long-lived branch, e.g., your trunk or a release branch, are a “worst practice” in software development. Mass integrates into a development branch are not such a problem; the problem arises when merging a development branch back into the main branch. Sometimes you may have no choice but to integrate a bunch of changes together (each change individually breaks things, and you need all of the changes or none for the tree to stay in a consistent, working state), but it can be massively disruptive for a large pile of changes to be thrown into a branch all at once.
One of the challenges in implementing a cross-platform file system driver such as Cascade File System is dealing with the many differences, small and large, between how Windows, Linux, and Macintosh file systems work. Some of these differences are well-known and obvious, but there are a lot of other interesting differences underneath the covers, especially when you get down into the file system driver kernel interfaces.
Let’s start with the most obvious one: case sensitivity. Linux has a case sensitive file system. Windows has a case-preserving but case-insensitive file system. Or, at least it looks like Windows does! But in reality Windows supports both. Check out the documentation for the NtCreateFile API, the native NT API that the Win32 API CreateFile maps to. By setting or not setting OBJ_CASE_INSENSITIVE, you can select which type of name lookup you prefer. It’s really up to the individual file system to decide how to interpret all these flags, though. Some Windows file systems, like the original FAT, aren’t even case-preserving.
The Macintosh is now Unix-based as of OSX, but its HFS file system has traditionally been case-insensitive and case-preserving, just like Windows. More recently, Apple now allows HFS to be formatted either way, as either case-sensitive or case-insensitive, but the default remains case-insensitive.
The issue of case sensitivity brings up another issue: internationalization. Windows, Linux, and Macintosh all support Unicode in paths; Windows encodes them as UTF-16 in all of its native APIs, whereas Linux and Macintosh use UTF-8. A problem: it’s possible for two non-identical Unicode strings to correspond to the same sequence of characters. That is, certain characters can be legally encoded in more than one way. Macintosh therefore requires all UTF-8 filenames to be stored in a canonicalized format and therefore will prevent you from creating two files in the same directory with the same name but different character encodings. Windows and Linux do not; this can cause interoperability problems moving data back and forth between the two.
There are several challenges in doing case-insensitive string comparisons in a Unicode-capable file system. NTFS on Windows adopts the following approach: two strings are compared by converting them both to uppercase first, then comparing them for exact equality. The conversion to uppercase is done using a 64K-entry, 128KB table stored on the volume and filled in when the partition is formatted; this ensures that the comparisons do not break (which could cause two files’ names to start colliding) when new characters are added to Unicode and someone upgrades their OS.
Windows uses backslashes as path separators, while Linux and Macintosh use forward slashes. Most of the Win32 APIs allow you to specify forward slashes and will do the conversion for you, but once you get into the NT kernel and the other low-level APIs, backslashes are mandatory.
This in turn means that the set of legal filenames differs between the operating systems. On Linux, for example, you can create a file whose name contains a backslash, while on Windows you cannot. Linux is very permissive about the legal character set, but Windows has a lot of extra restrictions. A filename cannot end with a space or a period; there are a number of reserved names like COM1 and NUL; and several other non-path-separator characters like <, >, :, “, and |, are reserved.
Windows has drive letters, Linux and Macintosh have mount points. Actually, inside the NT kernel and inside kernel drivers, there is really no such thing as a drive letter. A “drive letter” is nothing other than a symbolic link in the NT kernel namespace, e.g., from \DosDevices\X: to \Device\cfs. When you call CreateFile with a path x:\foo.txt, the driver owning the \Device\cfs namespace simply sees a request for \foo.txt. But for practical purposes, this is still important. A Windows path needs to be interpreted differently depending on whether it’s a drive letter path or a UNC path. A Windows file system can be ripped away from applications with files open by removing the symbolic link, whereas a Unix file system cannot be unmounted if files are still open.
The Windows cache manager holds files open. When you close the last handle to a file, from the file system driver’s point of view, the file may still be open. This makes it very difficult to unload a Windows file system driver without a reboot. Unmounting it, i.e., removing the drive letter symbolic link, is easy, but until memory pressure forces the Windows cache manager to flush its cached mappings, those cached mappings may stay around indefinitely.
Permissions are very different. Linux has the standard Unix “UID, GID, mode bits” permissions model, and Macintosh inherits this also. Both have added ACL-based permissions later, but their use is often not considered very mainstream. Windows, on the other hand, is thoroughly ACL-based. Every file in a Windows file system has a security descriptor that includes the ACL. The permissions are far more elaborate than just “read, write, execute”; there are over a dozen types of permissions that you can be granted or denied.
Other file attributes are also different. Windows has a standard list of file attribute bits like “archive”, “hidden”, and “system” that go back to the DOS era. There is no equivalent to these on Unix. All of the systems support a more generic “extended attribute” system, however.
Linux doesn’t have multiple data streams per file. One of the defining properties of Unix, going back to its very beginnings, is that a file is just an array of bytes. Windows, however, allows a file to have multiple “data streams”, while Macintosh supports a similar “resource fork” feature. Apple now discourages the use of resource forks, but multiple data streams continue to be an important feature on Windows in some cases. For example, Internet Explorer attaches an alternate data stream to each file you download to indicate where you downloaded it from. When you later try to run an app that was downloaded from an untrusted zone, you will get a warning asking you whether you really want to do that.
Windows has limited symbolic link support. Windows has “reparse points”, which are like symbolic links for directories only, with some other caveats; but they are supported poorly by many applications. Vista adds something closer to real Unix symbolic links, though again with some limitations.
NtCreateFile() on Windows throws in the kitchen sink. This API has a lot of flexibility that doesn’t exist in the Unix open() system call. For better or worse, just about everything goes through it. For example, there is no equivalent to mkdir() on Windows. Instead, NtCreateFile takes a flag to request that you want to create a directory rather than a file in the event that the path lookup fails. It also supports a number of other random features, like delete-on-close files.
The Windows delete and rename model is different. You wouldn’t know this from the Win32 APIs, but in order to delete or rename a file in Windows, you first have to open it! Once you’ve opened it can you call NtSetInformationFile with InformationClass of FileDispositionInformation or FileRenameInformation. Setting FileDispositionInformation doesn’t even delete the file; it merely enables delete-on-close for the file, and the delete-on-close request could very well be cancelled later.
File sharing restrictions and locking are different. Unix generally avoids the idea of restricting what can be done with a file just because someone else is using it. Having a file open doesn’t prevent it from being unlinked, and two people can open the same file for writing. On Windows, all of this is true in theory — you can request whatever sharing mode you want when you open a file — but in practice, most applications use restrictive sharing modes, preventing two apps from using the same file at the same time. Inside a single file, we also have byte range locking. Windows uses mandatory locking: if someone else has the bytes locked, an attempt to modify those bytes with WriteFile() will fail (but this is not enforced for memory-mapped files!). Unix uses only advisory locking and makes no effort to error-check read() or write() calls; it assumes that the application will be responsible and won’t touch data it hasn’t first locked.
This list of differences could go on and on. It’s a challenge to make sure that CFS supports all of the important file system semantics correctly across the platforms, especially because the revision control systems CFS builds on often have different semantics of their own that don’t quite match the standard file systems.
I’ve written earlier about machine-independent builds, but let’s talk about a related issue: build determinism. A build or a build step is deterministic if I can run it multiple times and get the same results (objects, binaries, etc.) every time. That is, the same inputs always result in the same outputs. The assumption of determinism is one of the fundamental reasons why we traditionally don’t check in derived files: we know that, given the sources, we can reconstruct them.
Unfortunately, many builds are not deterministic. Often this is merely annoying, but it can cause some practical problems also.
Why might a build be nondeterministic? The most common reason, I’ve found, is that a tool embeds a timestamp in its output. For example, the Microsoft tools all do this: every PE binary (.dll, .exe, etc.) has a timestamp field indicating when it was built. Further, there is no (documented?) way to tell the linker not to do this!
Since the embedded timestamp doesn’t affect the program’s runtime behavior, why do we care? Here are some reasons:
- If the binary timestamp is compared against the debug info timestamp (Visual Studio does this), the debug info won’t be loaded when they mismatch, even though it may well be accurate debug info built from the same tree at a different point in time. (Do you save off your .pdb files from all your official releases?)
- We can no longer check two .dll’s or .exe’s for exact equality via a standard diff program or MD5/SHA1 hash. We have to know which bytes to ignore in the comparison.
- We can’t uniquely identify a build created by an individual developer by its MD5/SHA1 hash; each developer’s builds will have a different hash. It would be nice if we could identify what software version someone is running simply with a “sha1sum <path>/*” command whose output was fed into a database.
- If you change a comment or some other part of a program that doesn’t have any impact on the compiled code, you may get unnecessary rebuilds of downstream steps. Some smarter “make” replacements will look at the file’s hash rather than its last-modified timestamp. Cascade will do the same if this file is an input file of another task. Do you really want your build system to rebuild your .msi installer after you change a comment in a .c file?
- Cascade implements single-instance storage of output files. That is, if two tasks produce an identical output file, Cascade will only store copy of the output file. This can save a lot of disk space in some cases. Any difference in an output file, however trivial, will defeat this single-instance storage.
Another way you can end up with an embedded timestamp is to use __TIME__ or __DATE__, or to write a script that embeds it in generated source code, although these are unlikely to happen by accident.
Yet another is digital signatures for code signing. Certificates expire, but you still have your private key even after it expired. Yet you can’t have your program stop working or stop being validated as authentic just because the certificate it was originally signed with has now expired. So certificate authorities provide a “timestamping” service where they will attach their own signature to your binary, attesting that the binary existed as of a particular timestamp (at which time the certificate was still valid).
Another major class of nondeterminism has to do with the absolute path to your source code. This is typically used to embed a path to your program’s debug info or source code, so that the debugger can automatically find it. Or, sometimes compiler command lines get embedded in binaries, and these command lines can tend to contain absolute paths to headers, libraries, etc. You probably don’t want this path information going into your official releases. If you are working in two separate trees or branches, or two developers have trees at different paths, you can’t copy binaries back and forth between them. It can also be annoying if you share your tree over the network so multiple computers can get to it. If your C: drive on one computer is mapped as another computer’s Z: drive, the embedded C: paths will be all wrong when a debugger or profiler running on the other computer tries to look up the code.
Aside from date-and-time-related and path-related nondeterminism, some other types of determinism to think about: (note that the line between “deterministic” and “machine-independent” is somewhat blurry)
- Does other environmental information enter into your build process? Some examples: your username, hostname, IP, processor type or speed, OS version, or environment variables.
- Do you do profile-guided optimization? If the profile data isn’t deterministic for whatever reason, the resulting optimized binaries won’t be deterministic, either.
- Does your build talk to any servers on the network? Once a build requires network access, you’ve greatly expanded the scope of what can go wrong.
Normally, when debugging, I would recommend starting from first principles and trying to understand the problem’s root cause before doing anything else. Without a complete understanding of the bug, your chances of implementing a complete and correct fix are poor. The last thing you want to do is apply a random fix that “seems to work” without understanding why it works; at that point, you’re a member of the Voodoo Magic School of Software Engineering.
So, my first tool in debugging is always, if possible, an interactive debugger. Stepping through the code is the ideal way to understand when and how the program’s actual behavior starts to diverge from its expected behavior. If an interactive debugger is out of the question for some reason, there are always fallbacks like “printf debugging”, but I find this to be much less effective than a real debugger.
Every so often, though, you run into a bug that is truly mysterious, where you can’t seem to track down what is going wrong. It would be great to understand the full behavior of your system in every last detail, but large and complex software systems eventually start to display “emergent” behavior, where the system as a whole starts to display behavior that cannot easily be explained as the sum of the behaviors of the parts.
What then? One approach to shortcut the debugging process is to binary search for the change that introduced a bug. This requires no understanding of the system’s behavior at all. All you need to know is a revision number where the bug exists and an older revision number where the bug didn’t exist. If it worked at revision 100 and fails at revision 200, then you test at revision 150 to narrow that range by a factor of 2, and so on until you identify the exact change responsible — just like any other binary search.
The first obvious caveat is that this only works for regressions, not for bugs that have existed since the software was first written. Another is that you had better be able to accurately reconstruct old builds of your software; this can be a problem if you don’t have machine-independent builds. Even if you can reconstruct old builds, it can take a while to build them (syncing a tree back to an older revision has a way of flushing out makefile bugs that only syncing to newer revisions doesn’t reveal, so incremental builds are often not trustworthy).
Another tough problem is that sometimes there are parts of the range you want to search where the software is broken: either it doesn’t build, or it builds but there is another, more severe bug that prevents you from running your test case in the first place.
A final problem is that while you may be able to track down the problem to a single commit, some engineers have a bad habit of combining many independent logical changes (which could have each been a separate commit) into one large super-change. Continuous integration is your friend here: you should be committing code on a daily basis, not batching up weeks or months of changes into a single commit. Aside from individual engineers’ bad habits, this can also happen if you merge an entire branch worth of changes (which could easily be hundreds of changes) back into your main codeline in a single commit. As far as that goes, my recommendation is to avoid using “private branches” or “feature branches” and to commit to the mainline as frequently as possible, even if you have to leave your new code in a disabled state via #ifdef or whatnot. “Big bang integration” has a bad reputation for a good reason.
Once you’ve found the change responsible, then what? Unless the problem is extremely urgent and a fix cannot wait a second longer than it needs to, or unless it’s extremely obvious why the change is wrong from inspection alone, I do not recommend that you immediately back out the change. Instead, this is when you fire up your debugger to figure out what is really going on. The change may not have been the real culprit; it may have just perturbed the system a bit, causing a preexisting bug elsewhere to reveal itself in your test case. All that binary searching gives you is a clue. It does not give you a root cause. You still need to find the root cause to apply a correct and complete fix.
One good way to make use of binary searching for a bug is to have your QA department handle it. Since it doesn’t require understanding of the code itself, anyone who can build and run the software can do a binary search. QA can provide the result of the binary search in the bug report, thereby offloading this task from development.
Cascade can help make binary searching more effective:
- Rejecting commits that break builds and regression tests helps ensure that the search won’t break down in a region of changes where the software doesn’t build or run at all, increasing the probability of a successful binary search.
- One of the slowest and most frustrating parts of a binary search is doing all of the builds. Since Cascade builds your software on every single relevant commit and archives those builds off, you can skip this step entirely for any recent regression.
- There’s no guesswork about which changes need to be searched and which ones can be skipped. Cascade’s automatic dependency tracking knows exactly which changes affect your build.
- Cascade helps enable continuous integration, allowing your engineers to commit code directly to the main codeline with less fear that low-quality commits such as build breaks will hurt everyone’s productivity. Smaller, more frequent commits make the results of binary searches more useful.
Traditionally, users of source control systems are told that they should only put source files, not derived files, in their repositories. For example, you would check in your .c and .h files, but not .obj, .lib, .dll, or .exe files built from those .c and .h files. Or, if you generate some of your source code using a Python script, you would check in the .py script file, but not the resulting .c file.
There are two reasons for this:
- Checking in these derived files bloats the repository and slows down the repository server.
- The derived files can become stale — that is, they can fall out of sync with the original source files.
The latter is the more troublesome of the two. Sometimes someone will forget to update the derived files. Sometimes someone might not even know that the derived file has been checked in in the first place. Sometimes a derived file’s dependencies are not obvious; for example, you might not realize that a module pulls in a particular header file through some complex chain of nested #include’s. Perhaps the ugliest case is that you might discover that someone else has broken the build for the derived file — either it no longer builds at all, or it appears to build but produces a file that no longer works — thereby preventing you from updating it, even though your changes clearly require it to be updated.
Many ugly, hard-to-track-down problems can happen when derived files become stale — much the same as if you don’t rebuild a particular .c file when a header file changes. If you’ve ever tracked down such a problem, you probably know how frustrating it can be. The program’s behavior can seem totally illogical until you finally realize (for example) that two .c files are using a different layout for the same structure.
Another difficult problem is merging. Merging derived files is incorrect. (Or, in mathematical terms, it is not always the case that f(a+b) = f(a) + f(b).) The derived file generated from the merge of the changes to the inputs is not always the same as the merge of the derived files generated by each changed input by itself. This is obvious for binary files like libraries, but it’s all too easy to fall into the trap of attempting to merge a derived text file instead of regenerating it from the latest sources.
This can also be a problem when committing your changes: if someone else commits a change that affects the same derived file, you need to grab their changes and rebuild the derived file again. The source control system won’t necessarily warn you about this, so it’s easy to check in a bad derived file by accident.
Yet, there are also reasons why this traditional advice to not check in derived files can be impractical. Let’s leave out obvious examples such as cutting official releases, where you probably want to permanently archive the binaries you released to customers (you might archive them somewhere other than your source control system, but there’s nothing fundamentally wrong with using your source control system to archive releases). Instead, let’s focus on the development process itself.
A large and complex software system’s build time can grow very long. You may only be working on a small piece of the project, but you have to build the entire thing, possibly including components that you may know little to nothing about. (Hopefully all of these components at least have the same build system, so you can type a single command like “make” from the top of the tree to build. In practice, this is not always the case; I’ve worked on projects where each module had its own incompatible build system.)
This creates a dilemma: either each individual engineer has to build the entire project, duplicating builds done by many other engineers, or we can check some or all of the component build results into source control, allowing an engineer to skip builds for components not affected by their changes. Either way, we’re wasting people’s time. The former makes for slow builds; the latter increases the pain of doing a commit.
Ultimately, both solutions are problematic. For large software projects, we need a better solution that offers the best of both worlds.
Cascade offers the possibility of a better solution. Using Cascade:
- You can easily grab pre-generated derived files, without the need to store those derived files in the repository.
- To keep the required disk space bounded, the derived files aren’t kept around forever. You can purge old derived files.
- The derived files are always kept up-to-date, precisely in sync with the source files they are generated from. Cascade’s automatic dependency tracking knows exactly when they need to be rebuilt. You don’t need to update them yourself as part of your commit.
- If the build of a derived file breaks, you’ll know about it right away, either from the web page or from the email Cascade sends out.
- There are no worries about merging derived files. Cascade will always re-generate the derived file correctly regardless of what changes are made to the source files and in what order.
Our GNU make whitepaper mentions that, if you’re using Visual C with GNU make, you can use the
/showIncludes compiler option to help generate
.d files, much like you can with the
-MD option to gcc. I thought I’d post a Python code snippet to illustrate.
One thing that’s important to remember is that, as the Visual C documentation points out,
/showIncludes prints its results to
stderr and not
stdout. This means that you cannot simply redirect
stdout to a file and then parse it. You generally don’t want to redirect
stderr to a file, because this will hide other error messages. Also, there’s no sense in creating a bunch of temporary files that you’ll just have to delete later.
Fortunately, the Python
subprocess module gives us extensive control over redirection of
stderr, so we can simply write a wrapper script around the compiler that takes care of everything. We’ll create a script named
cl.py that wraps
cl.exe. It will automatically add the
/showIncludes option and merge the
cl.exe into a single pipe that we can parse for
cl.py‘s usage is very simple. Normally, you might run the command
cl /c foo.c to create
foo.obj. Now you run
python cl.py /c foo.c to create both
Without further ado, here’s the script. Enjoy!
# cl.exe wrapper to create .d file from /showIncludes # Python 2.5 or later required from __future__ import with_statement import sys, os, subprocess # Determine path for .obj and .d files # This is probably not 100% robust; you may need to tweak it depending on how # you are invoking cl.exe. cmdline = sys.argv[1:] source = None target = './' # default is current directory for arg in cmdline: if arg.startswith('/') or arg.startswith('-'): # A compiler flag if arg[1:3] == 'Fo': target = arg[3:].replace('\\', '/') else: # This must be the source file name (assume there is only one) source = arg if target.endswith('/'): # Default object file name is source file with extension changed to .obj target = target + source[0:source.rfind('.')] + '.obj' if target.startswith('./'): target = target[2:] # chop off unnecessary ./ prefix # Name of .d file is name of object file with extension changed to .d depend_file = target[0:target.rfind('.')] + '.d' # Build cl.exe command line with /showIncludes # Assumption: cl.exe is in the user's PATH cmdline = ['cl.exe', '/showIncludes'] + cmdline # Launch the cl.exe process with stdout/stderr merged into a single pipe p = subprocess.Popen(cmdline, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) # Parse the cl.exe output to build up a set of dependencies deps = set() for line in p.stdout: if line.startswith('Note: including file:'): dep = line[21:].strip() dep = os.path.normpath(dep) dep = os.path.normcase(dep) dep = dep.replace('\\', '/') # use forward slashes for path separators dep = dep.replace(' ', '\ ') # escape spaces in paths with a backslash deps.add(dep) else: sys.stdout.write(line) # Wait for cl.exe to exit, then return an error if it failed ret = p.wait() if ret != 0: sys.exit(1) # Write the .d file, one dependency per line with open(depend_file, 'wt') as f: for dep in sorted(deps): print >>f, '%s: %s' % (target, dep)
One warning: spawning an extra Python process on each source file can hurt performance. Ideally, something along the lines of the above would be built directly into GNU make, but I digress.