I’ve written earlier about machine-independent builds, but let’s talk about a related issue: build determinism. A build or a build step is deterministic if I can run it multiple times and get the same results (objects, binaries, etc.) every time. That is, the same inputs always result in the same outputs. The assumption of determinism is one of the fundamental reasons why we traditionally don’t check in derived files: we know that, given the sources, we can reconstruct them.
Unfortunately, many builds are not deterministic. Often this is merely annoying, but it can cause some practical problems also.
Why might a build be nondeterministic? The most common reason, I’ve found, is that a tool embeds a timestamp in its output. For example, the Microsoft tools all do this: every PE binary (.dll, .exe, etc.) has a timestamp field indicating when it was built. Further, there is no (documented?) way to tell the linker not to do this!
Since the embedded timestamp doesn’t affect the program’s runtime behavior, why do we care? Here are some reasons:
- If the binary timestamp is compared against the debug info timestamp (Visual Studio does this), the debug info won’t be loaded when they mismatch, even though it may well be accurate debug info built from the same tree at a different point in time. (Do you save off your .pdb files from all your official releases?)
- We can no longer check two .dll’s or .exe’s for exact equality via a standard diff program or MD5/SHA1 hash. We have to know which bytes to ignore in the comparison.
- We can’t uniquely identify a build created by an individual developer by its MD5/SHA1 hash; each developer’s builds will have a different hash. It would be nice if we could identify what software version someone is running simply with a “sha1sum <path>/*” command whose output was fed into a database.
- If you change a comment or some other part of a program that doesn’t have any impact on the compiled code, you may get unnecessary rebuilds of downstream steps. Some smarter “make” replacements will look at the file’s hash rather than its last-modified timestamp. Cascade will do the same if this file is an input file of another task. Do you really want your build system to rebuild your .msi installer after you change a comment in a .c file?
- Cascade implements single-instance storage of output files. That is, if two tasks produce an identical output file, Cascade will only store copy of the output file. This can save a lot of disk space in some cases. Any difference in an output file, however trivial, will defeat this single-instance storage.
Another way you can end up with an embedded timestamp is to use __TIME__ or __DATE__, or to write a script that embeds it in generated source code, although these are unlikely to happen by accident.
Yet another is digital signatures for code signing. Certificates expire, but you still have your private key even after it expired. Yet you can’t have your program stop working or stop being validated as authentic just because the certificate it was originally signed with has now expired. So certificate authorities provide a “timestamping” service where they will attach their own signature to your binary, attesting that the binary existed as of a particular timestamp (at which time the certificate was still valid).
Another major class of nondeterminism has to do with the absolute path to your source code. This is typically used to embed a path to your program’s debug info or source code, so that the debugger can automatically find it. Or, sometimes compiler command lines get embedded in binaries, and these command lines can tend to contain absolute paths to headers, libraries, etc. You probably don’t want this path information going into your official releases. If you are working in two separate trees or branches, or two developers have trees at different paths, you can’t copy binaries back and forth between them. It can also be annoying if you share your tree over the network so multiple computers can get to it. If your C: drive on one computer is mapped as another computer’s Z: drive, the embedded C: paths will be all wrong when a debugger or profiler running on the other computer tries to look up the code.
Aside from date-and-time-related and path-related nondeterminism, some other types of determinism to think about: (note that the line between “deterministic” and “machine-independent” is somewhat blurry)
- Does other environmental information enter into your build process? Some examples: your username, hostname, IP, processor type or speed, OS version, or environment variables.
- Do you do profile-guided optimization? If the profile data isn’t deterministic for whatever reason, the resulting optimized binaries won’t be deterministic, either.
- Does your build talk to any servers on the network? Once a build requires network access, you’ve greatly expanded the scope of what can go wrong.