The Conifer Systems Blog

Build Determinism

3 comments

I’ve written earlier about machine-independent builds, but let’s talk about a related issue: build determinism.  A build or a build step is deterministic if I can run it multiple times and get the same results (objects, binaries, etc.) every time.  That is, the same inputs always result in the same outputs.  The assumption of determinism is one of the fundamental reasons why we traditionally don’t check in derived files: we know that, given the sources, we can reconstruct them.

Unfortunately, many builds are not deterministic.  Often this is merely annoying, but it can cause some practical problems also.

Why might a build be nondeterministic?  The most common reason, I’ve found, is that a tool embeds a timestamp in its output.  For example, the Microsoft tools all do this: every PE binary (.dll, .exe, etc.) has a timestamp field indicating when it was built.  Further, there is no (documented?) way to tell the linker not to do this!

Since the embedded timestamp doesn’t affect the program’s runtime behavior, why do we care?  Here are some reasons:

  • If the binary timestamp is compared against the debug info timestamp (Visual Studio does this), the debug info won’t be loaded when they mismatch, even though it may well be accurate debug info built from the same tree at a different point in time.  (Do you save off your .pdb files from all your official releases?)
  • We can no longer check two .dll’s or .exe’s for exact equality via a standard diff program or MD5/SHA1 hash.  We have to know which bytes to ignore in the comparison.
  • We can’t uniquely identify a build created by an individual developer by its MD5/SHA1 hash; each developer’s builds will have a different hash.  It would be nice if we could identify what software version someone is running simply with a “sha1sum <path>/*” command whose output was fed into a database.
  • If you change a comment or some other part of a program that doesn’t have any impact on the compiled code, you may get unnecessary rebuilds of downstream steps.  Some smarter “make” replacements will look at the file’s hash rather than its last-modified timestamp.  Cascade will do the same if this file is an input file of another task.  Do you really want your build system to rebuild your .msi installer after you change a comment in a .c file?
  • Cascade implements single-instance storage of output files.  That is, if two tasks produce an identical output file, Cascade will only store copy of the output file.  This can save a lot of disk space in some cases.  Any difference in an output file, however trivial, will defeat this single-instance storage.

Another way you can end up with an embedded timestamp is to use __TIME__ or __DATE__, or to write a script that embeds it in generated source code, although these are unlikely to happen by accident.

Yet another is digital signatures for code signing.  Certificates expire, but you still have your private key even after it expired.  Yet you can’t have your program stop working or stop being validated as authentic just because the certificate it was originally signed with has now expired.  So certificate authorities provide a “timestamping” service where they will attach their own signature to your binary, attesting that the binary existed as of a particular timestamp (at which time the certificate was still valid).

Another major class of nondeterminism has to do with the absolute path to your source code.  This is typically used to embed a path to your program’s debug info or source code, so that the debugger can automatically find it.  Or, sometimes compiler command lines get embedded in binaries, and these command lines can tend to contain absolute paths to headers, libraries, etc.  You probably don’t want this path information going into your official releases.  If you are working in two separate trees or branches, or two developers have trees at different paths, you can’t copy binaries back and forth between them.  It can also be annoying if you share your tree over the network so multiple computers can get to it.  If your C: drive on one computer is mapped as another computer’s Z: drive, the embedded C: paths will be all wrong when a debugger or profiler running on the other computer tries to look up the code.

Aside from date-and-time-related and path-related nondeterminism, some other types of determinism to think about: (note that the line between “deterministic” and “machine-independent” is somewhat blurry)

  • Does other environmental information enter into your build process?  Some examples: your username, hostname, IP, processor type or speed, OS version, or environment variables.
  • Do you do profile-guided optimization?  If the profile data isn’t deterministic for whatever reason, the resulting optimized binaries won’t be deterministic, either.
  • Does your build talk to any servers on the network?  Once a build requires network access, you’ve greatly expanded the scope of what can go wrong.

Written by Matt

October 17th, 2008 at 2:20 pm

Cascade 1.0.1.794 released!

no comments

Cascade 1.0.1.794 has been released!  This is a bug fix release that addresses a variety of issues, mostly minor, with Cascade 1.0.0.  A list of specific noteworthy fixes is provided in the release notes.  Try it now!


Written by Matt

October 16th, 2008 at 2:24 pm

Binary Searching for Bugs

2 comments

Normally, when debugging, I would recommend starting from first principles and trying to understand the problem’s root cause before doing anything else.  Without a complete understanding of the bug, your chances of implementing a complete and correct fix are poor.  The last thing you want to do is apply a random fix that “seems to work” without understanding why it works; at that point, you’re a member of the Voodoo Magic School of Software Engineering.

So, my first tool in debugging is always, if possible, an interactive debugger.  Stepping through the code is the ideal way to understand when and how the program’s actual behavior starts to diverge from its expected behavior.  If an interactive debugger is out of the question for some reason, there are always fallbacks like “printf debugging”, but I find this to be much less effective than a real debugger.

Every so often, though, you run into a bug that is truly mysterious, where you can’t seem to track down what is going wrong.  It would be great to understand the full behavior of your system in every last detail, but large and complex software systems eventually start to display “emergent” behavior, where the system as a whole starts to display behavior that cannot easily be explained as the sum of the behaviors of the parts.

What then?  One approach to shortcut the debugging process is to binary search for the change that introduced a bug.  This requires no understanding of the system’s behavior at all.  All you need to know is a revision number where the bug exists and an older revision number where the bug didn’t exist.  If it worked at revision 100 and fails at revision 200, then you test at revision 150 to narrow that range by a factor of 2, and so on until you identify the exact change responsible — just like any other binary search.

The first obvious caveat is that this only works for regressions, not for bugs that have existed since the software was first written.  Another is that you had better be able to accurately reconstruct old builds of your software; this can be a problem if you don’t have machine-independent builds.  Even if you can reconstruct old builds, it can take a while to build them (syncing a tree back to an older revision has a way of flushing out makefile bugs that only syncing to newer revisions doesn’t reveal, so incremental builds are often not trustworthy).

Another tough problem is that sometimes there are parts of the range you want to search where the software is broken: either it doesn’t build, or it builds but there is another, more severe bug that prevents you from running your test case in the first place.

A final problem is that while you may be able to track down the problem to a single commit, some engineers have a bad habit of combining many independent logical changes (which could have each been a separate commit) into one large super-change.  Continuous integration is your friend here: you should be committing code on a daily basis, not batching up weeks or months of changes into a single commit.  Aside from individual engineers’ bad habits, this can also happen if you merge an entire branch worth of changes (which could easily be hundreds of changes) back into your main codeline in a single commit.  As far as that goes, my recommendation is to avoid using “private branches” or “feature branches” and to commit to the mainline as frequently as possible, even if you have to leave your new code in a disabled state via #ifdef or whatnot.  “Big bang integration” has a bad reputation for a good reason.

Once you’ve found the change responsible, then what?  Unless the problem is extremely urgent and a fix cannot wait a second longer than it needs to, or unless it’s extremely obvious why the change is wrong from inspection alone, I do not recommend that you immediately back out the change.  Instead, this is when you fire up your debugger to figure out what is really going on.  The change may not have been the real culprit; it may have just perturbed the system a bit, causing a preexisting bug elsewhere to reveal itself in your test case.  All that binary searching gives you is a clue.  It does not give you a root cause.  You still need to find the root cause to apply a correct and complete fix.

One good way to make use of binary searching for a bug is to have your QA department handle it.  Since it doesn’t require understanding of the code itself, anyone who can build and run the software can do a binary search.  QA can provide the result of the binary search in the bug report, thereby offloading this task from development.

Cascade can help make binary searching more effective:

  • Rejecting commits that break builds and regression tests helps ensure that the search won’t break down in a region of changes where the software doesn’t build or run at all, increasing the probability of a successful binary search.
  • One of the slowest and most frustrating parts of a binary search is doing all of the builds.  Since Cascade builds your software on every single relevant commit and archives those builds off, you can skip this step entirely for any recent regression.
  • There’s no guesswork about which changes need to be searched and which ones can be skipped.  Cascade’s automatic dependency tracking knows exactly which changes affect your build.
  • Cascade helps enable continuous integration, allowing your engineers to commit code directly to the main codeline with less fear that low-quality commits such as build breaks will hurt everyone’s productivity.  Smaller, more frequent commits make the results of binary searches more useful.

Written by Matt

October 15th, 2008 at 11:21 am

Derived Files in Repositories

no comments

Traditionally, users of source control systems are told that they should only put source files, not derived files, in their repositories.  For example, you would check in your .c and .h files, but not .obj, .lib, .dll, or .exe files built from those .c and .h files.  Or, if you generate some of your source code using a Python script, you would check in the .py script file, but not the resulting .c file.

There are two reasons for this:

  • Checking in these derived files bloats the repository and slows down the repository server.
  • The derived files can become stale — that is, they can fall out of sync with the original source files.

The latter is the more troublesome of the two.  Sometimes someone will forget to update the derived files.  Sometimes someone might not even know that the derived file has been checked in in the first place.  Sometimes a derived file’s dependencies are not obvious; for example, you might not realize that a module pulls in a particular header file through some complex chain of nested #include’s.  Perhaps the ugliest case is that you might discover that someone else has broken the build for the derived file — either it no longer builds at all, or it appears to build but produces a file that no longer works — thereby preventing you from updating it, even though your changes clearly require it to be updated.

Many ugly, hard-to-track-down problems can happen when derived files become stale — much the same as if you don’t rebuild a particular .c file when a header file changes.  If you’ve ever tracked down such a problem, you probably know how frustrating it can be.  The program’s behavior can seem totally illogical until you finally realize (for example) that two .c files are using a different layout for the same structure.

Another difficult problem is merging.  Merging derived files is incorrect.  (Or, in mathematical terms, it is not always the case that f(a+b) = f(a) + f(b).)  The derived file generated from the merge of the changes to the inputs is not always the same as the merge of the derived files generated by each changed input by itself.  This is obvious for binary files like libraries, but it’s all too easy to fall into the trap of attempting to merge a derived text file instead of regenerating it from the latest sources.

This can also be a problem when committing your changes: if someone else commits a change that affects the same derived file, you need to grab their changes and rebuild the derived file again.  The source control system won’t necessarily warn you about this, so it’s easy to check in a bad derived file by accident.

Yet, there are also reasons why this traditional advice to not check in derived files can be impractical.  Let’s leave out obvious examples such as cutting official releases, where you probably want to permanently archive the binaries you released to customers (you might archive them somewhere other than your source control system, but there’s nothing fundamentally wrong with using your source control system to archive releases).  Instead, let’s focus on the development process itself.

A large and complex software system’s build time can grow very long.  You may only be working on a small piece of the project, but you have to build the entire thing, possibly including components that you may know little to nothing about.  (Hopefully all of these components at least have the same build system, so you can type a single command like “make” from the top of the tree to build.  In practice, this is not always the case; I’ve worked on projects where each module had its own incompatible build system.)

This creates a dilemma: either each individual engineer has to build the entire project, duplicating builds done by many other engineers, or we can check some or all of the component build results into source control, allowing an engineer to skip builds for components not affected by their changes.  Either way, we’re wasting people’s time.  The former makes for slow builds; the latter increases the pain of doing a commit.

Ultimately, both solutions are problematic.  For large software projects, we need a better solution that offers the best of both worlds.

Cascade offers the possibility of a better solution.  Using Cascade:

  • You can easily grab pre-generated derived files, without the need to store those derived files in the repository.
  • To keep the required disk space bounded, the derived files aren’t kept around forever.  You can purge old derived files.
  • The derived files are always kept up-to-date, precisely in sync with the source files they are generated from.  Cascade’s automatic dependency tracking knows exactly when they need to be rebuilt.  You don’t need to update them yourself as part of your commit.
  • If the build of a derived file breaks, you’ll know about it right away, either from the web page or from the email Cascade sends out.
  • There are no worries about merging derived files.  Cascade will always re-generate the derived file correctly regardless of what changes are made to the source files and in what order.

Written by Matt

October 14th, 2008 at 2:45 pm

How to Get Dependencies from /showIncludes

no comments

Our GNU make whitepaper mentions that, if you’re using Visual C with GNU make, you can use the /showIncludes compiler option to help generate .d files, much like you can with the -MD option to gcc.  I thought I’d post a Python code snippet to illustrate.

One thing that’s important to remember is that, as the Visual C documentation points out, /showIncludes prints its results to stderr and not stdout.  This means that you cannot simply redirect stdout to a file and then parse it.  You generally don’t want to redirect stderr to a file, because this will hide other error messages.  Also, there’s no sense in creating a bunch of temporary files that you’ll just have to delete later.

Fortunately, the Python subprocess module gives us extensive control over redirection of stdin, stdout, and stderr, so we can simply write a wrapper script around the compiler that takes care of everything.  We’ll create a script named cl.py that wraps cl.exe.  It will automatically add the /showIncludes option and merge the stdout and stderr from cl.exe into a single pipe that we can parse for /showIncludes information.

cl.py‘s usage is very simple.  Normally, you might run the command cl /c foo.c to create foo.obj.  Now you run python cl.py /c foo.c to create both foo.obj and foo.d.

Without further ado, here’s the script. Enjoy!

# cl.exe wrapper to create .d file from /showIncludes
# Python 2.5 or later required
from __future__ import with_statement
import sys, os, subprocess

# Determine path for .obj and .d files
# This is probably not 100% robust; you may need to tweak it depending on how
# you are invoking cl.exe.
cmdline = sys.argv[1:]
source = None
target = './' # default is current directory
for arg in cmdline:
    if arg.startswith('/') or arg.startswith('-'):
        # A compiler flag
        if arg[1:3] == 'Fo':
            target = arg[3:].replace('\\', '/')
    else:
        # This must be the source file name (assume there is only one)
        source = arg
if target.endswith('/'):
    # Default object file name is source file with extension changed to .obj
    target = target + source[0:source.rfind('.')] + '.obj'
if target.startswith('./'):
    target = target[2:] # chop off unnecessary ./ prefix

# Name of .d file is name of object file with extension changed to .d
depend_file = target[0:target.rfind('.')] + '.d'

# Build cl.exe command line with /showIncludes
# Assumption: cl.exe is in the user's PATH
cmdline = ['cl.exe', '/showIncludes'] + cmdline

# Launch the cl.exe process with stdout/stderr merged into a single pipe
p = subprocess.Popen(cmdline, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)

# Parse the cl.exe output to build up a set of dependencies
deps = set()
for line in p.stdout:
    if line.startswith('Note: including file:'):
        dep = line[21:].strip()
        dep = os.path.normpath(dep)
        dep = os.path.normcase(dep)
        dep = dep.replace('\\', '/') # use forward slashes for path separators
        dep = dep.replace(' ', '\ ') # escape spaces in paths with a backslash
        deps.add(dep)
    else:
        sys.stdout.write(line)

# Wait for cl.exe to exit, then return an error if it failed
ret = p.wait()
if ret != 0:
    sys.exit(1)

# Write the .d file, one dependency per line
with open(depend_file, 'wt') as f:
    for dep in sorted(deps):
        print >>f, '%s: %s' % (target, dep)

One warning: spawning an extra Python process on each source file can hurt performance.  Ideally, something along the lines of the above would be built directly into GNU make, but I digress.


Written by Matt

October 9th, 2008 at 2:29 pm

Cascade Flash Demo Updated

no comments

After the recent release of Cascade 1.0.0, our old Flash demo of an earlier version of Cascade was starting to look a little stale and has been cleaned up and replaced with a newer demo based on the latest release.  In addition to using a newer version of Cascade and featuring newer, prettier-looking Cascade Manager web pages, the demo is quite a bit shorter now: 15 minutes rather than 25.  The audio quality has also been cleaned up.

Enjoy!


Written by Matt

October 8th, 2008 at 6:59 pm

Posted in Cascade

Cascade 1.0.0.769 released!

no comments

Cascade 1.0.0.769 has been released!  This release includes a number of major enhancements:

  • The addition of a user manual.  In addition to being linked from our web site, you can access the manual via the Start Menu, via any right-click context menu, or via a button on the nav bar on any Cascade Manager page.
  • Greatly simplified installation of Cascade on Linux and Macintosh.  Most tasks are now performed by an installer script.
  • Further usability enhancements to Cascade Manager, including color-coding in tables (e.g. failed tasks are highlighted in red) and a new Revisions page with improved status reporting that replaces the old Home and Grid pages.
  • Checkpoints are now simply revisions with a dot, instead of a long unique ID: for example, revision 10.1 is the first checkpoint created so far relative to revision 10.
  • Other minor bug fixes and enhancements.

Try it now!


Written by Matt

October 7th, 2008 at 3:23 pm

Where Does All That Disk Space Go?

2 comments

Here’s an interesting blog post about why a Windows OS install can be so big.  One reason has to do with the WinSxS folder, which stores various versions of important system DLLs.  Why more than one version?  Well, each time a new patch, service pack, etc. is released, a DLL might be updated, and the new DLL is saved here in addition to the old one.

your next question is probably to ask why we don’t remove the older versions of the components. The short answer to that is reliability. The component store, along with other information on the system, allows us to determine at any given time what the best version of a component to project is. That means that if you uninstall a security update we can install the next highest version on the system – we no longer have an “out of order uninstall” problem. It also means that if you decide to install an optional feature, we don’t just choose the RTM version of the component, we’ll look to see what the highest available version on the system is. As each component on the system changes state that may in turn trigger changes in other components, and because the relationships between all the components are described on the system we can respond to those requirements in ways that we couldn’t in previous OS versions.

It’s downright tricky to design a system where you can install and uninstall both various components and various patches to those components in arbitrary orders, and then expect everything to work.

When you install your OS, you fill your computer’s hard drive with all sorts of stuff.  Realistically, you may never use a lot of that stuff.  As you install and uninstall more applications and updates, stuff tends to accumulate, a lot of it rarely used.  This isn’t a Microsoft/Windows thing.  This can happen on any OS.

Well, disks are big and cheap, so who cares?  I would have agreed a few years back, but there’s an important game-changing technology that makes me care again about conserving disk space: flash.  Solid state drives can be many times faster than traditional hard drives, but they also cost a lot more per gigabyte and aren’t available in extremely large sizes.  OS virtualization is another reason I might care: historically I had just one OS install on my hard drive, maybe 2 if I dual booted.  Now it’s common for me to have any number of virtual machines floating around, each one of which is its own independent OS install.

Cascade suggests a better way to solve this problem — a new model for software deployment for those of us whose computers are always connected to the Internet.  Instead of installing a copy of each component on the local disk, the file system should cache components on the local disk as needed.  Using CFS terminology, the installer would just set up a CFS tree/mount point pointing to a public (available on the Internet) Subversion repository containing the released software binaries.  As you used the software, CFS would automatically download the files you actually touched and cache them locally.  Of course, the cached files would stay around after rebooting your system.

The initial installer you would download would be tiny, since it wouldn’t contain any of the files, just a pointer to their location.  You’d never have to worry about whether you should install all of the features or just a subset to save disk space; all of the features would be available on demand at no cost in disk space.

In corporate environments, the downloads would go through Cascade Proxy, so you wouldn’t be downloading the sames files over and over again.

To update the software to pick up a patch, you would simply point CFS at a newer revision of that repository.  To “uninstall” the patch, you could always roll back to an old revision.

Most importantly, your CFS cache can easily fit on a SSD, so you could get the performance benefits of an SSD without worrying about running out of disk space as you install more applications.  For virtual machines, you could use a small CFS cache; cache misses can be serviced quickly out of a larger Cascade Proxy cache running on your host OS.


Written by Matt

September 30th, 2008 at 12:14 pm

Support for More Repository Types

no comments

Currently, as of version 0.2.1, Cascade supports Perforce, Subversion, and Alienbrain repositories.  If you’re using some other software to manage your repository, unfortunately, we don’t have a way for you to use Cascade just yet.  However, we’ve been careful to design Cascade so that we can easily add support for more repository types in the future, so you’re not completely out of luck (please don’t hesitate to contact us if you’re interested).

Let’s step back and look at the architecture of Cascade File System and Cascade Proxy a little bit.  Both are services that run in the background on a PC.  Both receive requests for data from some external source — CFS gets them from the file system layer in the kernel, Cascade Proxy gets them from a network connection.  Both use caching to satisfy those requests more efficiently: once a file is downloaded for the first time, it’s stored in the cache, so we don’t have to download it again on a second request for the same file.  In fact, CFS and Cascade Proxy both share the same “cache manager” implementation, although they make use of it in a somewhat different way.

The Cascade cache manager thinks of files in terms of their URL — for instance, svn-http://svn.collab.net/repos/svn.  This URL encodes all of the information necessary to find the file: what type of repository it lives in, the hostname of the server to connect to, and the path of that file on the server.

When a cache miss happens, the cache manager needs to download the file.  If it is configured with a proxy server, it will forward the request on to the proxy, without regard to what type of repository we are dealing with; the proxy server will take care of everything for us.  If we don’t have a proxy server, it will obtain the file directly from the repository.  To do this, it looks at that first part of the URL and passes the request to one of several “repository query” backends:

  • Requests for URLs beginning with “p4:” will be forwarded to the query_p4 library.
  • Requests for URLs beginning with “svn-http:” or “svn-https:” will be forwarded to the query_svn library. 
  • Requests for URLs beginning with “ab:” will be forwarded to the query_ab library.

Each of these query libraries implements a common API.  To add a new repository type, all we need to do is implement that API — map the standard set of queries the Cascade cache manager uses to the queries that we can make to the repository.  Then, we simply assign it a unique URL schema as above and add it to a table, and we’re done!  It’s pretty straightforward.

What about third-party addons to support new repository types?  We’ve considered the possibility and may offer some way to do this in the future, but for now we believe providing a third-party API for this would probably cause more problems than it would solve.


Written by Matt

September 19th, 2008 at 2:34 pm

Posted in Cascade

On Daily Builds

no comments

For many years it’s been a standard “best practice” of software development to have a “daily build” — or, if you prefer, a “nightly build” (example, with some impressive pictures).  But why daily?  Why not twice daily, or hourly?  If the build is broken, wouldn’t we want to know about it sooner rather than later?  The faster it gets fixed, the less the productivity cost to your team.

Granted, there are some software projects that take more than an hour to build, so a single computer couldn’t do an hourly build.  But those are the same huge projects where you can easily afford to buy more than one computer to do your builds.

And of course if an hour goes by and no new changes have been committed, there’s not much sense in doing another hourly build.  Which immediately leads to the insight: why not simply build on every single change that’s committed to the repository?

Sure, it may cost you a bit to buy a few computers to do those builds.  But let’s not forget how cheap computer power has gotten; if we’re talking about machines that can run headless and don’t need a monitor or video card, we can easily build an extremely powerful build farm workhorse computer (Core 2 Quad, 4GB RAM, 750GB hard drive, etc.) for under $500.

One problem you may run into is that many changes to your repository don’t affect your builds, or only affect a subset of your builds.  No sense in rebuilding when someone changes a documentation file that doesn’t ship externally.  If you build every component of every project on every change, the system wouldn’t scale very well as the volume of changes and number of components increases over time.  You could write up a list of “rules” to tell the system when to build and when not to build, but this could be a lot of work, and your rules could have bugs in them.  You’d constantly have to keep the rules up to date.  Well, there’s a solution to this also: a system like Cascade can, by hooking the file system, automatically determine which changes affect which builds.

Once you have a system like this set up, what do you gain?

  • Quick notification of build breaks.  There’s still some latency (if the build takes 5 minutes, you might not know about a break until up to 5 minutes after it’s committed), but if this is a problem, you can mitigate it by breaking up a large build into several smaller builds.
  • The system will tell you exactly which change broke the build, without any guessing on your part.  If a particular engineer is being careless and breaking the build frequently, you can remind them to test more carefully in the future.  Or, you can set up some kind of minor penalty for breaking the build (whoever breaks the build buys the team donuts, or something like that).
  • No need to explicitly “cut a build” for QA or release at a particular change.  Since the system builds every change, chances are it’s already been built!
  • Ability to binary search for regressions down to a single change.  If the system builds every single change and saves off the results somewhere, you can determine which change introduced a bug by doing a binary search using the resulting binaries.  (For example, if a bug was introduced yesterday, and there were 30 changes on that day, you could find the responsible change by running just 5 tests.)  Since this doesn’t require any debugging or understanding of the code itself, it’s a good way to offload work from your developers to your QA team and speed up bug triage — when a bug is filed, QA can tell you if it’s a regression, and if so, which change caused it.

It’s time to update this best practice.  A daily build is good, but a build on every committed change is better.


Written by Matt

September 18th, 2008 at 3:31 pm