The Conifer Systems Blog

Derived Files in Repositories

no comments

Traditionally, users of source control systems are told that they should only put source files, not derived files, in their repositories.  For example, you would check in your .c and .h files, but not .obj, .lib, .dll, or .exe files built from those .c and .h files.  Or, if you generate some of your source code using a Python script, you would check in the .py script file, but not the resulting .c file.

There are two reasons for this:

  • Checking in these derived files bloats the repository and slows down the repository server.
  • The derived files can become stale — that is, they can fall out of sync with the original source files.

The latter is the more troublesome of the two.  Sometimes someone will forget to update the derived files.  Sometimes someone might not even know that the derived file has been checked in in the first place.  Sometimes a derived file’s dependencies are not obvious; for example, you might not realize that a module pulls in a particular header file through some complex chain of nested #include’s.  Perhaps the ugliest case is that you might discover that someone else has broken the build for the derived file — either it no longer builds at all, or it appears to build but produces a file that no longer works — thereby preventing you from updating it, even though your changes clearly require it to be updated.

Many ugly, hard-to-track-down problems can happen when derived files become stale — much the same as if you don’t rebuild a particular .c file when a header file changes.  If you’ve ever tracked down such a problem, you probably know how frustrating it can be.  The program’s behavior can seem totally illogical until you finally realize (for example) that two .c files are using a different layout for the same structure.

Another difficult problem is merging.  Merging derived files is incorrect.  (Or, in mathematical terms, it is not always the case that f(a+b) = f(a) + f(b).)  The derived file generated from the merge of the changes to the inputs is not always the same as the merge of the derived files generated by each changed input by itself.  This is obvious for binary files like libraries, but it’s all too easy to fall into the trap of attempting to merge a derived text file instead of regenerating it from the latest sources.

This can also be a problem when committing your changes: if someone else commits a change that affects the same derived file, you need to grab their changes and rebuild the derived file again.  The source control system won’t necessarily warn you about this, so it’s easy to check in a bad derived file by accident.

Yet, there are also reasons why this traditional advice to not check in derived files can be impractical.  Let’s leave out obvious examples such as cutting official releases, where you probably want to permanently archive the binaries you released to customers (you might archive them somewhere other than your source control system, but there’s nothing fundamentally wrong with using your source control system to archive releases).  Instead, let’s focus on the development process itself.

A large and complex software system’s build time can grow very long.  You may only be working on a small piece of the project, but you have to build the entire thing, possibly including components that you may know little to nothing about.  (Hopefully all of these components at least have the same build system, so you can type a single command like “make” from the top of the tree to build.  In practice, this is not always the case; I’ve worked on projects where each module had its own incompatible build system.)

This creates a dilemma: either each individual engineer has to build the entire project, duplicating builds done by many other engineers, or we can check some or all of the component build results into source control, allowing an engineer to skip builds for components not affected by their changes.  Either way, we’re wasting people’s time.  The former makes for slow builds; the latter increases the pain of doing a commit.

Ultimately, both solutions are problematic.  For large software projects, we need a better solution that offers the best of both worlds.

Cascade offers the possibility of a better solution.  Using Cascade:

  • You can easily grab pre-generated derived files, without the need to store those derived files in the repository.
  • To keep the required disk space bounded, the derived files aren’t kept around forever.  You can purge old derived files.
  • The derived files are always kept up-to-date, precisely in sync with the source files they are generated from.  Cascade’s automatic dependency tracking knows exactly when they need to be rebuilt.  You don’t need to update them yourself as part of your commit.
  • If the build of a derived file breaks, you’ll know about it right away, either from the web page or from the email Cascade sends out.
  • There are no worries about merging derived files.  Cascade will always re-generate the derived file correctly regardless of what changes are made to the source files and in what order.

Written by Matt

October 14th, 2008 at 2:45 pm

Leave a Reply