Archive for October, 2008
Cascade 188.8.131.520 has been released! This release fixes a variety of bugs, some minor and some major, in Cascade 1.0.1. It also provides several important performance enhancements, especially to checkpoint and commit operations. A list of specific noteworthy fixes is provided in the release notes. Try it now!
One of the challenges in implementing a cross-platform file system driver such as Cascade File System is dealing with the many differences, small and large, between how Windows, Linux, and Macintosh file systems work. Some of these differences are well-known and obvious, but there are a lot of other interesting differences underneath the covers, especially when you get down into the file system driver kernel interfaces.
Let’s start with the most obvious one: case sensitivity. Linux has a case sensitive file system. Windows has a case-preserving but case-insensitive file system. Or, at least it looks like Windows does! But in reality Windows supports both. Check out the documentation for the NtCreateFile API, the native NT API that the Win32 API CreateFile maps to. By setting or not setting OBJ_CASE_INSENSITIVE, you can select which type of name lookup you prefer. It’s really up to the individual file system to decide how to interpret all these flags, though. Some Windows file systems, like the original FAT, aren’t even case-preserving.
The Macintosh is now Unix-based as of OSX, but its HFS file system has traditionally been case-insensitive and case-preserving, just like Windows. More recently, Apple now allows HFS to be formatted either way, as either case-sensitive or case-insensitive, but the default remains case-insensitive.
The issue of case sensitivity brings up another issue: internationalization. Windows, Linux, and Macintosh all support Unicode in paths; Windows encodes them as UTF-16 in all of its native APIs, whereas Linux and Macintosh use UTF-8. A problem: it’s possible for two non-identical Unicode strings to correspond to the same sequence of characters. That is, certain characters can be legally encoded in more than one way. Macintosh therefore requires all UTF-8 filenames to be stored in a canonicalized format and therefore will prevent you from creating two files in the same directory with the same name but different character encodings. Windows and Linux do not; this can cause interoperability problems moving data back and forth between the two.
There are several challenges in doing case-insensitive string comparisons in a Unicode-capable file system. NTFS on Windows adopts the following approach: two strings are compared by converting them both to uppercase first, then comparing them for exact equality. The conversion to uppercase is done using a 64K-entry, 128KB table stored on the volume and filled in when the partition is formatted; this ensures that the comparisons do not break (which could cause two files’ names to start colliding) when new characters are added to Unicode and someone upgrades their OS.
Windows uses backslashes as path separators, while Linux and Macintosh use forward slashes. Most of the Win32 APIs allow you to specify forward slashes and will do the conversion for you, but once you get into the NT kernel and the other low-level APIs, backslashes are mandatory.
This in turn means that the set of legal filenames differs between the operating systems. On Linux, for example, you can create a file whose name contains a backslash, while on Windows you cannot. Linux is very permissive about the legal character set, but Windows has a lot of extra restrictions. A filename cannot end with a space or a period; there are a number of reserved names like COM1 and NUL; and several other non-path-separator characters like <, >, :, “, and |, are reserved.
Windows has drive letters, Linux and Macintosh have mount points. Actually, inside the NT kernel and inside kernel drivers, there is really no such thing as a drive letter. A “drive letter” is nothing other than a symbolic link in the NT kernel namespace, e.g., from \DosDevices\X: to \Device\cfs. When you call CreateFile with a path x:\foo.txt, the driver owning the \Device\cfs namespace simply sees a request for \foo.txt. But for practical purposes, this is still important. A Windows path needs to be interpreted differently depending on whether it’s a drive letter path or a UNC path. A Windows file system can be ripped away from applications with files open by removing the symbolic link, whereas a Unix file system cannot be unmounted if files are still open.
The Windows cache manager holds files open. When you close the last handle to a file, from the file system driver’s point of view, the file may still be open. This makes it very difficult to unload a Windows file system driver without a reboot. Unmounting it, i.e., removing the drive letter symbolic link, is easy, but until memory pressure forces the Windows cache manager to flush its cached mappings, those cached mappings may stay around indefinitely.
Permissions are very different. Linux has the standard Unix “UID, GID, mode bits” permissions model, and Macintosh inherits this also. Both have added ACL-based permissions later, but their use is often not considered very mainstream. Windows, on the other hand, is thoroughly ACL-based. Every file in a Windows file system has a security descriptor that includes the ACL. The permissions are far more elaborate than just “read, write, execute”; there are over a dozen types of permissions that you can be granted or denied.
Other file attributes are also different. Windows has a standard list of file attribute bits like “archive”, “hidden”, and “system” that go back to the DOS era. There is no equivalent to these on Unix. All of the systems support a more generic “extended attribute” system, however.
Linux doesn’t have multiple data streams per file. One of the defining properties of Unix, going back to its very beginnings, is that a file is just an array of bytes. Windows, however, allows a file to have multiple “data streams”, while Macintosh supports a similar “resource fork” feature. Apple now discourages the use of resource forks, but multiple data streams continue to be an important feature on Windows in some cases. For example, Internet Explorer attaches an alternate data stream to each file you download to indicate where you downloaded it from. When you later try to run an app that was downloaded from an untrusted zone, you will get a warning asking you whether you really want to do that.
Windows has limited symbolic link support. Windows has “reparse points”, which are like symbolic links for directories only, with some other caveats; but they are supported poorly by many applications. Vista adds something closer to real Unix symbolic links, though again with some limitations.
NtCreateFile() on Windows throws in the kitchen sink. This API has a lot of flexibility that doesn’t exist in the Unix open() system call. For better or worse, just about everything goes through it. For example, there is no equivalent to mkdir() on Windows. Instead, NtCreateFile takes a flag to request that you want to create a directory rather than a file in the event that the path lookup fails. It also supports a number of other random features, like delete-on-close files.
The Windows delete and rename model is different. You wouldn’t know this from the Win32 APIs, but in order to delete or rename a file in Windows, you first have to open it! Once you’ve opened it can you call NtSetInformationFile with InformationClass of FileDispositionInformation or FileRenameInformation. Setting FileDispositionInformation doesn’t even delete the file; it merely enables delete-on-close for the file, and the delete-on-close request could very well be cancelled later.
File sharing restrictions and locking are different. Unix generally avoids the idea of restricting what can be done with a file just because someone else is using it. Having a file open doesn’t prevent it from being unlinked, and two people can open the same file for writing. On Windows, all of this is true in theory — you can request whatever sharing mode you want when you open a file — but in practice, most applications use restrictive sharing modes, preventing two apps from using the same file at the same time. Inside a single file, we also have byte range locking. Windows uses mandatory locking: if someone else has the bytes locked, an attempt to modify those bytes with WriteFile() will fail (but this is not enforced for memory-mapped files!). Unix uses only advisory locking and makes no effort to error-check read() or write() calls; it assumes that the application will be responsible and won’t touch data it hasn’t first locked.
This list of differences could go on and on. It’s a challenge to make sure that CFS supports all of the important file system semantics correctly across the platforms, especially because the revision control systems CFS builds on often have different semantics of their own that don’t quite match the standard file systems.
I’ve written earlier about machine-independent builds, but let’s talk about a related issue: build determinism. A build or a build step is deterministic if I can run it multiple times and get the same results (objects, binaries, etc.) every time. That is, the same inputs always result in the same outputs. The assumption of determinism is one of the fundamental reasons why we traditionally don’t check in derived files: we know that, given the sources, we can reconstruct them.
Unfortunately, many builds are not deterministic. Often this is merely annoying, but it can cause some practical problems also.
Why might a build be nondeterministic? The most common reason, I’ve found, is that a tool embeds a timestamp in its output. For example, the Microsoft tools all do this: every PE binary (.dll, .exe, etc.) has a timestamp field indicating when it was built. Further, there is no (documented?) way to tell the linker not to do this!
Since the embedded timestamp doesn’t affect the program’s runtime behavior, why do we care? Here are some reasons:
- If the binary timestamp is compared against the debug info timestamp (Visual Studio does this), the debug info won’t be loaded when they mismatch, even though it may well be accurate debug info built from the same tree at a different point in time. (Do you save off your .pdb files from all your official releases?)
- We can no longer check two .dll’s or .exe’s for exact equality via a standard diff program or MD5/SHA1 hash. We have to know which bytes to ignore in the comparison.
- We can’t uniquely identify a build created by an individual developer by its MD5/SHA1 hash; each developer’s builds will have a different hash. It would be nice if we could identify what software version someone is running simply with a “sha1sum <path>/*” command whose output was fed into a database.
- If you change a comment or some other part of a program that doesn’t have any impact on the compiled code, you may get unnecessary rebuilds of downstream steps. Some smarter “make” replacements will look at the file’s hash rather than its last-modified timestamp. Cascade will do the same if this file is an input file of another task. Do you really want your build system to rebuild your .msi installer after you change a comment in a .c file?
- Cascade implements single-instance storage of output files. That is, if two tasks produce an identical output file, Cascade will only store copy of the output file. This can save a lot of disk space in some cases. Any difference in an output file, however trivial, will defeat this single-instance storage.
Another way you can end up with an embedded timestamp is to use __TIME__ or __DATE__, or to write a script that embeds it in generated source code, although these are unlikely to happen by accident.
Yet another is digital signatures for code signing. Certificates expire, but you still have your private key even after it expired. Yet you can’t have your program stop working or stop being validated as authentic just because the certificate it was originally signed with has now expired. So certificate authorities provide a “timestamping” service where they will attach their own signature to your binary, attesting that the binary existed as of a particular timestamp (at which time the certificate was still valid).
Another major class of nondeterminism has to do with the absolute path to your source code. This is typically used to embed a path to your program’s debug info or source code, so that the debugger can automatically find it. Or, sometimes compiler command lines get embedded in binaries, and these command lines can tend to contain absolute paths to headers, libraries, etc. You probably don’t want this path information going into your official releases. If you are working in two separate trees or branches, or two developers have trees at different paths, you can’t copy binaries back and forth between them. It can also be annoying if you share your tree over the network so multiple computers can get to it. If your C: drive on one computer is mapped as another computer’s Z: drive, the embedded C: paths will be all wrong when a debugger or profiler running on the other computer tries to look up the code.
Aside from date-and-time-related and path-related nondeterminism, some other types of determinism to think about: (note that the line between “deterministic” and “machine-independent” is somewhat blurry)
- Does other environmental information enter into your build process? Some examples: your username, hostname, IP, processor type or speed, OS version, or environment variables.
- Do you do profile-guided optimization? If the profile data isn’t deterministic for whatever reason, the resulting optimized binaries won’t be deterministic, either.
- Does your build talk to any servers on the network? Once a build requires network access, you’ve greatly expanded the scope of what can go wrong.
Cascade 184.108.40.2064 has been released! This is a bug fix release that addresses a variety of issues, mostly minor, with Cascade 1.0.0. A list of specific noteworthy fixes is provided in the release notes. Try it now!
Normally, when debugging, I would recommend starting from first principles and trying to understand the problem’s root cause before doing anything else. Without a complete understanding of the bug, your chances of implementing a complete and correct fix are poor. The last thing you want to do is apply a random fix that “seems to work” without understanding why it works; at that point, you’re a member of the Voodoo Magic School of Software Engineering.
So, my first tool in debugging is always, if possible, an interactive debugger. Stepping through the code is the ideal way to understand when and how the program’s actual behavior starts to diverge from its expected behavior. If an interactive debugger is out of the question for some reason, there are always fallbacks like “printf debugging”, but I find this to be much less effective than a real debugger.
Every so often, though, you run into a bug that is truly mysterious, where you can’t seem to track down what is going wrong. It would be great to understand the full behavior of your system in every last detail, but large and complex software systems eventually start to display “emergent” behavior, where the system as a whole starts to display behavior that cannot easily be explained as the sum of the behaviors of the parts.
What then? One approach to shortcut the debugging process is to binary search for the change that introduced a bug. This requires no understanding of the system’s behavior at all. All you need to know is a revision number where the bug exists and an older revision number where the bug didn’t exist. If it worked at revision 100 and fails at revision 200, then you test at revision 150 to narrow that range by a factor of 2, and so on until you identify the exact change responsible — just like any other binary search.
The first obvious caveat is that this only works for regressions, not for bugs that have existed since the software was first written. Another is that you had better be able to accurately reconstruct old builds of your software; this can be a problem if you don’t have machine-independent builds. Even if you can reconstruct old builds, it can take a while to build them (syncing a tree back to an older revision has a way of flushing out makefile bugs that only syncing to newer revisions doesn’t reveal, so incremental builds are often not trustworthy).
Another tough problem is that sometimes there are parts of the range you want to search where the software is broken: either it doesn’t build, or it builds but there is another, more severe bug that prevents you from running your test case in the first place.
A final problem is that while you may be able to track down the problem to a single commit, some engineers have a bad habit of combining many independent logical changes (which could have each been a separate commit) into one large super-change. Continuous integration is your friend here: you should be committing code on a daily basis, not batching up weeks or months of changes into a single commit. Aside from individual engineers’ bad habits, this can also happen if you merge an entire branch worth of changes (which could easily be hundreds of changes) back into your main codeline in a single commit. As far as that goes, my recommendation is to avoid using “private branches” or “feature branches” and to commit to the mainline as frequently as possible, even if you have to leave your new code in a disabled state via #ifdef or whatnot. “Big bang integration” has a bad reputation for a good reason.
Once you’ve found the change responsible, then what? Unless the problem is extremely urgent and a fix cannot wait a second longer than it needs to, or unless it’s extremely obvious why the change is wrong from inspection alone, I do not recommend that you immediately back out the change. Instead, this is when you fire up your debugger to figure out what is really going on. The change may not have been the real culprit; it may have just perturbed the system a bit, causing a preexisting bug elsewhere to reveal itself in your test case. All that binary searching gives you is a clue. It does not give you a root cause. You still need to find the root cause to apply a correct and complete fix.
One good way to make use of binary searching for a bug is to have your QA department handle it. Since it doesn’t require understanding of the code itself, anyone who can build and run the software can do a binary search. QA can provide the result of the binary search in the bug report, thereby offloading this task from development.
Cascade can help make binary searching more effective:
- Rejecting commits that break builds and regression tests helps ensure that the search won’t break down in a region of changes where the software doesn’t build or run at all, increasing the probability of a successful binary search.
- One of the slowest and most frustrating parts of a binary search is doing all of the builds. Since Cascade builds your software on every single relevant commit and archives those builds off, you can skip this step entirely for any recent regression.
- There’s no guesswork about which changes need to be searched and which ones can be skipped. Cascade’s automatic dependency tracking knows exactly which changes affect your build.
- Cascade helps enable continuous integration, allowing your engineers to commit code directly to the main codeline with less fear that low-quality commits such as build breaks will hurt everyone’s productivity. Smaller, more frequent commits make the results of binary searches more useful.
Traditionally, users of source control systems are told that they should only put source files, not derived files, in their repositories. For example, you would check in your .c and .h files, but not .obj, .lib, .dll, or .exe files built from those .c and .h files. Or, if you generate some of your source code using a Python script, you would check in the .py script file, but not the resulting .c file.
There are two reasons for this:
- Checking in these derived files bloats the repository and slows down the repository server.
- The derived files can become stale — that is, they can fall out of sync with the original source files.
The latter is the more troublesome of the two. Sometimes someone will forget to update the derived files. Sometimes someone might not even know that the derived file has been checked in in the first place. Sometimes a derived file’s dependencies are not obvious; for example, you might not realize that a module pulls in a particular header file through some complex chain of nested #include’s. Perhaps the ugliest case is that you might discover that someone else has broken the build for the derived file — either it no longer builds at all, or it appears to build but produces a file that no longer works — thereby preventing you from updating it, even though your changes clearly require it to be updated.
Many ugly, hard-to-track-down problems can happen when derived files become stale — much the same as if you don’t rebuild a particular .c file when a header file changes. If you’ve ever tracked down such a problem, you probably know how frustrating it can be. The program’s behavior can seem totally illogical until you finally realize (for example) that two .c files are using a different layout for the same structure.
Another difficult problem is merging. Merging derived files is incorrect. (Or, in mathematical terms, it is not always the case that f(a+b) = f(a) + f(b).) The derived file generated from the merge of the changes to the inputs is not always the same as the merge of the derived files generated by each changed input by itself. This is obvious for binary files like libraries, but it’s all too easy to fall into the trap of attempting to merge a derived text file instead of regenerating it from the latest sources.
This can also be a problem when committing your changes: if someone else commits a change that affects the same derived file, you need to grab their changes and rebuild the derived file again. The source control system won’t necessarily warn you about this, so it’s easy to check in a bad derived file by accident.
Yet, there are also reasons why this traditional advice to not check in derived files can be impractical. Let’s leave out obvious examples such as cutting official releases, where you probably want to permanently archive the binaries you released to customers (you might archive them somewhere other than your source control system, but there’s nothing fundamentally wrong with using your source control system to archive releases). Instead, let’s focus on the development process itself.
A large and complex software system’s build time can grow very long. You may only be working on a small piece of the project, but you have to build the entire thing, possibly including components that you may know little to nothing about. (Hopefully all of these components at least have the same build system, so you can type a single command like “make” from the top of the tree to build. In practice, this is not always the case; I’ve worked on projects where each module had its own incompatible build system.)
This creates a dilemma: either each individual engineer has to build the entire project, duplicating builds done by many other engineers, or we can check some or all of the component build results into source control, allowing an engineer to skip builds for components not affected by their changes. Either way, we’re wasting people’s time. The former makes for slow builds; the latter increases the pain of doing a commit.
Ultimately, both solutions are problematic. For large software projects, we need a better solution that offers the best of both worlds.
Cascade offers the possibility of a better solution. Using Cascade:
- You can easily grab pre-generated derived files, without the need to store those derived files in the repository.
- To keep the required disk space bounded, the derived files aren’t kept around forever. You can purge old derived files.
- The derived files are always kept up-to-date, precisely in sync with the source files they are generated from. Cascade’s automatic dependency tracking knows exactly when they need to be rebuilt. You don’t need to update them yourself as part of your commit.
- If the build of a derived file breaks, you’ll know about it right away, either from the web page or from the email Cascade sends out.
- There are no worries about merging derived files. Cascade will always re-generate the derived file correctly regardless of what changes are made to the source files and in what order.
Our GNU make whitepaper mentions that, if you’re using Visual C with GNU make, you can use the
/showIncludes compiler option to help generate
.d files, much like you can with the
-MD option to gcc. I thought I’d post a Python code snippet to illustrate.
One thing that’s important to remember is that, as the Visual C documentation points out,
/showIncludes prints its results to
stderr and not
stdout. This means that you cannot simply redirect
stdout to a file and then parse it. You generally don’t want to redirect
stderr to a file, because this will hide other error messages. Also, there’s no sense in creating a bunch of temporary files that you’ll just have to delete later.
Fortunately, the Python
subprocess module gives us extensive control over redirection of
stderr, so we can simply write a wrapper script around the compiler that takes care of everything. We’ll create a script named
cl.py that wraps
cl.exe. It will automatically add the
/showIncludes option and merge the
cl.exe into a single pipe that we can parse for
cl.py‘s usage is very simple. Normally, you might run the command
cl /c foo.c to create
foo.obj. Now you run
python cl.py /c foo.c to create both
Without further ado, here’s the script. Enjoy!
# cl.exe wrapper to create .d file from /showIncludes # Python 2.5 or later required from __future__ import with_statement import sys, os, subprocess # Determine path for .obj and .d files # This is probably not 100% robust; you may need to tweak it depending on how # you are invoking cl.exe. cmdline = sys.argv[1:] source = None target = './' # default is current directory for arg in cmdline: if arg.startswith('/') or arg.startswith('-'): # A compiler flag if arg[1:3] == 'Fo': target = arg[3:].replace('\\', '/') else: # This must be the source file name (assume there is only one) source = arg if target.endswith('/'): # Default object file name is source file with extension changed to .obj target = target + source[0:source.rfind('.')] + '.obj' if target.startswith('./'): target = target[2:] # chop off unnecessary ./ prefix # Name of .d file is name of object file with extension changed to .d depend_file = target[0:target.rfind('.')] + '.d' # Build cl.exe command line with /showIncludes # Assumption: cl.exe is in the user's PATH cmdline = ['cl.exe', '/showIncludes'] + cmdline # Launch the cl.exe process with stdout/stderr merged into a single pipe p = subprocess.Popen(cmdline, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) # Parse the cl.exe output to build up a set of dependencies deps = set() for line in p.stdout: if line.startswith('Note: including file:'): dep = line[21:].strip() dep = os.path.normpath(dep) dep = os.path.normcase(dep) dep = dep.replace('\\', '/') # use forward slashes for path separators dep = dep.replace(' ', '\ ') # escape spaces in paths with a backslash deps.add(dep) else: sys.stdout.write(line) # Wait for cl.exe to exit, then return an error if it failed ret = p.wait() if ret != 0: sys.exit(1) # Write the .d file, one dependency per line with open(depend_file, 'wt') as f: for dep in sorted(deps): print >>f, '%s: %s' % (target, dep)
One warning: spawning an extra Python process on each source file can hurt performance. Ideally, something along the lines of the above would be built directly into GNU make, but I digress.
After the recent release of Cascade 1.0.0, our old Flash demo of an earlier version of Cascade was starting to look a little stale and has been cleaned up and replaced with a newer demo based on the latest release. In addition to using a newer version of Cascade and featuring newer, prettier-looking Cascade Manager web pages, the demo is quite a bit shorter now: 15 minutes rather than 25. The audio quality has also been cleaned up.
Cascade 220.127.116.119 has been released! This release includes a number of major enhancements:
- The addition of a user manual. In addition to being linked from our web site, you can access the manual via the Start Menu, via any right-click context menu, or via a button on the nav bar on any Cascade Manager page.
- Greatly simplified installation of Cascade on Linux and Macintosh. Most tasks are now performed by an installer script.
- Further usability enhancements to Cascade Manager, including color-coding in tables (e.g. failed tasks are highlighted in red) and a new Revisions page with improved status reporting that replaces the old Home and Grid pages.
- Checkpoints are now simply revisions with a dot, instead of a long unique ID: for example, revision 10.1 is the first checkpoint created so far relative to revision 10.
- Other minor bug fixes and enhancements.