Archive for September, 2008
Here’s an interesting blog post about why a Windows OS install can be so big. One reason has to do with the WinSxS folder, which stores various versions of important system DLLs. Why more than one version? Well, each time a new patch, service pack, etc. is released, a DLL might be updated, and the new DLL is saved here in addition to the old one.
your next question is probably to ask why we don’t remove the older versions of the components. The short answer to that is reliability. The component store, along with other information on the system, allows us to determine at any given time what the best version of a component to project is. That means that if you uninstall a security update we can install the next highest version on the system – we no longer have an “out of order uninstall” problem. It also means that if you decide to install an optional feature, we don’t just choose the RTM version of the component, we’ll look to see what the highest available version on the system is. As each component on the system changes state that may in turn trigger changes in other components, and because the relationships between all the components are described on the system we can respond to those requirements in ways that we couldn’t in previous OS versions.
It’s downright tricky to design a system where you can install and uninstall both various components and various patches to those components in arbitrary orders, and then expect everything to work.
When you install your OS, you fill your computer’s hard drive with all sorts of stuff. Realistically, you may never use a lot of that stuff. As you install and uninstall more applications and updates, stuff tends to accumulate, a lot of it rarely used. This isn’t a Microsoft/Windows thing. This can happen on any OS.
Well, disks are big and cheap, so who cares? I would have agreed a few years back, but there’s an important game-changing technology that makes me care again about conserving disk space: flash. Solid state drives can be many times faster than traditional hard drives, but they also cost a lot more per gigabyte and aren’t available in extremely large sizes. OS virtualization is another reason I might care: historically I had just one OS install on my hard drive, maybe 2 if I dual booted. Now it’s common for me to have any number of virtual machines floating around, each one of which is its own independent OS install.
Cascade suggests a better way to solve this problem — a new model for software deployment for those of us whose computers are always connected to the Internet. Instead of installing a copy of each component on the local disk, the file system should cache components on the local disk as needed. Using CFS terminology, the installer would just set up a CFS tree/mount point pointing to a public (available on the Internet) Subversion repository containing the released software binaries. As you used the software, CFS would automatically download the files you actually touched and cache them locally. Of course, the cached files would stay around after rebooting your system.
The initial installer you would download would be tiny, since it wouldn’t contain any of the files, just a pointer to their location. You’d never have to worry about whether you should install all of the features or just a subset to save disk space; all of the features would be available on demand at no cost in disk space.
In corporate environments, the downloads would go through Cascade Proxy, so you wouldn’t be downloading the sames files over and over again.
To update the software to pick up a patch, you would simply point CFS at a newer revision of that repository. To “uninstall” the patch, you could always roll back to an old revision.
Most importantly, your CFS cache can easily fit on a SSD, so you could get the performance benefits of an SSD without worrying about running out of disk space as you install more applications. For virtual machines, you could use a small CFS cache; cache misses can be serviced quickly out of a larger Cascade Proxy cache running on your host OS.
Currently, as of version 0.2.1, Cascade supports Perforce, Subversion, and Alienbrain repositories. If you’re using some other software to manage your repository, unfortunately, we don’t have a way for you to use Cascade just yet. However, we’ve been careful to design Cascade so that we can easily add support for more repository types in the future, so you’re not completely out of luck (please don’t hesitate to contact us if you’re interested).
Let’s step back and look at the architecture of Cascade File System and Cascade Proxy a little bit. Both are services that run in the background on a PC. Both receive requests for data from some external source — CFS gets them from the file system layer in the kernel, Cascade Proxy gets them from a network connection. Both use caching to satisfy those requests more efficiently: once a file is downloaded for the first time, it’s stored in the cache, so we don’t have to download it again on a second request for the same file. In fact, CFS and Cascade Proxy both share the same “cache manager” implementation, although they make use of it in a somewhat different way.
The Cascade cache manager thinks of files in terms of their URL — for instance, svn-http://svn.collab.net/repos/svn. This URL encodes all of the information necessary to find the file: what type of repository it lives in, the hostname of the server to connect to, and the path of that file on the server.
When a cache miss happens, the cache manager needs to download the file. If it is configured with a proxy server, it will forward the request on to the proxy, without regard to what type of repository we are dealing with; the proxy server will take care of everything for us. If we don’t have a proxy server, it will obtain the file directly from the repository. To do this, it looks at that first part of the URL and passes the request to one of several “repository query” backends:
- Requests for URLs beginning with “p4:” will be forwarded to the query_p4 library.
- Requests for URLs beginning with “svn-http:” or “svn-https:” will be forwarded to the query_svn library.
- Requests for URLs beginning with “ab:” will be forwarded to the query_ab library.
Each of these query libraries implements a common API. To add a new repository type, all we need to do is implement that API — map the standard set of queries the Cascade cache manager uses to the queries that we can make to the repository. Then, we simply assign it a unique URL schema as above and add it to a table, and we’re done! It’s pretty straightforward.
What about third-party addons to support new repository types? We’ve considered the possibility and may offer some way to do this in the future, but for now we believe providing a third-party API for this would probably cause more problems than it would solve.
For many years it’s been a standard “best practice” of software development to have a “daily build” — or, if you prefer, a “nightly build” (example, with some impressive pictures). But why daily? Why not twice daily, or hourly? If the build is broken, wouldn’t we want to know about it sooner rather than later? The faster it gets fixed, the less the productivity cost to your team.
Granted, there are some software projects that take more than an hour to build, so a single computer couldn’t do an hourly build. But those are the same huge projects where you can easily afford to buy more than one computer to do your builds.
And of course if an hour goes by and no new changes have been committed, there’s not much sense in doing another hourly build. Which immediately leads to the insight: why not simply build on every single change that’s committed to the repository?
Sure, it may cost you a bit to buy a few computers to do those builds. But let’s not forget how cheap computer power has gotten; if we’re talking about machines that can run headless and don’t need a monitor or video card, we can easily build an extremely powerful build farm workhorse computer (Core 2 Quad, 4GB RAM, 750GB hard drive, etc.) for under $500.
One problem you may run into is that many changes to your repository don’t affect your builds, or only affect a subset of your builds. No sense in rebuilding when someone changes a documentation file that doesn’t ship externally. If you build every component of every project on every change, the system wouldn’t scale very well as the volume of changes and number of components increases over time. You could write up a list of “rules” to tell the system when to build and when not to build, but this could be a lot of work, and your rules could have bugs in them. You’d constantly have to keep the rules up to date. Well, there’s a solution to this also: a system like Cascade can, by hooking the file system, automatically determine which changes affect which builds.
Once you have a system like this set up, what do you gain?
- Quick notification of build breaks. There’s still some latency (if the build takes 5 minutes, you might not know about a break until up to 5 minutes after it’s committed), but if this is a problem, you can mitigate it by breaking up a large build into several smaller builds.
- The system will tell you exactly which change broke the build, without any guessing on your part. If a particular engineer is being careless and breaking the build frequently, you can remind them to test more carefully in the future. Or, you can set up some kind of minor penalty for breaking the build (whoever breaks the build buys the team donuts, or something like that).
- No need to explicitly “cut a build” for QA or release at a particular change. Since the system builds every change, chances are it’s already been built!
- Ability to binary search for regressions down to a single change. If the system builds every single change and saves off the results somewhere, you can determine which change introduced a bug by doing a binary search using the resulting binaries. (For example, if a bug was introduced yesterday, and there were 30 changes on that day, you could find the responsible change by running just 5 tests.) Since this doesn’t require any debugging or understanding of the code itself, it’s a good way to offload work from your developers to your QA team and speed up bug triage — when a bug is filed, QA can tell you if it’s a regression, and if so, which change caused it.
It’s time to update this best practice. A daily build is good, but a build on every committed change is better.
Cascade 0.2.1.647 has been released! This release, aside from fixing various minor bugs, includes a redesign of the Cascade Manager web pages. The screenshots on our web pages have been updated accordingly. (The Flash demo hasn’t been updated yet.)
A common worry with Cascade File System is that building through a file system layer like CFS will be much slower than building off a regular local disk using a file system like NTFS. After all, building off a network file system like SMB/CIFS or NFS is typically much slower than building off a local disk.
Rather than just speculating, let’s take a look at some real numbers. I measured wall-clock time for the same build in three scenarios: building off a local NTFS drive, building off CFS, and building over SMB off a Samba server on the same LAN with a ping of about 300 microseconds. (These are actually times from the second of two clean builds in a row; the first build primes the caches in the system, eliminating sources of variability in the second run.)
As these numbers make clear, using a network file system as an analogy for CFS performance isn’t quite right. If the files and metadata you need are already in your CFS cache, CFS will not generate any network traffic. Further, CFS cache entries don’t “time out” or “expire”, and the main part of your CFS cache (the file data) will persist even after an OS reboot.
Does CFS have overhead? Sure, of course it does. There’s plenty of performance tuning that can still be done on CFS. At the same time, CFS also has at least one big performance advantage over modern disk-based file systems like NTFS and ext3fs: it’s not journaled. A CFS tree is just a workspace; the real data that needs absolute protection is the data in your repository and in your Cascade Manager checkpoints. If your OS crashes or your computer loses power, no big deal — you can just clone from your last checkpoint (checkpoints are cheap, so you can create them as often as you’d like). Journaled file systems, on the other hand, go to great lengths to ensure that once certain types of data have been written to disk, they cannot be lost even in an OS crash or power loss. Flushing data out to a hard drive is expensive: you have to wait for the hard drive to spin and seek to that spot on the drive, which can take milliseconds. CFS can skip all of this extra work.
Now, if we compare to a network file system – the details differ from file system to file system, but many network file systems don’t make much of an effort to cache, since someone else might change the files on the server at any time. Some will do limited amounts of caching but will “time out” cache entries, say, after 30 seconds, and ask the server again for the information. (Of course, this leaves a window of 30 seconds where you could get the wrong answer to a query!) Some will send a request to the server each time you open a file, asking whether the file has been changed since they last cached it, but this still requires a network round trip. Some support “oplock” functionality where they request that the server notify them when their cached data fall out of date, but not all servers support this, those that do might limit the number of outstanding oplocks, and the server can arbitrarily refuse or break oplocks at any time. The cache data is in memory and is lost either on reboot or even when the OS’s VM system needs to free up some pages to make room for other data.
There are also typically many other inefficiencies in network file system stacks — for example, packing and unpacking requests and replies, the TCP/IP stack, breaking up large requests into smaller ones to satisfy limits in the protocol, limited numbers of requests in flight (a serious problem if combined with network latency), overly “chatty” protocols that require many round trips to do simple operations, and sometimes just poorly-optimized client or server software.
Bottom line: once the files you need are in your CFS cache, CFS’s performance is similar to that of a local disk-based file system such as NTFS. CFS is much faster than a network file system like SMB.
One of the most important reasons to use revision control is to ensure that all developers are working from the same source base. If I reproduce a bug in my tree, I want to be sure that someone else can reproduce the same bug on their system by reconstructing the exact same source tree. Reconstructing a tree is as simple as knowing the tree’s revision number.
Unfortunately, it is all too easy to partially undercut this benefit of revision control if your build system does not generate the same binaries when you run it from two different computers. Suppose one person is building using Visual Studio 2005 and another using Visual Studio 2008. Or, they’re both using VS2008 but one has SP1 installed and the other doesn’t. The resulting binaries will be subtlely different.
No big deal, right? I mean, how often do you really run into a compiler bug?
Well, not so fast. In large, complex software projects, changing even the seemingly most insignificant variables can change the behavior of the system in unexpected ways. I’ll never forget a bug I spent well over a week tracking down, where, depending on the length of the program’s command line, subsequent memory allocations’ addresses shifted around. Someone was accessing an uninitialized structure element on the stack, and the garbage contents of this stack element (which were actually quite deterministic) would change as the program’s command line length changed. The bug would appear and go away as people checked in random other changes to the source code (which in turn shifted the addresses around again), but with any given exact set of binaries, the bug would either consistently happen or consistently not happen.
Likewise, my experience on large projects — let’s say, any project larger than 100K lines of code – has been that switching/upgrading compilers always causes at least one unexpected gotcha. This is especially true for C++ (as opposed to C) projects. Again, I could dredge up any number of obscure problems from past experience. Upgrading compilers is not a no-brainer; it’s a decision with pros and cons that has to be made carefully, and if you upgrade, all of the members of your team should upgrade at the same time.
If we’re talking about Visual Studio, I’d also be remiss to not talk about the C runtime library issues. If you build with 2005, you need the 2005 runtime installed on your computer; if you build with 2008, you need the 2008 runtime installed. The runtimes also have service pack levels, so a 2005 SP1 app requires the 2005 SP1 runtime. Again, it really does matter which exact compiler you are using, including service pack level.
Let’s return to the topic of revision control. What, exactly, should we put in the revision control system? We all know that our source code goes there, as do our build scripts, including any scripts required to create our installers. Test cases and test scripts often go there too. But this is still incomplete. In my view, your compiler, linker, and related tools also belong in revision control. So do all of the system header files you include and all the system libraries you link against. So do the headers and libraries for any third-party SDKs you rely on.
I believe it is a serious mistake to pick up whichever compiler, linker, headers, and libraries happen to be installed on the computer doing the build or happen to be pointed at by the user’s environment variables. This is a good way to get machine-dependent builds: two people build the same source code and get different results. “Different results” could mean one person gets a compile error and the other doesn’t. Or, it could mean that their builds both succeed, but they produce different binaries that behave differently.
Machine-dependent builds may be tolerable in small, informal projects, but if the software project is key to your company’s livelihood, do you really want to take that chance? In science we speak of “reproducibility of results”: if one person cannot independently validate another’s research, we are inclined to distrust that research. Thus, it’s important to keep a full record of all the methodology you use in your research, so that someone else can set up the exact same experiment you did. It’s no different with computers and software engineering.
There are other advantages to machine-independent builds, beyond reproducibility of results.
- You don’t have to “install” any software on your OS to get a build up and running — you just check out a tree and you’re good to go.
- You don’t have to roll out patches to the team and make sure that everyone installs them; you just commit a change that updates the tool and everyone gets it for free.
- It’s much more likely that you will be able to accurately reconstruct old builds. For example, supposing Version 1.0 was built using VS2008 and Version 1.1 was built using VS2008 SP1, if you want to go back to track down a Version 1.0 issue, you really should switch back to using the original compiler you used on that release, without the service pack installed. If the tools are in revision control, there’s no chance that you will forget to do this.
One good way to get your machine-independent builds up and running is to use a virtual machine. (There are a number of free virtualization products out there that will do the trick.) Set up a clean OS install in a virtual machine. Without locally installing any software on your virtual machine, check out a tree and build it; if it complains about something, chances are you’re still relying on some local tools by accident rather than pulling everything from source control. (You can do this without a virtual machine, but it’s so easy to forget about some obscure step you did long ago when you first set up your computer. With a virtual machine you can force yourself to start with a clean slate.)
Admittedly, there are some components where you simply cannot avoid a local install. The two main ones I’ve run into are the Visual C runtime library and the .NET Framework. In these cases, you simply have to document that users must install these on their build machines before doing a build. Again, using a virtual machine is valuable: it’s one of the best ways to discover missing steps in your build documentation.
One final note: this post has been fairly Windows-centric. Machine-independent builds are possible on Linux and Mac also, although each OS has its own gotchas. Probably a topic for a future post…
According to our contact info, our company is a “provider of software configuration management tools.” So what is “configuration management,” anyway?
Let’s start by saying that I have never been a big fan of the term “configuration management.” Everywhere that I read about “configuration management,” I feel like I should get out a Buzzword Bingo card with boxes like “agile” and “scrum.” It seems to be a standard term in the industry, though, so I guess we have to go with it. Rather than quoting you buzzword-laden slogans from other folks, let me instead offer my own definition: configuration management means “building better processes for building software.”
A good process is one that enhances the productivity of your team, so that you can deliver your users the functionality they need and want, deliver it sooner, and deliver it at a higher quality. A bad process is one that impedes the productivity of your team, usually either through excessive bureaucracy or what I call thrashing.
We’re all familiar with excessive bureaucracy (“Before you are allowed to commit your changes, please submit a witnessed, notarized, signed copy of Form 7C and we will evaluate your change request at the next biweekly meeting”), but what’s thrashing?
Thrashing means everyone is working hard, but you still aren’t making much progress towards your destination. Maybe you’re building functionality that your customers don’t need or want, or low priority features rather than essential ones. Maybe you’re taking so many shortcuts writing the software that you’re writing low quality code full of bugs, costing you more time down the road. Maybe your bug fixes are introducing new bugs, so the quality of your software isn’t improving even though everyone is scrambling to fix bugs as fast as they are reported.
Maybe management is telling engineers to work on project A in the morning and then changing their mind in the afternoon — now project B is more important — and again the next day — sorry, project A is more important after all. Maybe people are frequently pulled off their tasks to help out with a tradeshow or other customer demo. Maybe two engineers are both rewriting the same code in their local trees, not realizing that their changes are about to collide with one another. Maybe your build or regression test is always broken, and by the time a fix is checked in, new build or regression test breaks have been checked in on top of the existing break. Maybe your build process is not automated and poorly documented, so that when any two different people build their trees, they end up with different binaries, and people can’t reproduce each other’s bugs.
That’s thrashing. Under these circumstances, you can work as hard as you want, and you won’t accomplish much. It’s always tempting to say that everyone just needs to work even harder (more overtime!). Management sees that the schedule is slipping and sends out an email of “encouragement” saying “we promised to ship on date X — look, we can’t break our promises to customers, so get cracking, people!” This rarely helps. (A particularly dishonest but common management tactic is to tell one “aggressive” release date to the engineers and another “conservative” release date to the customers.)
We can’t really help you with excessive bureaucracy. If your developers spend 5 hours a day sitting in meetings and reading and sending email, sorry, there’s not much we can do about that. But when it comes to thrashing, better tools can definitely help.
Think about how you or your developers spend a typical day. I’ll give an example from personal experience. On several projects I formerly worked on, it would be common for me to spend the first 1-2 hours of each day:
- Downloading the latest source code from the revision control system
- Building the software, and addressing any build breaks I hit along the way
- Running some simple tests on the software, to make sure some basic things work, and fixing or reporting any bugs I hit along the way
Of course, I might get stuck. If I hit a bug that I couldn’t solve myself, I might have to wait for someone else to fix it. I’m a big believer in jumping on problems aggressively to unblock the rest of the team as quickly as possible (if you can fix a build break, go ahead and commit immediately — don’t wait for someone else to fix it), but you have to be careful: it’s irresponsible to make changes to code you don’t understand. In that case, if the other person doesn’t fix the problem promptly, you can spend days waiting. (Hopefully you have other projects you can work on so you don’t just have to sit there browsing the web all day.) Again, this has happened to me on real projects many times.
Wouldn’t it have been better if I could have shortcut those first 1-2 hours of each day and been able to set up a pre-built, pre-tested tree in seconds with Cascade?
Wouldn’t it have been better if those changes that broke builds or regression tests had been rejected — not allowed to be committed into the repository in the first place?
That’s what good configuration management is all about to me. It’s about building processes that aren’t bureaucratic — they allow people to spend their time doing work rather than jumping through unnecessary hoops and sitting in boring meetings — and increase everyone’s productivity by reducing thrashing.
You’ll never hear me talk about ”agile CM” or whatever the latest buzzwords are. Configuration management is too important to be reduced to marketing buzzwords and slogans. There’s real substance behind it, and the processes good configuration management engineers build can be the difference between shipping low quality software late and shipping high quality software on time.
Computers are deterministic.
Well, OK, not always. A single-threaded, sequential program that doesn’t interface with the outside world is deterministic, but once user input, hardware devices, and networking get involved, these create asynchronous events that can be nondeterministic in character. The same is true if two programs, each independently deterministic, write to the same files in the file system at the same time. Preemptive multithreading adds more nondeterminism, and computers with multiple CPU cores add still more.
At the same time, we have also built a whole family of software abstractions that take these initially nondeterministic constructs and push them back into the realm of determinism. Yes, user input is asynchronous, but if the window system puts it in a queue instead of giving us a callback, and we only have a single thread pulling messages out of that queue, we’re back to deterministic behavior. The arrival of network packets is nondetermistic, but if we use blocking TCP sockets, the operating system takes care of most of the dirty work for us and all we have to do is send and receive bytes. A 3D graphics processor works in parallel with the CPU, but when you call
glReadPixels, the OpenGL driver first waits for the previous rendering to finish so it doesn’t give you random incompletely rendered garbage.
Multithreading remains a hard case, but you aren’t forced to use it, and with careful use of mutexes, semaphores, etc., multithreaded programs can be perfectly deterministic. We can also hope that future programming environments will be able to automatically extract more parallelism from our software, so that we won’t have to write multithreaded code to get speedups on multi-core CPUs.
Even if your program has nondeterministic behavior, though, one simple fact remains: everything that happens on a computer happens for a reason. There is a logical explanation for why it happened. (That explanation may be “hardware failure” or “cosmic ray”, but these are rare, and even so, they are still explanations.) It may be hard to determine that reason, but with enough digging, it is usually possible.
So, it may be tempting to take a sledgehammer to your computer when it misbehaves, but when we have our “professional software engineer” hats on, we ought to be able to do better.
Let’s take GNU make as an example. If you’re read our whitepaper, or even if you’re just an occasional user of GNU make, you probably know that occasionally a build fails “for some weird inexplicable reason.” You type “make”, and instead of a correct binary popping out, you get a strange error message. Or, worse, it looks like it succeeded, but you get a binary that doesn’t run. So you type “make clean” and “make”, and then hope it will work this next time around.
So say that second try fails… what now? Well, you can send an email to other people and see if they’re hitting the same problem. You can look at the status of your automated build system and see whether it’s hitting the same failure. But suppose the problem really does seem to be unique to your computer. It’s possible your source control tree might be corrupted in some strange way. (This has happened to me a good number of times.) You might type “p4 sync -f”, but even that only fixes files that exist in the depot — it doesn’t delete files that exist but shouldn’t. So maybe you ”rm -rf” your entire tree and get a brand new one from scratch. Perhaps that will fix it! If it doesn’t, well, then you’re really in trouble…
At this point it’s tempting to throw up your hands in despair and admit defeat. But if we really want to keep going… well, does it happen if I do the build on another computer? I could reinstall my OS — maybe that will fix it?
This is, in a nutshell, what I call the “voodoo magic school of software engineering.” We run into a problem — a build fails. Rather than understanding the problem, determining its root cause, and addressing the root cause so it won’t happen again, we flail about, trying random stuff in the hope that something will magically fix it. When we get to the point of “let’s reinstall the OS — maybe that will help!”, we might as well just be sticking pins in a voodoo doll.
We may well “fix” the problem this way, but we won’t ever determine the root cause. If we don’t determine the root cause, how can we know that the problem won’t happen again to someone else or even to ourselves? How do we know that our “fix” really fixed the problem and didn’t just paper it over, ready to reemerge at any moment?
We don’t need to track down the root cause of everything that goes wrong on a computer. When I load some random web page and it comes up garbled, I usually just hit “refresh” and often it magically works on the second try. Or, sometimes I come back to the site later and hope that the site owner has fixed their bug. Either way, I don’t waste time trying to figure it out or even report the problem, unless the web site is really important to me.
That’s fine and well when we are talking about someone else’s web page. Someone else’s web page — someone else’s problem. But when we’re talking about our own software, our own web page, our own build flow — anything that affects our day-to-day workflow, or anything that we ship to customers and must take responsibility for — we owe it to ourselves as professional software engineers to drill down into problems and address their root causes so they don’t happen again.
When your build fails, it didn’t fail for an “inexplicable reason.” Sometimes your tree really is corrupted and “rm -rf” is the correct answer, but even if so, how did your tree get corrupted in the first place? For example, did some previous build step delete some files by accident? Did some runaway script corrupt it? Sometimes the root cause may just turn out to be user error, but even so, it’s helpful to understand our own mistakes so we don’t repeat them.
But more often, when these things happen, I find that there is a real problem to be fixed. A common example is a makefile with incomplete dependencies. If you’re missing a dependency, things may not be rebuilt correctly, and there’s no telling what might go wrong after that. Tracking down the missing dependency isn’t a waste of time. It’s an investment in the future.
We’ve published a new whitepaper on Replication vs. Caching. This whitepaper discusses the tradeoffs between two different ways of accelerating multi-site development. At the risk of giving away the punch line, here’s the key point to take away:
For accelerating offsite development, caching, especially when combined with intelligent prefetching, provides most of the advantages of replication without its many disadvantages. Setting up caches is cheap and easy. Replication is best suited for offsite backups, not for accelerating offsite development.
Mark Lucovsky’s description of the software engineering processes used in Windows NT development is old, but a good primer for anyone who hasn’t worked on projects this large. Software development projects become much more difficult as they grow, especially as you start to exceed 100 developers — and Windows 2000 had 1400 developers (plus another 1700 testers) working on 29 million lines of code.
It would be interesting to get an update on how things have changed since this was written, to get some insight on the challenges Windows Vista and Windows 7 development have faced and how the processes built for Windows 2000 have scaled over the last decade. But some things are safe bets: software projects tend to get bigger and harder over time. Software complexity has been increasing faster than our ability to deal with that complexity. Many of the basic tools people use in day-to-day software development are remarkably primitive.
Many of the problems that Mark describes are problems that Cascade attacks head-on:
- The source tree was very large (50GB — recall, disks were not as big at the time). It took a very long time (1 week) to download a brand new tree, and 2 hours to get today’s updates. The performance was improved greatly by moving to a new source control system, but even so, who wants to wait 3 hours to set up a new tree? With a file system-based approach like Cascade, a tree consumes negligible disk space. Setting up a new tree or updating an existing tree takes just seconds rather than minutes or hours.
- Slow builds. Some things never change: the full OS build took 5 hours on NT3.1, and while hardware got faster, the tree got larger, so it took 8 hours on Windows 2000, even on a very-high end machine (4 CPUs). I’ve had the same experience on my own projects: computers do get faster every year, but build times always seem to get worse, not better.
- Frequent regressions (build breaks, boot failures, etc.) that shut down the whole team. Even the smartest engineers make mistakes. Automated build and test labs of the sort Mark describes certainly help, but typically such systems can only detect breaks after the fact. Wouldn’t it be better if developers could know what their changes might break before they commit, or for the system to actively prevent breaks from being committed?
While it’s a little short on technical details, a Wall Street Journal article from a few years back seems to confirm that these problems haven’t gone away, and that they may have even contributed to Vista’s delays:
In making large software programs engineers regularly bring together all the new unfinished features into a single “build,” a sort of prototype used to test how the features work together. Ideally, engineers make a fresh build every night, fix any bugs and go back to refining their features the next day. But with 4,000 engineers writing code each day, testing the build became a Sisyphean task. When a bug popped up, trouble-shooters would often have to manually search through thousands of lines of code to find the problem.