The Conifer Systems Blog

Cascade 1.2.0.1069 released!

no comments

Cascade 1.2.0.1069 has been released!  This release adds a bunch of new features, such as exclusive locking, “Details” columns in the shell extension, and named checkpoints, as well as performance enhancements and bug fixes.  For a detailed list, check out the release notes.  Try it now


Written by Matt

February 18th, 2009 at 1:09 am

The Cost of Branching

no comments

I wrote previously on the cost of integration.  I’d like to follow up by discussing a related topic, the cost of branching.

Doesn’t integration imply branching?  Not exactly: branching requires integration, but integration necessarily takes place in any multi-developer project whether you use branching or not.  Integration simply means that two developers go off and work in parallel in separate trees, and when each one finishes their work, that work has to be combined.

Even if the two developers are editing different groups of files and never edit the same file at the same time, the mere act of combining their changes is still an “integration” and has at least some nonzero risk.  For example, on a platform with limited memory, e.g., a game console, each change independently might work, but combining the two might make the system run out of memory.  This isn’t either developer’s “fault”, necessarily, just a reflection of the fact that modern software systems are complex and their behavior isn’t always easily predictable.

Back to branching.  One cost of branching is that it generally increases the total number of integrations that have to be done.  When you commit, you’re immediately integrating with the other people inside your branch, but then later on your change will have to be integrated over into other branches.  The more branches, the more times it needs to be integrated.

An example is a team that has overlapping release branches.  If release 1 can’t be end-of-lifed by the time release 2 is branched, engineers will need to integrate from release 1 to release 2 to main on each change.  This can get many times worse if a large number of branches are outstanding.  Each integration represents another source of risk.  Consider that each integration could be botched, or even forgotten.  (Systems with merge tracking reduce the chance someone will forget about an integration, but merge tracking systems are definitely not foolproof.)

Another cost of branching is that it delays integrations.  This is especially the case with development branches, where a body of code is developed off in a separate branch before being integrated back into the main branch.  The longer you put off an integration, the more costly it gets.

Another cost of branching is that, in some branching models, the developer of the change is not the same person who will be integrating the change.  When I commit and need to merge with others’ changes, I, as the committer, am taking responsibility for my own merges.  But later on, if a different engineer has to integrate that change into another branch, or has to merge that change with other changes as part of an integrate into the same branch, this person may not fully understand the intricacies of the change.  This makes the integration more costly, and it increases the probability of an error.  Put another way: if developer A has to integrate his own change A with developer B’s change B, he at least understands his own code.  But with branching, developer C may be stuck merging changes A and B from developers A and B.

Another cost of branching is the impact on the revision control system.  While marketing materials often love to talk about how cheap branching is in some revision control system, the reality sometimes differs from the marketing:

  • Consider the cost of creating a branch.  Is it an O(1) (constant time) operation, regardless of how many files are being branched?  If the cost of creating a branch increases with the number of files being branched, this can become expensive as trees grow larger.  How much does this slow down your server and grow your database?
  • Consider all of the extra commits required to integrate between branches.  Think of regular commits as “work” and integrations as “overhead.”  As the number of branches increases, the percentage of commits that are overhead increase, reducing your server’s performance and causing your database to grow.
  • Consider the cost of checking out the branched trees onto client systems, in terms of disk space and network bandwidth.  If I need to work in 3 branches, I need to download about 3 times as many files.  (Note that a system like Cascade can reduce this cost, by caching files as they are accessed instead of downloading them all up front.)
  • If the system does merge tracking, tracking this information and performing queries on what files/changes need to be integrated may impose an additional cost on the server.  I’ve observed cases where these queries can bog down a server for many minutes.

Another cost of branching is the impact on your build, release, and QA flows.  Consider, for example, build and test automation.  Does each active branch need to have automated builds and tests set up on it?  If not, developers in that branch are effectively “flying blind.”  Yet setting up and maintaining all of these builds and tests can be burdensome, especially if build and release engineers are a shared resource between developers on many teams.  Every time someone creates a branch and requests that builds and tests be set up on the branch, more work is being created for this team.  Every time these builds and tests need to be reconfigured, again, more work is being created.  In addition, running all of these extra builds and tests may require you to buy more computers to run all of these builds and tests on.

Overall: branching is a powerful tool, but when it’s overused, it can impose a lot of costs on your team.  The most important rule to remember is that integration is overhead.  The greater percentage of their time your developers spend integrating, the less time they can spend developing.  Branches create more integration work.  My preference is always to err on the side of creating branches as late as possible, and ideally to not create them at all if there’s any way to get away with it.


Written by Matt

December 8th, 2008 at 12:33 pm

Cascade 1.1.1.991 released!

no comments

Cascade 1.1.1.991 has been released!  This is primarily a bug fix release with a few minor new features.  The exact changes are spelled out in the release notes.  Try it now


Written by Matt

December 5th, 2008 at 9:37 pm

The Cost of Integration

3 comments

I’d like to propose a fundamental law of configuration management: the cost of an integration increases over time.  This is similar to the well-known software engineering observation that the cost of fixing a bug increases over time.

Let’s start with a simple example: a single project with just 2 engineers, where each engineer commits a single change once per day.  Now suppose that both engineers, for some reason, decide to start committing their code in batches of 5 changes once per week instead.  I’m not sure why they would do this; I see large benefits to keeping commits small.

Here are the consequences I would forsee:

  • A reduction in per-commit overhead by batching up 5 changes into a single larger commit.
  • Increased communication overhead: a revision control system is a formalized way for engineers to communicate without having to send emails, etc.  In particular, the change descriptions, if well-written, help keep the other team members informed about what is going on.  Frequent commits also make conversations like “watch out, I’m working on some big changes to file X” less necessary.
  • Increased redundant work: both engineers might fix the same bug in their own local trees rather than picking up the change from the other engineer.
  • A larger number of merge conflicts.  At the risk of misapplying statistics and making a vast number of simplifying assumptions: if each change touches 5% of the total lines of code, and if changes are randomly distributed in the code, the probability of a merge conflict was about 1.2% weekly before and is about 5.1% weekly now.
  • Incompatible changes: both engineers might choose to rewrite the same block of code in two different and inconsistent ways.  This will show up as a merge conflict, but it’s worse than a plain old merge conflict.  You’re not just doing a textual merge, you’re trying to reconcile to conflicting visions of how the code should work “after the fact” and throwing away a good chunk of the work.  Had the first rewrite been committed more promptly, an additional rewrite might have been avoided.
  • New bugs are discovered and fixed later: if the first engineer’s changes introduce a bug that impacts the second engineer’s work, the bug might be discovered a week later rather than a day later.  Standard software engineering literature suggests that bugs cost more to fix over time.
  • Increased probability of losing work.  Once a change is committed, it’s saved in the repository and won’t be lost in an accidental “rm -rf” or a hardware failure (assuming that the repository itself is being backed up appropriately).

Unless you’re extremely worried about per-commit overhead (in which case I would suggest that you have bigger process problems you need to address), this is definitely not a good thing.

Merge conflicts in particular are more dangerous than a lot of people realize.  In software, it is not necessarily true that the correct way to combine two changes is to perform a textual merge of the source code.  It is dangerous to assume that simply because a textual merge did not detect any obvious conflicts, you are all set!

To perform a correct merge, you need to understand what has been changed and why.  Many engineers have a bad habit of being careless on merges: they let down their guard.  Merges are just as real as any other change, and we cannot assume that just because two changes worked independently that they will also work together.

Of course, if the textual merge does detect a conflict, the risks are far greater.  An automated merge won’t get tired or make a typo.  A human can and sometimes will.  If the conflicts are nontrivial, as in the case of two engineers rewriting the same code, merges can be some of the most dangerous changes of all.

So far I’m not really saying anything new here.  It’s pretty standard advice that engineers should commit code no less than once a day, even if only to reduce the risk of losing code by accident.  Also, there is a lot of literature on the benefits of “continuous integration” as opposed to “Big Bang integration”, or on releasing your software “early and often.”

At the same time, a lot of supposed proponents of continuous integration seem to talk the talk better than they walk the walk.  You will find a lot of these same people advocating such things as:

  • development branches, where different groups of engineers working on different features commit to different branches/codelines, rather than sharing a single “main” or “trunk” development branch
  • “distributed version control systems”, which are development branches taken to another level (all changes are effectively developed in a development branch, and no “main” branch even exists except by convention)
  • branching and releasing each component of your project separately, rather than putting all components under a single “main” or “trunk” and branching them simultaneously

I contend that, by delaying integrations, these practices are steps back in the direction of “Big Bang integration” and that they increase the total cost of integrations.

Consider development branches, where several engineers go off and work on different features in different branches rather than working concurrently in a single main branch.  Nearly all the same risks I listed above for committing once a week rather than once a day apply here also: communication overhead, redundant work, merge conflicts, incompatible changes, bugs discovered and fixed later.  (On the bright side, losing work by accident should not be an issue here.)

The more development branches you have, the more integrations you will need to do.  Someone will need to merge the changes from the main branch into the development branch on a regular basis, and when the development branch is done or at least has reached a “good” point, it needs to be merged back into the main branch.  Either way, this typically leads to “mass integrates” in both directions.

As I’ve written before, mass integrates are a “worst practice.”  Mass integrates can frequently run into dangerous merge conflicts.  Because you are merging two large bodies of changes, the probability of a textual or logical conflict between the two sets of changes can be high.  The longer the development branch lives on without being integrated back into the main branch, the greater this risk grows.  (If you must, for whatever reason, have a development branch, I recommend integrating in both directions as frequently as possible.)

A development branch can be thought of as an intentional delay in integrating code.  This can be tempting: “I get my own sandbox where I can do whatever I want!”  But this kind of freedom is dangerous at best.  For example, it encourages engineers to break things in the branch expecting that they will be “cleaned up later.”  If the feature’s schedule starts to slip, this deferred “cleanup” work may be skipped.  All of the sudden the development branch “needs” to be merged back into the main branch “right away” so that it can be in place for the next release branch.  (I’ve seen this happen a number of times.)

When you add in the costs of delayed integrations, I recommend against development branches.  You are better off doing the development in the main branch.  This may require a bit more care on each change–you can’t break stuff like you can off in your own little “sandbox”– but the discipline this requires will pay off later.  You won’t waste time integrating changes back and forth between branches, and you will spend a lot less time fiddling around with (textual or logical) merge conflicts.

If the new code isn’t ready to activate right away, you can simply hide it behind an #ifdef or somesuch.  Even if the #ifdef approach starts to prove to be difficult, it’s likely to still be easier than dealing with merge conflicts: when someone makes an unrelated change that interacts with your changes, there’s a good chance that they will help you out by updating the inactive #ifdef code.  And if someone makes a change that truly conflicts with your changes, you’ll know right away.


Written by Matt

November 29th, 2008 at 10:40 pm

Continuous Process Improvement

no comments

With the possible bankruptcy of the US Big Three automakers in the news, it’s interesting to think about the analogies between making cars and making software.  There is no one single reason why the Big Three have been on the decline for many years now, but surely one of the most important reasons is that American consumers decided that Japanese cars are generally higher-quality than American cars.

These days, the Japanese carmakers’ attitude towards quality and process improvement is so mainstream that it’s almost cliched.  It can easily be taken too far; for example, it is clearly not the case that increasing quality always saves money.  Rather than get involved in the “software quality” debate, I’d rather focus right now on the idea of “process improvement.”

Whether you realize it or not, you have a process for building software.  Oftentimes very little conscious thought has been put into this process, and it is frequently ineffective and wasteful.

Here are some basic elements of your software engineering process to think about:

  • Who decides whether a feature is going to be added or a bug is going to be fixed, and if so, in which release?
  • How many outstanding branches/codelines do you have?  When do you create a new one?  When do you shut down an existing one?  Who does integrations between them, and how often?
  • How do you build your software, all the way from the original source code to the final CD/DVD image or downloadable installer, for all of your target platforms/build configurations?  How do you detect and/or prevent build breaks?
  • How do you find bugs in your software?  Customer bug reports?  Internal QA?  Code reviews/code reading?  Static analysis tools and compiler warnings?  Asserts?  Other?
  • What do engineers do before they commit their changes, e.g., what platforms do they build and test on, and what tests do they run?
  • What happens after a change is committed?  What automated or manual builds and tests are run on it?
  • How do you verify that a bug is really fixed?
  • If a previously committed change seems to be causing problems, what do you do?  How much time do you let the change stay in place while you try to debug the problem?  Or do you “back it out first and ask questions later”, putting the responsibility on the original change’s author to figure out what went wrong and reapply a correct version of the change later?
  • Top engineers are often 10x more productive than average engineers.  Is your process geared towards allowing those top engineers to flourish, at the risk of occasional mistakes slipping through, or is it geared towards preventing the average engineer from making mistakes, at the risk of reducing your top engineers’ productivity?

Whatever your current process, if you want to improve the effectiveness of your software development organization, you should be looking for ways to enhance it.  A very simple way to do this, pioneered in manufacturing by the Japanese automakers, is to look for the root cause of each problem and fix the root cause so the problem cannot happen again.  (I’ve written previously on this topic.)  One simple method Toyota adopted to identify root causes is called the “5 Whys.”  The important thing is not the specific method you use, but that you do dig down to understand why problems are happening.

This isn’t just the responsibility of management.  Individual engineers should be looking for opportunities for process improvement, too.  Any time you find or fix a bug, for example, this gives you an opportunity to ask a bunch of questions:

  • When and how was the bug introduced?
  • Could we have prevented this bug from being introduced in the first place?
  • Could we have detected this bug sooner?
  • From the time this bug was reported, could we have fixed it sooner?
  • Was the bug prioritized appropriately?
  • This could be one of a family of related bugs.  Are there other, similar bugs elsewhere we should look for?

We can ask much the same questions any time someone’s build is broken:

  • Who broke the build?
  • Why wasn’t the build break discovered before commit?  How could it have been prevented?
  • How quickly was the build break detected?  How quickly was it fixed?
  • Are other build configurations or other components’ builds broken too?

To give a more concrete example, maybe a build break wasn’t discovered before commit because the developer only did builds on a subset of the target platforms/configurations.  Perhaps the debug build passed and the release build failed.  Or perhaps the Windows build passed and the Linux build failed.  What possible process improvements does this suggest?

  • Require people to test-build all configurations before committing.  I would probably not recommend this; the cost can easily exceed the benefit.  Also, engineers are likely to “forget” to follow such a requirement, either intentionally or unintentionally, or are likely to make “one last change” after doing all their tests and not go back and fully test everything again.
  • Reduce the number of supported build configurations.  Debug/release is pretty typical, but suppose you’re still supporting some ancient operating system that no one cares about any more; perhaps you can finally retire your old DOS or Win9x or MacOS9 build, for example?  Or perhaps you can have a single binary for all versions of Linux rather than a separate binary for each supported Linux distro?
  • Disable “warnings as errors.”  This one is a double-edged sword.  On one hand it prevents warnings from creeping in.  On the other hand it makes your builds more brittle.  It’s up to you to make the right choice.
  • Set up a system like Cascade that will reject the commit of any change that breaks a build.

We can never achieve process perfection, but over time we can improve our process so that we don’t make preventable mistakes.  We should be able to avoid making the same mistake twice, for example.  We also need to watch that our process doesn’t get overly bureaucratic and burdensome.  Every so often it may be useful to “deregulate” your process: toss out of some of the rules, especially the ones that you think might have a poor cost/benefit ratio, and see what happens.


Written by Matt

November 25th, 2008 at 4:04 pm

Cascade 1.1.0.943 released!

no comments

Cascade 1.1.0.943 has been released!  This release adds a number of new features.  The full list of what’s changed is spelled out in the release notes.  Try it now…

What’s more, Cascade is now completely free for personal use, even in a corporate environment.  Multi-user installations still require you to purchase a license.  The details are spelled out on the Pricing page and in the EULA at install time, but the bottom line is that you can use Cascade for as long as you want on as many computers as you want, so long as it’s just you using Cascade and accessing your Cascade servers.


Written by Matt

November 14th, 2008 at 12:42 am

When Are Small Commits Bad?

1 comment

I wrote previously on the topic of small commits.  So when and why would I advise bunching small changes together into bigger ones, aside from the obvious case of changes that must be done atomically to avoid breaking something?

One example is a change that causes a compatibility break.  Suppose you have an API, network protocol, file format, database schema, etc. you want to change.  If you’re going to make one change to it already, this would be a great opportunity to make other, simultaneous, desirable changes.  If people are going to have to upgrade their clients, servers, file parsers, file writers, databases, and/or database queries already, you might as well batch up these changes to reduce the total number of compatibility breaks and the total pain they will cause.  The worst case would be if the intermediate API, protocol, database, etc. is released outside your organization.  You might then have to support another version for the rest of time.

That’s not to say that you should make unnecessary or gratuitous changes at that time, but if you know you’re going to have to add 2 columns to a table in your database, you might as well add them both at once, rather than doing 2 separate changes to add one column at a time.

If a bug requires making an identical change to a bunch of different places in your source code, I’d likewise advise doing only a single change.  If the same code has been copied and pasted to a bunch of different locations, for example, and each one has the same bug, I’d advise fixing them all at once.  The last thing you want, certainly, is to be in the middle of fixing the bug and for someone to check in a new change adding another copy of the same buggy code — simply because you didn’t commit your changes all right away.  This also makes it clear from the revision history that the changes are connected to one another.

However, if you find yourself in such a situation, where a “simple” bug fix requires changing a lot of similar logic all over the place, I might also suggest that your should look at your design more carefully and refactor your code to reduce the replication of logic.  Any time you are copying and pasting code around, you are usually doing something wrong.

One of the often-claimed benefits of large commits is that there is fixed per-commit overhead.  A typical example of this overhead would be a mandatory code review: if every change must be emailed out and you must wait for a reply from another engineer approving your change, this might take a while.

Fixed per-commit overhead, which is very real in many organizations, makes it very tempting to batch up your changes.  I’d advise against this.  If you are finding that fixed per-commit overhead is forcing you to batch up unrelated changes into a single atomic commit, I would contend that you have a process issue that you need to address.

Sometimes fixed per-commit overhead is simply unnecessary bureaucracy: paranoid management enforcing commit policies that have no logical connection to the actual risk of a change.  My view is that a manager needs to be able to trust his employees’ judgment.  If you don’t trust your employees to make good decisions and to ask around for help when they don’t know the right answer, I’d suggest that you have a much bigger problem in your organization and that your commit policies are just a band-aid.

These policies tend to drag down the productivity of your best engineers.  If your best engineers are often 5-10x more productive than your average engineer, then you can ill afford to have them waste time on every commit, just to prevent your worst engineers from checking in bad code.  The real solution is to get rid of the bad engineers or to mentor them so that they don’t need extensive babysitting.

I’ve worked in several organizations with these kinds of overkill commit policies, and my general approach as an engineer was simply to ignore the policies, which were rarely enforced, and use my own best judgment instead.  (No… it really isn’t necessary to run an long, comprehensive test suite if all you’ve done is change a comment in the source code.)

In other cases, while the commit policy itself was basically reasonable, the time it took to run through the builds and tests was excessive.  In this case the answer is to optimize your processes.  If it takes several hours to build and test a change before committing it, forget the question of big vs. small commits — you’re killing your engineers’ productivity across the board.

For example, if your software needs to run on Windows, Linux, and Macintosh, it’s perfectly reasonable to expect that everyone’s changes should compile and pass a simple test on all three platforms before they are committed.  But building and testing your changes on all three platforms can take a while, and done manually, it’s error-prone (are you sure you copied the exact same files back and forth between your 3 source trees? are you sure the final change you committed is the same one you tested?).  This is where better tools like Cascade can help: instead of doing these builds and tests manually, you can simply “checkpoint” your changes and Cascade will take care of running them all.

If you’ve exhausted all the possible process improvements and commits are still taking a while, one final approach is to pipeline your work.  Once you’ve kicked off builds and tests for a change, you shouldn’t just need to go off and browse the web waiting for them to complete.  You ought to be able to start working on another, unrelated change in another tree.  Again, Cascade can help.  Traditionally, having more trees has been expensive: you have to check out and update the extra trees, and then you still have to build each tree independently (even though the build results should be the same).  With Cascade, cloning a new tree takes just seconds, and each tree comes prepopulated with the results of all your builds and tests.


Written by Matt

November 12th, 2008 at 6:39 pm

The Benefits of Small Commits

4 comments

Unless there’s a specific reason why you can’t, I recommend that you commit code to your repository in the smallest possible atomic chunks.

Look, it’s great that modern source control systems allow you to commit an atomic change to more than one file at a time.  This is an essential feature and I can’t imagine living without it.  But just because we can, that doesn’t mean that we should.

Probably the biggest reason to keep changes small is to make it easier to track down which change caused a particular bug.  If two changes are commingled into a single commit, you may have to manually disentangle them to figure out which one was responsible.

Large changes also make it complex to go back and dig through history.  If you’re trying to understand why someone made a particular change to a particular file and are looking through the old revisions, you might be confused if you find that someone made several other changes to the same file at the same time.  Are the changes interrelated?  Hopefully the change’s description will explain, but old change descriptions are often less than fully illuminating in practice.

If a change is demonstrated to cause a bug, you might want to back it out.  If other changes have been lumped together with it, you might unintentionally back out other, unrelated changes that did not cause the bug and might be desirable to keep in the tree.

Consider also the impact on other engineers who have changes in development.  These engineers will need to merge their changes with yours.  The larger and more invasive a change is, the harder it can be to merge with other changes.

One specific thing you should not do is combine cosmetic and functional changes in a single change.  For example, while making a change, if you notice that a source file has tabs instead of spaces, and your coding policy calls for spaces, don’t reformat the entire file at the same time that you are making your other changes.  The same goes for moving curly braces, making the text fit within a certain number of columns, using // vs. /* comments, etc.  It’s fine to make these changes to clean up code to meet coding policies… just don’t mix them with substantive, functional changes to the code.

One common way people end up committing large changes is the dreaded “mass integrate”.  That is, you have two branches, and you want to catch up the one branch with all the changes made to the other branch.  In a mass integrate, rather than integrating each individual change over by itself, you integrate all of the changes together in one big commit.  Mass integrates may touch hundreds or thousands of files.

Because they lump many changes together, they may introduce and fix large numbers of bugs all in a single commit, and it may be difficult to track down what caused what.  They obscure file history, especially if the descriptions of the individual changes being integrated are not all copy-and-pasted into the mass integrate’s description.  If the mass integrate proves to be unwise, you may not realistically be able to back it out without creating an even bigger mess.

Mass integrates into a long-lived branch, e.g., your trunk or a release branch, are a “worst practice” in software development.  Mass integrates into a development branch are not such a problem; the problem arises when merging a development branch back into the main branch.  Sometimes you may have no choice but to integrate a bunch of changes together (each change individually breaks things, and you need all of the changes or none for the tree to stay in a consistent, working state), but it can be massively disruptive for a large pile of changes to be thrown into a branch all at once.


Written by Matt

November 5th, 2008 at 6:03 pm

Cascade 1.0.2.860 released!

1 comment

Cascade 1.0.2.860 has been released!  This release fixes a variety of bugs, some minor and some major, in Cascade 1.0.1.  It also provides several important performance enhancements, especially to checkpoint and commit operations.  A list of specific noteworthy fixes is provided in the release notes.  Try it now!


Written by Matt

October 29th, 2008 at 3:16 am

Windows vs. Unix File System Semantics

4 comments

One of the challenges in implementing a cross-platform file system driver such as Cascade File System is dealing with the many differences, small and large, between how Windows, Linux, and Macintosh file systems work.  Some of these differences are well-known and obvious, but there are a lot of other interesting differences underneath the covers, especially when you get down into the file system driver kernel interfaces.

Let’s start with the most obvious one: case sensitivity.  Linux has a case sensitive file system.  Windows has a case-preserving but case-insensitive file system.  Or, at least it looks like Windows does!  But in reality Windows supports both.  Check out the documentation for the NtCreateFile API, the native NT API that the Win32 API CreateFile maps to.  By setting or not setting OBJ_CASE_INSENSITIVE, you can select which type of name lookup you prefer.  It’s really up to the individual file system to decide how to interpret all these flags, though.  Some Windows file systems, like the original FAT, aren’t even case-preserving.

The Macintosh is now Unix-based as of OSX, but its HFS file system has traditionally been case-insensitive and case-preserving, just like Windows.  More recently, Apple now allows HFS to be formatted either way, as either case-sensitive or case-insensitive, but the default remains case-insensitive.

The issue of case sensitivity brings up another issue: internationalization.  Windows, Linux, and Macintosh all support Unicode in paths; Windows encodes them as UTF-16 in all of its native APIs, whereas Linux and Macintosh use UTF-8.  A problem: it’s possible for two non-identical Unicode strings to correspond to the same sequence of characters.  That is, certain characters can be legally encoded in more than one way.  Macintosh therefore requires all UTF-8 filenames to be stored in a canonicalized format and therefore will prevent you from creating two files in the same directory with the same name but different character encodings.  Windows and Linux do not; this can cause interoperability problems moving data back and forth between the two.

There are several challenges in doing case-insensitive string comparisons in a Unicode-capable file system.  NTFS on Windows adopts the following approach: two strings are compared by converting them both to uppercase first, then comparing them for exact equality.  The conversion to uppercase is done using a 64K-entry, 128KB table stored on the volume and filled in when the partition is formatted; this ensures that the comparisons do not break (which could cause two files’ names to start colliding) when new characters are added to Unicode and someone upgrades their OS.

Windows uses backslashes as path separators, while Linux and Macintosh use forward slashes.  Most of the Win32 APIs allow you to specify forward slashes and will do the conversion for you, but once you get into the NT kernel and the other low-level APIs, backslashes are mandatory.

This in turn means that the set of legal filenames differs between the operating systems.  On Linux, for example, you can create a file whose name contains a backslash, while on Windows you cannot.  Linux is very permissive about the legal character set, but Windows has a lot of extra restrictions.  A filename cannot end with a space or a period; there are a number of reserved names like COM1 and NUL; and several other non-path-separator characters like <, >, :, “, and |,  are reserved.

Windows has drive letters, Linux and Macintosh have mount points.  Actually, inside the NT kernel and inside kernel drivers, there is really no such thing as a drive letter.  A “drive letter” is nothing other than a symbolic link in the NT kernel namespace, e.g., from \DosDevices\X: to \Device\cfs.  When you call CreateFile with a path x:\foo.txt, the driver owning the \Device\cfs namespace simply sees a request for \foo.txt.  But for practical purposes, this is still important.  A Windows path needs to be interpreted differently depending on whether it’s a drive letter path or a UNC path.  A Windows file system can be ripped away from applications with files open by removing the symbolic link, whereas a Unix file system cannot be unmounted if files are still open.

The Windows cache manager holds files open.  When you close the last handle to a file, from the file system driver’s point of view, the file may still be open.  This makes it very difficult to unload a Windows file system driver without a reboot.  Unmounting it, i.e., removing the drive letter symbolic link, is easy, but until memory pressure forces the Windows cache manager to flush its cached mappings, those cached mappings may stay around indefinitely.

Permissions are very different.  Linux has the standard Unix “UID, GID, mode bits” permissions model, and Macintosh inherits this also.  Both have added ACL-based permissions later, but their use is often not considered very mainstream.  Windows, on the other hand, is thoroughly ACL-based.  Every file in a Windows file system has a security descriptor that includes the ACL.  The permissions are far more elaborate than just “read, write, execute”; there are over a dozen types of permissions that you can be granted or denied.

Other file attributes are also different.  Windows has a standard list of file attribute bits like “archive”, “hidden”, and “system” that go back to the DOS era.  There is no equivalent to these on Unix.  All of the systems support a more generic “extended attribute” system, however.

Linux doesn’t have multiple data streams per file.  One of the defining properties of Unix, going back to its very beginnings, is that a file is just an array of bytes.  Windows, however, allows a file to have multiple “data streams”, while Macintosh supports a similar “resource fork” feature.  Apple now discourages the use of resource forks, but multiple data streams continue to be an important feature on Windows in some cases.  For example, Internet Explorer attaches an alternate data stream to each file you download to indicate where you downloaded it from.  When you later try to run an app that was downloaded from an untrusted zone, you will get a warning asking you whether you really want to do that.

Windows has limited symbolic link support.  Windows has “reparse points”, which are like symbolic links for directories only, with some other caveats; but they are supported poorly by many applications.  Vista adds something closer to real Unix symbolic links, though again with some limitations.

NtCreateFile() on Windows throws in the kitchen sink.  This API has a lot of flexibility that doesn’t exist in the Unix open() system call.  For better or worse, just about everything goes through it.  For example, there is no equivalent to mkdir() on Windows.  Instead, NtCreateFile takes a flag to request that you want to create a directory rather than a file in the event that the path lookup fails.  It also supports a number of other random features, like delete-on-close files.

The Windows delete and rename model is different.  You wouldn’t know this from the Win32 APIs, but in order to delete or rename a file in Windows, you first have to open it!  Once you’ve opened it can you call NtSetInformationFile with InformationClass of FileDispositionInformation or FileRenameInformation.  Setting FileDispositionInformation doesn’t even delete the file; it merely enables delete-on-close for the file, and the delete-on-close request could very well be cancelled later.

File sharing restrictions and locking are different.  Unix generally avoids the idea of restricting what can be done with a file just because someone else is using it.  Having a file open doesn’t prevent it from being unlinked, and two people can open the same file for writing.  On Windows, all of this is true in theory — you can request whatever sharing mode you want when you open a file — but in practice, most applications use restrictive sharing modes, preventing two apps from using the same file at the same time.  Inside a single file, we also have byte range locking.  Windows uses mandatory locking: if someone else has the bytes locked, an attempt to modify those bytes with WriteFile() will fail (but this is not enforced for memory-mapped files!).  Unix uses only advisory locking and makes no effort to error-check read() or write() calls; it assumes that the application will be responsible and won’t touch data it hasn’t first locked.

This list of differences could go on and on.  It’s a challenge to make sure that CFS supports all of the important file system semantics correctly across the platforms, especially because the revision control systems CFS builds on often have different semantics of their own that don’t quite match the standard file systems.


Written by Matt

October 21st, 2008 at 5:24 pm