Replication vs. Caching

Both Cascade File System and Cascade Proxy use caching to speed up access to slow repositories.  The first time someone accesses a file, it is downloaded from the appropriate server.  Subsequent accesses pull the file from the cache instead.  Eventually the cache will fill up, and as new files are downloaded, old files are automatically evicted from the cache.

Caching is not the only way to speed up access to a slow repository.  Another option is replication: mirroring the contents of a repository on multiple servers.  You might set up a “master” repository at your main office and create replicas at each remote site.  As changes are committed to the master repository, they are mirrored over to the remote replicas.

To put it another way, caching is a “pull” model — data is pulled as it is requested — whereas replication is a “push” model — data is pushed as it becomes available, regardless of whether it has been requested.

Replication’s Advantages

Replication has one big performance advantage over caching: it accelerates the first access to a file, not just subsequent accesses.  Replication has a number of disadvantages, however, and this advantage is not as clear-cut as it may seem.  Combining caching with prefetching has much the same effect.

For example, if your developers start to come in to the office at 8AM, you might kick off a prefetch of all the files they typically use at 7AM, and they’ll all be locally cached before anyone arrives.  Or, if you work from home, you could start a prefetch each day in the afternoon, and by the time you get home most of the files you need will already be cached.  You don’t need to prefetch everything — just the most important files — and this sort of prefetching can be automated using tools like “cron”.

Another important feature of replication is that it doubles as an efficient way to do backups.  If you have an entire copy of your repository offsite, the chances that you will lose all your data are slim.  A cache may allow you to recover some files after a loss of data, but it is not a replacement for a real backup system.

Replication also is ideal for disconnected operation.  If you lose all network connectivity, having a full replica means you still have access to all the data.  In practice, however, disconnected operation is becoming increasingly less important, with Internet and wireless connectivity nearly ubiquitous.

Replication’s Disadvantages

On the flip side, replication has quite a few disadvantages that — if you are not using it to perform backups — usually outweigh its advantages, especially for large projects.

Building a New Replica

Starting from scratch, it can take a very long time to build a new replica.  In effect, you must replay each commit to the repository starting from the beginning.  For sufficiently large projects, it may not even be realistic to build an offsite replica purely over a network — you may be forced to build a replica at your main site, then physically ship the disks to the remote site.

Caching, on the other hand, has no such upfront costs.  The cache can be populated gradually over time, and the speedup from using the cache will grow as more files are populated into it.

Disk Space Cost

Each replica consumes the same amount of disk space as the main repository.  The more replicas you need, and the larger your repository, the more you will need to spend on disks to store the replicas.  This is usually acceptable for small projects, but once a repository grows large enough that it cannot typically fit on a single commodity, off-the-shelf hard drive, this starts to become troublesome.  (Among other things, it becomes impractical for developers who work at home to mirror the repository.)

Caching has no such disk space requiements proportional to the size of the repository.  Larger caches can store more files, but even a modestly-sized cache can have large performance benefits.  It is practical to set up caches not just at a site-wide level, but also on an individual LAN.

Network Bandwidth Cost

Replication mirrors every change, whether it is needed or not.  As such, a replica is constantly consuming network bandwidth.  This can overload a remote office’s WAN link.  In the limit, it is even possible for replication to break down altogether if changes are being committed faster than the data can be mirrored.  Also, the mirroring places extra load on the master repository’s server.

Caching, on the other hand, will almost always decrease, not increase, WAN bandwidth usage.  A file is not downloaded unless it is really needed.

Replication Lag

Replication is not immediate.  It takes time for a change to propagate from the master repository to the replicas.  Sometimes the lag may be small, but it may spike if several large changes are committed in a short period of time.  When you are working off a replica, you may think you are using the very latest “top of tree” source code, when in fact you may be any number of changes behind.  Depending on how it’s set up, the replica server might claim that the missing changes don’t even exist — if you ask it to check out a revision number that hasn’t replicated yet, it may give you an error message rather than waiting until the replication catches up to that revision number.

Replication and Commits

So far we’ve only discussed operations that read from the repository.  So what happens when we want to commit when we are using replication?  (Note that with caching, commits work exactly the same way they do without caching.)

So far, when we’ve discussed replication, we’ve spoken of “read-only” replication, where there is only a single master repository and a number of read-only replicas.  Commits only go to the master repository, and the read-only replicas pull changes from the master.

For purposes of commits, read-only replication should be just like caching: it shouldn’t affect them.  Unfortunately, it isn’t always quite that simple.  Often, the read-only replica server has a different hostname and IP address than the master, and it may not be straightforward to convince the client to send the commit to the master rather than to the replica.  If you attempt to commit to the replica, your change will be rejected.  This is a solvable problem, but it affects some real replication-based systems.

We haven’t discussed another possible type of replication: “read-write” or “multi-master” replication, where commits can go to more than one repository, and replication goes in more than one direction.

There are a few major problems with read-write replication.  The first is that the replicas are not identical — the order in which changes are applied differs from replica to replica.  Suppose someone on the US team says that they could reproduce a particular bug at revision 10000.  The Europe team’s revision 10000, however, might not be the same as the US team’s revision 10000.  The change that was applied at revision 10000 on the US replica might be applied at revision 10005 on the Europe replica.  But if the Europe team uses revision 10005, now they’ve picked up 5 other changes.

This is harmless in some cases, but software is complex.  It is often the case that two changes must be applied in a particular order.  If the order is reversed, the software may not work.  Even ignoring order dependencies, this tends to make it more difficult for the teams to collaborate.  They are not working on the same codebase — they are working on slightly different codebases that diverge and are merged back together.

The “merge” is where the second major problem comes into play.  Suppose two people edit the same file on two different replicas.  At replication time, someone will need to merge these changes with one another.  In many cases the changes can be merged automatically by a computer program, but again, software is complex; even if change A and B don’t edit the same lines of code, it’s quite possible for change A by itself to work, change B by itself to work, and the combined change A+B to contain a bug.  Without a human looking at the changes, there is always an element of risk.

Worse, of course, are the cases where the changes clearly do touch the same lines of code, or where two people have rewritten the same code in two different and incompatible ways.  In extreme cases, changes may simply not be mergeable in any straightforward fashion, and someone’s work may have to be thrown away.

Either way, when humans have to look at the merges, replication lag goes from being an annoying problem to a much more severe problem.  Sometimes it is safe to replicate changes out of order — if one change is particularly hard, you can skip replicating it and keep going on subsequent changes, and let someone go back and tackle the hard one later.  Again, though, there may be order dependencies between changes; skipping a change tends to make merge conflicts more difficult to resolve; and the longer a merge is put off, the harder it gets, because that gives the trees an opportunity to diverge further.

Replication lag also becomes more harmful with read-write replicas.  Instead of just lagging behind top-of-tree, people are doing commits relative to outdated code.  The greater the lag, the greater the probability that these merges will be problematic.

Given the problems with read-write replication, we believe that if you do choose to implement replication, you are probably better off sticking with read-only replication.

Conclusion

For accelerating offsite development, caching, especially when combined with intelligent prefetching, provides most of the advantages of replication without its many disadvantages.  Setting up caches is cheap and easy.  Replication is best suited for offsite backups, not for accelerating offsite development.

Did you find this whitepaper interesting or valuable?  If so, you might want to subscribe to our blog, which features regular posts on software engineering topics.