The Conifer Systems Blog

The Voodoo Magic School of Software Engineering

2 comments

Computers are deterministic.

Well, OK, not always. A single-threaded, sequential program that doesn’t interface with the outside world is deterministic, but once user input, hardware devices, and networking get involved, these create asynchronous events that can be nondeterministic in character. The same is true if two programs, each independently deterministic, write to the same files in the file system at the same time. Preemptive multithreading adds more nondeterminism, and computers with multiple CPU cores add still more.

At the same time, we have also built a whole family of software abstractions that take these initially nondeterministic constructs and push them back into the realm of determinism. Yes, user input is asynchronous, but if the window system puts it in a queue instead of giving us a callback, and we only have a single thread pulling messages out of that queue, we’re back to deterministic behavior. The arrival of network packets is nondetermistic, but if we use blocking TCP sockets, the operating system takes care of most of the dirty work for us and all we have to do is send and receive bytes. A 3D graphics processor works in parallel with the CPU, but when you call glReadPixels, the OpenGL driver first waits for the previous rendering to finish so it doesn’t give you random incompletely rendered garbage.

Multithreading remains a hard case, but you aren’t forced to use it, and with careful use of mutexes, semaphores, etc., multithreaded programs can be perfectly deterministic.  We can also hope that future programming environments will be able to automatically extract more parallelism from our software, so that we won’t have to write multithreaded code to get speedups on multi-core CPUs.

Even if your program has nondeterministic behavior, though, one simple fact remains: everything that happens on a computer happens for a reason.  There is a logical explanation for why it happened.  (That explanation may be “hardware failure” or “cosmic ray”, but these are rare, and even so, they are still explanations.)  It may be hard to determine that reason, but with enough digging, it is usually possible.

So, it may be tempting to take a sledgehammer to your computer when it misbehaves, but when we have our “professional software engineer” hats on, we ought to be able to do better.

Let’s take GNU make as an example.  If you’re read our whitepaper, or even if you’re just an occasional user of GNU make, you probably know that occasionally a build fails “for some weird inexplicable reason.”  You type “make”, and instead of a correct binary popping out, you get a strange error message.  Or, worse, it looks like it succeeded, but you get a binary that doesn’t run.  So you type “make clean” and “make”, and then hope it will work this next time around.

So say that second try fails… what now?  Well, you can send an email to other people and see if they’re hitting the same problem.  You can look at the status of your automated build system and see whether it’s hitting the same failure.  But suppose the problem really does seem to be unique to your computer.  It’s possible your source control tree might be corrupted in some strange way.  (This has happened to me a good number of times.)  You might type “p4 sync -f”, but even that only fixes files that exist in the depot — it doesn’t delete files that exist but shouldn’t.  So maybe you “rm -rf” your entire tree and get a brand new one from scratch.  Perhaps that will fix it!  If it doesn’t, well, then you’re really in trouble…

At this point it’s tempting to throw up your hands in despair and admit defeat.  But if we really want to keep going… well, does it happen if I do the build on another computer?  I could reinstall my OS — maybe that will fix it?

This is, in a nutshell, what I call the “voodoo magic school of software engineering.”  We run into a problem — a build fails.  Rather than understanding the problem, determining its root cause, and addressing the root cause so it won’t happen again, we flail about, trying random stuff in the hope that something will magically fix it.  When we get to the point of “let’s reinstall the OS — maybe that will help!”, we might as well just be sticking pins in a voodoo doll.

We may well “fix” the problem this way, but we won’t ever determine the root cause.  If we don’t determine the root cause, how can we know that the problem won’t happen again to someone else or even to ourselves?  How do we know that our “fix” really fixed the problem and didn’t just paper it over, ready to reemerge at any moment?

We don’t need to track down the root cause of everything that goes wrong on a computer.  When I load some random web page and it comes up garbled, I usually just hit “refresh” and often it magically works on the second try.  Or, sometimes I come back to the site later and hope that the site owner has fixed their bug.  Either way, I don’t waste time trying to figure it out or even report the problem, unless the web site is really important to me.

That’s fine and well when we are talking about someone else’s web page.  Someone else’s web page — someone else’s problem.  But when we’re talking about our own software, our own web page, our own build flow — anything that affects our day-to-day workflow, or anything that we ship to customers and must take responsibility for — we owe it to ourselves as professional software engineers to drill down into problems and address their root causes so they don’t happen again.

When your build fails, it didn’t fail for an “inexplicable reason.”  Sometimes your tree really is corrupted and “rm -rf” is the correct answer, but even if so, how did your tree get corrupted in the first place?  For example, did some previous build step delete some files by accident?  Did some runaway script corrupt it?  Sometimes the root cause may just turn out to be user error, but even so, it’s helpful to understand our own mistakes so we don’t repeat them.

But more often, when these things happen, I find that there is a real problem to be fixed.  A common example is a makefile with incomplete dependencies.  If you’re missing a dependency, things may not be rebuilt correctly, and there’s no telling what might go wrong after that.  Tracking down the missing dependency isn’t a waste of time.  It’s an investment in the future.


Written by Matt

September 11th, 2008 at 2:14 pm

2 Responses to 'The Voodoo Magic School of Software Engineering'

Subscribe to comments with RSS or TrackBack to 'The Voodoo Magic School of Software Engineering'.

  1. […] Normally, when debugging, I would recommend starting from first principles and trying to understand the problem’s root cause before doing anything else.  Without a complete understanding of the bug, your chances of implementing a complete and correct fix are poor.  The last thing you want to do is apply a random fix that “seems to work” without understanding why it works; at that point, you’re a member of the Voodoo Magic School of Software Engineering. […]

  2. […] for the root cause of each problem and fix the root cause so the problem cannot happen again.  (I’ve written previously on this topic.)  One simple method Toyota adopted to identify root causes is called the “5 Whys.”  […]

Leave a Reply