Thursday 23 February 2012

Shotgun Initialization

Today I had a debate with a colleague about initializing local variables in C.  I have had this debate many times over the last 20 or more years and thought I should clarify what is the best thing to do, or at least, my opinion of the best thing.

The Problem

Uninitialized variables are a perennial problem in C.  Writing through uninitialized (but not necessarilly NULL) pointers is particularly nasty, as it can overwrite memory anywhere (notwithstanding hardware memory protection).  Of course, any variable can be uninitialized so we want to ensure that all variables are set to some (correct) value before they are actually used.

Years of bad experiences with this problem means that experienced C programmers often use a form of defensive programming where steps are taken to avoid the possibility of using memory that has not been set to some value.  For example, I have seen malloc() #defined so that the real malloc() is bypassed in favour of calloc().  The standard library function calloc() is useful because it initializes the allocated memory with zero, but it is not always appropriate.

Defensive Programming

Defensive programming is part of the culture of C.  For example, it is common to see code like this:

  for (i = 0; i < 10; ++i)   // defensive
    a[i] = f(i);

rather than:

  for (i = 0; i != 10; ++i)  // non-defensive
    a[i] = f(i);

Now these should do the same thing unless by some strange action the variable 'i'
Note that I use the literal 10 for simplicity. In real code you would not use a magic number but something like sizeof(a).
somehow skips having the value 10, in which case the non-defensive code will go into an infinite loop corrupting memory until the program terminates due to a memory exception or something else nasty happens.  How could  'i'  possibly do this?  Well stranger things have happened.  It may be due to errant behavior of the function f().

So which is better?  Well, at least when debugging, the non-defensive code is definitely better as the defensive code will mask the symptoms of the bug in f() and you may never be aware of it.

For a release build the defensive code is arguably better.  But surely it is better to find and fix the bug (a nasty one probably lurking in f()) than try to recover (perhaps unsuccesfully) from it?

Don't get me wrong.  I think defensive programming is useful.  It is preferable to use software that might behave a little strangely once in a while, as long as it doesn't crash and lose all your data.  However, as we saw, defensive programming can hide bugs.  So you might find that you were using a word processor which did some odd things but seemed to recover and allowed you to save your document - but you might not discover till much later that the document you saved was actually corrupted.

The best strategy is to be defensive and (at least in debug builds) also try to detect the bugs you are defending against.  I use code like this:

  for (i = 0; i < 10; ++i)
    a[i] = f(i);
  assert(i == 10);

Uninitialized Variables

To get back to the point of this article consider a function like this:

void g()
{
  int i;
  ....
  i = f();
  ...
  use(i);
  ...

The worry is that the initialization of  'i' is bypassed somehow and then 'i' is used with an invalid value.  Personally, I think, as far as possible, variables should be initialized with their proper value when they are declared.

  int i = f();
  ...
  use(i);
  ...

but this is not always possible in conventional C (at least in the more commonly used C-89) which requires you to declare variables at the start of a compound statement.
You can still have the problem in C++ if the initial value of all members is not known at the time of construction.
Luckily, this is not required in C++ and is even actively discouraged.  A greater part of the problem is that convention (and many C coding standards) require all local variables to be declared at the start of the function.

Debugability

The result is that many C programmers take the shotgun approach:

void g()
{
  int i = 0; // shotgun initialization
  ....
  i = f();
  ...
  use(i);
  ...

The only thing is that zero is not the correct initial value for 'i'.  The only good to come from this is that this makes the value of  'i' deterministic, since otherwise it could contain any random value that happened to be left over in the memory that is used for the local variables.

Making the code deterministic makes it more debugable (see my previous blog post on software quality attributes).  So it can save time in tracking down a bug, because it is easy to reproduce.  However, it can also make the code less verifiable since it can hide the symptoms of the bug.  It may be better to initialize the variable to some strange value so that it is obvious when it has not been initialized properly.

This is the reason that the Microsoft VC++ debug code initializes all local variables with the hex value 0xCC (uninitialized heap memory is filled with 0xCD).  This obviates the need to do shotgun initialization in that environment since the run-time behavior cannot randomly change.

Disadvantages

We saw that shotgun initialization has the advantage in increasing code debugability, but there are many disadvantages?

First, it can make the bug harder to detect (the software is less verifiable).  If you are debugging and 'i' has the (incorrect) value of zero you may think that it has been set correctly.  But if it contained 0xCCCCCCCC your interest should be immediately aroused, and it's more likely to trigger an assertion or a run-time error.

Another problem is when you initialize to some arbitrary value it is harder for the compiler to tell you that something is wrong.  Many compilers will attempt to warn if you use a variable that has not been initialized.

Further, if the code that uses 'i' is removed at a later date it will prevent many compilers from warning of an unused local variable.  Because 'i' is initialized it is not flagged as being unused.

However, my biggest concern is that the code becomes confusing.  The initialization of 'i' to zero serves no purpose and may confuse someone reading the code.  One of the biggest code smells is finding redundant code.  I have often found that such redundancy inhibits changes to the code and can itself lead to bugs.

Last (and least), it is less efficient to write the same memory twice - once with zero and then again with the real initial value.  One of the few things I dislike about C# is that you cannot avoid having an array initialized to zeros, even if you are going to set the elements to some other value.  For a large array this can take an appreciable amount of time.

The Bug is The Problem

The bottom line is that rather than trying to handle bugs gracefully, the first priority is not to have bugs.  Shotgun initialization can even hide bugs and makes the code harder to understand (and hence modify).  There are few, if any, advantages to using it and these are far outweighed by the disadvantages.

No comments:

Post a Comment