Friday, 19 April 2013

Alignment and #pragma pack()

A nasty problem that can occur in C/C++ is bad alignment when passing structs as parameters. For example, a provider of a module may provide a header file with structs having built their library or DLL using the default alignment setting. If the consumer of the module uses the header file with a different alignment setting then all sorts of nasty problems can occur since the caller and the callee have a different idea of the layout of the structure in memory. Alignment is generally controlled by using a command line switch (eg, /Zp in MSC) or with the #pragma pack() compiler directive.

How should this sort of problem be avoided? There are arguments on both sides.

Consumer: The provider should protect the use of their header (using #pragma pack) to make it independent of the current alignment setting.

Provider: The client should make sure the default alignment setting is active before #including header files.  If it needs to be changed before a struct declaration then it should be set back to the default immediately afterwards.

So who is correct? Actually, both sides have good points. But if either had followed my recommendations (see below) the problem would have been avoided.

What is padding for?

The C standard allows a compiler to add padding bytes to structs (and unions) after any field to allow the following field to be aligned according to any specific requirements of the compiler (or the user of the compiler). The standard does not specify, but typically a compiler will provide a command line argument to specify the (default) alignment. Good compilers also invariably support the de facto standard of #pragma pack, including push and pop options.

Padding bytes provide improved performance by reducing the amount of memory accesses required by suitably aligned data types. For example, on a 32-bit processor (more specifically a system which uses memory with 32 data lines) accessing a 32-bit integer will require two memory accesses when reading and writing the value rather than just one if it is stored on a 4-byte boundary (ie, unless the bottom two bits of the address of the integer are zero).

Some processors have alignment requirements which means that badly aligned data will actually terminate the program. For example, on the M68000 integers must be stored at even addresses.

However, the compiler should not allow the use of #pragma pack() to cause this problem.
The performance advantage of aligning data properly can be significant. In the past I have found that inner loop speeds can be more than doubled by fixing alignment issues. Nowadays I suspect that CPU caches mean that mis-alignment does not affect performance so much.

How does it work?

I won't explain in detail how padding is done as there are probably many other sources of this information. The rules are fairly simple but the consequences may be hard to comprehend, so you may want to read further and ponder different scenarios.

In brief, the values allowed for #pragma pack() are invariably powers of two. For example, Microsoft C has always supported 1, 2, 4 and 8 (and probably now supports larger values). Each field of a struct has alignment requirements equal to its size.  (An array has an alignment requirement equal to that of its base type; a nested struct has a requirement equal to the largest of its fields.)

Padding bytes are added before a field (except the first) to force alignment to the smaller of the alignment requirements of the field and the current alignment setting. Also padding bytes may be added at the end of the struct suitable for the field with the largest alignment requirement.  (This is mainly so that all elements of an array of the struct will be suitably aligned.)

For example, consider the following struct.


#pragma pack(4)
struct
{
    char a;         // 1 padding byte after a
    short b;
    char c[5];      // 3 padding bytes after c
    double d;
    char e[2];      // 2 padding bytes added at the end
};

Here one padding byte is added before b so it is aligned to a 2-byte boundary, because a short has an alignment requirement of 2 bytes. Three bytes are added before d to take it to a 4-byte boundary, because a double (assuming it's 8 bytes) has an alignment requirement of 8, but the alignment as specified in the #pragma pack() is only 4.  Finally, two padding bytes are added at the end to take the struct size to 24 - this ensures that the elements of an array of structs has the required alignment (in this case 4, due to the presence of d).

Graphically, the bytes are laid out like this:

Problems

Problems occur when structures are used with a specific alignment setting but used elsewhere or at a later time with a different setting. One way this could happen is if structures have been written to a file then the software that uses the structures is rebuilt with a different alignment setting. Old files, written using the original alignment setting, will be unusable due to misalignment of the data bytes.

However, the most common way that this problem occurs is in header files that define an interface to a module of some sort, such as a library or DLL.

For example, I encountered a very nasty problem when I was a young C programmer in the mid-1980's. I had developed a program that wrote data to binary files using different structs. I also used structs to pass data between different modules. At some stage we required the use of a third party library which provided a header file allowing us to interface with the library.

After #including the header file (in two of many modules) many strange and inexplicable things started to happen to the software. Suddenly structures passed between modules were completely different between the caller and the callee. Also reading data from old files resulted in seemingly "corrupted" values.

The problem was that the vendor had used #pragma pack(2) in their header file, but neglected to set it back to the default. This affected private structures in my source code. It could also affect structures in other header files, if  #included after the problem one, unless they themselves were protected using #pragma pack().

In those days, the Microsoft compiler I was using did not support the push/pop options of #pragma pack. However, the sensible thing for the vendor to do was to set alignment back to the default after it was changed like this:

#pragma pack(4)      // change alignment to 4
struct { ... };      // struct that requires alignment of 4
#pragma pack()       // reset alignment back to the default

Even better would be to design their structs to avoid the need for ever using #pragma pack() as I will explain now.

Alignment Agnostic

After the above experience my strategy was to always create structs that are what I call alignment agnostic. That is, the struct will have the same layout in memory, irrespective of the active alignment setting. In other words, carefully manage the size and placement of the fields of a struct so the compiler will never add padding bytes automatically.

How do you create alignment agnostic structs?

First, you have to understand how and where padding bytes would be added. If you did not understand my brief explanation above then you might try reading about it elsewhere.

You also need to know the size of your data types, especially if the structures need to be portable.  For example, is an int 16-bits or 32-bits? Of course, if the structures need to be portable then you should define your own types, such as INT16, INT32, etc and use those in the structures.

Then you can use these techniques:

1. Reorder fields in the structure. For example, you might group two short fields together to make a unit of 4 bytes, then follow it by a long (32-bit integer) so all three fields take up 8 bytes.

Not only does this avoid alignment problems but you may find that the struct is in fact smaller. And the decrease in size is not accompanied by a decrease in performance, as everything is still nicely aligned.

2. Increase the size of numeric fields. For example, using a short instead of an unsigned char may avoid the addition of a padding byte if it is on an even (word) boundary and the following field has an alignment requirement of 2.

This also has the advantage that you might avoid an unanticipated overflow with integers or loss of significance with floating points numbers.

3. Increase the length of strings. For example add one or two more characters to the end of the string to allow the following field to align to a 4-byte boundary (for a following 32-bit value) or 8-byte boundary (eg for a double).

This also has the advantage of lessening the chance off buffer overflow problems. For example, someone may forget to allow for a the string terminator and copy an extra byte into a character array.

4. Add dummy padding bytes manually.

For example, see char unused[6] in the example below.

Using the above struct as an example, we could reorganize like this.

struct
{
    char a;
    char c[5];
    short b;
    double d;
    char e[8];
};

Note that the struct is still 24 bytes in length but the positions of the fields (and the size of the struct) will not vary depending on the current alignment setting.  This is because
  1. the short field (b) is on a 2-byte boundary
  2. the double field (d) is on an 8-byte boundary
  3. the size of e is increased so the length is multiple of 8 bytes

Alternatively you can add a padding field at the end instead of increasing the size of e. This has the advantage that you could add extra field(s) in place of the unused bytes.


struct
{
    char a;
    char c[5];
    short b;
    double d;
    char e[2];
    char unused[6];
};

Note that for alignment up to 4, you only need to make unused 2 bytes in size, giving the struct a size of 20. But if alignment of 8 (or greater) is ever used then d needs to be on an 8-byte memory boundary and the struct size needs to be 24 bytes (ie, a multiple of 8).


Bit-fields

At this stage it would be remiss of me not to mention how bit-fields work. Bit-fields probably cause the most confusion when dealing with alignment. (If you don't use bit-fields you can skip this section.)

First according to the C standard all bit-fields whether declared as char, long, unsigned short, etc are treated the same. Since bit-fields are packed together in the same storage unit and since the storage unit is the natural integer size on the processor (usually the size of int), then the following structures would all have the same size. This would typically be 4 bytes on a 32-bit processor (probably 2 bytes on a 16-bit processor).

struct unsigned char a: 1; };
struct unsigned short b: 1; };
struct unsigned int c: 1; };

However, most compilers allow an extension to the standard where the bit-field storage unit is taken from the the base field type. In many compilers the above structures would have sizes of 1, 2 and 4 bytes respectively.

Note that consecutive bit-fields are placed by the compiler into the same storage unit, until its overflows and a new storage unit is begun. However, a bit-field of a different base type may also trigger a new storage unit. So, in the following struct the total size would be 4, which may surprise you.  The first storage unit (containing a and b) has size 1, then there is a padding byte, followed by another storage unit (containing c) of size 2.

struct
{
   unsigned char a: 1;
   unsigned char b: 2;
   unsigned short c: 1;
};

Graphically:

[Actually, if alignment is 1, then there is no padding byte and the struct size is 3.]

I guess the point is that when creating alignment agnostic structures with bit-fields you have to know how the compiler(s) you use allocate bit-field storage units. When bit-field storage units are used they have padding bytes added just as if they were integers of the same size.

Recommendations

For creators of header files:
  • make structures alignment agnostic
  • test that sizeof(your_struct) is invariant with different alignments (say 1, 8)
  • if you don't want to or know how to do that then use #pragma pack push/pop
  • add #pragma pack(push, n) in the header immediately before your struct(s)
  • add #pragma pack(pop) immediately after your last struct
  • never change the alignment before #including another header file
  • if push/pop is not supported at least use #pragma() to restore the default
For users of header files:
  • never change alignment before #including a header (in case it has problems)
  • protect structures in your own code with #pragma pack push/pop
  • OR better make your structures alignment agnostic

Summary

Luckily, this sort of problem is not common any more. C software does not often write structures to disk or pass stuff around using structs - all sorts of other things are available such as XML serialisation. When structs are used vendors have learnt to protect their header files.

The problem highlights another reason why I like C#.  .Net does not require the use of header files to define the interface for a module; instead an assembly's metadata is used by the compiler to find out about functions, their parameters, and anything else that is exposed publicly. This system avoids this and other problems that affect C and C++ programs.

Monday, 25 February 2013

Ignore Divide By Zero

Over 30 years of reading and writing C code I have seen (and written) a lot of code which tries to detect and recover from division by zero. Generally, this is a mistake.  In almost all cases it is better for the run-time to terminate the program (perhaps first generating an assertion). I have tried to explain this to colleagues (even experienced ones) and they think I am crazy, but please hear me out.

Note that I am not saying that you should be oblivious to the possibility of division by zero. I am simply saying that trying to recover from it at the point of the error is usually a mistake.

Conventional Wisdom

When I first began learning about programming (at Sydney University in the late 1970's) the first thing that was drummed into us was that it is not enough to get your program working. You also need to cater for error conditions. (I wholeheartedly agree with this - lack of proper error handling is by far the biggest blight on C software.)

The first example my lecturer gave of error-handling was division by zero.
Floating Point Division By Zero

I am talking about integer divide by zero, not floating point, unless otherwise stated. However, the same arguments generally apply to floating point calculations.

The major difference is that floating point division by zero usually will not terminate the program but generate an infinite result. (Most implementations nowadays provide floating point numbers that include positive and negative infinity).
The guidance was simply to make sure it cannot occur. A few years later I became an aficionado of defensive programming and my policy became that whenever I used division I needed to add extra code to check for the possibility of the divisor being zero and somehow recover from the situation - usually simply by setting the result of the operation to zero.

Most experienced C programmers take this approach, but I have found that this is almost always the wrong approach. Let's first look at the possible situations where division by zero may occur then consider each one in more detail.
  • a bug causes the divisor to take a zero value when it should never be zero
  • a zero divisor resulting from user input
  • incorrect data from an external source
  • in rare cases division by zero may be mathematically valid and handled specially
Bugs

Most of the time the problem occurs due to bugs in other parts of the code that have slipped through. It is often argued that the problem should be detected and handled something like this:

  if (numRequests > 0)
    aveTime = totalTime / numRequests;
  else
    aveTime = 0;        // Bad idea

The problem with this is that the bug is now silently hidden. Perhaps this is good for the final release of the software but is certainly not good when debugging and testing. It's better to find and fix the bug than to cover it up. This is the problem of defensive programming which I talked about in a previous post.

Personally, I would just leave the test out altogether and let the run-time system terminate the program. This is the fail fast approach. But I would also add an assertion, especially if using floating point values, since some implementations may not terminate but instead generate infinity, which is probably not what was desired.

  assert(numRequests > 0);
  aveTime = totalTime / numRequests;

This should be adequate with good design, thorough testing and software that is adequately verifiable (see my post on verifiability), but with badly written software you may not be certain that a bug has not slipped through, so the only alternative is to try to recover. But often continuing with a strange value may cause subsequent problems or even data-corruption.  In the above case it may be that aveTime should always be greater than zero.

Having studied a great deal of these situations I found that it can be very difficult to decide on a value that makes sense for the continued safe and sensible operation of the software. For example, it may make most sense to set aveTime to some very large value (since mathematically speaking, dividing by zero produces an infinite result); but if both totalTime and numRequests are both zero then aveTime probably should also be zero.

  if (numRequests > 0)
    aveTime = totalTime / numRequests;
  else
  {
    assert(0);          // Don't hide this bug in debug/test

    // Try to recover sensibly in case a bug was never found
    if (totalTime == 0)
      aveTime = 0;
    else
      aveTime = INT_MAX;
  }

The other alternative in C++ is to throw a software exception and let the software recover at a higher level.

User Input

Sometimes a calculation can be the result of user input. It is important to validate user input when it is entered. Not validating can cause inconsistencies in that data which later leads to problems like division by zero.

  for (;;)
  {
    int numberOfItems = GetNumberOfItemsFromUser();
    if (numberOfItems > 0)
      break;
    DisplayErrorMessage("You must have at least one item");
  }
  .
  .
  aveCost = totalCost / numberOfItems;  // No divide by zero here

Of course, a lot of things could happen between the input validation and the use of the value. If it is at all possible that the value could be corrupted or input bypassed then the previous section (Bugs) again applies.

Bad Data

A lot of programs carefully validate user input but assume that data from other sources is valid. Unless you are sure that the data is valid, for example by using a CRC then you should validate data when you receive it.

Data can be corrupted due to many things like hardware or software failure or human error.  In software with security implications, deliberate tampering may be an issue and a CRC is not sufficient - use a cryptographic checksum like SHA1.

Expected

In very rare cases division by zero may not actually be an error condition, in which case you may need to handle it especially. This is the reason that IEEE floating point numbers allow for infinite numbers. Generally, this sort of code would be for a specialized scientific or mathematical purpose and would use floating point numbers anyway.

Otherwise, it could be achieved like this:

  if (elapsedTime == 0)
  {
    isInfiniteSpeed = true;
    speed = -1;
  }
  else
  {
    speed = distance / elapsedTime;
    isInfinite = false;
  }

Of course, later code that used speed would need to also check isInfiniteSpeed before using speed.

Conclusion

Generally, it is a mistake to detect and try to recover from a divide by zero error.  Except for the very unusual situation where it is not an error-condition (see Expected section above) then it indicates there was a problem earlier such as a bug, corrupt data, or user input that was not validated.

With verifiable and thoroughly tested software the problem should not happen. Trying to recover is actually detrimental as it may hide a bug which would normally be found in testing.

For poorly written software (not well-written and not easily verifiable) it might be worthwhile trying to recover from the problem. It would also be necessary for fail safe systems where software termination could have dire consequences.

The problem with trying to recover is that there may not be a reasonable value to use in the circumstance. It is conventional to use a value of zero but often this is the worst possible value to use.

The most important point is that if you do recover from a divide by zero error that you do not hide the fact that there is a defect in the code. During debugging and testing the software should generate an exception or make it immediately obvious that there is a problem with the code. For released software the problem should be detected and reported - for example, to a monitored error log file.

Thursday, 31 January 2013

Why Scrum Fails


The general consensus is that Scrum implementations will fail to achieve their goals in at least 70% of organizations (some say 90%). (For other views, or more details, just search for for something like "Why Scrum fails" or "Does Scrum work?" on Google.) What is worse is that, even in organizations which believe that Scrum is working, it is not working or not working anywhere near as well as it could.

The real question is why this happens. Is it because people just don't "get it" or because it will not work in some (most) environments, or is it generally flawed?

Adapting Scrum

Every place I have worked that used Scrum it was claimed that "pure Scrum" cannot work in that particular environment. There is some truth in that but often it is simply reluctance to change. A major part of the reluctance is adapting a new management style where the team self-manages rather than the conventional team leader management approach.

Management

One of the guiding principles of Scrum is that the team is self-organizing and self-managing. The team decides how much work they can do and how they will do it and reports to the PO not to manager(s).

However, there is a reluctance of managers to let go of their control. Sometimes I believe managers are at fault and will even go as far as threatening team members to retain their control. However, most of the time I think developers are not willing to take responsibility and the managers are (perhaps only unconsciously) aware of that and feel they have to fill the gap.

Product Owner

The other problem is the PO. It is essential for Scrum to work that the PO be as close as possible to the real customer, by which I mean the actual end-user of the software. In most companies I have worked, the developers never even talk to a user of the software, and generally are at least 3 layers removed from them. Managers have talked for decades about having a "customer focus" (which I wholeheartedly endorse) and the best way to be customer-focused is to talk to and be familiar with the people that use your product.

Unfortunately, everywhere I have worked the PO is generally an analyst or even a former developer and far removed from the real users of the software. Real users (or at least someone who talks to real users) are invited to the sprint review but often they are not particularly interested or don't turn up at all, because they don't feel involved.

Senior Management

For most organizations Scrum is a cultural change and it is well-recognized that the culture of an organization starts at the top. If senior management are not committed then this is where the problems start. You don't get a proper PO and everyone realizes that Scrum is just a passing fad that the CEO heard about, but was not willing to invest any effort in making it work.

Basically, inertia is the problem. People are stuck in their ways. They may be aware that current processes are not good but they have gotten by with them for many years and there is no motivation to change, especially without management support.

Sunday, 16 December 2012

Asymmetric Bounds

Experienced programmers are aware of the advantages of expressing ranges using asymmetric bounds.  If you haven't heard of this term, I should explain that it is simply a way to express a range (typically of integers) using a (zero-based) start value and a value just past the end. Some people consider use of asymmetric bounds to be just a convention and I have even heard Pascal/Delphi programmers insist that it is a bad convention. It is not only a good convention, it is more than that, since it has a basis in mathematics as I explain below.

Further, though developers agree that using asymmetric bounds is a good idea in code, they try to hide the idea from users. Sometimes it is a good idea to translate internal representations for the benefit of the user (zero-based indexes being an obvious example), but the elegance and simplicity of asymmetric bounds is very useful in user interface design. I explore this more in the User Interface Examples section below.

Also note that the principle of asymmetric bounds is closely related to the idea of not treating zero as a special value in code. This is something I discussed about a year ago (see blog post on Zero). [I finally got around to writing this follow-up post which I promised at that time.]  The main problem with using an inclusive range is that there is no way to specify a range with zero elements.

There is little written on this subject, however, while researching this post I discovered that Andrew Koenig recently wrote some articles on this for DDJ (see Asymmetric Bounds, Part 1). However, I feel these articles are incomplete as they:
  • lack discussion of problems with using asymmetric bounds
  • do not talk about its importance in user interfaces at all
History

Specifying ranges has been problematic for humans for a very long time. These problems often result in the careful addition of either of the words inclusive or exclusive to ranges such as a range of dates. Unfortunately, there is no simple way to express asymmetric bounds for ranges other than to say inclusive start, exclusive end.

Mathematicians have been aware of the advantages of asymmetric bounds for hundreds of years. For example, a piece-wise function may have different formulas for different values of x, such as:

f(x) = 1,  for -infinity < x < -1
f(x) = -x, for -1 <= x < 0
f(x) = x,  for 0 <= x < 1
f(x) = 1,  for 1 <= x < infinity

The interval for each domain is specified using asymmetric bounds. So "0 <= x < 1" means x can take any value from zero (inclusive) up to one (exclusive). An abbreviated notation for this range is [0,1) where square brackets [] indicate that the ends are included and round brackets () indicate they are excluded. Using asymmetric bounds like this is important since a function can only return a single value for any particular value of x.

In the history of computers, the first high-level languages (like Fortran) used inclusive bounds for specifying loops and array sizes etc. On the other hand assembly-language programmers soon discovered the advantages of asymmetric bounds. The insight of Dennis Ritchie meant that asymmetric bounds became an intrinsic part of the C language not only in arrays and for-loops but also in other areas like file-handling. For example, eof() (end of file) does not become true (in C) until you read past the last byte of a file, whereas in Pascal eof becomes true at the last byte.

Fence-post Errors

The main advantage of using asymmetric bounds is that it reduces the likelihood of fence-post errors.

The fence-post error gets its name from the idea of counting the number of posts in a fence. For example, say you have a fence that's eight metres long and the posts are one metre apart. When asked how many posts there are in the fence it is easy to just think eight, whereas I am sure you know it is really nine.


In code, it is easy to make similar mistakes. For example, I have encountered programmers who when asked how many times the loop below executes would say, without hesitation, 12, since 22 - 10 = 12.

Fortran:
  DO I = 10, 22  ...

Pascal:
  for i := 10 to 22 do ...

C/C++/C#:
  for (i = 10; i <= 22; ++i) ...

These are obvious mistakes, but there are situations where this sort of off by one error is not so obvious. Once, you understand the idea, using asymmetric bounds avoids all manner of fence-post errors.

BTW, the convention in C (and derived languages) is to use asymmetric bounds in a for loop, so that the obvious 23 - 10 gives the right answer:

  for (i = 10; i < 23; ++i) ...

Empty Ranges

The use of asymmetric bounds is related to other conventions that are commonly used in C such as zero-based array indices, and even the use of modular arithmetic, such as integer division (/) and remainder (%). Looking at fence posts again it makes sense to number the posts starting at zero since the "first" post is the post zero meters from the start of the fence.


One thing that C does better than prededing languages is not treating zero as a special case. In some other languages you need special code to handle zero related "boundary" conditions. For example, in Fortran a FOR loop always executes at least once, which has caused innumerable bugs and crashes for things like degenerate conditions and empty input. You can read more about this in my blog from last November (see Zero).

One reason that C handles zero correctly is due to the use of asymmetric ranges. An empty range simply occurs when the start and end of the range are equal. (Compare this with Pascal where ranges are inclusive -- the smallest range you can specify has one element.) Hence it is simple to test for an empty range in C by checking that the start and end are equal.

Another advantage is that it can sometimes be useful to distinguish between different empty ranges. If you just use a special value (like a NULL pointer) to represent an empty range then you lose this ability. (See the section on malloc madness in the above-mentioned blog post from November last year).

The Special Value

Another thing that fits well with other conventions in C is the special value that indicates the end of the range. Since it does not indicate a valid value within the range the just past end value is useful to indicate an error, an unknown value or other special condition.

It is a common convention in C to use a value, that does not normally occur, to indicate an error - for example, a NULL (pointer) value or -1 (integer) value. The just past end value is also useful, especially where the type is indeterminate. For example, in C++ STL containers, iterators have an indeterminate type (so that the same code can work with different containers without modification). All containers support the end() function that returns an iterator with the special value. This value usually represents the "just past end" value -- that is, when iterating through the objects of the container it is used to determine that you have iterated past the end -- however, it can also be used to indicate other things such as that a search of the container returned no value.

These ideas are entrenched in the C standard. Examples can be seen not just in the run-time library (such as the behavior of eof() as mentioned above), but also in the language itself. For example, the C standard says that it is perfectly legal to take the address of an element one past the end of an array. Such a pointer can be compared to a pointer to any other element of the array (but cannot be dereferenced).

This idea is used in many libraries, operating systems, databases and other APIs, which present a "list" of items that can be manipulated. Typically, a function or method is be provided which inserts an item into the list by specifying which existing item the new item will be inserted before. Specifying the special value the insert() function will append to the list. For example, many Windows controls (such as a list box) take an index of -1 to mean "one past end" -- inserting before this index appends to the end of the list. Similarly many STL containers provide an insert() method which take an iterator as the insertion location or append to the container if passed the special value end().

Example

A common algorithm is a binary search, which is used to find an element in a sorted list (usually an array). I've found that binary searches takes two forms depending on whether you are searching for a specific value or just want to know in which range a value occurs. Note that in either case you need to make use of the special value. When searching for a particular value you would return the special value to indicate that the sought value was not found. When searching to find a range you also need to make use of the special value since the number of possible ranges is one more than the number of elements in the sorted list.

    function bin_search(int tofind)
    {
        int bot = 0;
        int top = values.size();          // top starts one past the end of the vector "values"
        while (bot < top)
       {
               int curr = (bot+top)/2;    // split the range into two (almost) equal sub-ranges
               if (tofind < values[curr])
                     top = curr;              // now check bottom half
               else
                     bot = curr + 1;        // or check top half
       }
       return bot;
    }

Using asymmetric ranges make this code simple and elegant. For comparison, I tried rewriting the above using symmetric bounds and 1-based array indexing (as a Pascal programmer would do) but I gave up after the third try since I kept getting the wrong value or an infinite loop. [If someone gets a working version using symmetric ranges please post it below.]

There are countless other example of where asymmetric bounds are useful. I am sure you will encounter them if you have not already.

Circular Buffer Problem

Despite all the benefits there are a few problems to watch out for.

One example is a circular buffer which is commonly used to implement a queue. A queue is a FIFO (first in - first out) data structure which is easily implemented with a circular buffer where the current output/input locations are simply stored as an index into the buffer. (The buffer is typically just an array). Adding to (or removing from) the queue is simply a matter of writing (or reading) the current item and incrementing the input (or output) location. Of course, these locations must wrap around when they reach the end of the buffer (which is why it is called a circular buffer). The wrap around is easily accomplished using modular arithmetic (ie, the % operator in C).

A circular buffer is actually a good example of using asymmetric bounds since the end of the current range of valid items is the same as the current input location. It is also a simple matter to check whether the buffer is empty since this occurs when the input and output locations are the same. However, there is a problem - can you see it?

The problem is that when the buffer becomes full then the input and output locations are also the same. So without further logic there is no way to distinguish between a buffer completely full and a buffer empty condition. This is a trap that I have fallen into as probably have many others.

Binary Heap

Speaking of queues reminded me of another problem. A binary heap (not to be confused with the C run-time memory heap) is normally a very good way to implement a priority queue (not to be confused with the FIFO queue we saw above). The algorithms for a binary heap are a rare example where zero-based array indexes are not a good idea.

I first wrote code that implements a binary heap in Pascal and it was simple and elegant. A binary heap is essentially a binary tree structure stored in an array. Since arrays in Pascal are 1-based the root elements is at index 1 (the start of the array). Given the index of a node it is simple to find the index of its two children - the first child has an index which is twice that of it's parent, and the other child has an index one more than that. (This is easily accomplished using bit-manipulation by shift left by one bit and filling the new bit with either 0 or 1.)

Similarly, to find the parent of a node simply divide by two (using integer division which throws away any remainder), or perform a right-shift.

At a later time I translated this code to C. Of course, C arrays are zero-based so I started with the root of the tree in index zero. However, this really made a mess of the code to find the children/parent of a node. The solution was simply to use one-based indexing and ignore the array element with an index of zero.

    void heap_add(int n)
    {
         // heap full check needed here

        int hole = ++current_size;                     // start at end
        while (hole > 1 && n < values[hole>>1]) // stop at root or at a higher priority item
        {
            values[hole] = values[hole>>1];         // move lower priority item up
              hole = hole>>1;                             // go to parent
        }
        values[hole] = n;
    }

This is the only algorithm I can recall where using one-based array indexing worked better than zero-based. Also, this does not use asymmetric bounds since current_size indicates the valid top element in the values array, rather than one past the end.

User-Interface Examples

There are countless places where user-interfaces can make use of asymmetric bounds - for example, almost anywhere that a user can manipulate an array or list of items.

An example, with which you are undoubtedly familiar, is a text control (or, similarly, a line in a text editor or word processor). If the control contains, for example, the string "Hello" then there are six places for the cursor - at each of the characters and also after the last character. Similarly, selecting some of the characters (using the mouse or keyboard) uses asymmetric bounds since the end of the selection is marked by the cursor position of the character one past the end of the selection. Further, when there is no selection the current cursor position (sometimes called the caret) can be seen to indicate both the start and end of an empty selection.

Similarly, any sort of editable list such as a list-box should have an empty position one past the last valid entry. When you "insert" a new item at this point it is appended to the end of the list.

Visual Studio Bookmarks

I use bookmarks in Visual Studio a great deal. I think it was Visual Studio 2005 that introduced a modeless Bookmarks window (similar to the one I added HexEdit a few years earlier) containing a list of all your bookmarks.

I really appreciate the VS bookmarks list apart from one peculiarity. You can reorder the bookmarks to your liking by dragging them in the list but when you drag a bookmark to a new location it places it after the item you dragged onto. This means to drag to the end of the list you drag onto the last item not past it. Further, you can't move an item to the top of the list (with a single drag) - see the screenshots in which I tried to drag "InitInstance" to the top of the list but it ended up underneath the top item.

To me this design is astonishing. Even if the developer(s) who designed it knew nothing about user-interface design you would think that they would be familiar with C, C++, or C# and have a rudimentary appreciation of the elegance and simplicity of using asymmetric bounds. This behavior of Visual Studio bookmarks confuses and annoys me every time I use it.

Conclusion

Use of asymmetric bounds and zero-based indexes greatly simplifies all sorts of problems. Most algorithms that involve any sort of list or range of values are more simply implemented, apart from a few special cases mentioned above.

Moreover, any user interface design that manipulates a list or array of objects should also use asymmetric bounds. When there is a current location (ie, cursor or caret) that the user uses to interact with the list then adding an item at the current location should insert before the current item; adding at the special location past the end should append to the list.

Friday, 23 November 2012

Verifiability

The verifiability (sometimes called testability) of software is how easy it is to check your code for bugs.  I often talk about the importance of making software maintainable but verifiability is another attribute that does not get enough attention.  (See some of my previous posts for more on quality attributes such the Importance of Developer Quality Attributes).

There are many ways to create software that makes it difficult to verify.  If software is hard to test then bugs will inevitably occur.  I talk about how to make code more verifiable below.

But before I go on, I should mention TDD (test-driven development).  I will probably talk more about TDD in a later post, but I believe one of the main (but never mentioned) advantages of TDD is that it makes code more verifiable.

Testing

Verifiability plays an important part in testing.  After all, you can't test something properly if it is difficult to verify that it is correct or, conversely, show that it has no bugs.  I give an example of code below that shows how code can be hard to verify and makes bugs far more likely.

In extreme cases, software changes can be impossible to test.  For example, I was recently asked to fix up some dialogs that had some spelling mistakes.  The problem was that the software was old and nobody even knew how to get the dialogs to display or even if the code that displayed them was reachable.  (Some of the dialogs were not even used anywhere in the code as far as I could tell.)  It makes no sense to fix something if you cannot verify that it has been fixed!


Debugging

Verifiability is also important in the debugging process.  Removing bugs has three phases:
  1. detecting the presence of a bug
  2. tracking down the root cause of the bug
  3. fixing the bug
The last stage is associated with the maintainability (or modifiability) of the code.  The first two stages are usually associated with verifiability, though I prefer to use an extra quality attribute, which I call debuggability to describe how easy it is to track down a bug, and use verifiability to strictly describe how hard it is to find bugs.

How Do We Make Software More Verifiable?

Verifiability can be increased in three ways that I can think of:
  • external testing tools
  • extra facilities built into the software such as assertions and unit tests
  • how the application code is actually designed and written

Testing Tools

There are many software tools that can be used to assist software testing. Automated testing tools allow testers to easily run regression tests.  This is very useful to ensure that software remains correct after modifications.

One aspect of code that is not often well-tested is error-handling.  Test tools, mock objects, debuggers and simulators allow simulation of error conditions that may be difficult to reproduce otherwise.  For example, a debug heap library is invaluable in detecting memory allocation errors in C programs.

Debuggers obviously make software more debuggable, but they can also be useful for verifiability.  Stepping through code in a debugger is extremely useful for understanding code and finding problems that may not have otherwise been seen.

Assertions

Assertions are an old but crucial way for software to self-diagnose when it has problems.  Without assertions bugs can go unnoticed and the software may silently continue in a damaged state (see Defensive Programming).

In my experience, assertions are even more important for debuggability.  I have seen problems in C (almost always rogue pointers) that have taken days if not weeks to track down.  Careful use of assertions can alert the developer to the problem at its source.

Unit Tests

Unit tests (and TDD) are important for verifiability.  If you have proper and complete units tests then you can get a straightforward and obvious indicator whether the code is correct.

In my experience most bugs that creep into released code are due to modifications to the original design.  Poor changes are made because the modifier does not fully understand how the code works -- even if the modifier was the original designer, they can still forget!  This is further compounded by the fact that the modified code is often less thoroughly tested than the original version.

Mock objects are useful by themselves, for example for testing error-handling code.  (When not used with unit tests they may be called simulators, debug libraries, automated test harnesses. etc)  Mock objects are particularly useful with unit tests as they allow testing of a module to be isolated from other parts of a system.

I will talk more about verifiability and unit tests in a future post, hopefully soon.

Software Design

Many aspects of good design that improve the maintainability of code are also good for verifiability.  For example, decoupling of modules makes the code easier to understand but it also makes the modules much easier to test individually.  Testing several simple modules is always simpler than testing a single large module with the same features, even if only because the permutations to be tested increase dramatically as the number of inputs increases.  (See SRP and other design principles in Software Design).

Code

Perhaps the simplest way to increase verifiability is simply to write good code.  The same principles that a software architect or designer uses also apply at the coding level.

One of the worst programming practices is to use global variables to communicate between modules or functions.  This results in poorly understood coupling between different parts of the system. For example, if the behaviour of a function depends on the value of a global variable, it makes it very difficult to verify that the function works correctly, since you can't ensure that some outside influence will change that variable.

Another problem is large functions that do too much (remember SRP).  Again the more parameters (or other inputs) to a function, the harder it is to check that all combinations are handled correctly.

Similarly, code with many branches (ie, if statements) is hard to verify since there may be a large number of different paths through the code.  Even ensuring that all code is tested (with a code coverage tool) is insufficient. You may need to test every combination of code paths to be sure the code has no bugs.

Code Example

Finally, I will give an example of some C code that is virtually impossible to properly verify due to the way it was written.

Some time ago I worked on some C code for an embedded system.  A lot of the code made use of time (and date) values returned from the system.  Time was returned as the number of milliseconds since midnight.

One particular function was used to determine if a process had timed out.  Like the rest of the code it used this system "time" value. The code was very simple to implement except for the special case where the start time and the end time straddled midnight.  (Normally the code below would not be executing in the middle of the night but under unusual circumstances it could.)  Here is a simplified version of the code:


/* return 1 if we have already passed end time */
int timed_out(long start, long end)
{
     long now = get_time();
    
     if (start == end && end != now)
          return 1;

     if (end > start)
     {
          if (now > end || now < start)
               return 1;
     }
     else if (end < start)
     {
          if (now < start && now > end)
               return 1;
     }
     return 0;
}

Do you think the above code always works correctly?  It is difficult just looking at it to tell if it is correct.  It is virtually impossible for testers to verify that the above code works correctly, even if they were aware that midnight was a special case in the code. Setting up a unit test would be difficult since the software was not permitted to change the system time (though a mock object could have been created to emulate the system time).

You might also notice that the  code suffers from poor understandability (and hence poor maintainability) as well as poor verifiability.

A much better way to handle such timing events is to use a type like time_t or clock_t which does not reset at the end of day.


/* return 1 if we have already passed end time */
int timed_out(clock_t end)
{
     clock_t now = clock();
     assert(now != (clock_t)-1);
     return now >= end;
}

Note that the above code is only for low-resolution timing since the C standard says nothing about the resolution of clock_t – ie, the value of CLOCKS_PER_SEC is implementation-defined. However, in this example 1/10 second was sufficient and CLOCKS_PER_SEC == 1000 for the compiler run-time being used.
This code is more verifiable. First, it is fairly simple manually verify since anyone reading the code can understand it fairly easily.  It is also easy to step through the code in the debugger to make sure it works.  Moreover, it does not need special testing for midnight.

Summary


It is important to create software that is easy to test. We talked about using tools and code (such as unit tests) to enhance verifiability, but more important is creating good code. Now that I think about it, verifiability is probably affected more by the data structures that are used (such as the time-of-day format used in the above example). The data structures define the code; get these structures right and the software is likely to be much more verifiable, and consequently less likely to have bugs.

Furthermore, in my experience writing software with testing in mind is more likely to produce better code. This is one of the (many) advantages of unit tests (and even more so with TDD). If you create units tests as you create the code (or even before you create the code in the case of TDD) then you will create code that is not only more verifiable but also better designed in general.

Tuesday, 13 November 2012

Reusability Futility

The software industry regularly gets carried away with the some new fad or other. Sometimes these are good ideas but because they are over-hyped and taken to extremes can actually be counter-productive. I have seen many such fads in my career. A big one that took off in the late 1980's, was the emphasis on making code reusable. It's even said that this lead to the development, or at least popularization, of object-oriented languages.

Now, I think reusability of code is great. It has had enormous benefits on productivity almost from the birth of the industry. However, many people tried to take it too far. There are more important attributes of software, and when reusability is applied it can be to the detriment of other attributes, such as understandability and efficiency.

It is also often attempted on modules of too large a size. My rule of thumb is that the larger a module the less likely it is to be usefully reusable. Of course, I expound on this further below, but first will explain about software quality attributes.

Quality Attributes

You may recall that in February I mentioned the difference between two types of quality attributes of software which I gave the names user and developer to emphasize who directly deals with their effects (see Importance of Developer Quality Attributes). User quality attributes are things like correctness, reliability, usability, efficiency, security, compatibility, etc.

Developer quality attributes are things like understandability, maintainability, portability, verifiability, etc. Typically the users of the software (and other stakeholders) are not troubled by, or even aware of, these developer quality attributes but that does not mean they are not important to them. In fact, my contention has always been that, for most large-scale projects, maintainability (and the closely related understandability) are by far the most important attributes. (Yes, even more than correctness since if software is incorrect you can fix it, but if it is unmaintainable then, unless you never have to change it, it is worthless.)

Another attribute of concern to developers is reusability. It is important to understand that maintainability and reusability are very different, though, there is generally some correlation between the two. I mention this because I worked with a colleague (AK - an OO "expert") in the late 1990's who could not even understand the distinction so I will explain:
  • Maintainability (or sometimes modifiability) is how hard it is to fix or enhance the software. It depends somewhat on other attributes, most importantly, understandability.
  • Reusability is how hard it is to make use of the same code in more than one place. It is important for the DRY principle (see Software Design)

The Rise of Reusability

In the 1970's a great debate began in the software industry about the software crisis. Rapid advances in computer hardware meant that more and more ambitious software projects were being attempted. Lack of experience with large software projects meant that most software was late, over-budget and of poor quality. It was thought that this situation would become much worse in the future due to the anticipated hardware advances (which, of course, did eventuate). In brief, the crisis was that there would be a huge shortfall in the number of experienced programmers required to develop the software that would make good use of the hardware.

At the same time, it was seen that there was a large amount of redundancy, as the same or similar software was being written over and over. To stop developers continually reinventing the wheel ways were sought to increase the reusability of code. There was already a lot of reuse, using low-level libraries (such as the C run-time library) but some people thought that high-level (ie generally large) components could be created which could then simply be slotted together to create useful software with little or no programming involved.

This emphasis on reuse reached such ridiculous levels by 1995 that I remember seeing this full page ad from Microsoft in many magazines I read at the time. This featured a (probably fictitious) recycling fanatic named Edwin Hoogerbeets, who lived in a house made from beer bottles.

An idea that a lot of people had was to try to emulate the advances in hardware, by using similar techniques to do the same in software. For example, a major hardware advance was the IC (integrated circuit, popularly known as the silicon chip), which lead to a push for similar software components. Many people even used the term "Software IC" to describe these components.

Also around this time (late 1980's) object-oriented languages started to gain popularity. To a large extent this was fuelled by the obsession with reusability as it was believed that OO designs enhanced reusability. Indeed many OO evangelists heavily emphasized this.

Many of these ideas were adopted by, or even originated in, educational institutions. The effect was a whole generation of software designers who were indoctrinated into the importance of making code reusable. This is the root cause of many problems with software created in the last 20 years (and probably still being created today).

YAGNI

When I first read about Extreme Programming (late last century :) I liked most of what I read, but I was not certain about the idea of YAGNI (You Ain't Gonna Need It). This principle says to only do enough to get things working according to current requirements. The idea is that developers should refrain from adding extra features for increased flexibility, perceived future needs, or just because they are cool.

My first thought was that it is stupid to design software without regard to any future requirements. On the other hand I could see where the idea was coming from since I had worked with many software components (such as the Windows API) that were tedious and time-consuming to use since they made you consider all sorts of irrelevant details. This problem was caused by a design that allowed you to accomplish almost any conceivable task (including many pointless ones) without regard to how easy it was to accomplish simple ones.

I don't want to go into a detailed discussion of YAGNI (maybe later). My main point here is that I believe YAGNI was a reaction to the behaviour of many software developers and architects. Instead of designing software components for the particular task at hand there was a push to make them as flexible and general purpose as possible in the name of reusability. In all companies I have worked for the software "architects" would often design the equivalent of a "Swiss Army Knife" when a simpler, more specific, tool was required.

This extra complexity in the design greatly detracts from understandability and hence maintainability (which I have already said is the most important "ility").

An Analogy

I am loathe to analogize software design with the design of physical machines as I believe the two are very different. However, I think there is some value in the analogy below.

But first I would like to point out that the history of invention is littered with failed designs that emphasized reusability. I read recently about 19th century American invention of a system of firearms made from interchangeable parts, which could be assembled to create various weapons like guns, rifles, etc. It was an ingenious system but was never successful because machines designed for a specific purpose are easier to use and maintain than ones designed for flexibility and reusability.

The analogy I have chosen is in aircraft design. If the reusability obsession of the software industry was applied to the aeronautical industry a manufacturer of a plane would design a single wing "module" that they would attempt to fit to all their aircraft from a tiny twin-engine to a super-jumbo. Of course, the wing would have to be flexible enough to work with all size of aircraft. Perhaps for large aircraft you would use multiple wings (4, 8 or even 16). For small aircraft perhaps you could get away with one wing.

We all know that planes are not designed like that. As far as I know (which is very little) every plane has its own custom wing design. Perhaps the same wing design could be used on two different models as long as they had similar size and performance characteristics, but what aircraft manufacturer is going to create two nearly identical models that will just compete with each other?

Of course, even in aircraft manufacture there is a lot of reuse. It's not inconceivable that all models from one manufacturer use the same rivets for the airframe, electrical wiring, seats, cables, pneumatic pumps, etc. This has also always been done in the software industry where the same low-level routines are used for a myriad of things.

The electronics industry is lucky in that fairly high-level components can be created that can be used in different products. Other industries, like the aircraft and software industries, are not so lucky. In software design, as in many other areas of design, creating high-level reusable components results in cumbersome products that just do not work.

Summary

Emphasizing reusability was one of the big mistakes of the software industry in the last few decades. I remember almost 20 years ago telling my colleagues "forget reusability, concentrate on maintainability". The trouble is software designers have been taught to make even high-level components flexible and reusable. This flexibility comes at the cost of increased complexity. Reusable components are usually harder to understand (and often also less efficient) which leads to software that is hard to build and maintain.

Fortunately, many in the industry are now saying the same thing. I believe that overly complex software lead to the idea of YAGNI. More recently criticism of reusability has become more open -- for example, Kevlin Henney's recent book, 97 Things Every Software Architect Should Know gives the rule Simplicity Before Generality, Use Before Reuse. I couldn't agree more!

Tuesday, 30 October 2012

Scrum Standup

One of the best ideas to come from Scrum is the daily standup.  This is where each member of the team tells the rest of the team what they did in the previous (working) day, what they intend to do in the next day, and any problems they anticipate that will stop them doing their job.

[Note that this follows on somewhat from my previous Scrum post at: http://devmethodologies.blogspot.com.au/2012/02/scrum-problems.html]

I've recently read a lot about the Scrum Daily Standup, but what is never mentioned (or perhaps just hinted at) is its major advantage: it quickly becomes obvious when someone is not pulling their weight.  Moreover, the standup, when done properly, motivates team members to focus on the task and give their all.  I discuss this a bit more below but first we need to understand a few things.

What It's NOT

First I will dispel a few misconceptions about the standup.

First, it is not about general communication, only about communicating to the team what each team member has and will be doing.  Any form of communication between two individuals or a subset of the team should be taken offline.

Second, it is not about updating the customer or management on progress.  You can tell when this sort of thing is happening when everybody talks to just one person (usually the Scrum Master or an observing manager).

Lastly, it is not about problem solving.  This usually happens when an ill-prepared team member instead of reporting their status starts asking questions.  Generally these problems should be taken off-line (and in my experience usually could have been resolved at an earlier time).

Why It's GOOD

Why do I like the standup so much?  There are three reasons: visibility, visibility and visibility.

First, let's be honest.  The average developer works at 50% efficiency or less, sometimes much less.  This is for many reasons, some of which are not entirely or not at all their fault.  This can be that they have been misinformed or misunderstood what they were to do.  Often programmers become side-tracked on less important or completely unimportant issues, or they take an inefficient approach for selfish reasons (such as trying a new technique or technology).  Finally, there are quite a lot who are simply unmotivated.  In summary, developers have a tendency to lose focus or focus on the wrong thing.

The visibility of the daily standup makes all these things obvious, if not immediately, then over several days or at most a few weeks.  If everyone knows what everyone else is doing then it soon becomes obvious when someone is slacking off, have become side-tracked or have lost focus.  It is up to the team to work out what is happening and provide the assistance, guidance or punishment as required.

Luckily punishment is rarely needed, due to the positive effects of the standup.  When everyone understands how everyone else is contributing to the sprint's goal(s) then there is much better motivation and focus on achieving those goals.  The team members make a commitment to the rest of the team and generally do their utmost to honor their commitments.  In summary, the standup helps to get the team working together.

What Goes WRONG

Things go wrong when there is reduced visibility.  This can be for a number of reasons.

One reason I have recently encountered is due to a Scrum team made up of many disparate members, working in different areas.  When different team members don't, or can't, understand what others in the team are doing then all the enormous benefits of the daily standup are lost.

Another problem is team size.  When a team becomes larger than 7 or 8 then all sorts of problems appear.  (This is the reason for the cardinal rule in Scrum of a team no larger than 9.)  First with too many people it becomes too hard to follow what everyone else is doing.  It is also much easier for a team member to disappear into the background. Further a smaller team brings a sense of camaraderie and sense of purpose.

Guidance

For Scrum to work well you need a small team where each team member is involved with, or at least understands, what ever other member is doing.  Ideally, everyone makes an important contribution to accomplishing the team goals; a great motivator is knowing that others in the team are depending on you.

It's important to ensure that the standup is working.  If people arrive late or fail to turn up at all then this indicates that the standup is uninspiring and needs to be fixed.  Here are some ideas on how to make sure the standup goes well:

  • when it is their turn each team member must talk for at least 1 minute but not more than 2 minutes.
  • they also cannot ask questions of others but must stick to reporting status and roadblocks
  • they should stick to reporting what they accomplished not how they did it or any problems they had
  • when a team member is reporting, other teams members cannot say anything except to asks questions
  • if someone misses the standup for some reason they must get someone else to report for them
  • if someone is away they still report on what they did the previous day they worked
  • people must not talk directly to any individual (like the Scrum Master) but to the whole team