Sunday 16 December 2012

Asymmetric Bounds

Experienced programmers are aware of the advantages of expressing ranges using asymmetric bounds.  If you haven't heard of this term, I should explain that it is simply a way to express a range (typically of integers) using a (zero-based) start value and a value just past the end. Some people consider use of asymmetric bounds to be just a convention and I have even heard Pascal/Delphi programmers insist that it is a bad convention. It is not only a good convention, it is more than that, since it has a basis in mathematics as I explain below.

Further, though developers agree that using asymmetric bounds is a good idea in code, they try to hide the idea from users. Sometimes it is a good idea to translate internal representations for the benefit of the user (zero-based indexes being an obvious example), but the elegance and simplicity of asymmetric bounds is very useful in user interface design. I explore this more in the User Interface Examples section below.

Also note that the principle of asymmetric bounds is closely related to the idea of not treating zero as a special value in code. This is something I discussed about a year ago (see blog post on Zero). [I finally got around to writing this follow-up post which I promised at that time.]  The main problem with using an inclusive range is that there is no way to specify a range with zero elements.

There is little written on this subject, however, while researching this post I discovered that Andrew Koenig recently wrote some articles on this for DDJ (see Asymmetric Bounds, Part 1). However, I feel these articles are incomplete as they:
  • lack discussion of problems with using asymmetric bounds
  • do not talk about its importance in user interfaces at all
History

Specifying ranges has been problematic for humans for a very long time. These problems often result in the careful addition of either of the words inclusive or exclusive to ranges such as a range of dates. Unfortunately, there is no simple way to express asymmetric bounds for ranges other than to say inclusive start, exclusive end.

Mathematicians have been aware of the advantages of asymmetric bounds for hundreds of years. For example, a piece-wise function may have different formulas for different values of x, such as:

f(x) = 1,  for -infinity < x < -1
f(x) = -x, for -1 <= x < 0
f(x) = x,  for 0 <= x < 1
f(x) = 1,  for 1 <= x < infinity

The interval for each domain is specified using asymmetric bounds. So "0 <= x < 1" means x can take any value from zero (inclusive) up to one (exclusive). An abbreviated notation for this range is [0,1) where square brackets [] indicate that the ends are included and round brackets () indicate they are excluded. Using asymmetric bounds like this is important since a function can only return a single value for any particular value of x.

In the history of computers, the first high-level languages (like Fortran) used inclusive bounds for specifying loops and array sizes etc. On the other hand assembly-language programmers soon discovered the advantages of asymmetric bounds. The insight of Dennis Ritchie meant that asymmetric bounds became an intrinsic part of the C language not only in arrays and for-loops but also in other areas like file-handling. For example, eof() (end of file) does not become true (in C) until you read past the last byte of a file, whereas in Pascal eof becomes true at the last byte.

Fence-post Errors

The main advantage of using asymmetric bounds is that it reduces the likelihood of fence-post errors.

The fence-post error gets its name from the idea of counting the number of posts in a fence. For example, say you have a fence that's eight metres long and the posts are one metre apart. When asked how many posts there are in the fence it is easy to just think eight, whereas I am sure you know it is really nine.


In code, it is easy to make similar mistakes. For example, I have encountered programmers who when asked how many times the loop below executes would say, without hesitation, 12, since 22 - 10 = 12.

Fortran:
  DO I = 10, 22  ...

Pascal:
  for i := 10 to 22 do ...

C/C++/C#:
  for (i = 10; i <= 22; ++i) ...

These are obvious mistakes, but there are situations where this sort of off by one error is not so obvious. Once, you understand the idea, using asymmetric bounds avoids all manner of fence-post errors.

BTW, the convention in C (and derived languages) is to use asymmetric bounds in a for loop, so that the obvious 23 - 10 gives the right answer:

  for (i = 10; i < 23; ++i) ...

Empty Ranges

The use of asymmetric bounds is related to other conventions that are commonly used in C such as zero-based array indices, and even the use of modular arithmetic, such as integer division (/) and remainder (%). Looking at fence posts again it makes sense to number the posts starting at zero since the "first" post is the post zero meters from the start of the fence.


One thing that C does better than preceding languages is not treating zero as a special case. In some other languages you need special code to handle zero related "boundary" conditions. For example, in Fortran a FOR loop always executes at least once, which has caused innumerable bugs and crashes for things like degenerate conditions and empty input. You can read more about this in my blog from last November (see Zero).

One reason that C handles zero correctly is due to the use of asymmetric ranges. An empty range simply occurs when the start and end of the range are equal. (Compare this with Pascal where ranges are inclusive -- the smallest range you can specify has one element.) Hence it is simple to test for an empty range in C by checking that the start and end are equal.

Another advantage is that it can sometimes be useful to distinguish between different empty ranges. If you just use a special value (like a NULL pointer) to represent an empty range then you lose this ability. (See the section on malloc madness in the above-mentioned blog post from November last year).

The Special Value

Another thing that fits well with other conventions in C is the special value that indicates the end of the range. Since it does not indicate a valid value within the range the just past end value is useful to indicate an error, an unknown value or other special condition.

It is a common convention in C to use a value, that does not normally occur, to indicate an error - for example, a NULL (pointer) value or -1 (integer) value. The just past end value is also useful, especially where the type is indeterminate. For example, in C++ STL containers, iterators have an indeterminate type (so that the same code can work with different containers without modification). All containers support the end() function that returns an iterator with the special value. This value usually represents the "just past end" value -- that is, when iterating through the objects of the container it is used to determine that you have iterated past the end -- however, it can also be used to indicate other things such as that a search of the container returned no value.

These ideas are entrenched in the C standard. Examples can be seen not just in the run-time library (such as the behavior of eof() as mentioned above), but also in the language itself. For example, the C standard says that it is perfectly legal to take the address of an element one past the end of an array. Such a pointer can be compared to a pointer to any other element of the array (but cannot be dereferenced).

This idea is used in many libraries, operating systems, databases and other APIs, which present a "list" of items that can be manipulated. Typically, a function or method is provided which inserts an item into the list by specifying which existing item the new item will be inserted before. Specifying the special value the insert() function will append to the list. For example, many Windows controls (such as a list box) take an index of -1 to mean "one past end" -- inserting before this index appends to the end of the list. Similarly many STL containers provide an insert() method which take an iterator as the insertion location or append to the container if passed the special value end().

Example

A common algorithm is a binary search, which is used to find an element in a sorted list (usually an array). I've found that binary searches takes two forms depending on whether you are searching for a specific value or just want to know in which range a value occurs. Note that in either case you need to make use of the special value. When searching for a particular value you would return the special value to indicate that the sought value was not found. When searching to find a range you also need to make use of the special value since the number of possible ranges is one more than the number of elements in the sorted list.

    function bin_search(int tofind)
    {
        int bot = 0;
        int top = values.size();          // top starts one past the end of the vector "values"
        while (bot < top)
       {
               int curr = (bot+top)/2;    // split the range into two (almost) equal sub-ranges
               if (tofind < values[curr])
                     top = curr;              // now check bottom half
               else
                     bot = curr + 1;        // or check top half
       }
       return bot;
    }

Using asymmetric ranges make this code simple and elegant. For comparison, I tried rewriting the above using symmetric bounds and 1-based array indexing (as a Pascal programmer would do) but I gave up after the third try since I kept getting the wrong value or an infinite loop. [If someone gets a working version using symmetric ranges please post it below.]

There are countless other example of where asymmetric bounds are useful. I am sure you will encounter them if you have not already.

Circular Buffer Problem

Despite all the benefits there are a few problems to watch out for.

One example is a circular buffer which is commonly used to implement a queue. A queue is a FIFO (first in - first out) data structure which is easily implemented with a circular buffer where the current output/input locations are simply stored as an index into the buffer. (The buffer is typically just an array). Adding to (or removing from) the queue is simply a matter of writing (or reading) the current item and incrementing the input (or output) location. Of course, these locations must wrap around when they reach the end of the buffer (which is why it is called a circular buffer). The wrap around is easily accomplished using modular arithmetic (ie, the % operator in C).

A circular buffer is actually a good example of using asymmetric bounds since the end of the current range of valid items is the same as the current input location. It is also a simple matter to check whether the buffer is empty since this occurs when the input and output locations are the same. However, there is a problem - can you see it?

The problem is that when the buffer becomes full then the input and output locations are also the same. So without further logic there is no way to distinguish between a buffer completely full and a buffer empty condition. This is a trap that I have fallen into as probably have many others.

Binary Heap

Speaking of queues reminded me of another problem. A binary heap (not to be confused with the C run-time memory heap) is normally a very good way to implement a priority queue (not to be confused with the FIFO queue we saw above). The algorithms for a binary heap are a rare example where zero-based array indexes are not a good idea.

I first wrote code that implements a binary heap in Pascal and it was simple and elegant. A binary heap is essentially a binary tree structure stored in an array. Since arrays in Pascal are 1-based the root element is at index 1 (the start of the array). Given the index of a node it is simple to find the index of its two children - the first child has an index which is twice that of it's parent, and the other child has an index one more than that. (This is efficiently accomplished using bit-manipulation by shift left by one bit and filling the new bit with either 0 or 1.)

Similarly, to find the parent of a node simply divide by two (using integer division which throws away any remainder), or perform a right-shift.

At a later time I translated this code to C. Of course, C arrays are zero-based so I started with the root of the tree in index zero. However, this really made a mess of the code to find the children/parent of a node. The solution was simply to use one-based indexing and ignore the array element with an index of zero.

    void heap_add(int n)
    {
         // heap full check needed here

        int hole = ++current_size;                     // start at end
        while (hole > 1 && n < values[hole>>1]) // stop at root or at a higher priority item
        {
            values[hole] = values[hole>>1];         // move lower priority item up
              hole = hole>>1;                             // go to parent
        }
        values[hole] = n;
    }

This is the only algorithm I can recall where using one-based array indexing worked better than zero-based. Also note that this does not use asymmetric bounds since current_size indicates the valid top element in the values array, rather than one past the end.

User-Interface Examples

There are countless places where user-interfaces can make use of asymmetric bounds - for example, almost anywhere that a user can manipulate an array or list of items.

An example, with which you are undoubtedly familiar, is a text control (or, similarly, a line in a text editor or word processor). If the control contains, for example, the string "Hello" then there are six places for the cursor - at each of the characters and also after the last character. Similarly, selecting some of the characters (using the mouse or keyboard) uses asymmetric bounds since the end of the selection is marked by the cursor position of the character one past the end of the selection. Further, when there is no selection the current cursor position (sometimes called the caret) can be seen to indicate both the start and end of an empty selection.

Similarly, any sort of editable list such as a list-box should have an empty position one past the last valid entry. When you "insert" a new item at this point it is appended to the end of the list.

Visual Studio Bookmarks

I use bookmarks in Visual Studio a great deal. I think it was Visual Studio 2005 that introduced a modeless Bookmarks window (similar to the one I added HexEdit a few years earlier) containing a list of all your bookmarks.

I really appreciate the VS bookmarks list apart from one peculiarity. You can reorder the bookmarks to your liking by dragging them in the list but when you drag a bookmark to a new location it places it after the item you dragged onto. This means to drag to the end of the list you drag onto the last item not past it. Further, you can't move an item to the top of the list (with a single drag) - see the screenshots in which I tried to drag "InitInstance" to the top of the list but it ended up underneath the top item.

To me this design is astonishing. Even if the developer(s) who designed it knew nothing about user-interface design you would think that they would be familiar with C, C++, or C# and have a rudimentary appreciation of the elegance and simplicity of using asymmetric bounds. This behavior of Visual Studio bookmarks confuses and annoys me every time I use it.

Conclusion

Use of asymmetric bounds and zero-based indexes greatly simplifies all sorts of problems. Most algorithms that involve any sort of list or range of values are more simply implemented, apart from a few special cases mentioned above.

Moreover, any user interface design that manipulates a list or array of objects should also use asymmetric bounds. When there is a current location (ie, cursor or caret) that the user uses to interact with the list then adding an item at the current location should insert before the current item; adding at the special location past the end should append to the list.

No comments:

Post a Comment