Software Development: 2011

Friday 18 November 2011

Zero

Zero may be nothing to you (pun intended) but it has caused many problems through history. In computer programming it is particularly important since incorrect handling can cause minor bugs or worse. In fact, in a well-documented disaster a rocket was lost due to a software defect related to the incorrect use of zero in a Fortran program.

So at the risk of raising much ado about nothing (I promise: that is the last stupid pun :) I will give some examples of the sort of things that can go wrong if you don't handle zero properly. But first a brief history.

History of Zero

First, we should distinguish between zero as a digit and zero as an integer. The symbol for zero can just be used as a digit, ie, a placeholder in a positional based numbering system (eg, the zero in 307 represents that there are no "tens"). A symbol for an empty position was first used by the Babylonians. But the idea of the integer zero was not invented till later by the Mayans and independently in India.

The Sumerians were probably the first to use a positional number system but not having a symbol for zero created problems. If you wanted to send a message to the market that you had 30 goats to sell there was no way to distinguish between 30 and 3. A farmer could avoid this problem by instead selling 29 or 31 goats but as numbers became more important the lack of a zero became problematic.

The Babylonians started with a similar system to the Sumerians, but eventually used the notation of a small mark to signify an empty position in a number. Greek astronomers were the first to use a circle for zero as a placeholder, but it was not until centuries later that Indian mathematicians included zero as a full integer like one, two etc. One reason that mathematicians probably avoided zero was the problem of what is the result of dividing a number by zero or the even weirder idea of 0/0 (zero divided by zero).

Zero in Algorithms

Sometimes having empty input makes no sense. For example, how do you take the average of zero numbers?

int average(int[] array)
{
      int sum = 0;
   for (int i = 0; i < array.size(); ++i)
         sum += array[i];
      return sum/array.size();
}

The above code will have a divide by zero error at run-time when presented with an empty array. We can easily fix this:

int average(int[] array)
{
    if (array.size() > 0)
      {
          int sum = 0;
          for (int i = 0; i < array.size(); ++i)
            sum += array[i];
          return sum/array.size();
      }
      else
          return 0; // May be more appropriate to return -1 to indicate an error, or throw an exception
}

Now suppose we want to find the sum of an array of numbers. Having just coded the above average() function we might be tempted to write:

int sum(int[] array)
{
    if (array.size() > 0)
    {
          int sum = 0;
        for (int i = 0; i < array.size(); ++i)
              sum += array[i];
          return sum;
      }
     else
        return 0;
}

However, in this case, taking the sum of zero numbers does make sense. The code should simply be written as:

int sum(int[] array)
{
      int sum = 0;
      for (int i = 0; i < array.size(); ++i)
          sum += array[i];
      return sum;
}

[Actually, doing it the first way would be better in Fortran as we will see below.]

I will now look at a few ways some programming languages do not handle zero well.

Fortran

Fortran was the first programming language I learnt and you would expect that I would have a certain fondness for it. Unfortunately, I always found it very peculiar. One of the stupidest things in Fortran is that a FOR loop is always executed at least once. This has resulted in countless bugs where boundary conditions were not handled properly. To avoid bugs you usually have to wrap the FOR loop in an IF statement which protects against the empty condition.

Pascal

Soon after learning Fortran I learnt Pascal and it was an enormous relief. It is much simpler to understand and elegant and, of course, it handles FOR loops correctly. However, I later came to appreciate that it has some deficiencies when compared to C-like languages. Ranges in Pascal are specified using inclusive bounds, so for example, the range "1..10" means the integers from 1 to 10 inclusive. You can express a range with one element such as "1..1", but how do you express a range with zero elements?

A common mathematical notation is to express ranges with square brackets when the ends are included and round brackets when the ends are excluded. Many things are simplified when an inclusive lower bound and an exclusive upper bound are used. Some people refer to this as using "asymmetric bounds". For example, the set of numbers from zero (inclusive) to one (exclusive) is shown as [0, 1).

Now you might wonder why you want to express a range with no elements. Actually, it is often very important to be able to do this sort of thing. Many algorithms work on data sets that may have zero elements as a degenerate case. If these situations are not handled then the code will fail under situations like being presented with empty input.

It is better for many reasons to use asymmetric bounds for ranges (as is conventional in C), and I will discuss these reasons in a later post. But one of the main advantages is that you can easily specify an empty (ie, zero-length) range.

Brief

In my first job I used to edit my C code in a word processor (WordStar), and this was a very painful experience. Soon after its release (1985?) I started using the programmer's editor called Brief which was simply wonderful in comparison. (Perhaps the best thing was the unlimited undo which was not restricted to changes in the text but other things like changes to the selection and layout - in fact the Undo for my hex editor, HexEdit, is modelled closely on it.)

Brief allowed you to "mark" text and copy or append it to its internal "clipboard". There were several ways of marking text including "non-inclusive" mode. Non-inclusive mode was very useful as it used the asymmetric bounds pattern mentioned above. This allowed you to mark the start and one past the end of a block of text for further processing.

I found non-inclusive mode very useful for creating macros in Brief's built-in programming language. I created several macros that built up text in the clipboard by appending to the clipboard in non-inclusive mode. But there was one major problem - copying or appending to the clipboard generated an error when the start and end of the marked range were at the same place. That is, you could not copy a zero-length selection.

I even wrote to the creators of Brief (who were obviously very clever guys for creating such a brilliant tool). The reply I got was basically that there could be no use for a selection of zero length. Of course, they were wrong as many of my macros failed under situations where they attempted to append a zero-length selection to the clipboard.

The moral is that even very clever people may not appreciate the importance of zero.

C

The good things about the C language is that in almost all cases zero is handled correctly. Experienced C programmers, usually reach the point where they need only give cursory consideration to boundary conditions. Generally, if you write the code in the simplest, most obvious, way it will just work under all inputs with no special code for boundary conditions.

Even the syntax of the language adopts the "zero is a normal number" approach. For example, in C the semi-colon is used as a statement terminator (rather than a statement separator as in Pascal). A compound statement which contains N statements will have N semi-colons - so a compound statement with no nested statements is not a special case.

Consider compound statements in Pascal and C:

Pascal:

begin statement1; statement2 end;     (* 2 statements, 1 semi-colon *)
begin statement1 end;                       (* 1 statement, 0 semi-colons *)
begin end;                                         (* 0 statements, can't have -1 semi-colons! *)

{ statement1; statement; }            /* 2 statements, 2 semi-colons */
{ statement1; }                            /* 1 statement, 1 semi-colon */
{ }                                             /* 0 statements, 0 semi-colons */

How can this really be that important? This could be important is if you have software that generates C code. In some circumstances you may need to generate a compound statement containing no nested statements. The syntax of C means you don't need to handle the boundary condition specially.

Also if you have ever edited much Pascal code you will know how painful it is to add (or move) statements at the end of a compound statement. You have to go to the end of the previous line and add a semicolon. Worse, if you forget to do so, as usually happens, you get syntax errors on the next build.

Note that C does not take this to extremes though. Commas in function parameter lists are separators not terminators.

func(param1, param2);   // 2 parameters, 1 comma
func(param1);                // 1 parameter, 0 commas
func();                           // 0 parameters, can't have -1 commas!

But you can use commas as terminators in array initialisation, though the last comma is optional. This makes it easy to edit arrays, such as appending and reordering the elements.

float array[] = {
    1.0,
    2.449489742,
    3.141592653,
};

Malloc Madness

Given C's very good behaviour when it comes to zero it was very surprising to find a certain proposal in the draft C standard in the late 1980's. Some on the C standard committee were keen to have malloc(0) return NULL.

All C compilers until then based their behaviour on the defacto standard, ie the behaviour of the original UNIX C compiler created by Dennis Ritchie (often referred to as K&R C). This behaviour was for malloc(0) to return a valid pointer into the heap. Of course, you could not do anything with the pointer (except compare it to other pointers) since it points to a zero-sized object.

In my opinion, anyone on the committee who was in favour of malloc(0) returning NULL showed such a lack of understanding of the C language and the importance of correctly handling zero that they should have immediately been asked to leave. At the time I scanned a fair bit of source code I had already written and it revealed that more than half the calls (about 100 from memory) to malloc() could have been passed a value of zero under boundary conditions.

In other words there were calls to malloc() like this:

data_t * p = malloc(nelts * sizeof(data_t));

where the algorithm relied on nelts possibly having the value zero and the return pointer p not being NULL in this case.

The main problem with the proposal is that malloc() can also return NULL when memory is exhausted. Lots of existing code would have to be changed to distinguish the two cases where NULL could be returned. So code like this:

if ((ptr == malloc(n)) == NULL)
// handle error

would have to be changed to:

if ((ptr == malloc(n)) == NULL && n > 0)
// handle error

However, a further problem with the proposal was that it would be then impossible to distinguish between different zero-sized objects. Of course, a simple workaround would be:

#define good_malloc(s) malloc((s) == 0 ? 1 : (s))

Of course, there is no point in creating a macro to fix a broken malloc() when malloc() should behave correctly.

There is an even more subtle problem. Can you see it?

Luckily, sense (partially) prevailed in that malloc(0) is not required to return NULL. There was some sort of compromise and the behaviour is "implementation-defined". Fortunately, I have never heard of a C compiler stupid enough to return NULL from malloc(0).

C#

I have always liked the C# language and the .Net CLR ever since I first read about it in the C/C++ Users Journal. For example, see my article from 2004 at http://www.codeproject.com/KB/cs/overflow_checking.aspx. However, even in C# there are problems with the handling of zero. Actually the problem below is more to do with the .Net CLR than C# per se.

Consider the base class of all data types called System.Object. This is effectively an "empty" object since it has no data fields. However, the CLR has a default implementation for System.Object.Equals that does a silly thing -- it just compares the pointers. This means that two objects will not compare equal unless they are the same instance of the object.

Side Note: It could be argued that returning false makes sense if the objects have different types (see System.Object.GetType); but it could also be argued that having zero oranges is the same as having zero apples. In any case the derived class version of Equals() should already have handled this situation.

Nothing should always be considered equal to nothing, hence System.Object.Equals() should always return true.

Now you might argue that you can't create a System.Object by itself and that a derived type (which we will call Derived in the example below) should properly implement its own version of Derived.Equals() that compares its members appropriately. The problem with this is that if all the members compare equal then the base class version of Equals() should be called like this:

bool Derived.Equals(Derived other)
{
     if (different types or members have different values)
          return false;
      else
          return base.Equals(other);
}

However, if base.Equals() is System.Object.Equals() then base.Equals() will return false (unless this and other are the same instance), which is wrong. OK, why don't we forget calling the base class and return true?

But there is still a maintenance problem. What if Derived is later changed to derive from another class? Derived.Equals() should now call base.Equals() but it would be easy for this necessary change to be missed.
Things would be so much simpler if System.Object.Equals() simply returned true.

BTW My rules of thumb when implementing Equals() are thus:

1. Always override Equals() (and GetHashCode()) for your class.
2. Never call base.Equals() if you don't derive from another class (implicit System.Object derivation).
3. Remember to call base.Equals() if your class derives another class.
4. Be careful with Equals() when inserting a new derivation between a derived class and System.Object.

Saturday 15 October 2011

Handling Software Design Complexity

I decided to start my first real post with the subject I consider to be at the very core of software development, but which is rarely discussed. But to get started I ask you this question:

What is the main problem facing software development?

Ask a progammer, a software project manager, a systems analyst, a system architect, an academic or almost anyone who knows anything about software development and you will get various answers along the lines of:

insufficient analysis and poorly understood requirements
poorly documented requirements
poor communication, especially of requirements
poor estimates
unrealistic deadlines (often due to poor estimates)
inadequate testing (often due to unrealistic deadlines)
changing requirements
unmaintainable software (often due to frequently changing requirements)

(Please post any more below, if you think I have missed something important.)

If you look at all these answers you can see that the root cause is simply due to complexity. The complexity of the original problem makes it hard to understand and communicate to others, which has the knock-on effect of poor estimates, etc. Coping with change just magnifies the complexity.

The significant problem of unmaintainable software is I believe also due to complexity. This can, and often is, due to large amount of duplicate and even unnecessary code. However, the complexity may just be due to the design of the software, or simply that the design is not well understood - after all complexity is just a matter of understanding.

Coping with Complexity

So how do we cope with complexity? In brief, and in general, we use a fundamental principle that has been used for thousands of years - divide and conquer. An example of its importance is the division of labour, without which the rise of human civilisation would not have even been possible.

How can we use the divide and conquer principle in software development? You have probably already guessed that the answer is we already do so, in a myriad of ways. Thousands, if not millions, of people have been involved in creating all the things that go towards someone creating a piece of software - from the people who created the operatings system, compiler, run-time libraries etc, through to the hardware etc, all the way to your mum, who made the breakfast that keeps your neurons firing.

The principle of divide and conquer is also at the core of the actual design of software. In fact just about everything you have or will ever read about creating software is concerned with the best way to break down a problem into manageable sub-problems. This is also known as information hiding, modules, black box, separation of concerns, encapsulation, decoupling, cohesion, modularity, abstraction. Its also the primary concern of modular programming, orthogonality, abstract data types, component-based software engineering, and object-oriented programming. In a way it is even the basis of the most important paradigm in the history of programming, that of structured programming.

Of course, there are other ways to deal with complexity, but divide and conquer is by far the most important. Before I look at some others though, I need to back-track a bit to consider the nature of complexity.

Understanding

The human brain is very powerful but it is limited in that it can only process a small amount of information at a time. Luckily, due to its ability to learn and memorize, it can accomplish many tasks, just not all at the same time. (Notwithstanding individual differences, which may have a slight sexual bias, I think the male and female brain are essentially the same in this regard.)

My definition of the complexity of a problem is how hard it is to understand. Given a complex problem we can often divide it into sub-problems, so we don't need to understand the whole problem at once. That is, given simple enough sub-problems the brain need only understand each individually, as well as how they are combined to form the main problem.

So even if a problem is too complex for an individual to solve, if it can be broken down into simple sub-problems each of which can be solved one at a time, then the total solution can eventually be solved. In fact, different sub-problems may even be solved by different individuals. Again, this is the principle of divide and conquer.

I mentioned above that there are other ways to deal with complexity. Though I am not an expert on this I can think of a few:

* simplification, or removing information superfluous to the problem
* analogy, or finding a similar problem that is easier to conceptualize
* symbols, since the brain is highly visually-oriented, things like diagrams and notations may assist understanding
* commonality, or recognizing sub-problems that have the same solution

Conclusion

In brief, by far the main problem faced in creating software is complexity. The main way we deal with complexity is using the principle of divide and conquer. The main approach to creating software is (using divide and conquer) to create small well-defined components that perform one simple task and that expose a simple well-understood interface.

This is much easier said than done, since the best way to split a problem into sub-problems is not always obvious. To some extent you have to obtain an understanding of the overall problem to be able to recognize the sub-problems and how to minimize the coupling between the different parts. Moreover, some problems seem to be non-decomposable, though I believe that there are few if any problems that are inherently so. Understanding of techniques and previous experience with analyzing similar problems is important for anyone designing software.

The study of the best way to create small, well-defined components (ie the best way to do the dividing) is the principle concern of software designers, from system architects down to lowly programmers. It is the raison d'etre for almost all the various programming techniques, concepts, processes, paradigms, etc that are continually being invented.

Friday 14 October 2011

Introduction

In 1993 the UTS (University of Technology, Sydney) started offering a part-time, post-graduate course in SQA (software quality assurance). I had been working as a programmer for ten years but doing this course really opened my eyes to a whole lot of things I was not even aware of.

One of those things was software development methodologies, since one-sixth of the course was devoted to that very subject. Of course, I had encountered some of the highly detailed waterfall-based methodologies in the past, but the reality was that nobody really used them and all the development I had done (in both small and large organizations) had been a little bit structured (waterfall) but mainly ad-hoc.

This was why I was very interested in the so-called iterative methodologies that were appearing at that time. Being a team-leader for a small project at the time I actually tried very hard to use Barry Boehm's "spiral methodology", which I believe became the basis (or one of the bases) for IBM's Rational Unified Process (RUP).

Unfortunately, my experiment with methodologies turned out to be disastrous and the project I was working on was cancelled, for which I take full repsonsibility. The problem was probably more my inexperience than anything but at the time my feeling was that Fred Brooks might have been right when he said there is "no silver bullet".

The Blog

Anyway, a lot has happened since then (in particular, agile methodologies), so I have decided to create this blog on different ideas and experiences I have since had about software development. (I already have a blog on the development of my freeware/shareware product HexEdit, but there was a lot of stuff that I could not squeeze into that context.)

So, I think this blog will mainly be about development methodologies. It will probably deviate into other areas of software development like architecture and other aspects of software quality assurance.