Saturday 13 July 2013

Nesting Functions Using Lambdas

A major problem with a lot of C code has always been that functions tend to become too long. It is sometimes difficult to move code into sub-routines if there are a lot of local variables you need to access. This is one advantage Pascal has over C, because it supports nested functions. But using lambda functions (currently supported in C++ and C#) you can achieve the same effect.

My rule of thumb is that functions should be 10 or less lines of code, but occasionally it makes sense for them to extend to 20 or even 30 lines.
Due to the ability to capture local variables lambda functions can be used to create nested functions. Lambda functions are provided in C++ and C# but not in C. (They are also rumored to appear in Java in the future.)

Lambda Functions

Last month I talked about the new lambda functions in C++11. It is great to be able to create simple unnamed functions especially when calling STL algorithms. If you create a lambda function like this:

   [] (int x, int y) { return x < y; }

the compiler will automatically create a little function and insert a call to the function at the point of its use. Note that the function does not need a name as it is defined and used right where it is required. The compiler infers that it has a boolean return value from the type of the return statement. (It also takes two integer parameters.)

But lambdas have more tricks than that. They can also capture variables from the containing scope where the lambda function is declared and used. For example:

   int value = something();
   ...
   [=] (int x) { return x < value; }

In this case the compiler has to create a temporary object that saves value in a private member and implements operator() in order for the function to be called. It is this ability to capture variables that allows lambda functions to be used as nested functions.

Nested Functions

If you haven't ever used a language that supports nested functions (like Pascal) then you probably don't realize what you are missing. If you find a function is becoming too big, it is a simple matter to split some of the code off into a nested function. Since a nested function has access to all local variables of its parent (and any ancestor functions) this is trivial to do. (If you have read much of this blog then you may know that I don't like Pascal much but this is one area where Pascal is better than C.)

In C, or any language that does not have nested functions, creating a sub-routine is a lot more difficult if it needs access to many local variables of the "parent" function. This is probably the greatest contributor to the fact that C functions are generally too long. When a function becomes too big it should be split up. (However, see Layer Anti-Pattern which explains that the obvious way to do this, using layers, is not always the best way.)

There are several ways you can give access to local variables in a sub-routine, such as:
  • pass a long list of local variables as parameters to the sub-routine
  • put all required local variables in a struct and pass a pointer to it
  • move the variables outside the function giving them file scope
  • make the variables into private member variables (in C++) to give all class methods access to them
None of these are good solutions in general, but let's consider each of these in a bit more detail to see why.

1. Pass as Arguments

Passing a long list of arguments is a bad idea. The more parameters a function has the harder it is to read the code. It is also tedious and error-prone to put the parameters at the right position in the function call. It's generally accepted that functions should have at most two (perhaps three) parameters.

Further, if any of the variables need to be modified, then they need to be passed by reference (ie, pointer or C++ reference). These sort of output parameters are also error-prone. Output from a function is conventionally performed through its return value. Having output also occur somewhere in a long list of parameters is confusing.

Finally, it can be inefficient, especially if some parameters are large objects (and not passed by const pointer/reference).

2. Pass in a Struct

The local variables to be shared could all be placed in a local struct. A pointer (or reference in C++) can be passed to the called function. The pointer can be const if none of the values are to be modified.

This is probably the best solution. It is fairly straightforward and efficient. However, it is tedious to set up and it sometimes makes little sense to put unrelated variables together in the same structure.

3. Use File Scope

The most common approach is to move shared variables outside the function to make them global. For example, if you have a function like this:

void g()
{
   int local = 1;

   for (...
      local *= something_complicated() + something_else();
   ...
}

You could move some of the code from g() into a new function, do_something() and give access to local as in the code below. (Note that this code has several problems, and at least one bug.)

int local = 1;

void do_something()
{
   local *= something_complicated() + something_else();
}

void g()
{
   for (...
      do_something();
   ...
}

The first problem with this is that the variable local now has global lifetime and is now only initialized once. Unless it is re-initialized every time g() is called then this change will probably cause a bug.

The other problem is that local and do_something() should be declared static. This is important so that the interface to the module is better defined. If this is not done then local can be modified and do_something() called from anywhere else in the program. This could be due to a deliberate attempt to bypass the official interface to the module, or it could be accidental due to name conflicts (ie, other global variables with the same name).

This is better:

static int local;

static void do_something()
{
   local *= something_complicated() + something_else();
}

void g()
{
   local = 1;

   for (...
      do_something();
   ...
}

That's better, but there is still a potential problem with g() -- it is not re-entrant. A re-entrant function is one that is called again while an earlier call is still active.

The most obvious form of a re-entrant function is simple recursion. However, functions can be re-entered in other ways, such as if the same function is called from different threads.

A common problem in Windows software is where a function is unwittingly re-entrant. This occurs when a function (directly or indirectly) makes a Windows system call that causes a message to be sent back to the caller and the message handler ends up calling the same function again. I have encountered several nasty problems of this nature in the past.

For the above reasons you should avoid file scope variables (and global/static variables in general). There is something comforting about having all you variables declared locally and safely tucked away on the stack.

4. Use Member variables

In C++ (and C#) there is another solution - use private member variables. This is an enormously popular approach and possibly one of the main reasons that people prefer C++ to C. However, I do not think it is always appropriate either.

For example, I have seen large classes, with hundreds of methods (member functions) that tuck away numerous private member variables where each is just used in two or a few methods. In fact, in my early C++ days, I created such classes (see the source code for my hex editor, which you can download from hexedit.com). A member variable that is used by only a small subset of all member functions is indicative of a bad design.

My rule of thumb is that a member variable should be used in at least half of all member functions.
In essence, using member variables like this is no that much better than that most heinous of crimes - using global variables. See my previous post if you don't know why global variables are evil.

Nested Functions Using Lambdas

The above discussion should have convinced you that there is no simple and elegant solution to nesting functions in C and even C++/C#. The advent of lambda functions provides that solution.

As I mentioned above a lambda function can refer to local variables of the scope in which it is declared. The compiler (in both C++ and C#) creates a special object to allow you to use any referenced variables from within the lambda function. In C# a reference to all the variables is stored in the object, but in C++ you can access them by value, reference or even const reference.

But enough of my confusing babbling. An example is the best way to explain what I mean. The example below shows how a long function can be made more readable by using lambda functions to create nested functions.

Example

It was difficult to find an example with which to demonstrate this problem and the solution. (The function must be necessarily long to demonstrate the problem we are trying to solve, but not too long as to be completely unwieldy.) After quite a bit of searching through old C code I found the function below, which, given an RGB color, generates the corresponding HLS values (hue, luminance and saturation). Note that this code was originally written in C a long time ago, but I have converted it to C# (as I don't have access to a C++ compiler that supports lambda functions at the moment).

First, here is the long version in C# (without lambdas). It is a bit too long at over 40 lines. Using nested functions could make the design more modular and the logic a bit clearer.

static void GetHLS(Color col, out int hue, 
                   out int luminance, out int saturation)
{
   const int hlsmax = 100; // Determines range of return value
   const int rgbmax = byte.MaxValue; // max of R, G, or B = 255

   int cmin = Math .Min(col.R, Math.Min(col.G, col.B));
   int cmax = Math .Max(col.R, Math.Max(col.G, col.B));
   int cdiff = cmax - cmin;

   luminance = ((cmax + cmin) * hlsmax + rgbmax) / (2 * rgbmax);

   if (cmax == cmin)
   {
      hue = -1;
      saturation = 0;
   }
   else
   {
      // Work out saturation
      if (luminance <= hlsmax / 2)
         saturation = (cdiff*hlsmax + (cmax + cmin)/2) / 
                                           (cmax + cmin);
      else
         saturation = (cdiff*hlsmax + (2*rgbmax-cmax-cmin)/2) / 
                              (2*rgbmax - cmax - cmin);

      // Work out hue
      int Rdelta = ((cmax - col.R)*(hlsmax/6) + cdiff/2)/cdiff;
      int Gdelta = ((cmax - col.G)*(hlsmax/6) + cdiff/2)/cdiff;
      int Bdelta = ((cmax - col.B)*(hlsmax/6) + cdiff/2)/cdiff;
      if (col.R == cmax)
         hue = Bdelta - Gdelta;
      else if (col.G == cmax)
         hue = (hlsmax / 3) + Rdelta - Bdelta;
      else
         hue = ((2 * hlsmax) / 3) + Gdelta - Rdelta;

      if (hue < 0)
         hue += hlsmax;
      if (hue > hlsmax)
         hue -= hlsmax;
   }
}

In the code below I have used lambdas to create six nested functions.

static void GetHLS(Color col, out int hue,
                   out int luminance, out int saturation)
{
   const int hlsmax = 100; // Determines range of return values
   const int rgbmax = byte.MaxValue;

   int cmin = Math.Min(col.R, Math.Min(col.G, col.B));
   int cmax = Math.Max(col.R, Math.Max(col.G, col.B));
   int cdiff = cmax - cmin;

   // Work out luminance first - required for others calcs
   int lum = ((cmax + cmin) * hlsmax + rgbmax) / (2 * rgbmax);

   // Nested functions
    Func<bool> IsGray = () => cmax == cmin;
    Func<int> Rdelta = () => ((cmax - col.R)*(hlsmax/6) + cdiff/2) / cdiff;
    Func<int> Gdelta = () => ((cmax - col.G)*(hlsmax/6) + cdiff/2) / cdiff;
    Func<int> Bdelta = () => ((cmax - col.B)*(hlsmax/6) + cdiff/2) / cdiff;
    Func<int> GetSaturation = () =>
    {
      if (IsGray())
         return 0;
      else if (lum <= hlsmax / 2)
         return (cdiff*hlsmax + (cmax + cmin) / 2) / 
                                      (cmax + cmin);
      else
         return (cdiff*hlsmax + (2*rgbmax - cmax - cmin)/2) /
                                    (2*rgbmax - cmax - cmin);
    };
    Func<int> GetHue = () =>
    {
      if (IsGray())
         return -1;

      int h;  // Calculated hue value
      if (col.R == cmax)
         h = Bdelta() - Gdelta();
      else if (col.G == cmax)
         h = (hlsmax / 3) + Rdelta() - Bdelta();
      else
         h = ((2 * hlsmax) / 3) + Gdelta() - Rdelta();

      if (h < 0)
         return h + hlsmax;
      else if (h > hlsmax)
         return h - hlsmax;
      else
         return h;
    };
   // End of nested functions

   luminance = lum;
   saturation = GetSaturation();
   hue = GetHue();
}

Note that, in this version, I have changed the logic a little. First, I create the IsGray() lambda function, which in this version is called twice (whereas the equivalent code was only called once in the original code above). This is somewhat compensated for by the fact that only two of the three delta functions (Rdelta, Gdelta, Bdelta) functions are called.

Note that creating the lambda functions will probably reduce the performance of GetHLS() to a small extent, due to the indirection of the function calls.

Conclusion

Lambda functions have another use beyond their obvious one of being able to create simple in-line, unnamed pieces of code. You can use them to create nested functions.

This is useful in C++ (if C++11 features are available to you) and C#. It is a shame you still can't do this easily in C - though you could compile C code as C++ just to use lambdas.

Saturday 6 July 2013

Best Practice in C for Modules

Today I discuss best practice in designing and implementing modules in C (and C++). The most important aspect is that each module should have a simple, well-defined interface given by its header file.

Modules and Interfaces in C

Last year I talked about how crucial the principle of divide and conquer is to software design (see Handling Software Complexity). To divide a problem, software is split into pieces that can, to some extent, be isolated from each other. These pieces can take many forms and have different names but I try to stick to the generic term module. Modules connect to each other by means of interfaces. The idea that you only need to understand the interface of a module, rather than how it is implemented internally, isolates the complexity. This makes designing complex software not just easier but possible.

The idea of using modules to create software is almost as old as software itself (despite what OO adherents would have you believe). There are many benefits, but the main one is information hiding; a simple well-defined interface hides the complexity of how a module is actually implemented. (There are other benefits, such as facilitating polymorphism, which I will not go into now - maybe in a future post.)

The advantages of information hiding are large, and well-documented, which makes it all the more surprising that I continue to encounter C code that ignores best practice in this area which was established 25 or 30 years ago. Let's start by looking at how interfaces are defined, using header files, then look at poor practices and the resulting problems.

Header Files

C (and C++) have always used header (.H) files to provide the interface for modules. Small modules are created using a single source (.C) file. Each module (.C file) should have its own header file which exposes its interface, or what can be used from the outside.

A larger module may combine several smaller modules and generate a (link-time or run-time) library. A library is typically implemented as a separate project with its own header file that exposes its public interface. For large pieces of software you may need even larger modules composed of multiple libraries, where you could use a single header file that #includes the header files for all the parts that are exposed publicly. An example is MS Windows which is composed of numerous DLLs (run-time libraries) the interfaces to which are exposed through the header file Windows.h.

The crucial thing is to have good header files that expose well-defined interfaces. Next I will look at some cases where header files are used poorly (or circumvented entirely).

Function Declarations

When I first used C, calling other modules was error-prone. Functions were declared in a header file so that when the compiler built other modules it knew the type of value it returned but you could not declare what parameters a function took. In other words the C language did not allow you to precisely define the interface to a module. (The later development of function prototypes during the development of the first C standard in the late 1980's addressed this problem.)

Here is an example of the sort of nasty bug that I and others encountered before the addition of function prototypes to C:

/* module.c - implementation of module */
int func(long value)
{
   ...

/* module.h - interface to module */
int func(); /* declares return type but NOT parameters */
   ...

/* caller.c - user of module */
   ...
   int small_value = 8;
   ...
   func(small_value);

The above code would almost always work but very occasionally would crash or behave very strangely due to the fact that func() was getting large values. Because the problem was not reproducible it was difficult to track down.

In brief, this was caused by several things:
  1. In this environment int was 16 bits and long was 32 bits. The caller was pushing 2 bytes on the stack (see diagram) but the callee was expecting 4 bytes.
  2. C calling conventions are such that the caller pops parameters off the stack. The callee doesn't care how many bytes were pushed.
  3. Little-endian byte order means that an integer value looks exactly like the same long value if it happens to be followed by two zero bytes.
  4. The bytes following the integer on the stack were almost always zero, but very occasionally were not.


The net effect was that func() would almost always get the correct value passed to it since the bytes following the 2 bytes were zero making small_value look like the equivalent 32-bit value. Only when the bytes following it on the stack were not zero would something strange happen.

This sort of bug is exactly the reason that function prototypes were invented. They allow a better module interface:

/* module.h - interface to module */
int func(long);

Now the compiler can ensure that parameters of the correct type are being provided, and even provide implicit conversions if available.

Note that, even after the advent of function prototypes, this problem can still occur when using functions that take a variable number of parameters. Function prototypes only allow a fixed number and type of parameters to be specified. The compiler is unable to check that anything is wrong with this:

   printf("%s", 42);


  • declare function prototypes for all functions (the compiler should enforce this anyway)
  • be very careful calling functions that take a variable number of parameters


Duplicate Declarations

A closely related problem is when a function is declared in multiple places. I recently encountered a problem because a function that should have been private to a module was being called from outside the module like this:

/* module.h - interface to module */
int f(int);

/* module.c - implementation of module */
void private_func(int);
...

int f(int val)
{
   ...
   private_func(1);
   ...

void private_func(int i);
{
   ...

/* other.c - implements an other module */
   ...
   void private_func(); // duplicate declaration!!!
   private_func();
   ...

Obviously, the private function had taken no parameters at some stage, whence someone had decided they needed to bypass the interface and have direct access to the function. Later it had been changed to take an integer parameter but it was still being called without parameters in other.c.

This is a good example of the advantages of the DRY (don't repeat yourself) principle. Functions should only ever be declared in one place. For public functions of a module they should only be declared in the header file for the module.

Even without the nasty bug caused by inconsistent function declarations the idea of circumventing the interface to the module (by accessing private functions from outside the module) is a bad idea. It makes the interface to the module less clearly defined which makes maintaining the software a nightmare. Moreover, calling the private function might cause the module to be left in an inconsistent state.

A way to avoid this type of thing from happening, intentionally or accidentally, is to declare all private functions as static (see Unintentional Global Variables below). This makes them completely inaccessible from outside.
  • only declare function prototypes in one place (the header file for the module)
  • include the header file in the module source file so the compiler can check that the declaration matches the actual function definition
  • include the header file in all other modules that use the modules so the compiler can ensure that all parameters are passed correctly
  • declare private functions and (file scope) variables static

Struct Alignment

Another nasty problem that can occur is related to how pad bytes are added to structs in C header files (and classes in C++).

It is common for an interface to a C module to use structs (or usually pointers to structs) as parameters to its functions. Hence these structs are part of the interface and must be added to the header file which defines the interface to the module. For example, the Windows header files are full of such structs (often typdedef'ed) such as struct _SYSTEMTIME (typedef'ed as SYSTEMTIME).

One problem to watch for is that the caller's and the callee's interpretation of the memory layout may differ if they have different alignment settings at the point where they #include the header. This is usually controlled by means of a command line option and/or #pragma pack. See my blog entry on Alignment for a complete description of this problem and how to create struct's that do not have this problem.

In C++ the same thing can happen with classes (or references and pointers to classes). In fact, it is even more common to pass these around. However, it is unusual for C++ programs to change alignment settings, so this is not normally a problem.
  • protect structs in header file from alignment problems by using #pragma pack or (better) make the struct alignment agnostic

Unintentional Global Variables

A lot of people, like me, learnt C from K&R. There are also many other C (and C++) books that have a similar coding style for their examples. This can be a problem, not because the example code isn't of the highest standard, but because the code is written for brevity. This style is not recommended for production code, but is nevertheless emulated. For example, K&R often uses very short variable names and declares function prototypes within the code instead of in a separate header file (see "Duplicate Declarations" above for why this is a bad idea), etc.

Another thing, in the K&R code, is that functions are never declared static, even when they are obviously "private". However, in large programs it is important to declare all private functions (and file-scope variables) static (in the same way that private members of a class in C++ are declared private). A simple code example:

int privateVar;
int privateFunc() { ... }

  ...
  privateVar = ...

If the above are only used within the one module (ie only in the current source file) then they should be made invisible from outside:

static int privateVar;
static int privateFunc() { ... }

This more clearly delineates what is and is not part of the interface and has the following advantages:
  1. it helps someone reading the code to quickly see what is private to the module
  2. it avoids accidental name conflicts that cause build errors or even subtle run-time errors
  3. it prevents other modules using private parts of a module, possibly undermining its consistency
  4. it speeds up the link process as the linker has to deal with less names

C++ Classes

Typical C++ code is better than C code in creating well-defined modules. This is because the language (and the culture of its use) encourages modularity by the use of classes. For example, there is usually a clear distinction between what is public and what is private, which makes for better, well-defined interfaces.

However, there are still some poor practices in C++. A common one is the putting many classes into the same source file. As a general rule its better to put each class into its own .CPP (or .CC, .CXX, etc) file with a corresponding .H (or .HPP) header file. Of course, sometimes a module is better defined by more than one class. In such case these class would be closely related (using inheritance or at least the friend mechanism), such as a container class and its iterator.

Classes should also be as small as possible, doing one simple thing. I know this means that even small programs may have lots of source files, which some people see as a problem. This should not be a disincentive as you can use the file-system to organize the files into directories and most IDEs (like Visual Studio) support grouping different source files into folders.

In summary, a low-level module in C++ is typically composed of a single class (or a few closely related classes). The class methods should be implemented in a single source (.CPP) file. There should be a corresponding header file (same name but with .H extension) which declares the class and any related types etc. A large class could have several source (.CPP) files but should still have only a single header (.H) file - though, a class of that size may be indicative of a poor design.

One problem with C++ is that private parts of a class appear in the header file (since they are part of the class declaration) even though they are not part of the public interface. Users of the module should, of course, ignore these private methods and members. (They only appear in the header so the compiler can work out the size of objects of the class.)

C++ Name Mangling

One very good thing about C++ when compared to C is the use of what is commonly called name mangling. C++ is safer because all global names (functions, variables, member functions, etc) have a "name" that not only encodes the identifier but also its type (including number and type of parameters for functions). This will detect many of the problems mentioned above.
Note that name-mangling can be disabled in C++ using extern "C". This allows C and C++ object modules to be linked together.

For example, if you try to call function func() from one module passing a short but func() is defined (in another module) to take a parameter of type long then these two functions will have different "mangled" names. That is, the compiler will use different names in the object files (eg, .OBJ files on Windows) for the caller and callee and the linker will generate an error such as "unresolved external".

Interfaces

While we are discussing C++ we may as well talk about interfaces. Some languages
What is an interface?

The interface keyword in Java and C# is used to declare an abstract type that other classes may derive from. The interface does not provide any implementation; the deriving class implements the interface's methods.
(like C# and Java) have explicit support for interfaces. In C you create an "interface" by declaring a group of related functions in a header file. In C++ you can create an interface using an abstract base class. This is a class which defines all the functions you can call but does not actually contain any code (or data). That is all functions are declared pure virtual with no actual implementation. (Internally this is simply an array of function pointers called a VTable.) Obviously, such a class is not much use by itself, but it allows another class to derive from in order to implement the "interface".


However, because C++ does not directly support interfaces I often see abstract classes which are trying to be interfaces but are not quite right. Here is an example. How many problems can you see?

class interface
{
private:
   void internal();
   int count;
public:
   interface() {}
   ~interface();
   virtual void process() =0;
   virtual void other() =0;
};

First, an interface class does not implement anything so it should not have any data members (like count) or private methods (like internal). An interface is never instantiated so it also should not have a constructor. (Also note that the above constructor is pointless in any case as the compiler will generate a default constructor like this if necessary.)

Finally, an interface class destructor should always be declared virtual. This ensures that if the object which implements the interface is destroyed using a pointer to the interface then the correct destructor for the object is called.

class interface
{
public:
   virtual ~interface();
   virtual void process() =0;
   virtual void other() =0;
};

In summary an interface (abstract base class) class should:
  • have no private parts
  • have no member variables
  • have no constructors
  • declare all functions pure virtual

Summary

Having simple, well-defined interfaces allows the design of software to be more easily understood and consequently built and maintained.

In C, this is done by using one source file per low-level module. Private parts of the module (functions and file-scope variables) are declared static so they are not visible from the outside. The public parts (ie, the interface to the module) are declared in the header file for the module. Functions should be declared in one place only (DRY) to prevent nasty bugs caused by inconsistent declarations.

These ideas are similar in C++ except that a module is typically implemented by a class.

DO

  • create a separate source (implementation) and header (interface) file for each module
  • clearly differentiate between public and private parts of the module
  • declare function prototypes for all public module functions in the module header file
  • make all private module functions (and file scope variables, if any) static

DON'T

  • declare functions in more than one place
  • declare the interface for more than one module in the same header file
  • allow the padding of structures in header files to depend on the current alignment setting