3 Compiling

3.1 The process of compiling

The first step of compiling a source file to an executable file is converting the code from the high level, human understandable language to assembly code. We know from previous chapters than assembly code works directly with the instructions and registers provided by the processor.

The compiler is the most complex step of process for a number of reasons. Firstly, humans are very unpredictable and have their source code in many different forms. The compiler is only interested the actual code, however humans need things like comments and whitespace (spaces, tabs, indents, etc) to understand code. The process that the compiler takes to convert the human-written source code to its internal representation is called parsing.

3.1.1 C code

With C code, there is actually a step before parsing the source code called the pre-processor. The pre-processor is at its core a text replacement program. For example, any variable declared as #define variable text will have variable replaced with text. This preprocessed code is then passed into the compiler.

3.2 Syntax

Any computing language has a particular syntax that describes the rules of the language. Both you and the compiler know the syntax rules, and all going well you will understand each other. Humans, being as they are, often forget the rules or break them, leading the compiler to be unable to understand your intentions. For example, if you were to leave the closing bracket off a if condition, the compiler does not know where the actual conditional is.

Syntax is most often described in Backus-Naur Form (BNF)1 which is a language with which you can describe languages!

3.3 Assembly Generation

The job of the compiler is to translate the higher level language into assembly code suitable for the target being compiled for. Obviously each different architecture has a different instruction set, different numbers of registers and different rules for correct operation.

3.3.1 Alignment

CPU's can generally only load values into registers from memory on specific alignments. Unaligned loads lead to, at best, performance degradation.
Figure 3.3.1.1 Alignment

Alignment of variables in memory is an important consideration for the compiler. Systems programmers need to be aware of alignment constraints to help the compiler create the most efficient code it can.

CPUs can generally not load a value into a register from an arbitrary memory location. It requires that variables be aligned on certain boundaries. In the example above, we can see how a 32 bit (4 byte) value is loaded into a register on a machine that requires 4 byte alignment of variables.

The first variable can be directly loaded into a register, as it falls between 4 byte boundaries. The second variable, however, spans the 4 byte boundary. This means that at minimum two loads will be required to get the variable into a single register; firstly the lower half and then the upper half.

Some architectures, such as x86, can handle unaligned loads in hardware and the only symptoms will be lower performance as the hardware does the extra work to get the value into the register. Others architectures can not have alignment rules violated and will raise an exception which is generally caught by the operating system which then has to manually load the register in parts, causing even more overheads.

3.3.1.1 Structure Padding

Programmers need to consider alignment especially when creating structs. Whilst the compiler knows the alignment rules for the architecture it is building for, at times programmers can cause sub-optimal behaviour.

The C99 standard only says that structures will be ordered in memory in the same order as they are specified in the declaration, and that in an array of structures all elements will be the same size.

$ cat struct.c
#include <stdio.h>

struct a_struct {
        char char_one;
        char char_two;
        int int_one;
};

int main(void)
{

        struct a_struct s;

        printf("%p : s.char_one\n" \
               "%p : s.char_two\n" \
               "%p : s.int_one\n", &s.char_one,
               &s.char_two, &s.int_one);

        return 0;

}

$ gcc -o struct struct.c

$ gcc -fpack-struct -o struct-packed struct.c

$ ./struct
0x7fdf6798 : s.char_one
0x7fdf6799 : s.char_two
0x7fdf679c : s.int_one

$ ./struct-packed
0x7fcd2778 : s.char_one
0x7fcd2779 : s.char_two
0x7fcd277a : s.int_one
Example 3.3.1.1.1 Struct padding example

In the example above, we contrive a structure that has two bytes (chars followed by a 4 byte integer. The compiler pads the structure as below.

The compiler pads the structure to align the integer on a 4 byte boundary.
Figure 3.3.1.1.1 Alignment

In the other example we direct the compiler not to pad structures and correspondingly we can see that the integer starts directly after the two chars.

3.3.1.2 Cache line alignment

We talked previously about aliasing in the cache, and how several addresses may map to the same cache line. Programmers need to be sure that when they write their programs they do not cause bouncing of cache lines.

This situation occurs when a program constantly accesses two areas of memory that map to the same cache line. This effectively wastes the cache line, as it gets loaded in, used for a short time and then must be kicked out and the other cache line loaded into the same place in the cache.

Obviously if this situation repeats the performance will be significantly reduced. The situation would be relieved if the conflicting data was organised in slightly different ways to avoid the cache line conflict.

One possible way to detect this sort of situation is profiling. When you profile your code you "watch" it to analyse what code paths are taken and how long they take to execute. With profile guided optimisation (PGO) the compiler can put special extra bits of code in the first binary it builds, which runs and makes a record of the branches taken, etc. You can then recompile the binary with the extra information to possibly create a better performing binary. Otherwise the programmer can look at the output of the profile and possibly detect situations such as cache line bouncing. (XXX somewhere else?)

3.3.1.3 Space - Speed Trade off

What the compiler has done above is traded off using some extra memory to gain a speed improvement in running our code. The compiler knows the rules of the architecture and can make decisions about the best way to align data, possibly by trading off small amounts of wasted memory for increased (or perhaps even just correct) performance.

Consequently as a programmer you should never make assumptions about the way variables and data will be laid out by the compiler. To do so is not portable, as a different architecture may have different rules and the compiler may make different decisions based on explicit commands or optimisation levels.

3.3.1.4 Making Assumptions

Thus, as a C programmer you need to be familiar with what you can assume about what the compiler will do and what may be variable. What exactly you can assume and can not assume is detailed in the C99 standard; if you are programming in C it is certainly worth the investment in becoming familiar with the rules to avoid writing non-portable or buggy code.

$ cat stack.c
#include <stdio.h>

struct a_struct {
        int a;
        int b;
};

int main(void)
{
        int i;
        struct a_struct s;
        printf("%p\n%p\ndiff %ld\n", &i, &s, (unsigned long)&s - (unsigned long)&i);
        return 0;
}
$ gcc-3.3 -Wall -o stack-3.3 ./stack.c
$ gcc-4.0 -o stack-4.0 stack.c

$ ./stack-3.3
0x60000fffffc2b510
0x60000fffffc2b520
diff 16

$ ./stack-4.0
0x60000fffff89b520
0x60000fffff89b524
diff 4
Example 3.3.1.4.1 Stack alignment example

In the example above, taken from an Itanium machine, we can see that the padding and alignment of the stack has changed considerably between gcc versions. This type of thing is to be expected and must be considered by the programmer.

Generally you should ensure that you do not make assumptions about the size of types or alignment rules.

3.3.1.5 C Idioms with alignment

There are a few common sequences of code that deal with alignment; generally most programs will consider it in some ways. You may see these "code idioms" in many places outside the kernel when dealing with programs that deal with chunks of data in some form or another, so it is worth investigating.

We can take some examples from the Linux kernel, which often has to deal with alignment of pages of memory within the system.

[ include/asm-ia64/page.h ]

/*
 * PAGE_SHIFT determines the actual kernel page size.
 */
#if defined(CONFIG_IA64_PAGE_SIZE_4KB)
# define PAGE_SHIFT     12
#elif defined(CONFIG_IA64_PAGE_SIZE_8KB)
# define PAGE_SHIFT     13
#elif defined(CONFIG_IA64_PAGE_SIZE_16KB)
# define PAGE_SHIFT     14
#elif defined(CONFIG_IA64_PAGE_SIZE_64KB)
# define PAGE_SHIFT     16
#else
# error Unsupported page size!
#endif

#define PAGE_SIZE               (__IA64_UL_CONST(1) << PAGE_SHIFT)
#define PAGE_MASK               (~(PAGE_SIZE - 1))
#define PAGE_ALIGN(addr)        (((addr) + PAGE_SIZE - 1) & PAGE_MASK)
Example 3.3.1.5.1 Page alignment manipulations

Above we can see that there are a number of different options for page sizes within the kernel, ranging from 4KB through 64KB.

The PAGE_SIZE macro is fairly self explanatory, giving the current page size selected within the system by shifting a value of 1 by the shift number given (remember, this is the equivalent of saying 2n where n is the PAGE_SHIFT).

Next we have a definition for PAGE_MASK. The PAGE_MASK allows us to find just those bits that are within the current page, that is the offset of an address within its page.

XXX continue short discussion

3.4 Optimisation

Once the compiler has an internal representation of the code, the really interesting part of the compiler starts. The compiler wants to find the most optimised assembly language output for the given input code. This is a large and varied problem and requires knowledge of everything from efficient algorithms based in computer science to deep knowledge about the particular processor the code is to be run on.

There are some common optimisations the compiler can look at when generating output. There are many, many more strategies for generating the best code, and it is always an active research area.

3.4.1 General Optimising

The compiler can often see that a particular piece of code can not be used and so leave it out optimise a particular language construct into something smaller with the same outcome.

3.4.2 Unrolling loops

If code contains a loop, such as a for or while loop and the compiler has some idea how many times it will execute, it may be more efficient to unroll the loop so that it executes sequentially. This means that instead of doing the inside of the loop and then branching back to the start to do repeat the process, the inner loop code is duplicated to be executed again.

Whilst this increases the size of the code, it may allow the processor to work through the instructions more efficiently as branches can cause inefficiencies in the pipeline of instructions coming into the processor.

3.4.3 Inlining functions

Similar to unrolling loops, it is possible to put embed called functions within the callee. The programmer can specify that the compiler should try to do this by specifying the function as inline in the function definition. Once again, you may trade code size for sequentially in the code by doing this.

3.4.4 Branch Prediction

Any time the computer comes across an if statement there are two possible outcomes; true or false. The processor wants to keep its incoming pipes as full as possible, so it can not wait for the outcome of the test before putting code into the pipeline.

Thus the compiler can make a prediction about what way the test is likely to go. There are some simple rules the compiler can use to guess things like this, for example if (val == -1) is probably not likely to be true, since -1 usually indicates an error code and hopefully that will not be triggered too often.

Some compilers can actually compile the program, have the user run it and take note of which way the branches go under real conditions. It can then re-compile it based on what it has seen.