Loop unrolling optimization, how does this work

assemblyc++

Consider this C-code:

int sum=0;
for(int i=0;i<5;i++)
    sum+=i;

This could be translated in (pseudo-) assembly this way (without loop unrolling):

% pseudo-code assembly
ADDI $R10, #0   % sum
ADDI $R11, #0   % i
LOOP:
ADD $R10, $R11
ADDI $R11, #1
BNE $R11, #5 LOOP

So my first question is how is this code translated using loop unrolling, between these two ways:

1)

ADDI $R10, #0
ADDI $R10, #0
ADDI $R10, #1
ADDI $R10, #2
ADDI $R10, #3
ADDI $R10, #4

2)

   ADD $R10, #10

Is the compiler able to optimize the code and directly know that it has to add 10 without performing all sums?

Also, is there a possibility to block the pipeline with a branch instruction? Do I have to write it this way:

% pseudo-code assembly
ADDI $R10, #0   % sum
ADDI $R11, #0   % i
LOOP:
ADD $R10, $R11
ADDI $R11, #1
NOP   % is this necessary to avoid the pipeline blocking?
NOP
NOP
NOP
BNE $R11, #5 LOOP

To avoid that the fetch-decode-exe-mem-write back cycle is interrupted by the branch?

Best Solution

This is more for demonstration of what a compiler is capable of, rather than what every compiler would do. The source:

#include <stdio.h>

int main(void)
{
    int i, sum = 0;

    for(i=0; i<5; i++) {
        sum+=i;
    }

    printf("%d\n", sum);
    return 0;
}

Note the printf I have added. If the variable is not used, the compiler will optimize out the entire loop.

Compiling with -O0 (No optimization)

gcc -Wall -O0 -S -c lala.c:

.L3:
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
.L2:
    cmpl    $4, -8(%rbp)
    jle .L3

The loop happens in a 'dumb' way, with -8(%rbp) being the variable i.

Compiling with -O1 (Optimization level 1)

gcc -Wall -O1 -S -c lala.c:

movl    $10, %edx

The loop has been completely removed and replaced with the equivalent value.


In unrolling, the compiler looks to see how many iterations would happen and tries to unroll by performing less iterations. For example, the loop body might be duplicated twice which would result in the number of branches to be halved. Such a case in C:

int i = 0, sum = 0;

sum += i;
i++;

for(; i<5;i++) {
    sum+=i;
    i++;
    sum+=i;
}

Notice that one iteration had to be extracted out of the loop. This is because 5 is an odd number and so the work can not simply be halved by duplicating the contents. In this case the loop will only be entered twice. The assembly code produced by -O0:

    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
    jmp .L2
.L3:
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
.L2:
    cmpl    $4, -8(%rbp)

Completely unrolling in C:

for(i=0; i<5;i++) {
    sum+=i;
    i++;
    sum+=i;
    i++;
    sum+=i;
    i++;
    sum+=i;
    i++;
    sum+=i;
}

This time the loop is actually entered only once. The assembly produced with -O0:

.L3:
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
    movl    -8(%rbp), %eax
    addl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
.L2:
    cmpl    $4, -8(%rbp)
    jle .L3
Related Question