Branching vs modulo

The question is: when cyclically iterating through a fixed range, is it better to use the modulo operator on the index variable, or use a conditional branch once the index exceeds the range limit? That is to say, which of the following is better?

for(int i=a; i!=b; i=(i+1)%N)
for(int i=a; i!=b; i=(++i==N)?0:i)

The first option looks much cleaner and is easier to read. In non performance-critical systems, or if your programming project is not at the end of its development stage, you should always use modulo. Never prematurely optimize. Even if you were to optimize your program, you should probably start with a profiler instead of fiddling with little things like the modulo operator in loops.

That said, I am currently writing a solver for the Travelling Salesman Problem where it is nice to have more speed. So I decided to find out which one is faster.

Branching is generally bad in loops, but with branch prediction it really only needs to evaluate the condition once every loop. On the other hand, the expensive modulo operator would be evaluated up to N times per loop.

1 Test setup

To test the performance, I made two simple C++ programs:

test_modulo.cpp

#include <iostream>
int main() {
    int N = 19739; long long x=0;
    for(int i=0; i<99999; i++) {
        for(int j=3; j!=2; j=(j+1)%N) x+=j;
    }
    std::cout<<x;
}

test_branch.cpp

#include <iostream>
int main() {
    int N = 19739; long long x=0;
    for(int i=0; i<99999; i++) {
        for(int j=3; j!=2; j=(++j==N)?0:j) x+=j;
    }
    std::cout<<x;
}

Here’s a script to test the performance of these programs:

testingscript.sh

echo "Modulo:"
g++ -o test test_modulo.cpp && time ./test
echo "Branch:"
g++ -o test test_branch.cpp && time ./test
echo "Modulo -O:"
g++ -O -o test test_modulo.cpp && time ./test
echo "Branch -O:"
g++ -O -o test test_branch.cpp && time ./test
echo "Modulo -O2:"
g++ -O2 -o test test_modulo.cpp && time ./test
echo "Branch -O2:"
g++ -O2 -o test test_branch.cpp && time ./test
echo "Modulo -O3:"
g++ -O3 -o test test_modulo.cpp && time ./test
echo "Branch -O3:"
g++ -O3 -o test test_branch.cpp && time ./test
rm test

2 Testing results

$sh testingscript.sh
Modulo:
19480224095811
real    0m14.962s
user    0m14.947s
sys 0m0.000s
Branch:
19480224095811
real    0m7.567s
user    0m7.557s
sys 0m0.000s
Modulo -O:
19480224095811
real    0m8.557s
user    0m8.547s
sys 0m0.000s
Branch -O:
19480224095811
real    0m1.425s
user    0m1.423s
sys 0m0.000s
Modulo -O2:
19480224095811
real    0m8.293s
user    0m8.283s
sys 0m0.000s
Branch -O2:
19480224095811
real    0m1.302s
user    0m1.300s
sys 0m0.000s
Modulo -O3:
19480224095811
real    0m8.277s
user    0m8.267s
sys 0m0.000s
Branch -O3:
19480224095811
real    0m1.308s
user    0m1.307s
sys 0m0.000s

So there, branching utterly trounces the modulo operator. What if we made N smaller? Surely it would need to branch more? Changing N to 7 and the maximum for i to 999999999, we get:

$sh testingscript.sh
Modulo:
18999999981
real    0m32.710s
user    0m32.653s
sys 0m0.020s
Branch:
18999999981
real    0m12.896s
user    0m12.883s
sys 0m0.000s
Modulo -O:
18999999981
real    0m16.496s
user    0m16.480s
sys 0m0.000s
Branch -O:
18999999981
real    0m4.541s
user    0m4.537s
sys 0m0.000s
Modulo -O2:
18999999981
real    0m16.669s
user    0m16.650s
sys 0m0.000s
Branch -O2:
18999999981
real    0m4.282s
user    0m4.280s
sys 0m0.000s
Modulo -O3:
18999999981
real    0m0.001s
user    0m0.000s
sys 0m0.000s
Branch -O3:
18999999981
real    0m4.318s
user    0m4.313s
sys 0m0.000s

Branch still wins most of the time with the notable exception of when it is compiled under -O3. In -O3, gcc does loop unrolling.