Branching vs modulo
The question is: when cyclically iterating through a fixed range, is it better to use the modulo operator on the index variable, or use a conditional branch once the index exceeds the range limit? That is to say, which of the following is better?
for(int i=a; i!=b; i=(i+1)%N)
for(int i=a; i!=b; i=(++i==N)?0:i)
The first option looks much cleaner and is easier to read. In non performance-critical systems, or if your programming project is not at the end of its development stage, you should always use modulo. Never prematurely optimize. Even if you were to optimize your program, you should probably start with a profiler instead of fiddling with little things like the modulo operator in loops.
That said, I am currently writing a solver for the Travelling Salesman Problem where it is nice to have more speed. So I decided to find out which one is faster.
Branching is generally bad in loops, but with branch prediction it really only needs to evaluate the condition once every loop. On the other hand, the expensive modulo operator would be evaluated up to N times per loop.
1 Test setup
To test the performance, I made two simple C++ programs:
test_modulo.cpp
#include <iostream>
int main() {
int N = 19739; long long x=0;
for(int i=0; i<99999; i++) {
for(int j=3; j!=2; j=(j+1)%N) x+=j;
}
std::cout<<x;
}
test_branch.cpp
#include <iostream>
int main() {
int N = 19739; long long x=0;
for(int i=0; i<99999; i++) {
for(int j=3; j!=2; j=(++j==N)?0:j) x+=j;
}
std::cout<<x;
}
Here’s a script to test the performance of these programs:
testingscript.sh
echo "Modulo:"
g++ -o test test_modulo.cpp && time ./test
echo "Branch:"
g++ -o test test_branch.cpp && time ./test
echo "Modulo -O:"
g++ -O -o test test_modulo.cpp && time ./test
echo "Branch -O:"
g++ -O -o test test_branch.cpp && time ./test
echo "Modulo -O2:"
g++ -O2 -o test test_modulo.cpp && time ./test
echo "Branch -O2:"
g++ -O2 -o test test_branch.cpp && time ./test
echo "Modulo -O3:"
g++ -O3 -o test test_modulo.cpp && time ./test
echo "Branch -O3:"
g++ -O3 -o test test_branch.cpp && time ./test
rm test
2 Testing results
$sh testingscript.sh Modulo: 19480224095811 real 0m14.962s user 0m14.947s sys 0m0.000s Branch: 19480224095811 real 0m7.567s user 0m7.557s sys 0m0.000s Modulo -O: 19480224095811 real 0m8.557s user 0m8.547s sys 0m0.000s Branch -O: 19480224095811 real 0m1.425s user 0m1.423s sys 0m0.000s Modulo -O2: 19480224095811 real 0m8.293s user 0m8.283s sys 0m0.000s Branch -O2: 19480224095811 real 0m1.302s user 0m1.300s sys 0m0.000s Modulo -O3: 19480224095811 real 0m8.277s user 0m8.267s sys 0m0.000s Branch -O3: 19480224095811 real 0m1.308s user 0m1.307s sys 0m0.000s
So there, branching utterly trounces the modulo operator. What if we made N smaller? Surely it would need to branch more? Changing N to 7 and the maximum for i to 999999999, we get:
$sh testingscript.sh Modulo: 18999999981 real 0m32.710s user 0m32.653s sys 0m0.020s Branch: 18999999981 real 0m12.896s user 0m12.883s sys 0m0.000s Modulo -O: 18999999981 real 0m16.496s user 0m16.480s sys 0m0.000s Branch -O: 18999999981 real 0m4.541s user 0m4.537s sys 0m0.000s Modulo -O2: 18999999981 real 0m16.669s user 0m16.650s sys 0m0.000s Branch -O2: 18999999981 real 0m4.282s user 0m4.280s sys 0m0.000s Modulo -O3: 18999999981 real 0m0.001s user 0m0.000s sys 0m0.000s Branch -O3: 18999999981 real 0m4.318s user 0m4.313s sys 0m0.000s
Branch still wins most of the time with the notable exception of when it is compiled under -O3. In -O3, gcc does loop unrolling.