Friday, December 20, 2013

Some Benchmarks of Mathematica vs Compiled Mathematica vs C vs ...

I was looking into some details of Compiling Mathematica code. And here are some benchmarks.

Summary

As the post is growing in length, it may be helpful to provide a summary here -- we benchmark Mathematica, Mathematica compiled code vs C and a few other languages. The test code is a do-loop with a few triangle function computation in the loop.



Direct benchmark of the following code

(1) The uncompiled Mathematica code; with fFun[10.5, 1*^8]: 2320s
      (This is inferred value. I actually runned 1*^5 times with 2.320s. The normalization is for the convenience of comparing with compiled code.)
fFun = Function[ {x, n},
    Module[ {sum, inc}, sum = 1.0; inc = 1.0; 
    Do[inc = inc*Sin[x]/i; sum = sum + Tan[inc], {i, n}]; sum]];

A note is in order: in this example, inc is finally as small as 10^{-462146}. Mathematica is so clever that it catches underflow of machine precision. Thus this small number is kept (but none of the examples below do this). To be fair, one can set (thank Rojo for pointing this out)

SetSystemOptions["CatchMachineUnderflow" -> False]

After  that, the above code takes 180s to run fFun[10.5, 1*^8].

(2) The code compiled to Mathematica VM; with mFun[10.5, 1*^8]: 6.80s
mFun = Compile[ {{x, _Real}, {n, _Integer}},
    Module[ {sum, inc}, sum = 1.0; inc = 1.0; 
    Do[inc = inc*Sin[x]/i; sum = sum + Tan[inc], {i, n}]; sum]];

(3) The code compiled to C; with cFun[10.5, 1*^8]: 1.04s
cFun = Compile[ {{x, _Real}, {n, _Integer}},
    Module[ {sum, inc}, sum = 1.0; inc = 1.0; 
    Do[inc = inc*Sin[x]/i; sum = sum + Tan[inc], {i, n}]; sum], 
   CompilationOptions -> {"InlineExternalDefinitions" -> True, 
     "InlineCompiledFunctions" -> True, 
     "ExpressionOptimization" -> True}, 
   RuntimeAttributes -> {Listable}, Parallelization -> True, 
   CompilationTarget -> "C"];

(4) The native C code; with -O0 3.14s; with -O3 0.66s
# include <stdio.h>
# include <math.h>
int main(void) {
  double x=10.5;
  double inc=1.0;
  double sum=1.0;
  for(int i = 1; i <= 100000000; i++) {
    inc = inc * sin (x) / (double) i;
    sum = sum + tan(inc);
  }
  printf("%f, %f\n", inc, sum);
}

Parallel performance of code (3)

Here is performance as a function of length of list (in variable x). Considering the turbo boost feature of the CPU, the code can be considered well parallelized.

The test is done on a 4800MQ CPU. The CPU has 4 physical cores with HT (thus 8 logical cores). It is interesting (and surprising) to see that the evolution time starts to double after list length 8, instead of 4.


Why the native C code is still faster?

One can have a look at the generated C code from Mathematica (it can be found when setting Compiler`$CCompilerOptions = {"CleanIntermediate" -> False};)

DLLEXPORT int compiledFunction1(WolframLibraryData libData, mint Argc, MArgument *Args, MArgument Res)
{
mint I0_ 0;
mint I0_ 1;
mint I0_ 3;
mreal R0_ 0;
mreal R0_ 2;
mreal R0_ 3;
mreal R0_ 4;
mreal R0_ 5;
mreal R0_ 6;
R0_ 0 = MArgument_getReal(Args[0]);
I0_ 0 = MArgument_getInteger(Args[1]);
R0_ 2 = R0_ 1;
R0_ 4 = R0_ 1;
I0_ 1 = I0_ 0;
I0_ 3 = I0_ 2;
goto lab15;
lab6:
R0_ 3 = sin(R0_ 0);
R0_ 5 = (mreal) I0_ 3;
R0_ 6 = 1 / R0_5;
R0_ 3 = R0_ 3 * R0_ 6;
R0_ 6 = R0_ 4 * R0_ 3;
R0_ 4 = R0_ 6;
R0_ 6 = tan(R0_ 4);
R0_ 3 = R0_ 2 + R0_ 6;
R0_ 2 = R0_ 3;
lab15:
if( ++I0_3 <= I0_ 1)
{
goto lab6;
}
MArgument_setReal(Res, R0_ 2);
return 0;
}

The for loop is written in a goto way. I think the compiler is confused and don't know how to optimize the loop (perhaps in branch prediction and register optimization) in this case. This is supported in that, if one make the calculation more complicated inside the loop (tested for nesting the sin functions a few times), the performance of the compiled Mathematica code approaches the C code.

Performance drop for external calls

Once there exists external calls to user defined non-compiled functions, the performance of the compiled code drops a lot.

This takes 25.6 seconds to run  1*^8 loops (note: if writing g=Identity, there is no performance drop. Defining g[x_]:=x, 1*^8 loops takes 29.8 seconds.)
g = # &
cFun = Compile[ {{x, _Real}, {n, _Integer}},
    Module[ {sum, inc}, sum = 1.0; inc = 1.0; 
    Do[inc = g[inc]*Sin[x]/i; sum = sum + Tan[inc], {i, n}]; sum], 
   CompilationOptions -> {"InlineExternalDefinitions" -> False, 
     "InlineCompiledFunctions" -> True, 
     "ExpressionOptimization" -> True}, 
   RuntimeAttributes -> {Listable}, Parallelization -> True, 
   CompilationTarget -> "C"];

For this case, one had better turn on "InlineExternalDefinitions" option (useful for pure function g=#&, but not the rule based g[x_]:=x). Then performance returns.

If one insert some Mathematica functions, which has no counter part in C or pure MVM, e.g. a BesselJ, 1*^8 runs take 1187s (again inferred value, and MVM version takes 1193s, uncompiled version takes 8975s). To compare, pure C code with bessel function from GSL takes only 71s. But the comparison is not completely fair, considering that the Mathematica BesselJ is more powerful, in potentially arbitrary precision and complex arguments for both variables.

Nevertheless, in this BesselJ case for performance one had better to write a C code and link it back to Mathematica.

Another lesson from this section is that, if there are complicated external calls, compile to C doesn't make much benefit compared to compile to MVM. But to compile is still useful.

Comparison with other languages

As Urjit suggests, here is the corresponding code run on Maxima:
x : 10.5;
inc : 1.0;
sum : 1.0;
for i thru 100000000 do (
  inc : inc * sin(x) / i,
  sum : sum + tan(inc)
);
print(sum);

To my surprise (in the delight sense), this code (10^8 loops as above) needs 702 seconds to run. This is 1/3 the time for the Mathematica non-compiled code.

Note that Both Maxima and Mathematica are symbolic systems, which are not primarily designed for numerical calculation. Thus the comparison cannot be over-interpreted as a benchmark of Mathematica vs Maxima :)

For the same purpose, Python is tested -- I believe the math software Sage should have similar performance. The following takes 59.6 seconds:
import math
import time
x = 10.5
inc = 1.0
sum = 1.0
for i in range(1,100000000):
    inc = inc * math.sin(x) / float(i)
    sum = sum + math.tan(inc)

time.clock()

Just for fun, I did the same test on emacs lisp. Interestingly, the performance is much better than I had expected (considering elisp is never designed to do numerical calculation as such). Here is the code. It takes 26s to run -- the fasted among the tested interpretation based languages:

(let ((x 10.5) (inc 1.0) (sum 1.0) (i 1.0))
  (insert (format-time-string "%s"))
  (insert "\n")
  (while (< i 10000001) 
    (setq inc (/ (* inc (sin x)) i))
    (setq sum (+ sum (tan inc)))
    (incf i))
  (insert (format-time-string "%s"))
  sum)

Mathematica packed list vs C code

As another test, we check the performance of Mathematica packed list. We use a different example because I don't know how to use the same technique on the above example (tell me how if you know).

Mathematica code: 0.73s

AbsoluteTiming@Total@Sin@Range[1.0, 10^8]

C code (single thread, gcc -O3): 3.36s

#include 
#include 
int main(){
  int i;
  double sum;
  for(i=1; i<=100000000; i++){
    sum += sin(i);
  }
  printf("%f\n", sum);
}

No comments:

Post a Comment