I ran these on a p3.2xlarge AWS EC2 instance with the following specs:
- 8 vCPUs
- 61 GB memory
- 1 NVIDIA Tesla P100 GPU
Software stack:
-
Ubuntu 22.04
-
Python 3.10.8
-
CUDA 11.4
-
Packages pulled from
pip
-
Backend versions:
aesara==2.8.9 cupy==11.4.0 jax==0.4.1 numba==0.56.4 numpy==1.23.5 taichi==1.3.0 torch==1.13.1 tensorflow==2.11.0
An equation consisting of >100 terms with no data dependencies and only elementary math. This benchmark should represent a best-case scenario for vector instructions and GPU performance.
$ taskset -c 0 python run.py benchmarks/equation_of_state/
benchmarks.equation_of_state
============================
Running on CPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 pytorch 10,000 0.000 0.000 0.000 0.000 0.000 0.000 0.011 6.560
4,096 jax 10,000 0.000 0.000 0.000 0.000 0.000 0.000 0.008 6.515
4,096 numba 10,000 0.000 0.000 0.000 0.000 0.000 0.000 0.012 3.760
4,096 taichi 10,000 0.001 0.000 0.001 0.001 0.001 0.001 0.008 3.007
4,096 aesara 10,000 0.001 0.000 0.001 0.001 0.001 0.001 0.012 2.535
4,096 tensorflow 10,000 0.001 0.000 0.001 0.001 0.001 0.001 0.012 2.123
4,096 numpy 10,000 0.002 0.000 0.002 0.002 0.002 0.002 0.009 1.000
16,384 pytorch 10,000 0.001 0.000 0.001 0.001 0.001 0.001 0.009 7.007
16,384 jax 10,000 0.001 0.000 0.001 0.001 0.001 0.001 0.009 6.092
16,384 tensorflow 10,000 0.002 0.000 0.002 0.002 0.002 0.002 0.013 4.393
16,384 numba 10,000 0.002 0.000 0.002 0.002 0.002 0.002 0.009 3.988
16,384 taichi 1,000 0.002 0.000 0.002 0.002 0.002 0.002 0.009 3.359
16,384 aesara 10,000 0.003 0.000 0.003 0.003 0.003 0.003 0.013 2.938
16,384 numpy 1,000 0.008 0.001 0.007 0.007 0.008 0.008 0.021 1.000
65,536 pytorch 1,000 0.005 0.000 0.004 0.005 0.005 0.005 0.012 6.202
65,536 jax 1,000 0.005 0.000 0.004 0.005 0.005 0.005 0.012 6.190
65,536 tensorflow 1,000 0.005 0.001 0.005 0.005 0.005 0.005 0.017 5.797
65,536 numba 1,000 0.007 0.001 0.007 0.007 0.007 0.007 0.019 3.849
65,536 taichi 1,000 0.009 0.001 0.009 0.009 0.009 0.009 0.024 3.243
65,536 aesara 1,000 0.010 0.000 0.010 0.010 0.010 0.010 0.022 2.888
65,536 numpy 1,000 0.028 0.003 0.027 0.027 0.028 0.028 0.080 1.000
262,144 pytorch 1,000 0.013 0.002 0.012 0.013 0.013 0.013 0.036 13.655
262,144 tensorflow 1,000 0.015 0.001 0.014 0.014 0.015 0.015 0.035 12.085
262,144 jax 1,000 0.016 0.001 0.014 0.016 0.016 0.016 0.043 11.357
262,144 numba 100 0.027 0.000 0.026 0.026 0.027 0.027 0.027 6.617
262,144 taichi 100 0.031 0.004 0.030 0.030 0.030 0.030 0.068 5.753
262,144 aesara 100 0.035 0.000 0.035 0.035 0.035 0.036 0.036 4.988
262,144 numpy 100 0.176 0.005 0.165 0.174 0.176 0.180 0.196 1.000
1,048,576 pytorch 100 0.056 0.000 0.055 0.056 0.056 0.056 0.060 12.895
1,048,576 jax 100 0.068 0.005 0.065 0.065 0.065 0.071 0.084 10.578
1,048,576 tensorflow 100 0.070 0.005 0.065 0.065 0.068 0.072 0.084 10.407
1,048,576 numba 100 0.111 0.001 0.109 0.111 0.111 0.111 0.114 6.523
1,048,576 taichi 100 0.132 0.000 0.131 0.131 0.131 0.132 0.133 5.500
1,048,576 aesara 100 0.146 0.001 0.144 0.146 0.146 0.146 0.148 4.947
1,048,576 numpy 10 0.723 0.011 0.714 0.719 0.720 0.723 0.754 1.000
4,194,304 pytorch 10 0.231 0.000 0.231 0.231 0.231 0.231 0.231 17.234
4,194,304 tensorflow 10 0.334 0.001 0.331 0.334 0.334 0.334 0.335 11.927
4,194,304 jax 10 0.335 0.001 0.334 0.334 0.335 0.335 0.338 11.870
4,194,304 numba 10 0.444 0.001 0.443 0.444 0.444 0.445 0.446 8.961
4,194,304 taichi 10 0.529 0.002 0.528 0.528 0.529 0.529 0.534 7.525
4,194,304 aesara 10 0.586 0.001 0.585 0.585 0.585 0.586 0.589 6.796
4,194,304 numpy 10 3.981 0.032 3.951 3.957 3.972 3.997 4.058 1.000
(time in wall seconds, less is better)
$ taskset -c 0 python run.py benchmarks/equation_of_state/ -s 16777216
benchmarks.equation_of_state
============================
Running on CPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
16,777,216 pytorch 10 0.976 0.004 0.971 0.972 0.976 0.978 0.985 16.375
16,777,216 tensorflow 10 1.277 0.003 1.272 1.275 1.276 1.278 1.284 12.518
16,777,216 jax 10 1.310 0.003 1.305 1.308 1.311 1.311 1.315 12.200
16,777,216 numba 10 1.741 0.003 1.739 1.739 1.740 1.745 1.745 9.177
16,777,216 taichi 10 1.985 0.002 1.982 1.983 1.986 1.988 1.988 8.048
16,777,216 aesara 10 2.329 0.005 2.322 2.326 2.328 2.332 2.339 6.861
16,777,216 numpy 10 15.980 0.071 15.908 15.930 15.951 16.024 16.122 1.000
$ for backend in cupy jax pytorch taichi tensorflow; do CUDA_VISIBLE_DEVICES="0" python run.py benchmarks/equation_of_state/ --gpu -b $backend -b numpy; done
benchmarks.equation_of_state
============================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 numpy 10,000 0.002 0.001 0.002 0.002 0.002 0.002 0.016 1.000
4,096 cupy 1,000 0.008 0.001 0.008 0.008 0.008 0.008 0.018 0.208
16,384 numpy 1,000 0.007 0.001 0.007 0.007 0.007 0.008 0.017 1.000
16,384 cupy 1,000 0.008 0.001 0.008 0.008 0.008 0.008 0.018 0.901
65,536 cupy 1,000 0.008 0.001 0.008 0.008 0.008 0.008 0.017 3.311
65,536 numpy 1,000 0.028 0.002 0.026 0.026 0.027 0.027 0.040 1.000
262,144 cupy 1,000 0.008 0.001 0.008 0.008 0.008 0.008 0.022 21.066
262,144 numpy 100 0.176 0.012 0.111 0.176 0.176 0.179 0.185 1.000
1,048,576 cupy 100 0.011 0.001 0.011 0.011 0.011 0.011 0.020 64.168
1,048,576 numpy 10 0.716 0.005 0.710 0.713 0.716 0.718 0.724 1.000
4,194,304 cupy 100 0.040 0.000 0.040 0.040 0.040 0.040 0.040 99.265
4,194,304 numpy 10 3.956 0.041 3.924 3.928 3.932 3.961 4.045 1.000
(time in wall seconds, less is better)
benchmarks.equation_of_state
============================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 jax 10,000 0.000 0.000 0.000 0.000 0.000 0.000 0.012 12.803
4,096 numpy 10,000 0.002 0.000 0.002 0.002 0.002 0.002 0.014 1.000
16,384 jax 10,000 0.000 0.000 0.000 0.000 0.000 0.000 0.011 60.102
16,384 numpy 1,000 0.008 0.001 0.007 0.007 0.007 0.008 0.017 1.000
65,536 jax 10,000 0.000 0.001 0.000 0.000 0.000 0.000 0.013 199.058
65,536 numpy 1,000 0.029 0.006 0.026 0.027 0.027 0.028 0.052 1.000
262,144 jax 1,000 0.000 0.000 0.000 0.000 0.000 0.000 0.009 810.167
262,144 numpy 100 0.179 0.010 0.110 0.175 0.178 0.180 0.205 1.000
1,048,576 jax 1,000 0.000 0.001 0.000 0.000 0.000 0.000 0.010 1686.376
1,048,576 numpy 10 0.728 0.007 0.713 0.725 0.729 0.732 0.736 1.000
4,194,304 jax 100 0.001 0.000 0.001 0.001 0.001 0.001 0.002 3202.356
4,194,304 numpy 10 3.999 0.063 3.893 3.945 4.017 4.030 4.110 1.000
(time in wall seconds, less is better)
benchmarks.equation_of_state
============================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 pytorch 10,000 0.000 0.000 0.000 0.000 0.000 0.000 0.010 8.881
4,096 numpy 1,000 0.002 0.000 0.002 0.002 0.002 0.002 0.009 1.000
16,384 pytorch 10,000 0.000 0.000 0.000 0.000 0.000 0.000 0.010 35.893
16,384 numpy 1,000 0.008 0.001 0.007 0.007 0.008 0.008 0.016 1.000
65,536 pytorch 10,000 0.000 0.000 0.000 0.000 0.000 0.000 0.010 147.702
65,536 numpy 100 0.032 0.005 0.027 0.027 0.033 0.036 0.047 1.000
262,144 pytorch 1,000 0.000 0.000 0.000 0.000 0.000 0.000 0.010 732.045
262,144 numpy 100 0.197 0.005 0.188 0.192 0.198 0.202 0.210 1.000
1,048,576 pytorch 1,000 0.000 0.000 0.000 0.000 0.000 0.000 0.008 1524.343
1,048,576 numpy 10 0.739 0.009 0.728 0.733 0.737 0.739 0.757 1.000
4,194,304 pytorch 100 0.002 0.000 0.002 0.002 0.002 0.002 0.002 2410.544
4,194,304 numpy 10 4.029 0.026 4.000 4.015 4.019 4.035 4.092 1.000
(time in wall seconds, less is better)
benchmarks.equation_of_state
============================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 taichi 10,000 0.000 0.001 0.000 0.000 0.000 0.000 0.016 21.539
4,096 numpy 10,000 0.002 0.001 0.002 0.002 0.002 0.002 0.018 1.000
16,384 taichi 10,000 0.000 0.001 0.000 0.000 0.000 0.000 0.016 74.277
16,384 numpy 1,000 0.008 0.001 0.007 0.007 0.008 0.008 0.023 1.000
65,536 taichi 10,000 0.000 0.001 0.000 0.000 0.000 0.000 0.015 308.928
65,536 numpy 1,000 0.029 0.006 0.026 0.027 0.027 0.028 0.076 1.000
262,144 taichi 1,000 0.000 0.001 0.000 0.000 0.000 0.000 0.012 1517.780
262,144 numpy 100 0.178 0.007 0.174 0.175 0.176 0.178 0.202 1.000
1,048,576 taichi 1,000 0.000 0.001 0.000 0.000 0.000 0.000 0.012 3093.831
1,048,576 numpy 10 0.719 0.012 0.710 0.711 0.713 0.720 0.742 1.000
4,194,304 taichi 100 0.001 0.000 0.001 0.001 0.001 0.001 0.001 5964.008
4,194,304 numpy 10 3.934 0.011 3.917 3.924 3.934 3.943 3.949 1.000
(time in wall seconds, less is better)
benchmarks.equation_of_state
============================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 tensorflow 10,000 0.001 0.000 0.000 0.000 0.001 0.001 0.011 3.284
4,096 numpy 10,000 0.002 0.000 0.002 0.002 0.002 0.002 0.012 1.000
16,384 tensorflow 10,000 0.001 0.000 0.000 0.000 0.001 0.001 0.010 14.329
16,384 numpy 1,000 0.007 0.000 0.007 0.007 0.007 0.008 0.009 1.000
65,536 tensorflow 1,000 0.001 0.000 0.000 0.000 0.001 0.001 0.009 50.879
65,536 numpy 100 0.026 0.001 0.026 0.026 0.026 0.027 0.029 1.000
262,144 tensorflow 1,000 0.001 0.000 0.000 0.000 0.001 0.001 0.001 233.719
262,144 numpy 100 0.119 0.009 0.112 0.115 0.116 0.120 0.147 1.000
1,048,576 tensorflow 1,000 0.001 0.000 0.001 0.001 0.001 0.001 0.001 1150.812
1,048,576 numpy 10 0.674 0.010 0.667 0.667 0.669 0.673 0.699 1.000
4,194,304 tensorflow 100 0.001 0.000 0.001 0.001 0.001 0.001 0.001 5013.081
4,194,304 numpy 10 3.929 0.040 3.884 3.888 3.930 3.963 3.983 1.000
(time in wall seconds, less is better)
A more balanced routine with many data dependencies (stencil operations), and tensor shapes of up to 5 dimensions. This is the most expensive part of Veros, so in a way this is the benchmark that interests me the most.
$ taskset -c 0 python run.py benchmarks/isoneutral_mixing/
benchmarks.isoneutral_mixing
============================
Running on CPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 numba 10,000 0.001 0.001 0.001 0.001 0.001 0.001 0.029 3.252
4,096 taichi 1,000 0.001 0.001 0.001 0.001 0.001 0.001 0.027 3.213
4,096 jax 1,000 0.001 0.001 0.001 0.001 0.001 0.001 0.026 3.197
4,096 aesara 1,000 0.003 0.001 0.003 0.003 0.003 0.003 0.027 1.459
4,096 numpy 1,000 0.004 0.002 0.004 0.004 0.004 0.004 0.032 1.000
4,096 pytorch 1,000 0.004 0.001 0.004 0.004 0.004 0.005 0.029 0.990
16,384 taichi 1,000 0.006 0.001 0.006 0.006 0.006 0.006 0.029 2.665
16,384 jax 1,000 0.006 0.001 0.006 0.006 0.006 0.006 0.036 2.407
16,384 numba 1,000 0.006 0.001 0.006 0.006 0.006 0.006 0.030 2.366
16,384 aesara 1,000 0.010 0.001 0.010 0.010 0.010 0.011 0.038 1.492
16,384 pytorch 1,000 0.011 0.001 0.011 0.011 0.011 0.011 0.035 1.394
16,384 numpy 1,000 0.015 0.001 0.015 0.015 0.015 0.015 0.040 1.000
65,536 taichi 100 0.025 0.000 0.024 0.025 0.025 0.025 0.025 2.388
65,536 jax 100 0.027 0.000 0.027 0.027 0.027 0.028 0.028 2.146
65,536 numba 100 0.028 0.000 0.028 0.028 0.028 0.028 0.028 2.081
65,536 pytorch 100 0.039 0.000 0.038 0.038 0.038 0.039 0.039 1.527
65,536 aesara 100 0.040 0.003 0.039 0.039 0.040 0.040 0.067 1.471
65,536 numpy 100 0.059 0.000 0.058 0.059 0.059 0.059 0.060 1.000
262,144 taichi 100 0.105 0.002 0.104 0.105 0.105 0.105 0.129 1.970
262,144 jax 100 0.108 0.001 0.107 0.107 0.107 0.108 0.109 1.923
262,144 numba 100 0.109 0.001 0.108 0.109 0.109 0.109 0.111 1.895
262,144 pytorch 100 0.139 0.002 0.137 0.138 0.138 0.140 0.149 1.486
262,144 aesara 100 0.147 0.001 0.145 0.146 0.147 0.147 0.153 1.407
262,144 numpy 10 0.207 0.003 0.202 0.205 0.206 0.208 0.215 1.000
1,048,576 taichi 10 0.411 0.000 0.411 0.411 0.411 0.411 0.413 2.325
1,048,576 numba 10 0.468 0.001 0.468 0.468 0.468 0.468 0.470 2.041
1,048,576 jax 10 0.623 0.003 0.621 0.621 0.622 0.624 0.630 1.534
1,048,576 pytorch 10 0.698 0.007 0.690 0.695 0.696 0.699 0.712 1.370
1,048,576 aesara 10 0.706 0.007 0.693 0.703 0.705 0.707 0.718 1.355
1,048,576 numpy 10 0.956 0.010 0.950 0.951 0.952 0.957 0.984 1.000
4,194,304 taichi 10 1.658 0.002 1.656 1.657 1.657 1.658 1.661 3.031
4,194,304 numba 10 2.374 0.003 2.369 2.373 2.374 2.375 2.380 2.116
4,194,304 jax 10 2.974 0.006 2.968 2.970 2.971 2.977 2.988 1.689
4,194,304 aesara 10 3.662 0.013 3.650 3.653 3.656 3.663 3.694 1.372
4,194,304 pytorch 10 4.187 0.103 3.974 4.133 4.193 4.227 4.363 1.200
4,194,304 numpy 10 5.024 0.034 4.993 5.002 5.009 5.025 5.091 1.000
(time in wall seconds, less is better)
$ taskset -c 0 python run.py benchmarks/isoneutral_mixing/ -s 16777216
benchmarks.isoneutral_mixing
============================
Running on CPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
16,777,216 numba 10 9.473 0.040 9.392 9.476 9.485 9.496 9.514 2.711
16,777,216 taichi 10 9.845 0.038 9.818 9.822 9.826 9.840 9.923 2.609
16,777,216 jax 10 12.243 0.044 12.155 12.235 12.247 12.281 12.292 2.098
16,777,216 aesara 10 15.542 0.067 15.460 15.476 15.547 15.566 15.662 1.652
16,777,216 numpy 10 25.681 0.157 25.448 25.566 25.692 25.776 25.957 1.000
16,777,216 pytorch 10 28.955 0.054 28.869 28.937 28.948 28.958 29.098 0.887
(time in wall seconds, less is better)
$ for backend in cupy jax pytorch; do CUDA_VISIBLE_DEVICES="0" python run.py benchmarks/isoneutral_mixing/ --gpu -b $backend -b numpy; done
benchmarks.isoneutral_mixing
============================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 numpy 1,000 0.004 0.001 0.004 0.004 0.004 0.004 0.027 1.000
4,096 cupy 1,000 0.014 0.001 0.014 0.014 0.014 0.014 0.037 0.292
16,384 cupy 1,000 0.014 0.001 0.014 0.014 0.014 0.014 0.037 1.054
16,384 numpy 1,000 0.015 0.001 0.015 0.015 0.015 0.015 0.040 1.000
65,536 cupy 100 0.014 0.002 0.014 0.014 0.014 0.014 0.037 3.836
65,536 numpy 100 0.055 0.002 0.054 0.054 0.055 0.055 0.066 1.000
262,144 cupy 100 0.015 0.001 0.014 0.014 0.014 0.014 0.023 16.322
262,144 numpy 10 0.237 0.007 0.228 0.229 0.238 0.244 0.247 1.000
1,048,576 cupy 10 0.015 0.001 0.014 0.015 0.015 0.015 0.018 72.092
1,048,576 numpy 10 1.070 0.002 1.067 1.068 1.070 1.072 1.073 1.000
4,194,304 cupy 10 0.051 0.000 0.050 0.050 0.051 0.051 0.051 98.410
4,194,304 numpy 10 4.974 0.029 4.945 4.954 4.960 4.981 5.037 1.000
(time in wall seconds, less is better)
benchmarks.isoneutral_mixing
============================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 jax 1,000 0.000 0.002 0.000 0.000 0.000 0.000 0.023 8.668
4,096 numpy 1,000 0.004 0.001 0.004 0.004 0.004 0.004 0.027 1.000
16,384 jax 1,000 0.000 0.001 0.000 0.000 0.000 0.000 0.023 33.054
16,384 numpy 1,000 0.015 0.001 0.015 0.015 0.015 0.015 0.043 1.000
65,536 jax 100 0.001 0.000 0.000 0.000 0.001 0.001 0.004 100.245
65,536 numpy 100 0.055 0.001 0.054 0.054 0.055 0.055 0.062 1.000
262,144 jax 100 0.002 0.001 0.002 0.002 0.002 0.002 0.011 118.359
262,144 numpy 100 0.231 0.006 0.213 0.227 0.230 0.234 0.251 1.000
1,048,576 jax 10 0.009 0.001 0.008 0.009 0.009 0.010 0.010 114.054
1,048,576 numpy 10 1.062 0.011 1.051 1.056 1.058 1.067 1.086 1.000
4,194,304 jax 10 0.025 0.000 0.024 0.025 0.025 0.025 0.025 199.321
4,194,304 numpy 10 4.954 0.054 4.914 4.924 4.935 4.946 5.104 1.000
(time in wall seconds, less is better)
benchmarks.isoneutral_mixing
============================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 numpy 1,000 0.004 0.001 0.004 0.004 0.004 0.004 0.027 1.000
4,096 pytorch 1,000 0.006 0.001 0.005 0.005 0.005 0.006 0.029 0.775
16,384 pytorch 1,000 0.006 0.002 0.005 0.005 0.005 0.006 0.029 2.709
16,384 numpy 1,000 0.015 0.000 0.015 0.015 0.015 0.015 0.024 1.000
65,536 pytorch 100 0.006 0.000 0.005 0.006 0.006 0.006 0.006 9.853
65,536 numpy 100 0.055 0.001 0.055 0.055 0.055 0.056 0.066 1.000
262,144 pytorch 100 0.006 0.000 0.006 0.006 0.006 0.006 0.008 38.100
262,144 numpy 10 0.227 0.009 0.202 0.226 0.230 0.233 0.236 1.000
1,048,576 pytorch 10 0.008 0.000 0.008 0.008 0.008 0.008 0.008 134.397
1,048,576 numpy 10 1.086 0.011 1.074 1.076 1.084 1.096 1.103 1.000
4,194,304 pytorch 10 0.022 0.000 0.022 0.022 0.022 0.022 0.023 223.333
4,194,304 numpy 10 5.021 0.027 4.988 5.000 5.016 5.044 5.068 1.000
(time in wall seconds, less is better)
$ CUDA_VISIBLE_DEVICES="0" python run.py benchmarks/isoneutral_mixing/ --gpu -b taichi -b numpy -s 1_048_576
benchmarks.isoneutral_mixing
============================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
1,048,576 taichi 10 0.101 0.004 0.096 0.096 0.103 0.104 0.104 10.831
1,048,576 numpy 10 1.089 0.010 1.073 1.085 1.089 1.092 1.110 1.000
(time in wall seconds, less is better)
This routine consists of some stencil operations and some linear algebra (a tridiagonal matrix solver), which cannot be vectorized.
$ taskset -c 0 python run.py benchmarks/turbulent_kinetic_energy/
benchmarks.turbulent_kinetic_energy
===================================
Running on CPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 jax 1,000 0.000 0.000 0.000 0.000 0.000 0.000 0.010 6.161
4,096 numba 1,000 0.001 0.000 0.001 0.001 0.001 0.001 0.010 2.180
4,096 pytorch 1,000 0.002 0.001 0.002 0.002 0.002 0.002 0.016 1.062
4,096 numpy 1,000 0.002 0.000 0.002 0.002 0.002 0.002 0.003 1.000
16,384 jax 1,000 0.002 0.000 0.002 0.002 0.002 0.002 0.011 3.961
16,384 numba 1,000 0.004 0.000 0.004 0.004 0.004 0.004 0.013 2.000
16,384 pytorch 1,000 0.005 0.001 0.004 0.004 0.004 0.005 0.017 1.635
16,384 numpy 1,000 0.007 0.001 0.007 0.007 0.007 0.008 0.017 1.000
65,536 jax 100 0.008 0.001 0.008 0.008 0.008 0.008 0.018 3.114
65,536 numba 100 0.012 0.000 0.012 0.012 0.012 0.012 0.013 2.091
65,536 pytorch 100 0.015 0.000 0.015 0.015 0.015 0.016 0.016 1.661
65,536 numpy 100 0.026 0.000 0.024 0.025 0.026 0.026 0.027 1.000
262,144 jax 100 0.030 0.000 0.030 0.030 0.030 0.031 0.032 2.976
262,144 numba 100 0.040 0.000 0.040 0.040 0.040 0.040 0.042 2.258
262,144 pytorch 100 0.051 0.001 0.050 0.051 0.051 0.051 0.054 1.777
262,144 numpy 100 0.091 0.001 0.089 0.090 0.090 0.091 0.097 1.000
1,048,576 numba 10 0.163 0.002 0.160 0.161 0.163 0.164 0.165 2.694
1,048,576 jax 10 0.167 0.004 0.158 0.166 0.167 0.169 0.172 2.625
1,048,576 pytorch 10 0.254 0.004 0.250 0.252 0.253 0.254 0.262 1.723
1,048,576 numpy 10 0.438 0.004 0.435 0.435 0.436 0.440 0.446 1.000
4,194,304 numba 10 0.827 0.014 0.800 0.821 0.826 0.836 0.849 2.578
4,194,304 jax 10 1.073 0.008 1.063 1.068 1.072 1.078 1.088 1.985
4,194,304 pytorch 10 1.691 0.054 1.630 1.643 1.689 1.709 1.810 1.260
4,194,304 numpy 10 2.131 0.018 2.119 2.120 2.124 2.135 2.181 1.000
(time in wall seconds, less is better)
$ taskset -c 0 python run.py benchmarks/turbulent_kinetic_energy/ -s 16777216
benchmarks.turbulent_kinetic_energy
===================================
Running on CPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
16,777,216 numba 10 3.708 0.015 3.686 3.693 3.715 3.721 3.728 3.117
16,777,216 jax 10 4.734 0.008 4.724 4.728 4.732 4.740 4.749 2.441
16,777,216 pytorch 10 9.745 0.040 9.696 9.707 9.735 9.790 9.801 1.186
16,777,216 numpy 10 11.558 0.041 11.515 11.532 11.549 11.568 11.667 1.000
(time in wall seconds, less is better)
$ for backend in jax pytorch; do CUDA_VISIBLE_DEVICES="0" python run.py benchmarks/turbulent_kinetic_energy/ --gpu -b $backend -b numpy; done
benchmarks.turbulent_kinetic_energy
===================================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 jax 1,000 0.000 0.000 0.000 0.000 0.000 0.000 0.001 6.956
4,096 numpy 1,000 0.002 0.000 0.002 0.002 0.002 0.002 0.003 1.000
16,384 jax 1,000 0.000 0.000 0.000 0.000 0.000 0.000 0.001 18.489
16,384 numpy 1,000 0.007 0.000 0.007 0.007 0.007 0.008 0.008 1.000
65,536 jax 100 0.001 0.000 0.000 0.001 0.001 0.001 0.001 44.388
65,536 numpy 100 0.026 0.000 0.025 0.025 0.026 0.026 0.027 1.000
262,144 jax 100 0.001 0.000 0.001 0.001 0.001 0.001 0.002 64.117
262,144 numpy 100 0.091 0.002 0.089 0.089 0.090 0.092 0.095 1.000
1,048,576 jax 10 0.005 0.000 0.005 0.005 0.005 0.005 0.005 93.975
1,048,576 numpy 10 0.493 0.007 0.488 0.489 0.489 0.500 0.506 1.000
4,194,304 jax 10 0.020 0.000 0.019 0.020 0.020 0.020 0.020 109.825
4,194,304 numpy 10 2.159 0.036 2.115 2.128 2.154 2.179 2.230 1.000
(time in wall seconds, less is better)
benchmarks.turbulent_kinetic_energy
===================================
Running on GPU
size backend calls mean stdev min 25% median 75% max Δ
------------------------------------------------------------------------------------------------------------------
4,096 numpy 1,000 0.002 0.000 0.002 0.002 0.002 0.002 0.006 1.000
4,096 pytorch 1,000 0.005 0.001 0.005 0.005 0.005 0.005 0.010 0.498
16,384 pytorch 1,000 0.005 0.001 0.005 0.005 0.005 0.005 0.009 1.432
16,384 numpy 1,000 0.008 0.001 0.007 0.008 0.008 0.008 0.011 1.000
65,536 pytorch 100 0.006 0.000 0.006 0.006 0.006 0.006 0.009 4.611
65,536 numpy 100 0.028 0.003 0.025 0.026 0.026 0.032 0.033 1.000
262,144 pytorch 100 0.007 0.001 0.007 0.007 0.007 0.007 0.010 16.117
262,144 numpy 100 0.117 0.003 0.100 0.117 0.117 0.118 0.123 1.000
1,048,576 pytorch 10 0.009 0.000 0.009 0.009 0.009 0.009 0.009 55.791
1,048,576 numpy 10 0.516 0.010 0.507 0.509 0.512 0.519 0.541 1.000
4,194,304 pytorch 10 0.023 0.001 0.023 0.023 0.023 0.023 0.025 94.396
4,194,304 numpy 10 2.174 0.010 2.150 2.171 2.173 2.178 2.189 1.000
(time in wall seconds, less is better)