深海游弋的鱼 – 默默的点滴

This is the sequel of the single precision SSE optimized sin, cos, log and exp that I wrote some time ago. Adapted to the NEON fpu of my pandaboard. Precision and range are exactly the same than the SSE version, so I won't repeat them.

The code

The functions below are licensed under the zlib license, so you can do basically what you want with them.

neon_mathfun.h source code for sin_ps, cos_ps, sincos_ps, exp_ps, log_ps, as straight C.
neon_mathfun_test.c Validation+Bench program for those function. Do not forget to run it once.

Performance

Results on a pandaboard with a 1GHz dual-core ARM Cortex A9 (OMAP4), using gcc 4.6.1

command line: gcc -O3 -mfloat-abi=softfp -mfpu=neon -march=armv7-a -mtune=cortex-a9 -Wall -W neon_mathfun_test.c -lm

exp([        -1000,          -100,           100,          1000]) = [            0,             0, 2.4061436e+38, 2.4061436e+38]
exp([         -nan,           inf,          -inf,           nan]) = [          nan, 2.4061436e+38,             0,           nan]
log([            0,           -10,         1e+30, 1.0005271e-42]) = [         -nan,          -nan,     69.077553,          -nan]
log([         -nan,           inf,          -inf,           nan]) = [    89.128304,     88.722839,          -nan,     89.128304]
sin([         -nan,           inf,          -inf,           nan]) = [          nan,           nan,          -nan,           nan]
cos([         -nan,           inf,          -inf,           nan]) = [          nan,           nan,           nan,           nan]
sin([       -1e+30,       -100000,         1e+30,        100000]) = [          inf,  -0.035749275,          -inf,   0.035749275]
cos([       -1e+30,       -100000,         1e+30,        100000]) = [          nan,    -0.9993608,           nan,    -0.9993608]
benching                 sinf .. ->    2.0 millions of vector evaluations/second -> 121 cycles/value on a 1000MHz computer
benching                 cosf .. ->    1.8 millions of vector evaluations/second -> 132 cycles/value on a 1000MHz computer
benching                 expf .. ->    1.1 millions of vector evaluations/second -> 221 cycles/value on a 1000MHz computer
benching                 logf .. ->    1.7 millions of vector evaluations/second -> 141 cycles/value on a 1000MHz computer
benching          cephes_sinf .. ->    2.4 millions of vector evaluations/second -> 103 cycles/value on a 1000MHz computer
benching          cephes_cosf .. ->    2.0 millions of vector evaluations/second -> 123 cycles/value on a 1000MHz computer
benching          cephes_expf .. ->    1.6 millions of vector evaluations/second -> 153 cycles/value on a 1000MHz computer
benching          cephes_logf .. ->    1.5 millions of vector evaluations/second -> 156 cycles/value on a 1000MHz computer
benching               sin_ps .. ->    5.8 millions of vector evaluations/second ->  43 cycles/value on a 1000MHz computer
benching               cos_ps .. ->    5.9 millions of vector evaluations/second ->  42 cycles/value on a 1000MHz computer
benching            sincos_ps .. ->    6.0 millions of vector evaluations/second ->  41 cycles/value on a 1000MHz computer
benching               exp_ps .. ->    5.6 millions of vector evaluations/second ->  44 cycles/value on a 1000MHz computer
benching               log_ps .. ->    5.3 millions of vector evaluations/second ->  47 cycles/value on a 1000MHz computer

exp([ -1000, -100, 100, 1000]) = [ 0, 0, 2.4061436e+38, 2.4061436e+38]

exp([ -nan, inf, -inf, nan]) = [ nan, 2.4061436e+38, 0, nan]

log([ 0, -10, 1e+30, 1.0005271e-42]) = [ -nan, -nan, 69.077553, -nan]

log([ -nan, inf, -inf, nan]) = [ 89.128304, 88.722839, -nan, 89.128304]

sin([ -nan, inf, -inf, nan]) = [ nan, nan, -nan, nan]

cos([ -nan, inf, -inf, nan]) = [ nan, nan, nan, nan]

sin([ -1e+30, -100000, 1e+30, 100000]) = [ inf, -0.035749275, -inf, 0.035749275]

cos([ -1e+30, -100000, 1e+30, 100000]) = [ nan, -0.9993608, nan, -0.9993608]

benching sinf .. -> 2.0 millions of vector evaluations/second -> 121 cycles/value on a 1000MHz computer

benching cosf .. -> 1.8 millions of vector evaluations/second -> 132 cycles/value on a 1000MHz computer

benching expf .. -> 1.1 millions of vector evaluations/second -> 221 cycles/value on a 1000MHz computer

benching logf .. -> 1.7 millions of vector evaluations/second -> 141 cycles/value on a 1000MHz computer

benching cephes_sinf .. -> 2.4 millions of vector evaluations/second -> 103 cycles/value on a 1000MHz computer

benching cephes_cosf .. -> 2.0 millions of vector evaluations/second -> 123 cycles/value on a 1000MHz computer

benching cephes_expf .. -> 1.6 millions of vector evaluations/second -> 153 cycles/value on a 1000MHz computer

benching cephes_logf .. -> 1.5 millions of vector evaluations/second -> 156 cycles/value on a 1000MHz computer

benching sin_ps .. -> 5.8 millions of vector evaluations/second -> 43 cycles/value on a 1000MHz computer

benching cos_ps .. -> 5.9 millions of vector evaluations/second -> 42 cycles/value on a 1000MHz computer

benching sincos_ps .. -> 6.0 millions of vector evaluations/second -> 41 cycles/value on a 1000MHz computer

benching exp_ps .. -> 5.6 millions of vector evaluations/second -> 44 cycles/value on a 1000MHz computer

benching log_ps .. -> 5.3 millions of vector evaluations/second -> 47 cycles/value on a 1000MHz computer

So performance is not stellar. I recommend to use gcc 4.6.1 or newer as it generates much better code than previous (gcc 4.5) versions -- almost 20% faster here. I believe rewriting these functions in assembly would improve the performance by 30%, and should not be very hard as the ARM and NEON asm is quite nice and easy to write -- maybe I'll do it. Computing two SIMD vectors at once would also help to improve a lot the performance as there are enough registers on NEON, and it would reduce the dependancies between neon instructions.

Note also that I have no idea of the performance on a Cortex A8 -- it may be extremely bad, I don't know.

Comparison with an Intel Atom

For comparison purposes, here is the performance of the SSE version on a single core Intel Atom N270 running at 1.66GHz

command line: cl.exe /arch:SSE /O2 /TP /MD sse_mathfun_test.c (this is msvc 2010)

benching                 sinf .. ->    1.3 millions of vector evaluations/second -> 303 cycles/value on a 1600MHz computer
benching                 cosf .. ->    1.3 millions of vector evaluations/second -> 305 cycles/value on a 1600MHz computer
benching         sincos (x87) .. ->    1.2 millions of vector evaluations/second -> 314 cycles/value on a 1600MHz computer
benching                 expf .. ->    1.6 millions of vector evaluations/second -> 244 cycles/value on a 1600MHz computer
benching                 logf .. ->    1.4 millions of vector evaluations/second -> 276 cycles/value on a 1600MHz computer
benching          cephes_sinf .. ->    1.4 millions of vector evaluations/second -> 280 cycles/value on a 1600MHz computer
benching          cephes_cosf .. ->    1.5 millions of vector evaluations/second -> 265 cycles/value on a 1600MHz computer
benching          cephes_expf .. ->    0.7 millions of vector evaluations/second -> 548 cycles/value on a 1600MHz computer
benching          cephes_logf .. ->    0.8 millions of vector evaluations/second -> 489 cycles/value on a 1600MHz computer
benching               sin_ps .. ->    9.2 millions of vector evaluations/second ->  43 cycles/value on a 1600MHz computer
benching               cos_ps .. ->    9.5 millions of vector evaluations/second ->  42 cycles/value on a 1600MHz computer
benching            sincos_ps .. ->    8.8 millions of vector evaluations/second ->  45 cycles/value on a 1600MHz computer
benching               exp_ps .. ->    9.8 millions of vector evaluations/second ->  41 cycles/value on a 1600MHz computer
benching               log_ps .. ->    8.6 millions of vector evaluations/second ->  46 cycles/value on a 1600MHz computer

benching sinf .. -> 1.3 millions of vector evaluations/second -> 303 cycles/value on a 1600MHz computer

benching cosf .. -> 1.3 millions of vector evaluations/second -> 305 cycles/value on a 1600MHz computer

benching sincos (x87) .. -> 1.2 millions of vector evaluations/second -> 314 cycles/value on a 1600MHz computer

benching expf .. -> 1.6 millions of vector evaluations/second -> 244 cycles/value on a 1600MHz computer

benching logf .. -> 1.4 millions of vector evaluations/second -> 276 cycles/value on a 1600MHz computer

benching cephes_sinf .. -> 1.4 millions of vector evaluations/second -> 280 cycles/value on a 1600MHz computer

benching cephes_cosf .. -> 1.5 millions of vector evaluations/second -> 265 cycles/value on a 1600MHz computer

benching cephes_expf .. -> 0.7 millions of vector evaluations/second -> 548 cycles/value on a 1600MHz computer

benching cephes_logf .. -> 0.8 millions of vector evaluations/second -> 489 cycles/value on a 1600MHz computer

benching sin_ps .. -> 9.2 millions of vector evaluations/second -> 43 cycles/value on a 1600MHz computer

benching cos_ps .. -> 9.5 millions of vector evaluations/second -> 42 cycles/value on a 1600MHz computer

benching sincos_ps .. -> 8.8 millions of vector evaluations/second -> 45 cycles/value on a 1600MHz computer

benching exp_ps .. -> 9.8 millions of vector evaluations/second -> 41 cycles/value on a 1600MHz computer

benching log_ps .. -> 8.6 millions of vector evaluations/second -> 46 cycles/value on a 1600MHz computer

The number of cycles is quite similar -- but the atom has a higher clock..

Last modified: 2011/05/29

参考链接

Simple ARM NEON optimized sin, cos, log and exp

Simple ARM NEON optimized sin, cos, log and exp

The code

The functions below are licensed under the zlib license, so you can do basically what you want with them.

Performance

Results on a pandaboard with a 1GHz dual-core ARM Cortex A9 (OMAP4), using gcc 4.6.1

Comparison with an Intel Atom

For comparison purposes, here is the performance of the SSE version on a single core Intel Atom N270 running at 1.66GHz

The number of cycles is quite similar -- but the atom has a higher clock..

Last modified: 2011/05/29

参考链接

发布者

默默

发表回复取消回复

2025 年 4 月
一	二	三	四	五	六	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

The code

The functions below are licensed under the zlib license, so you can do basically what you want with them.

Performance

Results on a pandaboard with a 1GHz dual-core ARM Cortex A9 (OMAP4), using gcc 4.6.1

Comparison with an Intel Atom

For comparison purposes, here is the performance of the SSE version on a single core Intel Atom N270 running at 1.66GHz

The number of cycles is quite similar -- but the atom has a higher clock..

Last modified: 2011/05/29

参考链接

发布者

默默

发表回复 取消回复

发表回复取消回复