c - Why vectorizing the loop does not have performance improvement -
i investigating effect of vectorization on performance of program. in regard, have written following code:
#include <stdio.h> #include <sys/time.h> #include <stdlib.h> #define len 10000000 int main(){ struct timeval sttime, endtime; double* = (double*)malloc(len*sizeof(*a)); double* b = (double*)malloc(len*sizeof(*b)); double* c = (double*)malloc(len*sizeof(*c)); int k; for(k = 0; k < len; k++){ a[k] = rand(); b[k] = rand(); } gettimeofday(&sttime, null); for(k = 0; k < len; k++) c[k] = a[k] * b[k]; gettimeofday(&endtime, null); file* fh = fopen("dump", "w"); for(k = 0; k < len; k++) fprintf(fh, "c[%d] = %f\t", k, c[k]); fclose(fh); double timee = (double)(endtime.tv_usec + endtime.tv_sec*1000000 - sttime.tv_usec - sttime.tv_sec*1000000); printf("time elapsed: %f\n", timee); return 0; }
in code, initializing , multiplying 2 vectors. results saved in vector c
. interested in effect of vectorizing following loop:
for(k = 0; k < len; k++) c[k] = a[k] * b[k];
i compile code using following 2 commands:
1) icc -o2 testsmid.c -o testsmid -no-vec -no-simd 2) icc -o2 testsmid.c -o testsmid -vec-report2
i expect see performance improvement since second command vectorizes loop. however, studies show there no performance improvement when loop vectorized.
i may have missed here since not super familiar topic. so, please let me know if there wrong code.
thanks in advance help.
ps: using mac osx, there no need align data allocated memories 16-byte aligned.
edit: first thank comments , answers. thought answer proposed @mysticial , there further points should mentioned here. firstly, @vinska mentioned, c[k]=a[k]*b[k]
not take 1 cycle. in addition loop index increment , comparison made ensure k
smaller len
, there other things done perform operation. having @ assembly code generated compiler, can seen simple multiplication needs more 1 cycle. vectorized version looks like:
l_b1.9: # preds l_b1.8 movq %r13, %rax #25.5 andq $15, %rax #25.5 testl %eax, %eax #25.5 je l_b1.12 # prob 50% #25.5 # loe rbx r12 r13 r14 r15 eax l_b1.10: # preds l_b1.9 testb $7, %al #25.5 jne l_b1.32 # prob 10% #25.5 # loe rbx r12 r13 r14 r15 l_b1.11: # preds l_b1.10 movsd (%r14), %xmm0 #26.16 movl $1, %eax #25.5 mulsd (%r15), %xmm0 #26.23 movsd %xmm0, (%r13) #26.9 # loe rbx r12 r13 r14 r15 eax l_b1.12: # preds l_b1.11 l_b1.9 movl %eax, %edx #25.5 movl %eax, %eax #26.23 negl %edx #25.5 andl $1, %edx #25.5 negl %edx #25.5 addl $10000000, %edx #25.5 lea (%r15,%rax,8), %rcx #26.23 testq $15, %rcx #25.5 je l_b1.16 # prob 60% #25.5 # loe rdx rbx r12 r13 r14 r15 eax l_b1.13: # preds l_b1.12 movl %eax, %eax #25.5 # loe rax rdx rbx r12 r13 r14 r15 l_b1.14: # preds l_b1.14 l_b1.13 movups (%r15,%rax,8), %xmm0 #26.23 movsd (%r14,%rax,8), %xmm1 #26.16 movhpd 8(%r14,%rax,8), %xmm1 #26.16 mulpd %xmm0, %xmm1 #26.23 movntpd %xmm1, (%r13,%rax,8) #26.9 addq $2, %rax #25.5 cmpq %rdx, %rax #25.5 jb l_b1.14 # prob 99% #25.5 jmp l_b1.20 # prob 100% #25.5 # loe rax rdx rbx r12 r13 r14 r15 l_b1.16: # preds l_b1.12 movl %eax, %eax #25.5 # loe rax rdx rbx r12 r13 r14 r15 l_b1.17: # preds l_b1.17 l_b1.16 movsd (%r14,%rax,8), %xmm0 #26.16 movhpd 8(%r14,%rax,8), %xmm0 #26.16 mulpd (%r15,%rax,8), %xmm0 #26.23 movntpd %xmm0, (%r13,%rax,8) #26.9 addq $2, %rax #25.5 cmpq %rdx, %rax #25.5 jb l_b1.17 # prob 99% #25.5 # loe rax rdx rbx r12 r13 r14 r15 l_b1.18: # preds l_b1.17 mfence #25.5 # loe rdx rbx r12 r13 r14 r15 l_b1.19: # preds l_b1.18 mfence #25.5 # loe rdx rbx r12 r13 r14 r15 l_b1.20: # preds l_b1.14 l_b1.19 l_b1.32 cmpq $10000000, %rdx #25.5 jae l_b1.24 # prob 0% #25.5 # loe rdx rbx r12 r13 r14 r15 l_b1.22: # preds l_b1.20 l_b1.22 movsd (%r14,%rdx,8), %xmm0 #26.16 mulsd (%r15,%rdx,8), %xmm0 #26.23 movsd %xmm0, (%r13,%rdx,8) #26.9 incq %rdx #25.5 cmpq $10000000, %rdx #25.5 jb l_b1.22 # prob 99% #25.5 # loe rdx rbx r12 r13 r14 r15 l_b1.24: # preds l_b1.22 l_b1.20
and non-vectoized version is:
l_b1.9: # preds l_b1.8 xorl %eax, %eax #25.5 # loe rbx r12 r13 r14 r15 eax l_b1.10: # preds l_b1.10 l_b1.9 lea (%rax,%rax), %edx #26.9 incl %eax #25.5 cmpl $5000000, %eax #25.5 movsd (%r15,%rdx,8), %xmm0 #26.16 movsd 8(%r15,%rdx,8), %xmm1 #26.16 mulsd (%r13,%rdx,8), %xmm0 #26.23 mulsd 8(%r13,%rdx,8), %xmm1 #26.23 movsd %xmm0, (%rbx,%rdx,8) #26.9 movsd %xmm1, 8(%rbx,%rdx,8) #26.9 jb l_b1.10 # prob 99% #25.5 # loe rbx r12 r13 r14 r15 eax
beside this, processor not load 24 bytes. in each access memory, full line (64 bytes) loaded. more importantly, since memory required a
, b
, , c
contiguous, prefetcher lot , loads next blocks in advance. having said that, think memory bandwidth calculated @mysticial pessimistic.
moreover, using simd improve performance of program simple addition mentioned in intel vectorization guide. therefore, seems should able gain performance improvement simple loop.
edit2: again comments. also, thank @mysticial sample code, saw effect of simd on performance improvement. problem, mysticial mentioned, memory bandwidth. choosing small size a
, b
, , c
fit l1 cache, can seen simd can improve performance significantly. here results got:
icc -o2 -o testsmidnovec -no-vec testsmid2.c: 17.34 sec icc -o2 -o testsmidvecnounroll -vec-report2 testsmid2.c: 9.33 sec
and unrolling loop improves performance further:
icc -o2 -o testsmidvecunroll -vec-report2 testsmid2.c -unroll=8: 8.6sec
also, should mention takes 1 cycle processor complete iteration when compiled -o2
.
ps: computer macbook pro core i5 @2.5ghz (dual core)
this original answer valid in 2013. of 2017 hardware, things have changed enough both question , answer out-of-date.
see end of answer 2017 update.
original answer (2013):
because you're bottlenecked memory bandwidth.
while vectorization , other micro-optimizations can improve speed of computation, can't increase speed of memory.
in example:
for(k = 0; k < len; k++) c[k] = a[k] * b[k];
you making single pass on memory doing little work. maxing out memory bandwidth.
so regardless of how it's optimized, (vectorized, unrolled, etc...) isn't gonna faster.
a typical desktop machine of 2013 has on order of 10 gb/s of memory bandwidth*.
your loop touches 24 bytes/iteration.
without vectorization, modern x64 processor can 1 iteration cycle*.
suppose you're running @ 4 ghz:
(4 * 10^9) * 24 bytes/iteration = 96 gb/s
that's 10x of memory bandwidth - without vectorization.
*not surprisingly, few people doubted numbers gave above since gave no citation. off top of head experience. here's benchmarks prove it.
the loop iteration can run fast 1 cycle/iteration:
we can rid of memory bottleneck if reduce len
fits in cache.
(i tested in c++ since easier. makes no difference.)
#include <iostream> #include <time.h> using std::cout; using std::endl; int main(){ const int len = 256; double *a = (double*)malloc(len*sizeof(*a)); double *b = (double*)malloc(len*sizeof(*a)); double *c = (double*)malloc(len*sizeof(*a)); int k; for(k = 0; k < len; k++){ a[k] = rand(); b[k] = rand(); } clock_t time0 = clock(); (int = 0; < 100000000; i++){ for(k = 0; k < len; k++) c[k] = a[k] * b[k]; } clock_t time1 = clock(); cout << (double)(time1 - time0) / clocks_per_sec << endl; }
- processor: intel core i7 2600k @ 4.2 ghz
- compiler: visual studio 2012
- time: 6.55 seconds
in test, ran 25,600,000,000 iterations in 6.55 seconds.
6.55 * 4.2 ghz
= 27,510,000,000 cycles27,510,000,000 / 25,600,000,000
= 1.074 cycles/iteration
now if you're wondering how it's possible do:
- 2 loads
- 1 store
- 1 multiply
- increment counter
- compare + branch
all in 1 cycle...
it's because modern processors , compilers awesome.
while each of these operations have latency (especially multiply), processor able execute multiple iterations @ same time. test machine sandy bridge processor, capable of sustaining 2x128b loads, 1x128b store, , 1x256b vector fp multiply every single cycle. , potentially 1 or 2 vector or integer ops, if loads memory source operands micro-fused uops. (2 loads + 1 store throughput when using 256b avx loads/stores, otherwise 2 total memory ops per cycle (at 1 store)).
looking @ assembly (which i'll omit brevity), seems compiler unrolled loop, thereby reducing looping overhead. didn't quite manage vectorize it.
memory bandwidth on order of 10 gb/s:
the easiest way test via memset()
:
#include <iostream> #include <time.h> using std::cout; using std::endl; int main(){ const int len = 1 << 30; // 1gb char *a = (char*)calloc(len,1); clock_t time0 = clock(); (int = 0; < 100; i++){ memset(a,0xff,len); } clock_t time1 = clock(); cout << (double)(time1 - time0) / clocks_per_sec << endl; }
- processor: intel core i7 2600k @ 4.2 ghz
- compiler: visual studio 2012
- time: 5.811 seconds
so takes machine 5.811 seconds write 100 gb of memory. that's 17.2 gb/s.
and processor on higher end. nehalem , core 2 generation processors have less memory bandwidth.
update march 2017:
as of 2017, things have gotten more complicated.
thanks ddr4 , quad-channel memory, no longer possible single thread saturate memory bandwidth. problem of bandwidth doesn't go away. though bandwidth has gone up, processor cores have improved - , there more of them.
to put mathematically:
- each core has bandwidth limit
x
. - main memory has bandwidth limit of
y
. - on older systems,
x > y
. - on current high-end systems,
x < y
.x * (# of cores) > y
.
back in 2013: sandy bridge @ 4 ghz + dual-channel ddr3 @ 1333 mhz
- no vectorization (8-byte load/stores):
x = 32 gb/s
,y = ~17 gb/s
- vectorized sse* (16-byte load/stores):
x = 64 gb/s
,y = ~17 gb/s
now in 2017: haswell-e @ 4 ghz + quad-channel ddr4 @ 2400 mhz
- no vectorization (8-byte load/stores):
x = 32 gb/s
,y = ~70 gb/s
- vectorized avx* (32-byte load/stores):
x = 64 gb/s
,y = ~70 gb/s
(for both sandy bridge , haswell, architectural limits in cache limit bandwidth 16 bytes/cycle regardless of simd width.)
so nowadays, single thread not able saturate memory bandwidth. , need vectorize achieve limit of x
. still hit main memory bandwidth limit of y
2 or more threads.
but 1 thing hasn't changed , won't change long time: you not able run bandwidth-hogging loop on cores without saturating total memory bandwidth.
Comments
Post a Comment