c - Why vectorizing the loop does not have performance improvement -


i investigating effect of vectorization on performance of program. in regard, have written following code:

#include <stdio.h> #include <sys/time.h> #include <stdlib.h>  #define len 10000000  int main(){      struct timeval sttime, endtime;      double* = (double*)malloc(len*sizeof(*a));     double* b = (double*)malloc(len*sizeof(*b));     double* c = (double*)malloc(len*sizeof(*c));      int k;     for(k = 0; k < len; k++){         a[k] = rand();         b[k] = rand();     }      gettimeofday(&sttime, null);      for(k = 0; k < len; k++)         c[k] = a[k] * b[k];      gettimeofday(&endtime, null);      file* fh = fopen("dump", "w");     for(k = 0; k < len; k++)         fprintf(fh, "c[%d] = %f\t", k, c[k]);     fclose(fh);      double timee = (double)(endtime.tv_usec + endtime.tv_sec*1000000 - sttime.tv_usec - sttime.tv_sec*1000000);      printf("time elapsed: %f\n", timee);      return 0; } 

in code, initializing , multiplying 2 vectors. results saved in vector c. interested in effect of vectorizing following loop:

for(k = 0; k < len; k++)     c[k] = a[k] * b[k]; 

i compile code using following 2 commands:

1) icc -o2 testsmid.c -o testsmid -no-vec -no-simd 2) icc -o2 testsmid.c -o testsmid -vec-report2 

i expect see performance improvement since second command vectorizes loop. however, studies show there no performance improvement when loop vectorized.

i may have missed here since not super familiar topic. so, please let me know if there wrong code.

thanks in advance help.

ps: using mac osx, there no need align data allocated memories 16-byte aligned.

edit: first thank comments , answers. thought answer proposed @mysticial , there further points should mentioned here. firstly, @vinska mentioned, c[k]=a[k]*b[k] not take 1 cycle. in addition loop index increment , comparison made ensure k smaller len, there other things done perform operation. having @ assembly code generated compiler, can seen simple multiplication needs more 1 cycle. vectorized version looks like:

l_b1.9:                         # preds l_b1.8         movq      %r13, %rax                                    #25.5         andq      $15, %rax                                     #25.5         testl     %eax, %eax                                    #25.5         je        l_b1.12       # prob 50%                      #25.5                                 # loe rbx r12 r13 r14 r15 eax l_b1.10:                        # preds l_b1.9         testb     $7, %al                                       #25.5         jne       l_b1.32       # prob 10%                      #25.5                                 # loe rbx r12 r13 r14 r15 l_b1.11:                        # preds l_b1.10         movsd     (%r14), %xmm0                                 #26.16         movl      $1, %eax                                      #25.5         mulsd     (%r15), %xmm0                                 #26.23         movsd     %xmm0, (%r13)                                 #26.9                                 # loe rbx r12 r13 r14 r15 eax l_b1.12:                        # preds l_b1.11 l_b1.9         movl      %eax, %edx                                    #25.5         movl      %eax, %eax                                    #26.23         negl      %edx                                          #25.5         andl      $1, %edx                                      #25.5         negl      %edx                                          #25.5         addl      $10000000, %edx                               #25.5         lea       (%r15,%rax,8), %rcx                           #26.23         testq     $15, %rcx                                     #25.5         je        l_b1.16       # prob 60%                      #25.5                                 # loe rdx rbx r12 r13 r14 r15 eax l_b1.13:                        # preds l_b1.12         movl      %eax, %eax                                    #25.5                                 # loe rax rdx rbx r12 r13 r14 r15 l_b1.14:                        # preds l_b1.14 l_b1.13         movups    (%r15,%rax,8), %xmm0                          #26.23         movsd     (%r14,%rax,8), %xmm1                          #26.16         movhpd    8(%r14,%rax,8), %xmm1                         #26.16         mulpd     %xmm0, %xmm1                                  #26.23         movntpd   %xmm1, (%r13,%rax,8)                          #26.9         addq      $2, %rax                                      #25.5         cmpq      %rdx, %rax                                    #25.5         jb        l_b1.14       # prob 99%                      #25.5         jmp       l_b1.20       # prob 100%                     #25.5                                 # loe rax rdx rbx r12 r13 r14 r15 l_b1.16:                        # preds l_b1.12         movl      %eax, %eax                                    #25.5                                 # loe rax rdx rbx r12 r13 r14 r15 l_b1.17:                        # preds l_b1.17 l_b1.16         movsd     (%r14,%rax,8), %xmm0                          #26.16         movhpd    8(%r14,%rax,8), %xmm0                         #26.16         mulpd     (%r15,%rax,8), %xmm0                          #26.23         movntpd   %xmm0, (%r13,%rax,8)                          #26.9         addq      $2, %rax                                      #25.5         cmpq      %rdx, %rax                                    #25.5         jb        l_b1.17       # prob 99%                      #25.5                                 # loe rax rdx rbx r12 r13 r14 r15 l_b1.18:                        # preds l_b1.17         mfence                                                  #25.5                                 # loe rdx rbx r12 r13 r14 r15 l_b1.19:                        # preds l_b1.18         mfence                                                  #25.5                                 # loe rdx rbx r12 r13 r14 r15 l_b1.20:                        # preds l_b1.14 l_b1.19 l_b1.32         cmpq      $10000000, %rdx                               #25.5         jae       l_b1.24       # prob 0%                       #25.5                                 # loe rdx rbx r12 r13 r14 r15 l_b1.22:                        # preds l_b1.20 l_b1.22         movsd     (%r14,%rdx,8), %xmm0                          #26.16         mulsd     (%r15,%rdx,8), %xmm0                          #26.23         movsd     %xmm0, (%r13,%rdx,8)                          #26.9         incq      %rdx                                          #25.5         cmpq      $10000000, %rdx                               #25.5         jb        l_b1.22       # prob 99%                      #25.5                                 # loe rdx rbx r12 r13 r14 r15 l_b1.24:                        # preds l_b1.22 l_b1.20 

and non-vectoized version is:

l_b1.9:                         # preds l_b1.8         xorl      %eax, %eax                                    #25.5                                 # loe rbx r12 r13 r14 r15 eax l_b1.10:                        # preds l_b1.10 l_b1.9         lea       (%rax,%rax), %edx                             #26.9         incl      %eax                                          #25.5         cmpl      $5000000, %eax                                #25.5         movsd     (%r15,%rdx,8), %xmm0                          #26.16         movsd     8(%r15,%rdx,8), %xmm1                         #26.16         mulsd     (%r13,%rdx,8), %xmm0                          #26.23         mulsd     8(%r13,%rdx,8), %xmm1                         #26.23         movsd     %xmm0, (%rbx,%rdx,8)                          #26.9         movsd     %xmm1, 8(%rbx,%rdx,8)                         #26.9         jb        l_b1.10       # prob 99%                      #25.5                                 # loe rbx r12 r13 r14 r15 eax 

beside this, processor not load 24 bytes. in each access memory, full line (64 bytes) loaded. more importantly, since memory required a, b, , c contiguous, prefetcher lot , loads next blocks in advance. having said that, think memory bandwidth calculated @mysticial pessimistic.

moreover, using simd improve performance of program simple addition mentioned in intel vectorization guide. therefore, seems should able gain performance improvement simple loop.

edit2: again comments. also, thank @mysticial sample code, saw effect of simd on performance improvement. problem, mysticial mentioned, memory bandwidth. choosing small size a, b, , c fit l1 cache, can seen simd can improve performance significantly. here results got:

icc -o2 -o testsmidnovec -no-vec testsmid2.c: 17.34 sec  icc -o2 -o testsmidvecnounroll -vec-report2 testsmid2.c: 9.33 sec 

and unrolling loop improves performance further:

icc -o2 -o testsmidvecunroll -vec-report2 testsmid2.c -unroll=8: 8.6sec 

also, should mention takes 1 cycle processor complete iteration when compiled -o2.

ps: computer macbook pro core i5 @2.5ghz (dual core)

this original answer valid in 2013. of 2017 hardware, things have changed enough both question , answer out-of-date.

see end of answer 2017 update.


original answer (2013):

because you're bottlenecked memory bandwidth.

while vectorization , other micro-optimizations can improve speed of computation, can't increase speed of memory.

in example:

for(k = 0; k < len; k++)     c[k] = a[k] * b[k]; 

you making single pass on memory doing little work. maxing out memory bandwidth.

so regardless of how it's optimized, (vectorized, unrolled, etc...) isn't gonna faster.


a typical desktop machine of 2013 has on order of 10 gb/s of memory bandwidth*.
your loop touches 24 bytes/iteration.

without vectorization, modern x64 processor can 1 iteration cycle*.

suppose you're running @ 4 ghz:

  • (4 * 10^9) * 24 bytes/iteration = 96 gb/s

that's 10x of memory bandwidth - without vectorization.


*not surprisingly, few people doubted numbers gave above since gave no citation. off top of head experience. here's benchmarks prove it.

the loop iteration can run fast 1 cycle/iteration:

we can rid of memory bottleneck if reduce len fits in cache.
(i tested in c++ since easier. makes no difference.)

#include <iostream> #include <time.h> using std::cout; using std::endl;  int main(){     const int len = 256;      double *a = (double*)malloc(len*sizeof(*a));     double *b = (double*)malloc(len*sizeof(*a));     double *c = (double*)malloc(len*sizeof(*a));      int k;     for(k = 0; k < len; k++){         a[k] = rand();         b[k] = rand();     }      clock_t time0 = clock();      (int = 0; < 100000000; i++){         for(k = 0; k < len; k++)             c[k] = a[k] * b[k];     }      clock_t time1 = clock();     cout << (double)(time1 - time0) / clocks_per_sec << endl; } 
  • processor: intel core i7 2600k @ 4.2 ghz
  • compiler: visual studio 2012
  • time: 6.55 seconds

in test, ran 25,600,000,000 iterations in 6.55 seconds.

  • 6.55 * 4.2 ghz = 27,510,000,000 cycles
  • 27,510,000,000 / 25,600,000,000 = 1.074 cycles/iteration

now if you're wondering how it's possible do:

  • 2 loads
  • 1 store
  • 1 multiply
  • increment counter
  • compare + branch

all in 1 cycle...

it's because modern processors , compilers awesome.

while each of these operations have latency (especially multiply), processor able execute multiple iterations @ same time. test machine sandy bridge processor, capable of sustaining 2x128b loads, 1x128b store, , 1x256b vector fp multiply every single cycle. , potentially 1 or 2 vector or integer ops, if loads memory source operands micro-fused uops. (2 loads + 1 store throughput when using 256b avx loads/stores, otherwise 2 total memory ops per cycle (at 1 store)).

looking @ assembly (which i'll omit brevity), seems compiler unrolled loop, thereby reducing looping overhead. didn't quite manage vectorize it.


memory bandwidth on order of 10 gb/s:

the easiest way test via memset():

#include <iostream> #include <time.h> using std::cout; using std::endl;  int main(){     const int len = 1 << 30;    //  1gb      char *a = (char*)calloc(len,1);      clock_t time0 = clock();      (int = 0; < 100; i++){         memset(a,0xff,len);     }      clock_t time1 = clock();     cout << (double)(time1 - time0) / clocks_per_sec << endl; } 
  • processor: intel core i7 2600k @ 4.2 ghz
  • compiler: visual studio 2012
  • time: 5.811 seconds

so takes machine 5.811 seconds write 100 gb of memory. that's 17.2 gb/s.

and processor on higher end. nehalem , core 2 generation processors have less memory bandwidth.


update march 2017:

as of 2017, things have gotten more complicated.

thanks ddr4 , quad-channel memory, no longer possible single thread saturate memory bandwidth. problem of bandwidth doesn't go away. though bandwidth has gone up, processor cores have improved - , there more of them.

to put mathematically:

  • each core has bandwidth limit x.
  • main memory has bandwidth limit of y.
  • on older systems, x > y.
  • on current high-end systems, x < y. x * (# of cores) > y.

back in 2013: sandy bridge @ 4 ghz + dual-channel ddr3 @ 1333 mhz

  • no vectorization (8-byte load/stores): x = 32 gb/s , y = ~17 gb/s
  • vectorized sse* (16-byte load/stores): x = 64 gb/s , y = ~17 gb/s

now in 2017: haswell-e @ 4 ghz + quad-channel ddr4 @ 2400 mhz

  • no vectorization (8-byte load/stores): x = 32 gb/s , y = ~70 gb/s
  • vectorized avx* (32-byte load/stores): x = 64 gb/s , y = ~70 gb/s

(for both sandy bridge , haswell, architectural limits in cache limit bandwidth 16 bytes/cycle regardless of simd width.)

so nowadays, single thread not able saturate memory bandwidth. , need vectorize achieve limit of x. still hit main memory bandwidth limit of y 2 or more threads.

but 1 thing hasn't changed , won't change long time: you not able run bandwidth-hogging loop on cores without saturating total memory bandwidth.


Comments

Popular posts from this blog

css - Which browser returns the correct result for getBoundingClientRect of an SVG element? -

gcc - Calling fftR4() in c from assembly -

Function that returns a formatted array in VBA -