shared memory - CUDA volatile and threadfence -


what difference between following 2 functions?

__device__ inline void comparator_volatile(volatile float &a, volatile float &b, uint dir) { float t; if ((a > b) == dir) {     t = a;     = b;     b = t;   } }  __device__ inline void comparator(float &a, float &b, uint dir) { float t; if ((a > b) == dir) {     t = a;     = b;     b = t;   }   __threadfence(); } 

could me?

i implement bitonicsort in different versions based on cuda sdk version. atomic version (bitonicsortatomic), tried use __threadfence() in __syncblocks_atomic maintain memory consistency. doesn't work (the output incorrect). have call comparator_volatile instead of comparator, correct result. idea? bitonicsort benchmark:

// (c) copyright 2013, university of illinois. rights reserved #include <stdlib.h> #include <stdio.h> #include "parboil.h"  #define threads 256 #define blocks 32 #define num_vals 2*threads*blocks  __device__ volatile int mutex = 0; __device__ inline void __syncblocks_atomic(int goal) {     __syncthreads(); //  __threadfence();     int tx = threadidx.x;     if (tx == 0) {         atomicadd((int *)&mutex, 1);         while(g_mutex != goal) {}     }     __syncthreads(); }  __device__ inline void comparator(float &a, float &b, uint dir) {     float t;     if ((a > b) == dir) {         t = a;         = b;         b = t;     } }  __device__ inline void comparator_volatile(volatile float &a, volatile float &b, uint dir) {     float t;     if ((a > b) == dir) {         t = a;         = b;         b = t;     } }  #ifdef naive __global__ void bitonicsortnaive(float *src, int stride, int size) {   unsigned int tid = threadidx.x + blockdim.x * blockidx.x;   uint dir = (tid & (size / 2)) == 0;   unsigned int pos = 2*tid - (tid & (stride - 1));   comparator(src[pos], src[pos+stride], dir); } #endif  #ifdef atomic __global__ void bitonicsortatomic(float *src, int length) {   uint numblocks = griddim.x * griddim.y * griddim.z;   uint goalval = 0;   uint tid = threadidx.x + blockdim.x * blockidx.x;   for(uint size=2; size<=length; size<<=1) {     for(uint stride=size>>1; stride>0; stride=stride>>1) {       uint dir = (tid & (size / 2)) == 0;       uint pos = 2*tid - (tid & (stride - 1));       comparator_volatile(src[pos], src[pos+stride], dir);       if(stride>threads || (stride==1 && size>=threads)) {         goalval += numblocks;         __syncblocks_atomic(goalval);       }       else         __syncthreads();     } // end stride   } // end size } #endif  int main() {   printf("[bench] bitonic sort %d elements\n", num_vals);   printf("[bench] xuhao chen <cxh@illinois.edu>\n"); #ifdef naive   printf("[bench] naive version\n"); #endif #ifdef atomic   printf("[bench] atomic barrier\n"); #endif   float *values = (float*) malloc( num_vals * sizeof(float));   array_init(values, num_vals);   float *dev_values;   size_t size = num_vals * sizeof(float);   cudamalloc((void**) &dev_values, size);   cudamemcpy(dev_values, values, size, cudamemcpyhosttodevice);   dim3 blocks(blocks,1);   dim3 threads(threads,1);   cudadevicesynchronize();  #ifdef naive   int j, k;   (k = 2; k <= num_vals; k <<= 1) {     (j=k>>1; j>0; j=j>>1) {       bitonicsortnaive<<<blocks, threads>>>(dev_values, j, k);     }   } #endif  #ifdef atomic   bitonicsortatomic<<<blocks, threads>>>(dev_values, num_vals); #endif    cudadevicesynchronize();   cudamemcpy(values, dev_values, size, cudamemcpydevicetohost);   cudafree(dev_values);   free(values); } 

__syncblocks_atomic function implement global barrier. since there inter-block communication, have keep data consistency.

the cuda programming guide states:

if variable located in global or shared memory declared volatile, compiler assumes value can changed or used @ time thread , therefore reference variable compiles actual memory read or write instruction.

this means memory flushed assign value variable, , fetched directly memory (with no cache) when try read value.

in first code sample, since both , b volatile, 6 actual memory instructions generated. 1 read/write each time use either or b. point other threads able see modifications earlier, while made. downside execution slower, because caches disabled.

in second code sample, on other side, gpu authorized use caches accelerate execution, until end of function, when it's forced issue memory write. if both , b cached, 2 memory writes issued. downside other threads might able see changed value after fence.

another thing should consider operations not atomic. if other threads try access , b while function executing, might see partial execution of function, in both cases. in second code sample, bit less happen, because thread use cached value, , flush final values @ once (anyway, should not rely on this).

also, volatile works faster version of __threadfence() among threads in same warp (because threads in warp act synchronously).


Comments

Popular posts from this blog

css - Which browser returns the correct result for getBoundingClientRect of an SVG element? -

gcc - Calling fftR4() in c from assembly -

.htaccess - Matching full URL in RewriteCond -