UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Visitor skotti
Visitor
474 Views
Registered: ‎06-18-2018

Long read requests within tiled matrix multiplication inside a kernel

Hello!

 

I have this piece of code. I tried to come up with tiled matrix multiplication for the kernel, as it is claimed to be efficient. However this example and also one in the sdaccel documentation related to tiling cause read requests for me.

 

 

__kernel void kernel0(__global double *A, __global double *B, __global double *tmp, double alpha)
{

    int b0 = get_group_id(0), b1 = get_group_id(1);
    int t0 = get_local_id(0), t1 = get_local_id(1);

    local double local_A[BLOCK_SIZE*BLOCK_SIZE] __attribute__((xcl_array_partition(complete, 1)));
    local double local_B[BLOCK_SIZE*BLOCK_SIZE] __attribute__((xcl_array_partition(complete, 1)));
    local double local_C[BLOCK_SIZE*BLOCK_SIZE] __attribute__((xcl_array_partition(complete, 1)));

    for (int i0 = 0; i0 <=179; i0+=BLOCK_SIZE) {
    	for (int j0 = 0; j0 <= 189; j0+=BLOCK_SIZE) {
    		for (int k0 = 0; k0 <= 209; k0+=BLOCK_SIZE) {

    			__attribute__((opencl_unroll_hint))
    			for (int i1 = i0, i2 = 0; i2 < BLOCK_SIZE; i1++, i2++) {
    				for (int k1 = k0, k2 = 0; k2 < BLOCK_SIZE; k1++, k2++) {
    					local_A[i2 * BLOCK_SIZE + k2] = (i1 > 180  || k1 > 210) ? 0 : A[i1 * 210 + k1];
    				}
    			}

    			__attribute__((opencl_unroll_hint))
    			for (int k1 = k0, k2 = 0; k2 < BLOCK_SIZE; k1++, k2++) {
    				for (int j1 = j0, j2 = 0; j2 < BLOCK_SIZE; j1++, j2++) {
    					local_B[k2 * BLOCK_SIZE + j2] =  (k1 > 210  || j1 > 190) ? 0 : B[k1 * 190 + j1];
    				}
    			}


			  /*__attribute__((opencl_unroll_hint))
			  for (int k1 = 0; k1 < BLOCK_SIZE; k1++) {
				  for (int i1 = 0; i1 < BLOCK_SIZE; i1++) {
					  for (int j1 = 0; j1 < BLOCK_SIZE; j1++) {
						  int last = (k==0) ? 0 : localC[i1*BLOCK_SIZE+j];

						  // Update current sum
						  // Handle boundary conditions
						  int a_val = (i < a_row && k < a_col)? localA[i1*BLOCK_SIZE+k1] : 0;
						  int b_val = (k < b_row && j < b_col)? localB[k1*BLOCK_SIZE+j1] : 0;
						  int result = last + a_val*b_val;

						  // Write back results
						  localC[i1*BLOCK_SIZE+j1] = result;
					  }
				  }
			  }*/
    			__attribute__((opencl_unroll_hint))
    			for (int i1 = 0; i1 < BLOCK_SIZE; i1++) {
    				for (int j1 = 0; j1 < BLOCK_SIZE; j1++) {
    					double result = 0.0;
    					for (int k1 = 0; k1 < BLOCK_SIZE; k1++) {
    						result += ((alpha * local_A[i1 * BLOCK_SIZE + k1]) * local_B[k1 * BLOCK_SIZE + j1]);
    					}
    					local_C[i1*BLOCK_SIZE+j1] = result;
    					barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
    				}
    			}

    			__attribute__((opencl_unroll_hint))
    			for (int i1 = i0, i2 = 0; i2 < BLOCK_SIZE; i1++, i2++) {
    				for (int j1 = j0, j2 = 0; j2 < BLOCK_SIZE; j1++, j2++) {
    					if (i1 < 180  && j1 < 190) tmp[i1 * 190 + j1] = local_C[i2 * BLOCK_SIZE + j2];
    				}
    			}
    		}
    	}
    }

  /*  for (int c0 = 0; c0 <= 179; c0 += 1){
      for (int c1 = 0; c1 <= 189; c1 += 1) {
        for (int c2 = 0; c2 <= 209; c2 += 1) {
            result += ((alpha * local_A[c0 * 210 + c2]) * local_B[c2 * 190 + c1]);
        }
        local_C[c0*190+c1] = result;
        barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
      }
    }*/
}

 

How this code may be improved? I am not sure why these two cycled of copying information locally to local arrays cause read requests which increase the latency dramatically. How can I get rid of this read requests, maybe with help of separate functions? But if there are any dependencies functions will probably not solve this.

0 Kudos
2 Replies
Xilinx Employee
Xilinx Employee
461 Views
Registered: ‎01-12-2017

Re: Long read requests within tiled matrix multiplication inside a kernel

0 Kudos
Visitor skotti
Visitor
459 Views
Registered: ‎06-18-2018

Re: Long read requests within tiled matrix multiplication inside a kernel

@kalib, I have already looked at this one.

But it does not help me with my problem.

 

Actually I am not so bothered with the computation cycle how with copying data into local arrays.

 

In provided by you example it is done for a very small array but seems there will be no read requests for data as I also have no read requests when I for example performs a copy of global array into local memory at the beginning of the function.

 

But when I copy inside a cycle inside small blocks , then I actually have these read requests. However I don't understand why it happens.

 

Do you suggest to isolate the body of the innermost loop into completely separate function?

Or maybe copying should be performed in another way? I need to copy with these if statements as size of an array may be not divisible by BLOCK_SIZE.

0 Kudos