cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
sejin
Visitor
Visitor
241 Views
Registered: ‎02-24-2021

Inefficient pipelining when local memory is used

Hi, I am trying to optimize my kernel code. 

I tried to minimize global memory access by using local memory.

However, when I change the part that accesses the global memory to access the local memory instead, pipelining results in II with 6 instead of a 1.

When I accessed global memory II was 1 and changing global memory access to a local memory access is all I changed.

Below is my original kernel code which accesses array out_r which is in the global memory. You could just focus on the inner loop, VECTOR_LOOP.

extern "C" {
void K_VADD(const float *in1, // Read-Only Vector 1
	  float *out_r,     // Output Result
	  const unsigned int *emb_l,
	  const int *lS_o,
	  const int *lS_i,
	  const int num_sparse_features,
	  const int arch_sparse_feature_size,
	  const int sparse_offset_group_batch_size,
	  const int sparse_index_group_batch_size
) 
    {
#pragma HLS INTERFACE m_axi port=in1 offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=out_r offset=slave bundle=gmem1
#pragma HLS INTERFACE m_axi port=emb_l offset=slave bundle=gmem2
#pragma HLS INTERFACE m_axi port=lS_o offset=slave bundle=gmem3
#pragma HLS INTERFACE m_axi port=lS_i offset=slave bundle=gmem4
#pragma HLS INTERFACE s_axilite port=in1 bundle=control
#pragma HLS INTERFACE s_axilite port=out_r bundle=control
#pragma HLS INTERFACE s_axilite port=emb_l bundle=control
#pragma HLS INTERFACE s_axilite port=lS_o bundle=control
#pragma HLS INTERFACE s_axilite port=lS_i bundle=control
#pragma HLS INTERFACE s_axilite port=num_sparse_features bundle=control
#pragma HLS INTERFACE s_axilite port=arch_sparse_feature_size bundle=control
#pragma HLS INTERFACE s_axilite port=sparse_offset_group_batch_size bundle=control
#pragma HLS INTERFACE s_axilite port=sparse_index_group_batch_size bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle = control

        int *sparse_index_group_batch;
	    int *sparse_offset_group_batch;
	    unsigned int offset;
	    int begin, end;
	    int i, j, k, l;
	    int vector_num;
	    int out_r_size = sparse_offset_group_batch_size * arch_sparse_feature_size;
        unsigned int table_addr;
        int out_vec_addr;

	    // Initialize out_r to 0.0
        OUT_R_INIT: for (i = 0; i < out_r_size; i++)
        {
            #pragma HLS PIPELINE II=1
            out_r[i] = 0.0;
        }

        TABLE_LOOP: for (k = 0; k < num_sparse_features; k++)
	    {
            sparse_index_group_batch = (int *)lS_i + k * sparse_index_group_batch_size;
	        sparse_offset_group_batch = (int *)lS_o + k * sparse_offset_group_batch_size;
	        begin = sparse_offset_group_batch[0];
            table_addr = emb_l[k];

	        // Embedding lookup as much as batch size
            BATCH_LOOP: for (i = 0; i < sparse_offset_group_batch_size; i++)
            {
                if (i == sparse_offset_group_batch_size - 1)
		            end = sparse_index_group_batch_size;
		        else
		            end = sparse_offset_group_batch[i + 1];
                out_vec_addr = i * arch_sparse_feature_size;

                REQUEST_LOOP: for (j = begin; j < end; j++)
                {
                    //emb_lookup_cpp
                    vector_num = sparse_index_group_batch[j];
		            offset = table_addr + (unsigned int)vector_num * (unsigned int)arch_sparse_feature_size; 
		            //embedding.emb_read + sum
                    VECTOR_LOOP: for (l = 0; l < arch_sparse_feature_size; l++)
                    {
                        #pragma HLS PIPELINE II=1
                        out_r[out_vec_addr + l] += in1[offset + (unsigned int)l];
                    }
		        }
                begin = end;
	        }					
	    }
    }
}

 And below is my new code which has changed only a little from the code above. Some loops have changed but they all result in II=1 except for the inner loop, VECTOR_LOOP. You could notice that now it accesses out_r_local instead of out_r. This is all that has been changed but this results in II=6.

extern "C" {
void K_VADD(const float *in1, // Read-Only Vector 1
	  float *out_r,     // Output Result
	  const unsigned int *emb_l,
	  const int *lS_o,
	  const int *lS_i,
	  const int num_sparse_features,
	  const int arch_sparse_feature_size,
	  const int sparse_offset_group_batch_size,
	  const int sparse_index_group_batch_size
) 
    {
#pragma HLS INTERFACE m_axi port=in1 offset=slave bundle=gmem0
#pragma HLS INTERFACE m_axi port=out_r offset=slave bundle=gmem1
#pragma HLS INTERFACE m_axi port=emb_l offset=slave bundle=gmem2
#pragma HLS INTERFACE m_axi port=lS_o offset=slave bundle=gmem3
#pragma HLS INTERFACE m_axi port=lS_i offset=slave bundle=gmem4
#pragma HLS INTERFACE s_axilite port=in1 bundle=control
#pragma HLS INTERFACE s_axilite port=out_r bundle=control
#pragma HLS INTERFACE s_axilite port=emb_l bundle=control
#pragma HLS INTERFACE s_axilite port=lS_o bundle=control
#pragma HLS INTERFACE s_axilite port=lS_i bundle=control
#pragma HLS INTERFACE s_axilite port=num_sparse_features bundle=control
#pragma HLS INTERFACE s_axilite port=arch_sparse_feature_size bundle=control
#pragma HLS INTERFACE s_axilite port=sparse_offset_group_batch_size bundle=control
#pragma HLS INTERFACE s_axilite port=sparse_index_group_batch_size bundle=control
#pragma HLS INTERFACE s_axilite port=return bundle = control

        int *sparse_index_group_batch;
	    int *sparse_offset_group_batch;
	    unsigned int offset;
	    int begin, end;
	    int i, j, k, l;
	    int vector_num;
	    int out_r_size = sparse_offset_group_batch_size * arch_sparse_feature_size;
        unsigned int table_addr;
        int out_vec_addr;
        float out_r_local[MAX_SIZE];

	    // Initialize out_r to 0.0
        OUT_R_INIT: for (i = 0; i < out_r_size; i++)
        {
            #pragma HLS PIPELINE II=1
            out_r_local[i] = 0.0;
        }

        TABLE_LOOP: for (k = 0; k < num_sparse_features; k++)
	    {
            sparse_index_group_batch = (int *)lS_i + k * sparse_index_group_batch_size;
	        sparse_offset_group_batch = (int *)lS_o + k * sparse_offset_group_batch_size;
	        begin = sparse_offset_group_batch[0];
            table_addr = emb_l[k];

	        // Embedding lookup as much as batch size
            BATCH_LOOP: for (i = 0; i < sparse_offset_group_batch_size; i++)
            {
                if (i == sparse_offset_group_batch_size - 1)
		            end = sparse_index_group_batch_size;
		        else
		            end = sparse_offset_group_batch[i + 1];
                out_vec_addr = i * arch_sparse_feature_size;

                REQUEST_LOOP: for (j = begin; j < end; j++)
                {
                    //emb_lookup_cpp
                    vector_num = sparse_index_group_batch[j];
		            offset = table_addr + (unsigned int)vector_num * (unsigned int)arch_sparse_feature_size; 
		            //embedding.emb_read + sum
                    VECTOR_LOOP: for (l = 0; l < arch_sparse_feature_size; l++)
                    {
                        #pragma HLS PIPELINE II=1
                        out_r_local[out_vec_addr + l] += in1[offset + (unsigned int)l];
                    }
		        }
                begin = end;
	        }					
	    }
        WRITE_LOOP: for (i = 0; i < out_r_size; i++)
        {
            #pragma HLS PIPELINE II=1
            out_r_[i] = out_r_local[i];
        }
    }
}

 

Could someone please tell me why accessing the local memory in this case results in an inefficient pipelining?

I would also like to get advice on how to improve the efficiency of the pipelining or the code itself.

0 Kudos
2 Replies
yangc
Xilinx Employee
Xilinx Employee
222 Views
Registered: ‎02-27-2019

In fact, I don't understand your modification. In1 is from global memory, you use out_r_local, it adds the delay I think. Whether you use "local memory", it can't minimize global memory access.

-------------------------------------------------------------------------
Don't forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
0 Kudos
sejin
Visitor
Visitor
203 Views
Registered: ‎02-24-2021

In1 is from global memory and out_r is also from the global memory. As you mentioned, the number of accesses to In1 would be the same, but I wanted to minimize the number of accesses to out_r.

To give more information, in the first code, each element of array out_r is being accessed 26 times. So instead of this, I modified the implementation as shown in the second code. I modified it to access out_r_local 26 times instead and only access out_r once in the WRITE_LOOP. 

0 Kudos