cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Explorer
Explorer
1,309 Views
Registered: ‎05-23-2017

Why a long stall happens during data read?

Hi ,

In my kernel there is a case, in which when need the kernle will read 960*8bit data from the DRAM attached to the FPGA to the kernel.

Below is the fucntion for the data reading.

Port "feature_or" is the kernel port connect to the ddrbank2. It's a 64*8bit(512bit) width data.

Port "feature_or_buffer" is the arrary to store the data in the kernel.It's lenght is 960*8bit

D_OR=960

When a 960*8bit data read happened, Loop "read_feature_or0" will read 960/64=15 times through "feature_or" port.

The data firstly be store to a "feature_or_buffer_temp"

When the 15-times-iteration is done it will load data to the feature_or_buffer.

 

void read_feature_or(const D_point_512_64 *feature_or,Dtype_uint data_size, Dtype_uint index, Dtype_s *feature_or_buffer){
    if(index<data_size){
        D_point_512_64 feature_or_temp;
    #pragma HLS ARRAY_PARTITION variable = feature_or_temp.x complete dim=0
        Dtype_s_8 feature_or_buffer_temp[D_OR];
    #pragma HLS ARRAY_PARTITION variable = feature_or_buffer_temp complete dim=0
        ap_uint<10> loop_num = D_OR/64;// each 512 bit including 64float data
read_feature_or0:for(ap_uint<10> j=0; j<loop_num; j++){ #pragma HLS PIPELINE feature_or_temp = feature_or[index*loop_num+j]; ap_uint<10> inter = j<<6;//j*64 read_feature_or1:for(ap_uint<10> i=0; i<64; i++){ #pragma HLS unroll feature_or_buffer_temp[inter+i] = feature_or_temp.x[i]; } } read_feature_or2:for(ap_uint<10> i=0; i<D_OR; i++){ feature_or_buffer[i] = feature_or_buffer_temp[i]; } } }

After I run the harware emulation I got the wave-form of the result.

From the system_estimate_pcaf_fpga.hw.xilinx_vcu1525_dynamic_5_1.xtxt, I can see this function's II is 15 cycle and average case is 23 cycles.

15.JPG

1. How can I reduce the average case from 23 to 15?

2. From the wave-form I can see there a very long stall happens during the Loop "read_feature_or0”.

I can see there are 15 times reading operation happened to read 960*8bit data through  Loop "read_feature_or0”, but the 15 iteration are not continuous.

If I run the kernel on the fabric targeting the system run, does this issue will also happen?

How can I remove the stall?

 2.JPGstallissue.JPG

 

0 Kudos
8 Replies
Highlighted
Xilinx Employee
Xilinx Employee
1,260 Views
Registered: ‎06-17-2008

The stall is caused by the overhead of DDR read operation among each burst op. Since you chop the 960*8bit data read into 15 seperate burst read, there will be gap among each of them. The HW run may have some difference result comparing to HW-emu. We recommend to use single burst operation if possible to reduce such overhead. And we have plenty of methodolgy guide in the SDAccel optimizations UG.  And for II optimization, you may launch the HLS project and use Analysis view to check the II violation. I think it may related to the nested for-loop latency and the load/store operation of feature_or_temp.

0 Kudos
Highlighted
Explorer
Explorer
1,243 Views
Registered: ‎05-23-2017

@yunl

Thanks very much !

 I also tried to seperate the load and store "feature_or_temp", i.e. read 960*8bit data to "feature_or_temp" first and then assign "feature_or_temp" to "feature_or_buffer". But got the sam result.

 

void read_feature_or(const D_point_512_64 *feature_or,Dtype_uint data_size, Dtype_uint index, Dtype_s *feature_or_buffer){
    if(index<data_size){
        ap_uint<10> loop_num = D_OR/64;// each 512 bit including 32 float data

        D_point_512_64 feature_or_temp[D_OR/64];
    #pragma HLS ARRAY_PARTITION variable = feature_or_temp complete dim=0
        Dtype_s feature_or_buffer_temp[D_OR];
    #pragma HLS ARRAY_PARTITION variable = feature_or_buffer_temp complete dim=0

read_feature_or0:for(ap_uint<10> j=0; j<loop_num; j++){
            #pragma HLS PIPELINE
            feature_or_temp[j] = feature_or[index*loop_num+j];
            }

            for(ap_uint<10> j=0; j<loop_num; j++){
            #pragma HLS PIPELINE
read_feature_or1:for(ap_uint<10> i=0; i<64; i++){
                #pragma HLS unroll
                    ap_uint<10> inter = j<<6;
                    //feature_or_buffer_temp[inter+i] = feature_or_temp[j].x[i];
                    feature_or_buffer[inter+i] = feature_or_temp[j].x[i];
                 }
                }
    }
}

 

Since a 960*8bit read operation is radaomly happened, i.e. the first 960*8bit read followed by the second 960*8bit read and... case will not happen.

So I more care about the latency. i.e. after the read request send how long can I got the 960*8bit  data back for use.

"Since you chop the 960*8bit data read into 15 separate burst read, there will be gap among each of them. The HW run may have some difference result comparing to HW-emu. We recommend to use single burst operation if possible to reduce such overhead."

How do achieve a single burst operation for my case?

 

 

 

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
1,207 Views
Registered: ‎06-17-2008

A simple for-loop or memcpy makes burst detection simpler. You may refer to two examples as below:

https://github.com/Xilinx/SDAccel_Examples/tree/master/getting_started/kernel_to_gmem/wide_mem_rw_c (for-loop style)

https://github.com/Xilinx/SDAccel_Examples/tree/master/getting_started/kernel_to_gmem/burst_rw_c (memcpy style)

 

0 Kudos
Highlighted
Explorer
Explorer
1,196 Views
Registered: ‎05-23-2017

@yunl

I think I used the same strategy as the for-loop example.

1.  Create a local vector array.

2.  continuously read data to vector array.

The only different is the port "" is a structure using "data_pack " for 512 bit access.

But I don't think this will affect the burst read, right?

Any suggestion is very welcomed!

#pragma HLS INTERFACE m_axi port=feature_or_0  offset=slave bundle=gmem1 
#pragma HLS data_pack variable=feature_or_0
void read_feature_or(const D_point_512_64 *feature_or,Dtype_uint data_size, Dtype_uint index, Dtype_s *feature_or_buffer){
if(index<data_size){
ap_uint<10> loop_num = D_OR/64;// each 512 bit including 32 float data

D_point_512_64 feature_or_temp[16];//Local memory to store vector
#pragma HLS ARRAY_PARTITION variable = feature_or_temp complete dim=0

read_feature_or0:for(ap_uint<10> j=0; j<16; j++){ //continuously read
#pragma HLS PIPELINE
feature_or_temp[j] = feature_or[index*loop_num+j];
}

for(ap_uint<10> j=0; j<loop_num; j++){
#pragma HLS PIPELINE
read_feature_or1:for(ap_uint<10> i=0; i<64; i++){
#pragma HLS unroll
ap_uint<10> inter = j<<6;
feature_or_buffer[inter+i] = feature_or_temp[j].x[i];
}
}
}
}
0 Kudos
Highlighted
Explorer
Explorer
1,154 Views
Registered: ‎05-23-2017

Any help on this?

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
1,087 Views
Registered: ‎06-17-2008

Hi @mathmaxsean,

I think it may due to that the loop not simple enough.

Below is the data read loop and the index for 'feature_or' is related to 'index','loop_num' and 'j'. While loop_num is a constant and j is the index of the loop so that tool can analyze. However, index is in input to the top function that tool may not know its pattern therefore it may not be able to infer a burst operation.

read_feature_or0:for(ap_uint<10> j=0; j<16; j++){ //continuously read
#pragma HLS PIPELINE
feature_or_temp[j] = feature_or[index*loop_num+j];
}

0 Kudos
Highlighted
Explorer
Explorer
980 Views
Registered: ‎05-23-2017

@yunl

Thanks very much for feedback.

Yes. The "index" is a value gotten from the previous fucntion.

It's a random value between 0~1000,000.

Is there a way to bypass this?

Thanks.

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
947 Views
Registered: ‎06-17-2008

Hi @mathmaxsean,

 

Can you try modifying below pragma and give a try?

#pragma HLS INTERFACE m_axi port=feature_or_0  offset=slave bundle=gmem1 max_read_burst_length=256 max_write_burst_length=256

0 Kudos