cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Visitor
Visitor
171 Views
Registered: ‎10-11-2019

AXI burst transfer with variable start_address

Hello to everyone,

I am trying to accelerate an application and I am targeting the Alveo U200 card.
I am using version 2019.2 of the v++ compiler.

I have a buffer migrated to the global memory and I have a kernel that reads from this buffer. The kernel is invoked many times and the buffer stays the same between kernel invocations. 

I want to make burst reads from the global memory to the FPGA.
In order to utilize the full-width of the bus, I read ap_uint<512> values.

The relative parts of my code are as follows:

void kernel_load(const ap_uint<512> *array, const int _start, ... , 
                 const int _iterations, hls::stream<ap_uint<512>> stream)
{
    int start = _start;
    int end = start + _iterations;
    for (int i = start; i < end; i++) {
       #pragma HLS PIPELINE II=1
        stream << array[i];
    }
}

void kernel(const int ap_uint<512> *array, const int _start, const int _iterations, ...)
{
   #pragma HLS INTERFACE port = array offset = slave bundle = gmem \
                         max_read_burst_length = 64
   ...
   #pragma HLS INTERFACE  s_axilite port = array bundle = control
   ...
   #pragma HLS INTERFACE  s_axilite port = return bundle = control
   ...
      
    static hls::stream<ap_uint<512>> stream;
   #pragma HLS STREAM variable=stream depth=64
   /* I know that this depth can be further optimized, 
    * but I put it like that to make sure at the beginning
    */

   #pragma HLS DATAFLOW
    kernel_load(array, _start, ..., _iterations, stream);
    ...
}

 

The HLS report produced when compiling the kernel for H/W emulation did not show any issues. I was able to achieve II=1 in the pipeline. When I tried to run the emulation the timing was off. Checking the read channel for the gmem interface I saw that the all the burst transactions had the anticipated size of 64 bytes, but burst length 1, so in essence no burst transaction was happening, but instead single transfers. The stalls between transfers were quite big.

Trying different things and changing bits of the code here and there, I noticed that the burst transactions were working as expected if the start variable was 0 and set in code before compiling. 

    ...
    int start = 0;
    int end = _iterations;
    for (int i = start; i < end; i++) {
       #pragma HLS PIPELINE II=1
        stream << array[i];
    }
    ...

 

I say before compiling, because if the _start scalar argument was 0, then it still wouldn't do the transfers in burst.

From what I understood reading the AXI specification, I should not have a problem with my original code. I expected to give the start_address on the fly and get the array values in burst. 
The specification states that no burst transaction can exceed the 4K boundary, so what I expected was maybe have an initial burst of less length than the 64 maximum (because of the unaligned-to-4K address) and after that everything would be in order.

The fact that even with the _start argument being 0 (effectively having to run the 2nd code - of course the design is different in the two cases) it still wouldn't would work troubled me.

Currently, I have a working solution, where I read all values from the first one of the buffer and I discard all the values before the start.

    ...  
    for (int i = 0; i < end; i++) {
       #pragma HLS PIPELINE II=1
        ap_uint<512> buf = array[i];
        if (i > start - 1) {
            stream << buf;
        }
    }
    ...

 

I am happy that at least I have something working and I know my host code won't generally pass an argument _start different than 0 many times (so essentially I won't read many useless values), but I feel that it can be fixed. 

I hope someone can help.

Greetings, 
Charalampos

0 Kudos