UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

Reply

How to optimize axi master procedure?

Highlighted
Visitor
Posts: 7
Registered: ‎06-19-2017

How to optimize axi master procedure?

Hello,

 

I want to send stream data from FIFO to DDR3 memory. I juse ZC 702 board.

 

Here is my code.

 

 

 

#include "Img2MM.h"

#define C_BURST_N 4

void Img2MM(volatile unsigned int *oAXI, volatile unsigned int *iData,
                        uint6 iC, uint8 iR, uint9 iLineOffset) {
#pragma HLS DATAFLOW
#pragma HLS INTERFACE m_axi port=oAXI offset=direct
#pragma HLS INTERFACE ap_fifo port=iData
#pragma HLS INTERFACE ap_none port=iC
#pragma HLS INTERFACE ap_none port=iR
#pragma HLS INTERFACE ap_none port=iLineOffset


uint8 g;
uint6 h;
uint3 i;
unsigned int buff[C_BURST_N];

 

    L2: for (g=0; g <= (iR - 1); g++){

        L1: for (h=0; h <= (iC - 1); h++){
#pragma HLS PIPELINE enable_flush rewind

 

            L0_0: for (i=0; i <= (C_BURST_N - 1); i++){
#pragma HLS UNROLL skip_exit_check
                buff[i] = iData[(g * (iC * C_BURST_N)) + (h * C_BURST_N) + i];
            }

 

            F0_0: {
                memcpy((unsigned int *)(oAXI + (g * ((C_BURST_N * iC) + iLineOffset)) + (h * C_BURST_N)),

                                buff, (C_BURST_N * sizeof(unsigned int)) );
            }

 

        }
    }
}

 

This code is working, but this consumes around 20 clocks from after burst sending to next sending.

Burst sending usually takes 5 clocks, but but 20 clocks seems like taking too long time to send next burst.

Because of this, The FIFO got overflow. 

How can I improve send stream data from FIFO ip to DDR3 memory with my code above?

 

Thank you.

 

 

 

Contributor
Posts: 48
Registered: ‎06-02-2015

Re: How to optimize axi master procedure?

[ Edited ]

DATAFLOW-ing the function Img2MM doesn't do much if this is all of your real code, since you only have one big nested loop...

Can't you increase the fifo depth and do longer burst? or use a wider bus than 32bit? (in SDAccel we do 512bit bus...)

And I wonder what II (initialization interval) did you get for your L1, since you have a memcpy call inside, which is itself similar to a II=1 for-loop...

Visitor
Posts: 7
Registered: ‎06-19-2017

Re: How to optimize axi master procedure?

Yes, This is the real code.
I was thinking about more FIFO depth but, the maximum iC and iR are 35, and 128, so the FIFO depth should be much lager to solve this problem.
Longer Burst? I want, but I cannot for this application.
Using wider bus might be a possible option now.
I just found that 11-clocks consumed to get BVALID after WVALID.
After 1-clock BVALID, I got WVALID after 3 clocks.
Any Ideas? in my perspective, I don't understand why 11-clocks and 3 clocks are taken....
Contributor
Posts: 48
Registered: ‎06-02-2015

Re: How to optimize axi master procedure?

Again what II value did you get for your L1?
If its a high value, i think what you can do is:
1) pipeline II=1 instead of unroll L0_0,
2) use explicit for loop instead of memcpy, then you can
3) use DATAFLOW inside L1, this allows your fifo read and bus burst write happen concurrently. Currently in your code the fifo read won't start until your previous bus write completes...
Visitor
Posts: 7
Registered: ‎06-19-2017

Re: How to optimize axi master procedure?

Thanks sammhho,
But how can I figure II (initialization interval) of L1 out?
Visitor
Posts: 7
Registered: ‎06-19-2017

Re: How to optimize axi master procedure?

And is possible adding DATAFLOW indise L1?
I cannot find a DATAFLOW option in the Directive option list
Contributor
Posts: 48
Registered: ‎06-02-2015

Re: How to optimize axi master procedure?

In the synthesis report, look at the "Loop" table, under "Initiation Interval", sub-columns are "achieved" and "target", for loops with target II=1 and HLS can't schedule, the number would be in red

I don't know starting from which version of HLS did they support loop DATAFLOW but at least for 2016.1 and up they do
if you can't find it on the menu you can try just put the pragma line inside the L1, look at the log and you'll see if HLS actually tried to do DATAFLOW on it
Scholar
Posts: 1,941
Registered: ‎04-26-2015

Re: How to optimize axi master procedure?

[ Edited ]

Some code:

 

 

#include <stdio.h>

#include "ap_int.h"

typedef ap_uint<6> uint6;
typedef ap_uint<3> uint3;
typedef ap_uint<8> uint8;
typedef ap_uint<9> uint9;

#define C_BURST_N 4

void Img2MM(unsigned int *oAXI, unsigned int *iData,
                        uint6 iC, uint8 iR, uint9 iLineOffset) {

#pragma HLS INTERFACE m_axi port=oAXI offset=direct
#pragma HLS INTERFACE ap_fifo port=iData
#pragma HLS INTERFACE ap_none port=iC
#pragma HLS INTERFACE ap_none port=iR
#pragma HLS INTERFACE ap_none port=iLineOffset


uint8 g;
uint6 h;
uint3 i;
    L2: for (g=0; g <= (iR - 1); g++){

        L1: for (h=0; h <= (iC - 1); h++){

            L0_0: for (i=0; i <= (C_BURST_N - 1); i++){
#pragma HLS PIPELINE II=1
                int writeOffset = g*((C_BURST_N * iC) + iLineOffset) + h*C_BURST_N;
                unsigned int buff = iData[writeOffset + i];
                oAXI[writeOffset + i] = buff;
            }
        }
    }
}

Note that I've converted it to be C++ friendly for HLS, just by including ap_int.h and typedef'ing the required ap_uint types. I would recommend using C++ for HLS designs, since that gives you access to all the nifty fixed-point types and math operations. As shown above, the conversion process is not exactly difficult.

 

 

 

With this approach, HLS is able to flatten the floop nest to make a single big loop, and then do a variable-length burst. This will have far lower overhead than the original 4-element burst, and it means you don't have to buffer four elements too.

 

This should work as long as the FIFO does have some idle time. Doesn't have to be much, but if the FIFO is receiving one new element on every single clock cycle and this block reads slightly less than one element per cycle on average then eventually you'll encounter an overflow. If that is likely to be an issue, then you need to make the interfaces wider. If you can safely assume that iR*iC is a multiple of two, then you can convert the whole block to 64-bit. You'll have to convert the FIFO interface to 64-bit too. This would allow it to process data at two elements per cycle, so even if the FIFO is getting one new element every cycle the HLS block will have no trouble keeping up.

Visitor
Posts: 7
Registered: ‎06-19-2017

Re: How to optimize axi master procedure?

Thanks u4223374 for detail explanation.
Let me test that one by one.
Visitor
Posts: 7
Registered: ‎06-19-2017

Re: How to optimize axi master procedure?

Thanks sammhho,
I can see the loop table now.
Yes, target is 1, and achieved is 4 in red color.
I use 2015.2 not 2016.x, but let me try as you mentioned.
Thanks