UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Participant gustavsvj
Participant
395 Views
Registered: ‎02-25-2019

AXI Stream to Memory

Jump to solution

Hello,

I'm fairly new to HLS development so hopefully this is a fairly simple qustion.

I have a 64 bit wide input AXI stream and a 512 bit wide DDR4 interface both running at 300MHz. The goal is to store the data arriving on the stream to the DDR4 memory.
This functionality is actually working but I'm unhappy with the achieved transfer speed. The current code is structured in the following three functions:

  1. Buffer around 1KB of data
  2. Calculate memory address
  3. Write to memory using memcpy

These are currently executed sequential when looking at both the analysis and the co simulation results. This means that the input stream is blocked while the data is written to memory which I would really like to avoid. 
In RTL I would implement this using two buffers that could be written to/read from alternately. How can I achieve something similar (or better) in HSL?

Kind regards,
Gustav

 

Tags (3)
0 Kudos
1 Solution

Accepted Solutions
Xilinx Employee
Xilinx Employee
212 Views
Registered: ‎08-17-2011

Re: AXI Stream to Memory

Jump to solution

Hello @gustavsvj 

 

The 65% utilization that you quote I'm guessing it's coming from :

1- the top function II=294 versus

2- the trip count that you get in your read data function: 24 outer loops and 8 inner loops => II should be close to 24*8=192 

 

so 192/294 = 65 % utilization. 

 

I'm sure you see the answer now: you need to get the II of read data function closer to the ideal value rather than the what you have now.. ie one way or another, and/or depending on your coding style preference, you need to hoist the pipeline in the outer loop or manually merge the 2 loops.

 

something like this... please check with your C TB if this is still correct 

 

void readData(AXI_STREAM &s_axis_data, ap_uint<512> *data, ap_uint<16> *size){

    ap_uint<512> buff;
    AXI_T dataIn;
readloop:
    for (int xx = 0; xx < 24*8; xx++) {
        #pragma HLS PIPELINE II=1
        dataIn = s_axis_data.read();
        int j = xx / 8;
        ap_uint<3> i = xx;

        switch(i){
            // on first iteration, you store and clear all of buff
            case 0: buff = dataIn.data; break;
            case 1: buff(127,64) = dataIn.data; break;
            case 2: buff(191,128) = dataIn.data; break;
            case 3: buff(255,192) = dataIn.data; break;
            case 4: buff(319,256) = dataIn.data; break;
            case 5: buff(383,320) = dataIn.data; break;
            case 6: buff(447,384) = dataIn.data; break;
            case 7: buff(511,448) = dataIn.data; break;
        }

        data[j] = buff;
        if (dataIn.last){
//            *size = (i + 1) * sizeof(ap_uint<64>) + j * 64;
//            don't do sizeof on classes!!
// would have been better to have constants in the header *size = (i + 1) * 8 + (j) * 64; break; } } // readloop }
- Hervé

SIGNATURE:
* New Dedicated Vivado HLS forums* http://forums.xilinx.com/t5/High-Level-Synthesis-HLS/bd-p/hls
* Readme/Guidance* http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

* Please mark the Answer as "Accept as solution" if information provided is helpful.
* Give Kudos to a post which you think is helpful and reply oriented.
8 Replies
Xilinx Employee
Xilinx Employee
390 Views
Registered: ‎09-04-2017

Re: AXI Stream to Memory

Jump to solution

Did you try applying DATAFLOW ?

Thanks,

Nithin

0 Kudos
Participant gustavsvj
Participant
384 Views
Registered: ‎02-25-2019

Re: AXI Stream to Memory

Jump to solution

Dear Nithin,

I did try that. Unfortunately it doesn't have an effect. I'm guessing it's because the data buffer is used both when reading and writing data.

Kind regards,
Gustav

0 Kudos
Xilinx Employee
Xilinx Employee
381 Views
Registered: ‎09-04-2017

Re: AXI Stream to Memory

Jump to solution

Hi Gustav,

  Can you share your code.

Thanks,

Nithin

0 Kudos
Participant gustavsvj
Participant
377 Views
Registered: ‎02-25-2019

Re: AXI Stream to Memory

Jump to solution

memory_writer.cpp:

#include "memory_writer.hpp"
#include <hls_stream.h>
#include <ap_axi_sdata.h>
#include <ap_int.h>
#include <string.h>

	void performWrite(ap_uint<512> *ddr, ap_uint<512> *data, ap_uint<32> address, ap_uint<16> size){
		memcpy(ddr + address, data, size);
	}

	void readData(AXI_STREAM &s_axis_data, ap_uint<512> *data, ap_uint<16> *size){

		for (ap_uint<8> j = 0; j < 24; j++){
			ap_uint<512> buff;
			AXI_T dataIn;

			for (ap_uint<4> i = 0; i < 8; i++){

	#pragma HLS PIPELINE II=1

				dataIn = s_axis_data.read();

				switch(i % 8){
				case 0:
					buff(63,0) = dataIn.data;
					break;

				case 1:
					buff(127,64) = dataIn.data;
					break;

				case 2:
					buff(191,128) = dataIn.data;
					break;

				case 3:
					buff(255,192) = dataIn.data;
					break;

				case 4:
					buff(319,256) = dataIn.data;
					break;

				case 5:
					buff(383,320) = dataIn.data;
					break;

				case 6:
					buff(447,384) = dataIn.data;
					break;

				case 7:
					buff(511,448) = dataIn.data;
					break;
				}

				if (dataIn.last){
					*size = (i + 1) * sizeof(ap_uint<64>) + j * 64;
					break;
				}

			}

			data[j] = buff;
			if (dataIn.last){
				break;
			}

		}

	}

	void addrCounter(ap_uint<32> *addr, ap_uint<16> size){
		static ap_uint<32> addrBuff = 0;
		static ap_uint<16> eventCount = 0;

		eventCount++;

		if (eventCount > 1024){
			*addr = 0;
			addrBuff = 0;
			eventCount = 0;
		}
		else{
			*addr = addrBuff;

			if (size % 64 != 0){
				addrBuff += (size / 64) + 1;
			}
			else{
				addrBuff += (size / 64);
			}
		}
	}





void memory_writer(AXI_STREAM &s_axis_data, ap_uint<512> *ddr){
#pragma HLS INTERFACE ap_ctrl_none port=return
#pragma HLS INTERFACE m_axi port=ddr depth=1000 offset=none
#pragma HLS INTERFACE axis port=s_axis_data

#pragma HLS dataflow

	ap_uint<512> dataBuff[24];
	ap_uint<16> dataSize;
	ap_uint<32> addr;


	readData(s_axis_data, dataBuff, &dataSize);

	addrCounter(&addr, dataSize);

	performWrite(ddr, dataBuff, addr, dataSize);

}

memory writer.hpp:

#include <hls_stream.h>
#include <ap_axi_sdata.h>
#include <ap_int.h>


#define MAX_PKG_SIZE 1400
#define MEM_DEPTH (1400 * 8 * 1024) / 512


typedef ap_axiu <64,0,0,0> AXI_T;
typedef hls::stream<AXI_T> AXI_STREAM;


void memory_writer(AXI_STREAM &s_axis_data, ap_uint<512> *ddr);
0 Kudos
Xilinx Employee
Xilinx Employee
339 Views
Registered: ‎09-04-2017

Re: AXI Stream to Memory

Jump to solution

Hi Gustav,

  I see that DATAFLOW does have an effect. 

If you remove data flow and run, you can see that the total latency is the sum of latencies from each of the functions. while with data flow, it's much less

Thanks,

Nithin

latency.jpg
Participant gustavsvj
Participant
330 Views
Registered: ‎02-25-2019

Re: AXI Stream to Memory

Jump to solution

Dear Nithin,

You're of course right. What I meant was that it doesn't have enough of an effect. 
With DATAFLOW enabled I get around 65% of utilization on the input stream when testing the code on the KCU105. When looking at the data widths it should be possible to get a 100% utilization of the input stream as the output has around 8 times the bandwidth. 

So my qustion is how I should structere my HLS code so the buffering operation can run in parallel with the write operation?

Kind regards,
Gustav

0 Kudos
Xilinx Employee
Xilinx Employee
213 Views
Registered: ‎08-17-2011

Re: AXI Stream to Memory

Jump to solution

Hello @gustavsvj 

 

The 65% utilization that you quote I'm guessing it's coming from :

1- the top function II=294 versus

2- the trip count that you get in your read data function: 24 outer loops and 8 inner loops => II should be close to 24*8=192 

 

so 192/294 = 65 % utilization. 

 

I'm sure you see the answer now: you need to get the II of read data function closer to the ideal value rather than the what you have now.. ie one way or another, and/or depending on your coding style preference, you need to hoist the pipeline in the outer loop or manually merge the 2 loops.

 

something like this... please check with your C TB if this is still correct 

 

void readData(AXI_STREAM &s_axis_data, ap_uint<512> *data, ap_uint<16> *size){

    ap_uint<512> buff;
    AXI_T dataIn;
readloop:
    for (int xx = 0; xx < 24*8; xx++) {
        #pragma HLS PIPELINE II=1
        dataIn = s_axis_data.read();
        int j = xx / 8;
        ap_uint<3> i = xx;

        switch(i){
            // on first iteration, you store and clear all of buff
            case 0: buff = dataIn.data; break;
            case 1: buff(127,64) = dataIn.data; break;
            case 2: buff(191,128) = dataIn.data; break;
            case 3: buff(255,192) = dataIn.data; break;
            case 4: buff(319,256) = dataIn.data; break;
            case 5: buff(383,320) = dataIn.data; break;
            case 6: buff(447,384) = dataIn.data; break;
            case 7: buff(511,448) = dataIn.data; break;
        }

        data[j] = buff;
        if (dataIn.last){
//            *size = (i + 1) * sizeof(ap_uint<64>) + j * 64;
//            don't do sizeof on classes!!
// would have been better to have constants in the header *size = (i + 1) * 8 + (j) * 64; break; } } // readloop }
- Hervé

SIGNATURE:
* New Dedicated Vivado HLS forums* http://forums.xilinx.com/t5/High-Level-Synthesis-HLS/bd-p/hls
* Readme/Guidance* http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

* Please mark the Answer as "Accept as solution" if information provided is helpful.
* Give Kudos to a post which you think is helpful and reply oriented.
Participant gustavsvj
Participant
194 Views
Registered: ‎02-25-2019

Re: AXI Stream to Memory

Jump to solution

Dear Hervé

You're a hero! Your observation is totally accurate.
With your improvements the code is providing very close to a 100% of the bandwidth. 

Thank you very much!

Kind regards,
Gustav

0 Kudos