cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Participant
Participant
999 Views
Registered: ‎02-25-2019

AXI Stream to Memory

Jump to solution

Hello,

I'm fairly new to HLS development so hopefully this is a fairly simple qustion.

I have a 64 bit wide input AXI stream and a 512 bit wide DDR4 interface both running at 300MHz. The goal is to store the data arriving on the stream to the DDR4 memory.
This functionality is actually working but I'm unhappy with the achieved transfer speed. The current code is structured in the following three functions:

  1. Buffer around 1KB of data
  2. Calculate memory address
  3. Write to memory using memcpy

These are currently executed sequential when looking at both the analysis and the co simulation results. This means that the input stream is blocked while the data is written to memory which I would really like to avoid. 
In RTL I would implement this using two buffers that could be written to/read from alternately. How can I achieve something similar (or better) in HSL?

Kind regards,
Gustav

 

Tags (3)
0 Kudos
1 Solution

Accepted Solutions
Highlighted
Xilinx Employee
Xilinx Employee
816 Views
Registered: ‎08-17-2011

Hello @gustavsvj 

 

The 65% utilization that you quote I'm guessing it's coming from :

1- the top function II=294 versus

2- the trip count that you get in your read data function: 24 outer loops and 8 inner loops => II should be close to 24*8=192 

 

so 192/294 = 65 % utilization. 

 

I'm sure you see the answer now: you need to get the II of read data function closer to the ideal value rather than the what you have now.. ie one way or another, and/or depending on your coding style preference, you need to hoist the pipeline in the outer loop or manually merge the 2 loops.

 

something like this... please check with your C TB if this is still correct 

 

void readData(AXI_STREAM &s_axis_data, ap_uint<512> *data, ap_uint<16> *size){

    ap_uint<512> buff;
    AXI_T dataIn;
readloop:
    for (int xx = 0; xx < 24*8; xx++) {
        #pragma HLS PIPELINE II=1
        dataIn = s_axis_data.read();
        int j = xx / 8;
        ap_uint<3> i = xx;

        switch(i){
            // on first iteration, you store and clear all of buff
            case 0: buff = dataIn.data; break;
            case 1: buff(127,64) = dataIn.data; break;
            case 2: buff(191,128) = dataIn.data; break;
            case 3: buff(255,192) = dataIn.data; break;
            case 4: buff(319,256) = dataIn.data; break;
            case 5: buff(383,320) = dataIn.data; break;
            case 6: buff(447,384) = dataIn.data; break;
            case 7: buff(511,448) = dataIn.data; break;
        }

        data[j] = buff;
        if (dataIn.last){
//            *size = (i + 1) * sizeof(ap_uint<64>) + j * 64;
//            don't do sizeof on classes!!
// would have been better to have constants in the header *size = (i + 1) * 8 + (j) * 64; break; } } // readloop }
- Hervé

SIGNATURE:
* New Dedicated Vivado HLS forums* http://forums.xilinx.com/t5/High-Level-Synthesis-HLS/bd-p/hls
* Readme/Guidance* http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

* Please mark the Answer as "Accept as solution" if information provided is helpful.
* Give Kudos to a post which you think is helpful and reply oriented.

View solution in original post

8 Replies
Highlighted
Xilinx Employee
Xilinx Employee
994 Views
Registered: ‎09-04-2017

Did you try applying DATAFLOW ?

Thanks,

Nithin

0 Kudos
Highlighted
Participant
Participant
988 Views
Registered: ‎02-25-2019

Dear Nithin,

I did try that. Unfortunately it doesn't have an effect. I'm guessing it's because the data buffer is used both when reading and writing data.

Kind regards,
Gustav

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
985 Views
Registered: ‎09-04-2017

Hi Gustav,

  Can you share your code.

Thanks,

Nithin

0 Kudos
Highlighted
Participant
Participant
981 Views
Registered: ‎02-25-2019

memory_writer.cpp:

#include "memory_writer.hpp"
#include <hls_stream.h>
#include <ap_axi_sdata.h>
#include <ap_int.h>
#include <string.h>

	void performWrite(ap_uint<512> *ddr, ap_uint<512> *data, ap_uint<32> address, ap_uint<16> size){
		memcpy(ddr + address, data, size);
	}

	void readData(AXI_STREAM &s_axis_data, ap_uint<512> *data, ap_uint<16> *size){

		for (ap_uint<8> j = 0; j < 24; j++){
			ap_uint<512> buff;
			AXI_T dataIn;

			for (ap_uint<4> i = 0; i < 8; i++){

	#pragma HLS PIPELINE II=1

				dataIn = s_axis_data.read();

				switch(i % 8){
				case 0:
					buff(63,0) = dataIn.data;
					break;

				case 1:
					buff(127,64) = dataIn.data;
					break;

				case 2:
					buff(191,128) = dataIn.data;
					break;

				case 3:
					buff(255,192) = dataIn.data;
					break;

				case 4:
					buff(319,256) = dataIn.data;
					break;

				case 5:
					buff(383,320) = dataIn.data;
					break;

				case 6:
					buff(447,384) = dataIn.data;
					break;

				case 7:
					buff(511,448) = dataIn.data;
					break;
				}

				if (dataIn.last){
					*size = (i + 1) * sizeof(ap_uint<64>) + j * 64;
					break;
				}

			}

			data[j] = buff;
			if (dataIn.last){
				break;
			}

		}

	}

	void addrCounter(ap_uint<32> *addr, ap_uint<16> size){
		static ap_uint<32> addrBuff = 0;
		static ap_uint<16> eventCount = 0;

		eventCount++;

		if (eventCount > 1024){
			*addr = 0;
			addrBuff = 0;
			eventCount = 0;
		}
		else{
			*addr = addrBuff;

			if (size % 64 != 0){
				addrBuff += (size / 64) + 1;
			}
			else{
				addrBuff += (size / 64);
			}
		}
	}





void memory_writer(AXI_STREAM &s_axis_data, ap_uint<512> *ddr){
#pragma HLS INTERFACE ap_ctrl_none port=return
#pragma HLS INTERFACE m_axi port=ddr depth=1000 offset=none
#pragma HLS INTERFACE axis port=s_axis_data

#pragma HLS dataflow

	ap_uint<512> dataBuff[24];
	ap_uint<16> dataSize;
	ap_uint<32> addr;


	readData(s_axis_data, dataBuff, &dataSize);

	addrCounter(&addr, dataSize);

	performWrite(ddr, dataBuff, addr, dataSize);

}

memory writer.hpp:

#include <hls_stream.h>
#include <ap_axi_sdata.h>
#include <ap_int.h>


#define MAX_PKG_SIZE 1400
#define MEM_DEPTH (1400 * 8 * 1024) / 512


typedef ap_axiu <64,0,0,0> AXI_T;
typedef hls::stream<AXI_T> AXI_STREAM;


void memory_writer(AXI_STREAM &s_axis_data, ap_uint<512> *ddr);
0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
943 Views
Registered: ‎09-04-2017

Hi Gustav,

  I see that DATAFLOW does have an effect. 

If you remove data flow and run, you can see that the total latency is the sum of latencies from each of the functions. while with data flow, it's much less

Thanks,

Nithin

latency.jpg
Highlighted
Participant
Participant
934 Views
Registered: ‎02-25-2019

Dear Nithin,

You're of course right. What I meant was that it doesn't have enough of an effect. 
With DATAFLOW enabled I get around 65% of utilization on the input stream when testing the code on the KCU105. When looking at the data widths it should be possible to get a 100% utilization of the input stream as the output has around 8 times the bandwidth. 

So my qustion is how I should structere my HLS code so the buffering operation can run in parallel with the write operation?

Kind regards,
Gustav

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
817 Views
Registered: ‎08-17-2011

Hello @gustavsvj 

 

The 65% utilization that you quote I'm guessing it's coming from :

1- the top function II=294 versus

2- the trip count that you get in your read data function: 24 outer loops and 8 inner loops => II should be close to 24*8=192 

 

so 192/294 = 65 % utilization. 

 

I'm sure you see the answer now: you need to get the II of read data function closer to the ideal value rather than the what you have now.. ie one way or another, and/or depending on your coding style preference, you need to hoist the pipeline in the outer loop or manually merge the 2 loops.

 

something like this... please check with your C TB if this is still correct 

 

void readData(AXI_STREAM &s_axis_data, ap_uint<512> *data, ap_uint<16> *size){

    ap_uint<512> buff;
    AXI_T dataIn;
readloop:
    for (int xx = 0; xx < 24*8; xx++) {
        #pragma HLS PIPELINE II=1
        dataIn = s_axis_data.read();
        int j = xx / 8;
        ap_uint<3> i = xx;

        switch(i){
            // on first iteration, you store and clear all of buff
            case 0: buff = dataIn.data; break;
            case 1: buff(127,64) = dataIn.data; break;
            case 2: buff(191,128) = dataIn.data; break;
            case 3: buff(255,192) = dataIn.data; break;
            case 4: buff(319,256) = dataIn.data; break;
            case 5: buff(383,320) = dataIn.data; break;
            case 6: buff(447,384) = dataIn.data; break;
            case 7: buff(511,448) = dataIn.data; break;
        }

        data[j] = buff;
        if (dataIn.last){
//            *size = (i + 1) * sizeof(ap_uint<64>) + j * 64;
//            don't do sizeof on classes!!
// would have been better to have constants in the header *size = (i + 1) * 8 + (j) * 64; break; } } // readloop }
- Hervé

SIGNATURE:
* New Dedicated Vivado HLS forums* http://forums.xilinx.com/t5/High-Level-Synthesis-HLS/bd-p/hls
* Readme/Guidance* http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

* Please mark the Answer as "Accept as solution" if information provided is helpful.
* Give Kudos to a post which you think is helpful and reply oriented.

View solution in original post

Highlighted
Participant
Participant
798 Views
Registered: ‎02-25-2019

Dear Hervé

You're a hero! Your observation is totally accurate.
With your improvements the code is providing very close to a 100% of the bandwidth. 

Thank you very much!

Kind regards,
Gustav

0 Kudos