UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Participant jpan127
Participant
4,910 Views
Registered: ‎05-28-2017

How would I implement an output buffer that signals whenever it is ready?

Jump to solution

Hi everyone, got a little more experience since my last post, but still kind of confused on the techniques used in HLS.  I am trying to implement an LZW decompressor and add that to the base overlay of the PYNQ board.

 

My current function has an input char array[150000] and and output char array[250000].  The size of these arrays plus a large hashmap inside the function has led me to insanely high BRAM usages.  Therefore, I was hoping to use a strategy another user muzaffer came up with in my last post (however my problem is a little different this time).

 

I would like to have an output array of let's say [10000] and fill it up one char at a time.  When the array is full set a flag (a bool to true) and then wait for the output array to be stored in python on jupyter (where I will be calling the function from).  Once the output array is extracted, in python I want to set another flag WE that will tell my function to resume running.

 

void decode (char in[ENC_SIZE], char out[DEC_SIZE], int &in_size, int &out_size, bool &OE, bool &done, bool WE, bool &error)
{
	// in_size: 	size of input array, can be smaller than ENC_SIZE
	// out_size: 	size of output array that is filled, can be filled partially
	// OE: 			output flag, output enable to say the output array is ready
	// done: 		output flag to say the output array is the last output
	// WE: 			input flag to tell the decompressor to keep running
	// error: 		output flag if there is an issue and needs to abort and the output is incorrect

	int in_index = 0;
	out_size = 0;

	OE 		= false;
	done 	= false;
	error 	= false;

	char c, d;

	for (int i=0; i<150000; i++)
	{
		/* get input, can be between 1-5 times */
		c = in[in_index++];

		/* do some calculations, search hashmap, get d */

		/* send output, can be between 1-20 times */
		out[out_size++] = d;

		if (out_size == DEC_SIZE) {
			OE = true;
			while (WE == false);
WE = false; OE = false; out_size = 0; } } }

In each main loop iteration, the number of reads and writes are variable, sometimes multiple reads, sometimes one, sometimes multiple writes, sometimes one.

 

Main question: Is this feasible?  Am I using the right approach? Should I be using some AXI stream interface (I am unfamiliar with) instead?

 

I can kind of picture this working from python: the function will busy wait in the while loop, then from python set the memory address of bool WE to 1, then the while loop breaks and the function resets it to 0.  However, I am not sure how to test this with a testbench because in the testbench I call the function once, passing variables to it.  

 

Thanks!

Tags (2)
0 Kudos
1 Solution

Accepted Solutions
Scholar u4223374
Scholar
7,567 Views
Registered: ‎04-26-2015

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

(1) Here's a quick screenshot of how you'd connect a block with a single AXI Master to a Zynq. If the block had two AXI Masters, you'd just set the AXI Interconnect to have two slave ports and connect both AXI Masters to that.

 

Vivado_BD.jpg

 

(2a) You can't return a line number as such - but I assume that there are only a few places in your code where it writes to the buffer. If it only stops when the buffer is full, then those few places are the only ones where it'll have to restart from. Depending on which one causes it to stop, you can apply a different value to a static variable - and then when restarting the block use that to figure out where you were. Here's a short example that XORs the input data with a rotating 64-bit value, stops when the buffers are full/empty, and continues from where it finished last time:

 

typedef ap_uint<16> data_type;
void processBlockAXIMaster(data_type * dataIn, data_type * dataOut, ap_uint<32> bufferLength, ap_uint<32> dataLength) {
#pragma HLS INTERFACE s_axilite port=return
#pragma HLS INTERFACE s_axilite port=bufferLength
#pragma HLS INTERFACE s_axilite port=dataLength
#pragma HLS INTERFACE m_axi port=dataIn offset=slave
#pragma HLS INTERFACE m_axi port=dataOut offset=slave


    static bool continuing = false;
    static ap_uint<32> dataLengthRemaining;
    static ap_uint<64> currentXOR;

    if (!continuing) {
        // Load initial settings.
        dataLengthRemaining = dataLength;
        currentXOR = INITIAL_VALUE;
    }

    ap_uint<32> index = 0;
    while (1) {
        if ((index < bufferLength) && (dataLengthRemaining > 0)) {
            // Still under the maximum buffer size and still within the data range.
            data_type data = dataIn[index];
            data ^= currentXOR(15,0);
            currentXOR = ( currentXOR(62,0) , currentXOR(63));
            dataOut[index] = data;
            dataLengthRemaining--;
            index++;
        } else {
            if (index >= bufferLength) {
                // We won't be able to read any more from the buffer, so stop here.
                // Set "continuing" to indicate that we'll be continuing with the same state.
                continuing = true;
            } else {
                // No more data, mark that we'll be resetting the block next time.
                continuing = false;
            }
            break;
        }
    }
}

In this case, if the block runs out of space in the buffer, it returns - but the state (data length remaining and the current XOR value) are saved along with a flag to remind it whether it stopped in the middle of processing the data or not. If it did, it continues where it left-off; otherwise it resets to initial conditions.

 

(2b) Normally with an AXI Stream, there's a side-channel in addition to the main data. In that channel there's a bit called TLAST, which is used to indicate the final element of the data. Normally your block would detect TLAST and stop processing at that point. Here's a very simple bit of code to demonstrate TLAST usage:

 

#include <hls_stream.h>
#include <ap_int.h>
#include <ap_axi_sdata.h>

#define INITIAL_VALUE 0x123456789ABCDEFLL

typedef ap_axiu<16,1,1,1> data_element;
typedef hls::stream<data_element> data_stream;

void processBlockStreaming(data_stream & streamIn, data_stream & streamOut) {
#pragma HLS INTERFACE s_axilite port=return
#pragma HLS INTERFACE axis port=streamIn
#pragma HLS INTERFACE axis port=streamOut
    ap_uint<64> currentXOR = INITIAL_VALUE;

    while (1) {
        data_element data = streamIn.read(); // Read some data from the stream; if it's not available then wait until it becomes available.
        data.data ^= currentXOR(15,0);
        currentXOR = ( currentXOR(62,0) , currentXOR(63));
        streamOut.write(data); // Write some data to the stream; if it's full then wait until there's space.
        if (data.last) {
            break;
        }
    }
}

Generally the input comes from an AXI DMA block (which already knows about TLAST and will set it correctly) and the output goes to another DMA block (which will understand the TLAST coming from this block). Both take that data to/from RAM via a MIG or via a Zynq PS. As soon as the output DMA moves some more data to RAM, space will open up in the stream and the block will continue.

 

(3) I'm an engineer who tends to spend a lot of time working with (or fighting with) HLS. I guess that makes me reasonably well-placed to deal with questions.

33 Replies
Scholar u4223374
Scholar
4,901 Views
Registered: ‎04-26-2015

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

A few ideas:

 

(1) Put the buffers into main RAM instead of block RAM. This is really simple to set up: you just set the interface mode (with #pragma HLS INTERFACE) to m_axi, and connect the newly-created AXI Master to the Zynq PS HP AXI Slave ports. The buffer can now be hundreds of MB without causing any issues - you just have to make sure the CPU doesn't try to use that space at the same time.

 

(2) Skip the OE/done/WE flags, because they're virtually impossible to simulate (they can only be simulated in a multithreaded system, and HLS doesn't do multithreading). Instead, when the buffer is full, store the internal state (eg. what number it was up to, maybe an internal accumulator, etc) and return. This will trigger an interrupt (which can get the CPU to stop whatever it's doing and have a look at the results) and also set the block's ap_done bit. When you want it to start again, just set the block's ap_start bit. The block should then restore its state from the stored values and continue where it left off. This (a) avoids partially duplicating the functions of the standard HLS control signals, and (b) means that you can actually simulate it just by calling the function several times.

Participant jpan127
Participant
4,883 Views
Registered: ‎05-28-2017

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution
Hi wow you are fast, thank you so much.

(1) Is there an article I read on this? I feel searching information about VHLS difficult to find. I have looked at UG902 but I may miss a few things.

I have created 1 overlay before, and following directions of a Xilinx-written lab assignment it said to make my function and all my inputs/outputs [HLS INTERFACE s_axilite port=<portname> bundle=<samebundlename>] so I have been doing that. Is that still necessary? If it is, do I need to remove it to use the m_axi INTERFACE? Also how would I connect this AXI Master to the Zynq ports? Is this in VHLS or in Vivado?

(2) This sounds interesting. However, my function is more complex than the version I gave above. If the buffer is full at one point and the function returns from that point (ex: line 150), and I set the ap_start bit, will it return to line 151? There is a while loop inside the main loop that empties a onto the output buffer. The buffer can fill halfway through emptying the stack.

The reason I'm not sure if I can call the function multiple times in a C simulation is because it populates a hashmap as it runs through the code. Recalling the function and starting at a different point of the input array would lose the hash entries. Sorry if I misunderstood you, I am still very new to all this.
0 Kudos
Scholar u4223374
Scholar
4,850 Views
Registered: ‎04-26-2015

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

(1) You change the interface in HLS. The exact pragma would be:

 

#pragma HLS INTERFACE m_axi port=<portname> offset=slave

You'd have to remove the AXI-Lite pragma.

 

Once that's done, in Vivado you'll see an extra AXI Master port on the block. That port connects to an AXI Interconnect, and the AXI Interconnect connects to one of the HP Slave ports on the PS. I'll try to get some screenshots of this tomorrow.

 

(2) It can't return to that point automatically. You'd have to keep a static variable that keeps track of what line number it was up to, so that when you start it again it can find the sam point. Similarly, you'd have to store the hashmap so that it could be restored when the function was restarted.

 

 

It occurs to me that if the inputs/outputs are fully sequential, just using an AXI4 Stream interface might work better. That includes automatic flow control - if the input has no more data available, or the output has no more space for data, the block gets paused until they're ready to continue.

Participant jpan127
Participant
4,823 Views
Registered: ‎05-28-2017

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution
(1) Thank you, some screenshots would be helpful.

(2a) How is it possible to return to a line number in both C-Sim and overlay? Is there a way to store the program counter? I would assume that is abstracted and not accessible.

(2b) That sounds promising too. For the input, how would it know not to block if the input is done? For the output, how would it know to continue if the output is full (is there some way to empty it)? Do you have any sample AXI4 code? Most of the examples I see are too simple and have parallel and single read/write, and I'm not sure if the same can be applied to my case.

(3) If you don't mind me asking, do you work in this field? What kind of job do you have, that you know so much (seeing you have a lot of posts and answer a lot of questions on this board)?
0 Kudos
Scholar u4223374
Scholar
7,568 Views
Registered: ‎04-26-2015

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

(1) Here's a quick screenshot of how you'd connect a block with a single AXI Master to a Zynq. If the block had two AXI Masters, you'd just set the AXI Interconnect to have two slave ports and connect both AXI Masters to that.

 

Vivado_BD.jpg

 

(2a) You can't return a line number as such - but I assume that there are only a few places in your code where it writes to the buffer. If it only stops when the buffer is full, then those few places are the only ones where it'll have to restart from. Depending on which one causes it to stop, you can apply a different value to a static variable - and then when restarting the block use that to figure out where you were. Here's a short example that XORs the input data with a rotating 64-bit value, stops when the buffers are full/empty, and continues from where it finished last time:

 

typedef ap_uint<16> data_type;
void processBlockAXIMaster(data_type * dataIn, data_type * dataOut, ap_uint<32> bufferLength, ap_uint<32> dataLength) {
#pragma HLS INTERFACE s_axilite port=return
#pragma HLS INTERFACE s_axilite port=bufferLength
#pragma HLS INTERFACE s_axilite port=dataLength
#pragma HLS INTERFACE m_axi port=dataIn offset=slave
#pragma HLS INTERFACE m_axi port=dataOut offset=slave


    static bool continuing = false;
    static ap_uint<32> dataLengthRemaining;
    static ap_uint<64> currentXOR;

    if (!continuing) {
        // Load initial settings.
        dataLengthRemaining = dataLength;
        currentXOR = INITIAL_VALUE;
    }

    ap_uint<32> index = 0;
    while (1) {
        if ((index < bufferLength) && (dataLengthRemaining > 0)) {
            // Still under the maximum buffer size and still within the data range.
            data_type data = dataIn[index];
            data ^= currentXOR(15,0);
            currentXOR = ( currentXOR(62,0) , currentXOR(63));
            dataOut[index] = data;
            dataLengthRemaining--;
            index++;
        } else {
            if (index >= bufferLength) {
                // We won't be able to read any more from the buffer, so stop here.
                // Set "continuing" to indicate that we'll be continuing with the same state.
                continuing = true;
            } else {
                // No more data, mark that we'll be resetting the block next time.
                continuing = false;
            }
            break;
        }
    }
}

In this case, if the block runs out of space in the buffer, it returns - but the state (data length remaining and the current XOR value) are saved along with a flag to remind it whether it stopped in the middle of processing the data or not. If it did, it continues where it left-off; otherwise it resets to initial conditions.

 

(2b) Normally with an AXI Stream, there's a side-channel in addition to the main data. In that channel there's a bit called TLAST, which is used to indicate the final element of the data. Normally your block would detect TLAST and stop processing at that point. Here's a very simple bit of code to demonstrate TLAST usage:

 

#include <hls_stream.h>
#include <ap_int.h>
#include <ap_axi_sdata.h>

#define INITIAL_VALUE 0x123456789ABCDEFLL

typedef ap_axiu<16,1,1,1> data_element;
typedef hls::stream<data_element> data_stream;

void processBlockStreaming(data_stream & streamIn, data_stream & streamOut) {
#pragma HLS INTERFACE s_axilite port=return
#pragma HLS INTERFACE axis port=streamIn
#pragma HLS INTERFACE axis port=streamOut
    ap_uint<64> currentXOR = INITIAL_VALUE;

    while (1) {
        data_element data = streamIn.read(); // Read some data from the stream; if it's not available then wait until it becomes available.
        data.data ^= currentXOR(15,0);
        currentXOR = ( currentXOR(62,0) , currentXOR(63));
        streamOut.write(data); // Write some data to the stream; if it's full then wait until there's space.
        if (data.last) {
            break;
        }
    }
}

Generally the input comes from an AXI DMA block (which already knows about TLAST and will set it correctly) and the output goes to another DMA block (which will understand the TLAST coming from this block). Both take that data to/from RAM via a MIG or via a Zynq PS. As soon as the output DMA moves some more data to RAM, space will open up in the stream and the block will continue.

 

(3) I'm an engineer who tends to spend a lot of time working with (or fighting with) HLS. I guess that makes me reasonably well-placed to deal with questions.

Participant jpan127
Participant
4,673 Views
Registered: ‎05-28-2017

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

Hi @u4223374,

 

Sorry for the late reply.  The xilinx forums sent me an email, just today, saying I had a new reply...

 

Thanks again for the replies, and for the detailed and thought out answers.

 

(1) I will save that image for reference, thank you.

(2) I'm not sure if you came up with that or not, but that is clever, I may use that for this project or my next.

(3) What do you design with HLS? I'm wondering if HLS is efficient enough to design products from start to finish.

 

My previous designs were simple and I did not have to manually wire anything, so when I had to recently, I was very confused.  Since my last reply 8 days ago, I had tried to implemented hls::stream per your suggestions, and it dropped my BRAM utilization a good amount, and I was very happy.  I also used streams without side channels, because I wasn't sure if I needed them.

 

typedef hls::stream<unsigned char> STREAM;

void decompressor(STREAM &in, STREAM &out, int in_size, int &out_size)
{
#pragma HLS INTERFACE axis port=in
#pragma HLS INTERFACE axis port=out
#pragma HLS INTERFACE s_axilite port=return
#pragma HLS INLINE
#pragma HLS stream depth=10 variable=in
#pragma HLS stream depth=10 variable=out
...
...
}
#include "main.hpp"

int main()
{
	int in[ENC_SIZE] = {0};
	unsigned char out[500000] = {0};
	
	int in_size  = 0;
	int out_size = 0;

	fstream in_file(ENCODED,  fstream::in);
	fstream out_file(DECODED, fstream::out | fstream::trunc);

	/* EXTRACT INPUT FILE */

	while (!in_file.eof())
	{
	    in[in_size++] = in_file.get();
	}

	/* CONVERT INT ARRAY TO UNSIGNED CHAR STREAMS */

	STREAM i_stream("IN_STREAM");
	STREAM o_stream("OUT_STREAM");

	for (int i=0; i<in_size; i++) i_stream << (unsigned char)in[i];

	decompressor(i_stream, o_stream, in_size, out_size);

	int out_index = 0;

	while (!o_stream.empty()) {
	    o_stream >> out[out_index++];
	}

	/*OUTPUT TO FILE*/

        for (int i=0; i < out_size; i++) {
	    out_file << out[i];
        }

	in_file.close();
	out_file.close();
}

Does anything look off?  I went mostly by different examples and trial and error, so I'm not sure if I'm using it properly.  Also I read in the UG902 somewhere that passing hls::stream as parameters to the top function is unsynthesizable but it synthesized anyways and multiple examples use it.

 

block.PNG

(4) (Decode2 is the same as decompressor, just previous name) All of these ports appeared and I wasn't sure how to connect them.  I think 

#pragma HLS INTERFACE s_axilite port=return

generated some of them.

 

(5) Lastly, I am wondering about how to feed inputs to it through the overlay.  I am used to writing in all of the values, starting the overlay, waiting for it to finish, then reading the output values.  However, using hls::stream the memory map was different.  The streams did not have a memory address, and the out_size split up into out_size_i and out_size_o:

 

mmap.PNG

0 Kudos
Scholar u4223374
Scholar
4,641 Views
Registered: ‎04-26-2015

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

(2) It's a pretty standard use for static variables in C, and they behave the same way in HLS.

 

(3) Depends what the aims are. HLS is a good tool if you need to get a complex block done quickly, especially where the cost of the FPGA is a small part of the total product cost. For example, if the final product contains $200,000 worth of parts, then if HLS means that you have to buy a bigger FPGA (eg. $1500 instead of $700) then that's not really a problem - especially if it means your product is on the market six months earlier. On the other hand, if you're planning to make millions of products and they have to be as cheap as possible, spending extra time to do the blocks efficiently in HDL code might be worthwhile. Unfortunately I can't talk much about what I actually use HLS for...

 

There's no particular need for side-channels in an AXI Stream, but I think the Xilinx DMA blocks do prefer to have at least TLAST.

 

(4) I don't think you actually need the #pragma HLS stream directives there. HLS doesn't need to know the depth of its input and output streams. I suspect that this will stop HLS implementing separate ports for your streams, and instead it'll bundle data/tvalid/tready into a single AXIS interface. The AXI Lite port is generated by the port=return statement, and can be connected to any AXI Master - normally on a Zynq it connects to one of the GP AXI Masters through an interconnect.

 

(5) This is what the Xilinx DMA blocks are used for. The DMA reads data from an address in RAM and converts it to an AXI Stream - or the opposite, depending on how it's configured.

 

 

Participant jpan127
Participant
4,621 Views
Registered: ‎05-28-2017

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

(2) Good to know, thanks, I still have a lot more to learn.  (Undergraduate student right now)

 

(3) Can I ask what job title/career description you have?  I am interested in writing HDL and interested in writing in software, so ideally I hope to find a job that I can write both after graduating.  Or maybe I will go more into embedded.

 

(4) I removed the stream directives.  The extra ports are indeed packaged together now, however I am still confused where I should be routing them to.  Since they are supposed to be connecting to the RAM, I'm guessing I would need one of those "AXI Direct Memory Access" IP blocks.

 

The PYNQ base overlay was a huge system with a few dozen blocks which made it difficult for me to understand what was going on.  So I restarted the design with just the ZYNQ processing system and my custom block.  I then added an AXI Direct Memory Access block and let connection automation handle the rest.  

Here is my current design:

 

block.PNG

 

My block has these AXI stream ports: "in_V" and "out_V".  Connection automation did not route these for me, so I took a guess and routed it to the S_AXIS_S2MM and M_AXI_S2MM ports (I highlighted them, and ignore the axis_clock_converter.  I was thinking that might be the solution to my problem below.).  I ended up with some clock errors:

 

error2.PNG

 

(6) The clock frequencies on the ZYNQ were set previously from the PYNQ base overlay design, so they were working with their designs.  However some of the ports seem unnecessary or incorrect for my design.  I poked around and noticed that FCLK_CLK3 is 166MHz which drives the s_axi_lite_aclk, and the ap_clk on my custom block.  The FCLK_CLK0 is 100MHz and drives all of the other clocks on the AXI DMA.  After figuring out why I'm getting these errors, I'm unsure how to correct it.

 

Should I make everything 100MHz, and driven from the same clock?

Am I even wiring in_V and out_V to the correct ports?

 

I am so close to getting this to work, but there's so much to learn/know I am still getting lost at every step.  Sorry that every answer you give produces more questions.

0 Kudos
Scholar u4223374
Scholar
4,581 Views
Registered: ‎04-26-2015

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

Right, now that the Xilinx forums are letting me login again...

(2) No worries.

(3) Will reply via PM, keep this thread for the HLS-specific stuff.

(4, 5) Those connections to the DMA look just about perfect. Some changes:

- I would just run everything on a 100MHz clock for now - it'll greatly simplify the design, and it'll give the place & route tool an easier job. Later, when you're after absolute maximum performance, you can go back and run some blocks at higher speeds. See next point...

- Double-click the ZYNQ7 IP, click "Clock Configuration", then go to the PL Fabric clocks. Here you'll want to disable everything except for FCLK_CLK0, and set that to 100MHz.

- Next go to the "PS-PL Configuration" section, expand the "HP Slave AXI Interface section", and then enable "S AXI HP0 Interface". This gives you a high-performance AXI Slave port on the PS, which the HLS block can use to access RAM. Close that window.

- Double-click on the DMA block. For now, disable the Scatter Gather engine and the Control / Status Stream, which will get rid of two extraneous ports. Close that window.

- Double-click on the axi_mem_interconnect and reduce the number of AXI Master ports to 1. Then connect this port to the newly-generated S_AXI_HP0 port on the ZYNQ7 block.

- Grab the Interrupt output from the HLS block and connect that to the IQR_F2P port on the ZYNQ7. You probably won't be using it, but the tools get unhappy if it's disconnected.

- Now delete the 166MHz and 142MHz Processor System Reset blocks, and connect pretty much all the spare clock inputs (eg. S_AXI_HP0_ACLK that just turned up on the ZYNQ7 block) to the FCLK_CLK0 output on the ZYNQ7.

- In the Block Design, click the Address Editor tab and see if there are any unmapped slaves. There probably will be; just right-click anywhere and then select "Auto Assign Address", and Vivado will go through all the AXI buses and figure out what's connected to what.

Now you should have a system that's running at 100MHz all the way through (apart from the CPU cores, which are irrelevant), with the DMA able to feed data to that block from RAM and send it back to RAM.

Participant jpan127
Participant
4,026 Views
Registered: ‎05-28-2017

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

Awesome!

 

I was able to generate a bitstream with only an error saying I should tie off my TLAST signal.  I guess I can fix that by using side channels for the streams but it let me synthesize so I will fix that problem later.

 

I had to delete the AXI Interconnects, re-add and run connection automation to get the clocks to sync correctly.  Otherwise they were still stuck on auto at 166Mhz and 142MHz.  (For anyone else reading this in the future)

 

(7) I noticed the in_V and out_V ports are connected to the same AXI DMA.  Then the AXI DMA is connected to one AXI Interconnect, which is then connected to one port on the ZYNQ, S_AXI_HP0.

 

However I noticed in an example I found online that they use 2 DMA blocks to separate the streams.  I am attempting to write some python script to run the overlay and allocate the streams but I found that I only have one DMA block to access.  Should I be using 2 DMA blocks?  

 

Just to reiterate the purpose of my overlay: I am trying to use a local file, open in Python, load it into the DMA buffer, start the overlay, and read back from the DMA a completely different data stream.

 

Update to my block design:

block.PNG

 

The example I was referring to:

https://github.com/cathalmccabe/pynq_tutorial/blob/master/pynq_tutorial/notebooks/pynq_tutorial_dma_example.ipynb

 

My Python code and the DMA address:

lzw_notebook.PNG

0 Kudos
Scholar u4223374
Scholar
4,016 Views
Registered: ‎04-26-2015

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

(7) You can do it either way. If the examples you're using show two DMA blocks, it's probably going to be easier to implement it like that. Just add a second DMA, turn off the read channel on one and the write channel on the other, and connect them both to the same AXI interconnect.

0 Kudos
Participant jpan127
Participant
4,004 Views
Registered: ‎05-28-2017

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

Yea I ended up using 2 DMA blocks and I think it may be necessary for the pynq API to access it.

 

Sigh, another problem arose.  Hopefully you can shed some light on this, I saw some posts of yours that you have talked about something similar.

 

After my HLS code was working, the C-Sim was working, I had an error in Co-Sim in which it passed but the output was incorrect.  An integer, out_size,  is supposed to increment everytime it writes to the output stream.  However at the end of Co-Sim, the integer was 0 and the output stream was empty.  I ignored it because Co-Sim passed and I thought it was one of those circumstances where (I've seen people say this before) that Co-Sim just won't work but it will work fine on hardware.  However now that my overlay is working, block design error free, and my jupyter notebook all set to go, I was getting 0s in my overlay output.

 

I thought it was my python script not working correctly but then I remembered Co-Sim was producing 0 for out_size so I went back to HLS.

 

What I have done:

1) Remove pipeline pragmas

2) Remove inline pragma

3) Remove interface s_axilite port=return

4) Added a variable to determine which part of the code the simulation exits from

5) Saw non-blocking with streams is bad, so I removed a section that used stream.empty()

 

Right now after doing all the above 5, my code somehow skips my main loop.  The only guess I have left is that opening a file and feeding it by stream into the module is not working/simulating correctly and nothing is getting fed in.

 

#include "decode.hpp"

void lzw_decompressor(STREAM &in, STREAM &out, int in_size, int &out_size, int &status, int &itr)
{
#pragma HLS INTERFACE axis port=in
#pragma HLS INTERFACE axis port=out
//#pragma HLS INTERFACE s_axilite port=return
//#pragma HLS INLINE

	status		 = 1;	// start
	int in_index     = 0;
	out_size 	 = 0;

	status = 100;
	int maxbits     = getBits(in, in_size, BITS_TO_SEND_MAXBITS);
	int window	= getBits(in, in_size, BITS_TO_SEND_WINDOW);
	int escape	= getBits(in, in_size, BITS_TO_SEND_ESCAPE);
	status = 1;

	cout << "maxbits: " << maxbits << "\twindow: " << window << "\tescape: " << escape << endl;

	int code, newcode;
	int oldcode	= EMPTY;
	int timer	= 1;
	int finalkar 	= 0;
	int justpruned	= 0;
	int nbits 	= 9;

	struct elem 	e;
	HashArray 	st;
	stack 		k;
	itr = 0;
	unsigned char 	c;

    /****************************************** MAIN LOOP *****************************************/


MAIN_LOOP:

	while(code != EOF)
  	{
	    status = 10;											// inside main loop

//#pragma HLS PIPELINE II=1

	    status = 100;	// inside getBits
	    code = newcode = getBits(in, in_size, nbits);			// should only be reading once per loop
	    status = 10;	// outside getBits

	    if (code == EOF) {
	    	status = 2;
	    	break;
	    }
	    else if (code == INCR_NBITS) {							// handle nbits incrementing code
			nbits++;
			continue;
	    }
e = st.CodeLookup(code); if (e.null == true) { // if unknown code, assume KwKwK k.push(finalkar); e = st.CodeLookup(oldcode); } if (e.null == true) { status = -1; // failure cout << "code: " << code << endl; fprintf(stderr, "%i: Error: input file corrupted\n", itr); exit(EXIT_FAILURE); } while (e.prefix != EMPTY) { // add all chars in the code to a stack k.push(e.kar); code = e.prefix; e = st.CodeLookup(code); if (code == -1) status = -3; } finalkar = e.kar; // save the first char of this code, and print it c = finalkar; AXI_VALUE temp1; temp1.data = c; out << temp1; // <<<<<<<<<<<<<<<<<<<<<<< OUTPUT HERE out_size++; while (!k.empty()) { // print all the chars in the stack c = (unsigned char)k.pop(); AXI_VALUE temp2; temp2.data = c; out << temp2; // <<<<<<<<<<<<<<<<<<<<<<< OUTPUT HERE out_size++; } if (oldcode != EMPTY) { if (st.ReturnFreeSpots()!= 0 && justpruned == 0) { st.Insert(finalkar, oldcode); } } st.UpdateSentTime(newcode, timer++); oldcode = newcode; itr++; } if (itr < 10000) status = -2; }

Here is my top function, I'm not sure how to simplify it but I can try tomorrow.  getBits is a function that grabs from the stream (variable number of times) and performs some bitwise operations.  Most of the other functions are hashtable operations.  Spent all day on this...

0 Kudos
Participant jpan127
Participant
3,986 Views
Registered: ‎05-28-2017

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution
Ok, all of my ports that are not AXI stream, I inserted a directive INTERFACE s_axilite. I noticed accidentally that without these directives it passes Co-Sim perfectly. However I believe I need these directives to be able to write/read to them from the overlay's pynq API.

Not sure if there is another way though.
0 Kudos
Scholar u4223374
Scholar
3,964 Views
Registered: ‎04-26-2015

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

The immediate problem I notice is that "code" is used before it is assigned.

 

You generate it with "int code, newcode;" which doesn't assign a value - they just get whatever value happens to be in RAM at that address. Then you immediately check whether it's equal to EOF. For repeated runs of the simulation it might well be equal to EOF - because it was presumably EOF when the last run finished, and there's a good chance it'll be put in the same memory location. Try explicitly setting code = 0 (or something not equal to EOF) at the start.

 

Incrementing an output repeatedly has caused issues for me in the past with HLS. My preferred approach now is to increment an internal variable and only write to the output once (at the end) - this seems to keep HLS much happier. You could give that a try.

 

With regards to empty() and its use in streams - it's generally not a good idea because the stream can easily be "empty" just because the DMA block feeding it was a bit slow at fetching data. If you wait just a few cycles (which HLS will do automatically with the usual blocking read/write calls) then the stream will no longer be empty. In simulation and cosimulation, where you have to load all the data into the stream before starting, the stream will only report empty when it's really, truly out of data - so you will tend to get completely different results to hardware.

0 Kudos
Participant jpan127
Participant
3,955 Views
Registered: ‎05-28-2017

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

@u4223374

 

:) You were spot on.  I actually saw this tip from one of your older posts and already synthesized last night.  

 

To clarify what I did, for other people reading: for all output ports except the streams I made local variables which I modified throughout the code, then copied the local variables to the output ports only once.

 

I made small changes like the one you mentioned for assigning variables to 0.  I changed the while loop into a for loop, not sure if that makes HLS happier or not.

 

Also I saw your post on using an AXI master to debug.  I was wondering why does it need to be an AXI master?  How do they behave differently?  I used a struct with a bunch of variables inside for debugging and one variable was line which set itself to the line number at different spots in my code.  (Like what I did with status in my above/old code)  And I think that helped a lot to understand where it was exiting.  It was actually mismatching incredibly weird.  The line number said it was outside the main while loop, while some variables were set, and they could only have been set inside the main while loop.  And it was exiting at random spots that should not be exiting.  It never even went to the end of the function.

 

Anyways I think it is working now, co-sim was great, block design great, will report back soon, and accept a solution.

Thank you so much! 

0 Kudos
Scholar u4223374
Scholar
3,924 Views
Registered: ‎04-26-2015

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

Wow, someone actually went back and read through the old posts? That's might be a first! The Xilinx forums don't generally make finding old posts very easy.

 

Regarding the AXI Master debugging: you've basically got three options, AXI Lite, AXI Stream, or AXI Master.

 

AXI Lite works fine for small amounts of data, but for some of my blocks I was generating 10MB+ worth of logs (I wanted to print out all the intermediate values, loop indices, etc). Since AXI Lite has to store everything in block RAM, it can't handle large logs.

 

An AXI Stream would have worked, but it adds another DMA block - and I was feeling lazy at the time (easier to modify the HLS code with an AXI Master than to modify both the HLS code and the Petalinux app with a DMA driver). On a more practical note, if the block is actually locking-up half way through then it won't send the TLAST bit to the DMA, which may cause confusion there (I haven't actually used the DMA blocks much, only the VDMAs, and those really do like to get TLAST at the right time).

 

AXI Masters are horribly inefficient in resource usage (compared to AXI Streams), and also very slow when you're not doing burst writes, but if the block stops you can be pretty sure that the data in RAM indicates precisely where it got up to before stopping. This makes it pretty well-suited to the debugging task.

 

One thing that I've been meaning to do (and will try next time I need to debug a block) is to use an AXI Stream, but connect that to a Stream FIFO instead of to a DMA. This would mean that when the FIFO fills up, the block gets paused until the Zynq PS has time to clear the data out. Ideally the PS would be reading data in real-time and printing it to the console, which would behave almost like the console output in C...

0 Kudos
Participant jpan127
Participant
3,910 Views
Registered: ‎05-28-2017

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

Hahaha well your old posts + muzaffer's are a wealth of knowledge.  Sifting through a hundred old posts turns up some useful information.  And you're right, it is kind of hard to search through these forums.  I use google search with site:xilinx.com, it's a little better.

 

Well there is something wrong with my DMA.  I am reading (from the pynq notebook) the status signals of the DMA streams and they appear to show the DMA never "halted", or is still running.

MM2S Status Register 0x04 = 00

S2MM Status Register 0x34 = 00

From the AXI DMA Manual^

 

I am not sure at this point if my DMA is not working or my IP is not working.

So I probably won't use a stream to debug these faulty streams.  I want to use AXI Lite to store in BRAM, but since I am only writing my debug signals at the very end of the code, if it is stuck anywhere, the debug information is never written...

 

1. I am going to create an IP that only passes values through, from the input stream to the output stream, and see if that works.

2. If that works, I will try looking into using AXI Master or the Integrated Logic Analyzer.

 

If you have any more information/resource on how to use the AXI Master that would be great.

 

I thought I would be learning these tools in small increments, but it seems like I need an understanding of almost everything!

0 Kudos
Scholar u4223374
Scholar
3,871 Views
Registered: ‎04-26-2015

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

Regarding the questions:

 

(1) First thing to do is probably to just connect one DMA to the other DMA (ie just connect the stream output on one to the stream input on the other). That way there's no HLS in the loop at all. If this doesn't work, it's just a DMA configuration issue - which you can sort out at your leisure. If it does work, add a HLS block and see what happens.

 

(2) I'd probably start with the ILA for now - it'll tend to catch lower-level issues that the AXI Master debugging won't. I can try to do a tutorial on the AXI Masters next week (I try not to spend too much time on the forums over the weekends).

 

 

The other option is to try to go to AXI Masters for everything (which eliminates the DMAs) - but this can cause interesting (ie really annoying) issues when you've got two on one block that are working simultaneously. It also means that you need to put all the input data in RAM at once, whereas with the DMAs you can put in (for example) 1MB, run the block on that, then put in 1MB more (to the same location as before) and run the block on that, etc. I would continue with the streaming system for now; it's a much nicer system and it should actually be easier to get working.

0 Kudos
Participant jpan127
Participant
3,865 Views
Registered: ‎05-28-2017

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

(1) Ah, I really should have done that.  I did create the HLS block that passes data through.  However from the PS, I created 2 DMA buffers (4MB) and tried to transfer data through them.  I noticed that when transferring over exactly 16KB, the overlay never started/finished.  Under 16KB, the overlay passed data from the input to the output completely fine.

 

Then I went back to my original design and noticed that when transferring over maybe ~1000 the same issue occurred.  For my original overlay, the issue occurred sometimes over 1000 sometimes over 1500.  I'm pretty lost at this point.  I made sure my memory segment allocation was plenty.  At least 4MB for the data segments, and 256MB for the Data_MM2S and Data_S2MM segments (I'm not sure what these are for.  I am guessing the Data segments are for buffers and the Data_MM2S/Data_S2MM is for the entire allocation)

 

Capture.PNG

 

I will definitely try connecting the DMA to each other.

 

(2) I spent some time looking over the ILA.  I connected it, using the setup wizard, and generated a bitstream.  The flow for loading a bitstream is different for the pynq so I had to look into how the flow is for the zynq.  I had an error that said debug core was not found in user scan chain something something.  I saw a post saying that I need to initialize the PS first.  I think this means launching SDK and programming the FPGA? Then I got pretty lost on how to use SDK.  Need some time to learn a new tool.  Also the code seems unfamiliar, I'm guessing theres libraries for interfacing with the PL from the PS, from SDK.  So basically the C++/Zynq version of the Pynq's Python/Jupyter interface?

After programming the FPGA fom SDK, successfully instead of the error from Vivado, I wasn't sure how to use the ILA.

 

Xilinx also released a video today about the Verification IP for Zynq-7000, which looks promising, but I think it only comes with v2017.

 

Yea, please, a tutorial would be great.  I'm hoping to make some tutorials too once I get a better grasp on this.  Especially for pynq since the information is so sparse.

 

 

0 Kudos
Participant jpan127
Participant
3,872 Views
Registered: ‎05-28-2017

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

I found out, accidentally again, why 16KB was a particular size that caused issues.

 

Solution, for anyone reading:

 

dma.PNG

 

The highlighted parameter controls the depth of the DMA buffer.  It is defaulted to 14 bits or 16KB.  I increased it to the maximum and the issue disappeared (kind of).

 

I tested the DMA blocks connected to each other, after fixing the issue above, and they work now.

I then tested with an HLS block that passes data through sequentially, one write, one read, per loop iteration, and that works too.

 

Then I went back to my original design and tried to apply the same idea.  No luck...

 

I went back to HLS and added 2 AXI Lite arrays for the input/output: in[1000] and out[1000], just to store the first 1000 inputs read and 1000 outputs written.

 

Then I made an overlay and found out when transferring >128 bytes my transfer never completes.  I added a dma.wait() function right after the transfer and it always times out, even when I put 100 seconds.  So I guess there is another parameter that is limiting something to 7 bits?

When transferring <=128 bytes my transfer to the input buffer completes, and my in[1000] and out[1000] arrays reflect accurate input/output data.  However the output buffer is still empty, and the DMA status shows it never halted.  I also added a dma.wait() function right after the transfer, and it also always times out.

 

Since the output is correct, reflected by the AXI Lite arrays, and the DMA works for my simple loopback block, I am guessing the only possible problem is my HLS code.  Maybe something is causing it to hang?  I'm not sure why, it passed in Co-Sim though.

 

0 Kudos
Scholar u4223374
Scholar
3,863 Views
Registered: ‎04-26-2015

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

Ah, I should have caught that one.

 

 

Unfortunately at this point there are two things that could be causing the problem:

 

(1) An issue in the HLS code.

 

(2) An issue in HLS itself.

 

If it works in cosimulation, then failures afterwards tend to point towards an issue in HLS itself, which will be tricky to debug. I'll work on an AXI Master debugging tutorial when I've got a minute. Is there a specific bit of code you'd like me to use with the tutorial?

0 Kudos
Participant jpan127
Participant
3,850 Views
Registered: ‎05-28-2017

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

Not in particular.  Just a general usage of it would be great!  Thanks.  And hooking it up in the block diagram.  Is the master also memory mapped, and accessible the same way as AXI Lite?

 

If it is an issue with HLS itself, is it an issue with co-simulation's false positive? (I don't see how that would be possible)

Or is it synthesizing incorrectly?

 

Yea I guess I am going to wait for your tutorial and try the master as my next step.

I really feel like it's just a setting or a parameter that is incorrect.  It fails consistently over exactly 128 bytes.  I also set the stream depth to 128KB (max) just in case.

0 Kudos
Scholar u4223374
Scholar
3,843 Views
Registered: ‎04-26-2015

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

Mismatches between cosimulation and hardware are possible; I've had blocks that apparently passed cosimulation but couldn't possibly work in hardware (because HLS had tied TREADY to zero in one of the AXI Stream inputs). I've also seen problems where HLS blocks have managed to completely lock-up the AXI bus infrastructure - but I think that's only possible when using AXI Masters.

 

The AXI Master writes data directly to RAM, just like a DMA would (except without the stream in the middle). You could have used one with this block instead of the streams, but getting decent performance from an AXI Master requires some planning. Luckily for debug we don't really care at all about performance; I've had blocks that take over a second to run (compared to under 5ms for normal operation) because of all the debugging code - but if it means I can find the problem then that's worthwhile.

 

Debugging code

Here's some code, derived from the streaming XOR code I posted earlier. This contains a deliberately-introduced synthesis bug (using the __SYNTHESIS__ macro). While I do know of a few reproducible synthesis bugs, they're somewhat more work to implement and/or version-specific.

 

Testbench:

 

 

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include "../src/types.h"

#define WRITE_GOLD_DATA 0

void processBlockStreaming(data_stream & streamIn, data_stream & streamOut, ap_uint<32> * debug, ap_uint<32> & debugLength);

int main(void) {

	data_stream streamIn;
	data_stream streamOut;

	// I just picked 1048576 out of the air. Use whatever size you want).
	ap_uint<32> * debugData = (ap_uint<32> *) calloc(sizeof(ap_uint<32>),1048576);
	ap_uint<32> debugLength;

	// Load up the input stream with enough data to get two complete rotations of the XOR.
	// For debugging we'll use a straight counting sequence.

	for (int i = 0; i < 128; i++) {
		data_element dataIn;
		dataIn.data = i;
		dataIn.dest = 0;
		dataIn.id = 0;
		dataIn.keep = 3;
		dataIn.strb = 0;
		dataIn.user = 0;
		if (i == 127) {
			dataIn.last = 1;
		} else {
			dataIn.last = 0;
		}
		streamIn.write(dataIn);
	}


	processBlockStreaming(streamIn,streamOut,debugData,debugLength);

	// Verify that the input data stream is empty.
	uint32_t remainingInputData = 0;
	while (!streamIn.empty()) {
		streamIn.read();
		remainingInputData++;
	}
	printf("Found %d elements remaining in input\n",remainingInputData);

	// Now have a look at the output data.
#if WRITE_GOLD_DATA
	FILE * goldDataFile = fopen("E:/Xilinx/Projects/HLS/Resources/gold.dat","wb");
	while (!streamOut.empty()) {
		data_element dataIn = streamOut.read();
		fwrite(&dataIn.data,2,1,goldDataFile);
	}
	fclose(goldDataFile);
#else
	FILE * goldDataFile = fopen("E:/Xilinx/Projects/HLS/Resources/gold.dat","rb");
	bool foundTLAST = false;
	uint32_t count = 0;
	while (!streamOut.empty()) {
		data_element dataIn = streamOut.read();
		if (foundTLAST) {
			printf("Warning (%d): found data after TLAST bit\n",count);
		}
		uint16_t goldData;
		int elements = fread(&goldData,2,1,goldDataFile); // Read a single element from the gold data file.
		if (elements < 1) {
			printf("Warning (%d): ran out of data elements in gold file while still reading output\n",count);
		}
		if (goldData != dataIn.data) {
			printf("Warning (%d): data mismatch, expected %04X got %04X\n",count,goldData,(uint16_t) dataIn.data);
		}
		count++;
	}
	printf("Found %d elements in output\n",count);
	fclose(goldDataFile);
#endif

	// And dump debugging data.
	FILE * debugFile = fopen("E:/Xilinx/Projects/HLS/Resources/debug.txt","wb");
	for (int i = 0; i < debugLength; i++) {
		ap_uint<32> tmp = debugData[i];
		if (tmp.range(31,24) == 0xE1) {
			// Insert an extra line break for neatness.
			fprintf(debugFile,"\n%02X %02X %04X\n",(int) tmp.range(31,24), (int) tmp.range(23,16), (int) tmp.range(15,0));
		} else if (tmp.range(31,28) == 0x0E) {
			fprintf(debugFile,"%02X %02X %04X\n",(int) tmp.range(31,24), (int) tmp.range(23,16), (int) tmp.range(15,0));
		} else {
			fprintf(debugFile,"%08X\n",(int) tmp);
		}
	}
	fclose(debugFile);

	return 0;
}

 

Header:

 

 

#include <hls_stream.h>
#include <ap_int.h>
#include <ap_axi_sdata.h>

typedef ap_axiu<16,1,1,1> data_element;
typedef hls::stream<data_element> data_stream;

 

Main code:

 

 

#include "types.h"

#ifdef __SYNTHESIS__
#define INITIAL_VALUE 0x12356789ABCDEF
#else
#define INITIAL_VALUE 0x123456789ABCDEF
#endif


void processBlockStreaming(data_stream & streamIn, data_stream & streamOut, ap_uint<32> * debug, ap_uint<32> & debugLength) {
#pragma HLS INTERFACE s_axilite port=return
#pragma HLS INTERFACE axis port=streamIn
#pragma HLS INTERFACE axis port=streamOut
#pragma HLS INTERFACE s_axilite port=debugLength
#pragma HLS INTERFACE m_axi port=debug offset=slave depth=1048576

    ap_uint<32> elementNum = 0;
    ap_uint<32> dbgLength = 0;
    ap_uint<64> currentXOR = INITIAL_VALUE;

    debug[dbgLength] = 0x12345678; // Start.
    dbgLength++;

    while (1) {
        data_element data = streamIn.read(); // Read some data from the stream; if it's not available then wait until it becomes available.

        debug[dbgLength] = 0xE1000000 | (elementNum << 16) | data.data;
        dbgLength++;
        debug[dbgLength] = 0xE2000000 | (elementNum << 16) | currentXOR(63,48);
        dbgLength++;
        debug[dbgLength] = 0xE3000000 | (elementNum << 16) | currentXOR(47,32);
        dbgLength++;
        debug[dbgLength] = 0xE4000000 | (elementNum << 16) | currentXOR(31,16);
        dbgLength++;
        debug[dbgLength] = 0xE5000000 | (elementNum << 16) | currentXOR(15,0);
        dbgLength++;

        data.data ^= currentXOR(15,0);
        currentXOR = ( currentXOR(62,0) , currentXOR(63,63));

        debug[dbgLength] = 0xE6000000 | (elementNum << 16) | data.data;
        dbgLength++;
        elementNum++;

        streamOut.write(data); // Write some data to the stream; if it's full then wait until there's space.
        if (data.last) {
            break;
        }
    }

    debug[dbgLength] = 0x87654321; // End.
    dbgLength++;
    debugLength = dbgLength;
}

I've added an AXI Master for debugging, and an AXI Lite variable for the length. If the block doesn't finish then the length won't get written, but that's not really an issue - if you clear the memory space used by the AXI Master before running the block then you can see how many elements were written simply because it'll never write all-zeros to a 4-byte region.

 

 

Each element here is 32-bit, which consists of an 8-bit ID (0xE followed by a number so I can identify which step in the process it is), an 8-bit element count (it'll wrap around eventually but it's adequate) and 16-bit data from inside the block. For every loop iteration I'm writing out the input data, the current XOR shift register, and the 16-bit data output. There are also start and end writes, just in case it's either getting stuck before starting the loop (suggests variable initialization problems) or getting stuck right at the end (not sure what could cause that, but I'm sure HLS can manage it).

 

The testbench checks that all the input data was used and compares the output data against known-good ("gold") data. It also prints out a text file containing all the debug information plus a little bit of formatting. The important thing is that when you put it into hardware, you can retrieve the same debugging information from RAM and save that to a file which is ideally identical to the one you got from simulation.

 

Here's what the first few lines of debugging output look like from a C simulation:

 

 

12345678

E1 00 0000
E2 00 0123
E3 00 4567
E4 00 89AB
E5 00 CDEF
E6 00 CDEF

E1 01 0001
E2 01 0246
E3 01 8ACF
E4 01 1357
E5 01 9BDE
E6 01 9BDF

E1 02 0002
E2 02 048D
E3 02 159E
E4 02 26AF
E5 02 37BC
E6 02 37BE

E1 03 0003
E2 03 091A
E3 03 2B3C
E4 03 4D5E
E5 03 6F78
E6 03 6F7B

 

 

And from cosimulation or hardware:

 

 

12345678

E1 00 0000
E2 00 0012
E3 00 3567
E4 00 89AB
E5 00 CDEF
E6 00 CDEF

E1 01 0001
E2 01 0024
E3 01 6ACF
E4 01 1357
E5 01 9BDE
E6 01 9BDF

E1 02 0002
E2 02 0048
E3 02 D59E
E4 02 26AF
E5 02 37BC
E6 02 37BE

E1 03 0003
E2 03 0091
E3 03 AB3C
E4 03 4D5E
E5 03 6F78
E6 03 6F7B

 

Even though the outputs are fine at this stage, you can already see where the problem is. E2:E5 contain the 64-bit shift register used for XORing; even on the very first iteration it's obvious that they're not the same in cosimulation/hardware.

 

Connecting it

 

Connecting the AXI Master is really easy; it connects exactly the same way as the DMA blocks would.

 

m_AXI_connection.jpg

Make sure you map its address space in the Address Editor (just let Vivado auto-assign this).

 

Getting the data

This is the part that I can't help much with. In Petalinux I'd just mmap a non-cached section of RAM, point the AXI Master to that (there's a register on the AXI Lite interface that sets the AXI Master's starting address in RAM), run the block, and then pull data out. Presumably there's a similar system available for Python since you must be able to effectively map RAM for the DMAs.

 

Once you've got an array pointing to the data, then you can use code very similar to that in the testbench to write it out to a file.

Participant jpan127
Participant
3,834 Views
Registered: ‎05-28-2017

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

Great tutorial, and very complete, thank you.  I hope other people can find this post in the future.

 

I have a few questions:

 

1. I noticed that in your block diagram you have all the DMA connected to a single AXI Interconnect that connects to a single High Performance port.  I followed someone else's design to have two High Performance ports: HP0 for DMA read and HP1 for DMA write.

But you have three different ports merging into the same HP0 port.

Is there any difference?

 

2. Regarding mapping the AXI Master, I believe there is a python library Xlnk that can do that.  It assigns a contiguous block of memory.

 

3. It seems to me that the AXI Master, in HLS, behaves identical to the AXI Stream?

Set the interface, write to it like an array, then read from it like an array.

If anything it just seems like a Stream that is much slower? What makes them the way they are that I can be "be pretty sure that the data in RAM indicates precisely where it got up to before stopping" (your message 17 of this thread)?

I also noticed you are using a pointer instead of pass-by-reference.

 

4. Random question about HLS code: I am under the impression that while loops bounded by a variable are inefficient for optimization.  So what I did was change them to for loops with either 1) an iteration maximum slightly above what is needed or 2) equal to a variable instead of checking if my stack is empty.

 

1) Old
while (a < b)
{
    ...
    ...
}
1) New
for (int i=0; i<max_possible_iter; i++)
{
    if (a >= b) break;
    ...
    ...
}

2) Old
while (!stack.empty())
{
    ...
    ...
}
2) New
int size = stack.size()
for (int i=0; i<size; i++)
{
    ...
    ...
}

I thought this would allow my design to be pipelineable, but I had some synthesis warnings and I looked into it.  Apparently my loops are still not perfect or semi-perfect loops.  The main loop still has "non-trivial" logic before and between my inner loops.  Usually the design examples are straightforward:

1)

for (...) {
    for (...) {
...
... } } 2) for (...) { ... } for (...) { .... }

Is my design just inherently non-pipelineable? Also, my LUT usage jumped up by a lot.

0 Kudos
Scholar u4223374
Scholar
3,804 Views
Registered: ‎04-26-2015

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

(1) It doesn't matter too much. You can get more bandwidth by using multiple ports (each of the HP AXI Slaves can only handle 1GB/s, whereas the main RAM can handle over 4GB/s depending on the board); the only reason I went with one port was that I already had a single AXI Master connected for another block.

 

(2) Excellent, that sounds good.

 

(3) An AXI Master can do random read/write - you can start at any location, read or write as much as you want, then jump to any other location and continue (including reading the same element multiple times). With an AXI Master you can have read and write from the same port, whereas a stream is inherently uni-directional. In this case none of that matters because the access is sequential and uni-directional. In terms of performance, an AXI Master can be just as fast as an AXI Stream and a DMA block - but to achieve that you have to set up burst access properly. More trouble than it's worth in this case. Streams also have the advantage that you can use them to join blocks together, whereas AXI Masters are almost exclusively used for transferring data to and from RAM.

 

The only thing that does matter about the AXI Master here is that there's no buffering going on - when you tell it to write something, it puts that into RAM straight away. This is where message #17 comes into play. If the block writes the "0xE2" message but not the "0xE3" one then you can be pretty sure that something between those two is causing the problem (although nothing does occur between those two). With Streams, there tends to be automatic buffering. If you find only "0xE2" in RAM, that might be because it only got up to that point in the code ... or it might have gotten further and all the later elements ("0xE3" etc) are stuck in a buffer somewhere. With the VDMA blocks (not too sure about the basic DMA ones) the block will wait until it's got a whole line of data in its internal buffers before writing to RAM.

 

The use of a pointer here is to indicate "array of unknown length" rather than "passing a variable by address".

 

(4) I'm honestly not sure about that. I rarely use the "while (a < b)" format, but "while (1)" is very common (with a break statement half way down).

 

The important thing for HLS is to set up the loops so that the test condition can be evaluated early. Basic "for" loops make this easy - as soon as each iteration starts HLS can determine whether the next one will run (complex "for" loops that include intermediate break statements are not so easy). "while (a < b)" is also fine if HLS can calculate "a" and "b" quickly; on the other hand if "a" and "b" are only available right at the end of each iteration then it's not going to be able to pipeline the loop very well.

 

In your examples, I suspect that (1) would be OK either way; for the second approach I'd probably just do an infinite loop with "while (1)" (or "while (true)") so that HLS isn't having to keep track of "i" as well. For (2) the first option should be fine as long as you do exactly one read from the stack in each iteration. In this case HLS can move the read to the start of the loop, and then immediately test whether the array is empty.

 

For most AXI Stream loops, you read from the stream at the start of the loop, and stop when TLAST is set. This makes pipelining really easy.

 

 

With regards to perfect loop nesting: sometimes it just works, sometimes it doesn't. As long as the inner loop is reasonably long it doesn't matter (eg. if the inner loop is 256 cycles long, burning one cycle to switch between inner and outer loops is irrelevant). If the inner loop is short, you can do it manually - I've had to rewrite nested loops as a single longer loop a few times in order to get better performance. Can you post the exact code that you're trying to get working?

0 Kudos
Participant jpan127
Participant
3,787 Views
Registered: ‎05-28-2017

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

(1) Ah I see, that makes sense.

 

(3) Is setting up burst access properly a difficult task?

Thanks again, you have given an overview of all these AXI interfaces better than any other resource I've read.

 

(4) Interesting, ok I will change it back to while loops, the for loops did generate a much larger amount of LUT, I think from 10% to 30%.

 

(5) Sure, I should have posted earlier but did not want to bother you with my messy code.  I have attached a zip.

 

(6) Currently, as you can see from the top function decode.cpp, I have 3 AXI Master ports, each with 350,000 indices.  This design is currently failing timing, I'm guessing too many masters and/or too many writes? I reduced my previous clock from 150MHz to 100MHz and my Total Negative Slack dropped from -32,000ns to -9,000ns, still failing timing.

I used 3 masters<32> because I have 3 numbers (status, section, iteration) in the first, the index of the input stream in the second, and the index of the output stream in the third.  The stream indices go up to ~260,000 so I couldn't compact them in one.

Maybe I will try combining the last two into a single master<64>.

(Section is basically what part of the code it is in, e.g. 1 = E1, 2 = E2)

(Please check decode.hpp for more information)

 

I am not too familiar with timing, in general, but I guess it is because the master writes are too slow?  My HLS co-sim latency went up from ~1,500,000 to ~7,500,000.  I am also unsure of the TNS concept.  I am thinking negative slack is how much an operation passes the timing constraint by?  And by failing timing, the data is not arriving at the registers fast enough for the data to be valid for the next transfer?  I took a guess by slowing down the clock speed, the design would have an easier time achieving timing, however I have not seen people use under 100MHz and not sure if I should.  I will test anyway.  I don't need the design to be fast at the moment.

0 Kudos
Participant jpan127
Participant
3,771 Views
Registered: ‎05-28-2017

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

Wow...even my master doesn't work properly...

 

I must either be doing something very stupid, or something in HLS is going wrong.

 

I attached a log of my HLS output and my hardware output.  The hardware output has the first line correct, so I may be retrieving it correctly, but then it skips a few times, and starts showing a weird pattern.

 

The iterations should increment by one, but it skips suspiciously: 0, 1, 3, 5, 7, 15, 31, 63, 127, 255, 511, etc (a power of 2 - 1?)

The section stays at 15.

The status changes from 1 to 3.

 

The master ports for my in_index and out_index that increments, as it reads/writes, is completely 0.

 

If anything, I am about to just take apart my code and build it piece by piece...at least I know a loopback works properly.

0 Kudos
Scholar u4223374
Scholar
3,764 Views
Registered: ‎04-26-2015

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

(3) Depends on the function. For something like image convolution, it's trivial to set up a burst - images almost always give you a constant-length, sequential access pattern. The only challenge here is ensuring that it doesn't manage to deadlock itself (eg. initiates the burst to write data, then finds that it can't read data because the bus is locked by the writing side, and because it can't read data it doesn't have anything to write).

 

On the other hand, for a function with more random behaviour (eg. sequential writes that don't happen on every cycle) then HLS won't be able to infer a burst. You can add buffering to make a burst practical (eg. wait until it's got 256 elements before starting a burst) but this is often easier said than done within HLS. In this case, it generally makes much more sense to use an AXI Stream and add an external AXI Stream FIFO (or use the buffers built into various AXI blocks).

 

(5) OK, will have a look at that when I have a chance.

 

(6) Wow, that's a major timing failure. What sort of total resource utilization have you got? Unless you're on something like 99% LUT utilization, it feels like something has to be badly set up wrong. Timing shouldn't fail by that much unless Vivado has had to route signals right across the chip repeatedly.

 

What's the worst-case negative slack (WNS)? TNS tends to tell you about congestion of the design while WNS tells you about specific paths that are causing issues. If you're seeing a few paths that have huge timing failures (eg. WNS = 1000ns so potentially only about 10 paths failing) then that indicates a need to investigate a specific part of the design. If you're seeing a lot of paths with minor timing failures (eg. WNS = 1ns, so at least 9000 failing paths) then Vivado has messed something up.

 

You can definitely try reducing the clock speed; 50MHz is fine just to get a block working.

 

 

 

Regarding your most recent post: there's something very, very odd happening here (as you've seen). If it works in C simulation, there are only three things that can be causing it:

 

(1) HLS is broken. It happens sometimes, and we find work-arounds until Xilinx figures out what they're doing wrong.

 

(2) HLS is not happy with some aspect of your coding. Technically this is the same as (1) (in that HLS should accept with anything that the C compiler can handle) but it's generally easy to fix - the biggest challenge is finding out which part HLS is confused by.

 

(3) Something is different between the C simulation and the hardware. This might indicate an issue in the block design, or it might just be something that happens to work on a PC most of the time (eg. failing to initialize a variable) but consistently fails on an FPGA.

 

I'll have a look through the code and see what I can find.

 

0 Kudos
Participant jpan127
Participant
3,750 Views
Registered: ‎05-28-2017

Re: How would I implement an output buffer that signals whenever it is ready?

Jump to solution

(3) Ah I see, that makes sense.  I did find your old post about preventing a burst transaction.

 

(5) Thank you.

 

(6) It was around -9,000ns.  I forgot to mention in my 2nd post that I did set it to 50MHz and it did completely pass timing with positive slack.  My utilization was 40% LUT and <20% everything else.  I think I had some negative WNS/TNS in my previous versions, but adding the Masters did cripple the timing.  (Previously 150MHz, now 50MHz)

 

Thanks again.  I am really hoping you find something in my code that is incredibly stupid and obviously wrong.  Until then, I will probably start dissecting it.

0 Kudos