cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
jhorswill
Observer
Observer
549 Views
Registered: ‎10-08-2020

Instantaneous asynchronous back pressure on block - achievable through AXI4 stream protocol?

Hi there.

I am struggling with creating a mechanism that provides instantaneous back-pressure on my block, once an incoming tready signal goes low.

Please let me know if I am missing any relevant information; this is only my second post on these forums. I am also still very new to this software.


I am redesigning two existing VHDL blocks using VHLS, writing the functions in C++. This is for the part xczu9eg-ffvb1156-2-e, at a clock frequency of 250 MHz.

These are connected to each other and the surrounding (hit finder) block chain using an AXI4 stream protocol, each processing waves of packets containing 64 ADC samples and adjusting them to be sent to the next block.

Since I have little experience with interface directives, I implemented these AXI4 stream ports as function arguments, as follows (from the header files):

The first block: a pedestal subtraction algorithm (latency = 1, II = 1, pipelined)

void pedsub_HLS(word_t tdata_i, word_t* tdata_o, word_t accum_i,
                word_t* accum_o, word_t ped_i,
                word_t* ped_o, bool tvalid_i, bool* tvalid_o,
                bool tuser_i, bool* tuser_o,
                bool tkeep0_i, bool* tkeep0_o, bool tkeep1_i,
                bool* tkeep1_o, bool tready_i, bool treset_i,
                bool* treset_o, bool tlast_i, bool* tlast_o);

The second block: an FIR filter (latency = 1, II = 1, pipelined)

void fir_HLS_simplified(short tdata_i, short* tdata_o,
             		    bool tvalid_i, bool* tvalid_o, bool tuser_i,
					    bool* tuser_o, bool tkeep0_i, bool* tkeep0_o,
					    bool tkeep1_i, bool* tkeep1_o, bool tready_i,
					    bool treset_i, bool* treset_o, bool tlast_i,
					    bool* tlast_o);

 

As I am sure is the case with most AXI4 stream systems, a high tvalid indicates valid data to be processed and saved. High tuser and tlast indicate the final sample in the packet. A high tkeep is similar to tvalid in preventing the discarding of invalid data. A low tready indicates backpressure from the subsequent blocks, and high tready indicates normal operation. A high treset indicates that all block variables and arrays will be reset to default values, and the data input will begin again. And of course tdata contains the ADC sample to be processed.

This system was fine until I ran into difficulty with implementing an instantaneous halt of block operations in response to a low tready_i signal. The behaviour I am looking for in this instance is that all block operations cease besides the input, and most importantly, the output, which outputs the variables that the block processed the clock cycle before tready_i went low. This output continues for every clock cycle until tready_i goes high again, where normal operations resume.

This is obviously not normally possible when the block takes a minimum of 1 clock cycle to respond to an input, meaning the response is not instantaneous and would have to wait until the next clock cycle to take effect.

I managed to circumvent this issue with pedsub_HLS by outputting the previous function call's outputs as soon as a low tready was read from the input. However, this approach did not work for fir_HLS_simplified for unknown reasons. The function outputs still seem to change on the falling edge of tready_i when I look at them in Questasim.

Here is a simplified version of the algorithm (sorry for the formatting):

// FIR tap array (N_TAP = 32)
short tap_array[N_TAP];
// Tap coefficients
short fir_coeffs[N_TAP] = {0,0,0,0,0,0,0,0,2,4,6,7,9,11,12,13,13,12,11,9,7,6,4,2,0,0,
		        0,0,0,0,0,0};

void fir_HLS_simplified(short tdata_i, short* tdata_o,
             		bool tvalid_i, bool* tvalid_o, bool tuser_i,
			bool* tuser_o, bool tkeep0_i, bool* tkeep0_o,
			bool tkeep1_i, bool* tkeep1_o, bool tready_i,
			bool treset_i, bool* treset_o, bool tlast_i,
			bool* tlast_o) {
        // Arrays containing previous values for tready mechanism (default CLK_REC = 1, but can change depending on circumstances.)
	    static short tdata_previous[CLK_REC];
	    static bool tvalid_previous[CLK_REC], tuser_previous[CLK_REC],
                tkeep_previous[CLK_REC], tlast_previous[CLK_REC];

		if (tready_i) {
			short sum = 0;
			short mult;
                        // FIR filter mechanism
			if (tvalid_i && tkeep0_i && tkeep1_i) {
				register_loop: for (short i = N_TAP - 1; i >= 0; i--) {
									// Introduce new input to first register
									if (i==0) {
										tap_array[i] = tdata_i;
									}
									// Move all other values up one register
									else {
										tap_array[i] = tap_array[i-1];
									}

									mult = tap_array[i]*fir_coeffs[i];
									sum += mult;
				}
			// End of FIR mechanism
			}
                        // Drive outputs
                    output_assign(sum, &(*tdata_o),
                                  tvalid_i, &(*tvalid_o), tuser_i,
                                  &(*tuser_o), tkeep0_i, &(*tkeep0_o),
                                  tkeep1_i, &(*tkeep1_o), treset_i,
                                  &(*treset_o), tlast_i, &(*tlast_o));
                        // Save previous values for when tready_i = false;
                        previous_assign: for (short i = CLK_REC - 1; i >= 0; i--) {
				        if (i == 0) {
					        tdata_previous[i] = sum;
					        tvalid_previous[i] = tvalid_i;
					        tuser_previous[i] = tuser_i;
					        tkeep_previous[i] = tkeep0_i;
					        tlast_previous[i] = tlast_i;
				         }

				     else {
					    tdata_previous[i] = tdata_previous[i-1];
					    tvalid_previous[i] = tvalid_previous[i-1];
					    tuser_previous[i] = tuser_previous[i-1];
					    tkeep_previous[i] = tkeep_previous[i-1];
					    tlast_previous[i] = tlast_previous[i-1];
				    }
			    }
                // If tready is low, output the previous clock's outputs.
		else {
                         output_assign(tdata_previous[CLK_REC-1], &(*tdata_o),
                              tvalid_previous[CLK_REC-1], &(*tvalid_o),
                              tuser_previous[CLK_REC-1], &(*tuser_o),
                              tkeep_previous[CLK_REC-1], &(*tkeep0_o),
                              tkeep_previous[CLK_REC-1], &(*tkeep1_o),
                              treset_i, &(*treset_o),
                              tlast_previous[CLK_REC-1], &(*tlast_o));
                 }
}

This approach would also not work for any block for a latency N > 1, as tready would go low during N - 1 pipelined function operations. This would lead to N - 1 new outputs for N-1 clock cycles after tready goes low. This contradicts the uniform, previous clock's output that is desired after tready goes low.

This led me to believe I need to find a new approach. I have seen AXI4 stream protocol interface directives being discussed on a number of forum posts and in ug902. I believe this may be a way to implement the instantaneous/asynchronous back-pressure behaviour I require. How would I go about doing this?

 

Thank you for your time!

0 Kudos
7 Replies
necare81
Explorer
Explorer
526 Views
Registered: ‎03-31-2016

I recommend getting the AXI stream spec and reading it.

Your understanding of tkeep is wrong.  Only tvalid and tready control if you should be doing something this clock.

Tkeep and tstrb are different ways to indicate how the respective bytes should be be treated.  Tstrb indicates if the content of the byte needs to be maintained and tkeep indicates if the position needs to be maintained.  The spec has rules on how to behave if only using one of those signals.

 

Getting back to the back pressure, it really depends on how you definite previous clock. Assume tready is true to start. If on a specific rising edge, X, the master sets tvalid to true and tdata to 1, tready stays at 1. The next edge, X+1, the master still drives tvalid as true but tdata is now 2 and the slave also drives tready to false.

Now what happens at edge X+2 and what do you want to happen?

Do you want it to send tdata as 1, because that is going back to the "previous" cycle, between X and X+1, or did your mean simply not change and maintain the value present from cycle X+1 to X+2?

The spec requires it to be the second. If that doesn't work you need to add logic inside the module to handle it.

When talking about the changes at the clock you need to be careful about if you are talking about relative to the just before or just after the edge.

drjohnsmith
Teacher
Teacher
511 Views
Registered: ‎07-09-2009

its a bit old, but

https://www.xilinx.com/support/documentation/ip_documentation/ug761_axi_reference_guide.pdf

https://vhdlwhiz.com/axi-fifo/

 

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>
0 Kudos
richardhead
Scholar
Scholar
502 Views
Registered: ‎08-01-2012

All AXI is a simple handshaking mechanism. A data transfer occurs when tvalid and tready are high on the same clock edge. This is true for all of AXIS and each channel in AXI4, AXI4L and AXI3.

"data" covers the rest of the signals. As @necare81 points out, it does sound like you may have a musinderstanding of the behaviour of the rest of the signals. There is also no such signal as treset (reset in AXI is aresetn)

Im also confused why you are writing an AXI interface in HLS when this is the interface HLS uses when you design at a higher level of abstraction, without the user having to define the behaviour. Why not just write in HDL?

0 Kudos
jhorswill
Observer
Observer
498 Views
Registered: ‎10-08-2020

Thanks for your response.

In terms of the stream spec, are you referring to UG761? If so, I will have a look at this. Thank you for educating me on how tkeep works.

 

After reading your question on the tready mechanism, let me explain the desired behaviour more clearly.

 

When tready goes low, I would like the output to be maintained as the value generated between cycle X and X + 1. This means that the output does not change after the edge that tready is driven to false (X+1). In your example, tdata would remain at the value it is registered at the X+1 edge (tdata = 2); no new values would be output until the slave drives tready high again. This is what I meant by 'instantaneous'.

This is assuming that the result tdata = 2 at X+1 is from the function operation between X and X+1. That is the desired behaviour. Values of tdata or any of the AXI4 stream variables from any operation after tready is driven low should not reach the output. This means that the result of the operation between X+1 and X+2 should not be sent to output at X+2.

Therefore at X+2, X+3 e.t.c tdata would remain at 2 until tready is driven high again.

The reason I mentioned the previous clock cycle output is because in my block, the effect of the low tready would take place 1 clock cycle after it was driven low (driven at X+1 -> effect at X+2), and I needed to anticipate this and drive the outputs from X+1 at X+2 to 'simulate' an instantaneous response. Does this make sense? This was the workaround. However this only works in the specific case of latency = 1 and only for the pedsub_HLS algorithm. Hence why I am looking for a different solution.

 

 

0 Kudos
richardhead
Scholar
Scholar
496 Views
Registered: ‎08-01-2012

The AXI Streaming spec was defined by ARM, and can be read here: https://developer.arm.com/documentation/ihi0051/a/Introduction/About-the-AXI4-Stream-protocol

0 Kudos
jhorswill
Observer
Observer
491 Views
Registered: ‎10-08-2020

Hi @richardhead 

You are right, I probably do misunderstand the general purpose of the rest of the signals. I have had limited exposure to interface protocols. I will definitely read UG761.

The reason I am redesigning existing VHDL algorithms in HLS is to gauge whether it is a viable design method in the context of my team's project. I am in a field where a lot more people know C++ and how to use a simple GUI than write HDL!

This behaviour has already been implemented in VHDL scripts; I am trying to recreate it in HLS.

I am aiming to employ the axis and/or s_axilite interface directives to achieve my goal, instead of manually including the handshake signals as function arguments.

Would these directives, or anything similar, have the functionality to achieve what I need?

Thanks

0 Kudos
richardhead
Scholar
Scholar
417 Views
Registered: ‎08-01-2012

With HLS you are supposed to write it more C-like, you dont worry about interfaces - the High Level part of HLS does that for you. You dont write HLS simply by translating VHDL.You write it more like C and the compiler does the interfacing for you - you dont need to know how it works on a cycle -> cycle basis like you did you system C.

It sounds like you probably need to do some HLS tutorials.