cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
6,872 Views
Registered: ‎01-22-2015

pipeline with feedback ?

Jump to solution

Part of my HDL code is failing timing analysis because I attempt too many operations in one clock cycle.  So, I have tried restructuring the code as a pipeline.  However, I find that results formed late in the pipeline influence the new value being fed into pipeline (ie. the pipeline has feedback).

 

Is pipelining with feedback possible?

0 Kudos
1 Solution

Accepted Solutions
Highlighted
Guide
Guide
11,776 Views
Registered: ‎01-23-2009

Fundamentally, this is the inherent limitation to performance of an algorithm when implemented in hardware.

If your system has feedback, this creates a loop. Because of the feedback, one iteration of the loop must be complete before the next iteration can begin. As a result, you have a "certain amount of work" to do in each iteration. Lets say that the amount of "work" in the loop (conceptually) takes time T.

You have a number of ways of implementing this. You can implement the loop in one clock cycle, in which case your maximum frequency is 1/T.

If you pipeline this, and place a set of flip-flops in the middle of the combinatorial logic, then you can increase your frequency to 2/T. However - now it takes 2 clock cycles for the loop to get its feedback information. As a result, you can only perform one iteration of the loop every 2 clock cycles (for a total of 1/T iterations per second).

The "loop" that has the most work to do in an architecture is the "critical loop" - it cannot be accelerated by pipelining or increasing the frequency, and hence ultimately limits the performance of your system.

So, you need to reduce the complexity of the critical loop. There is no "simple" way of doing this. You may be able to improve it be separating the parts of the loop that are truly critical from other parts that are not, and moving the non-critical stuff into the cycle before (i.e. pre-calculate as much of the inputs to the loop as possible), or postpone it to post-processing. If you leave just the critical portion of the loop in the clock cycle of the loop, then the synthesis tool may be able to implement it faster.

However, again, there is a limit above which you just cannot go - if you have isolated the critical loop to just the critical portion, and you must run the loop each iteration, then that defines the performance limit of your architecture. If that's not fast enough, then you have to find another architecture (which may or may not be possible).

Avrum

View solution in original post

Tags (1)
11 Replies
Highlighted
Xilinx Employee
Xilinx Employee
6,843 Views
Registered: ‎08-01-2008
check these guides

https://www.xilinx.com/support/documentation/sw_manuals/xilinx13_4/ug612.pdf

https://www.xilinx.com/support/answers/40838.html
Thanks and Regards
Balkrishan
--------------------------------------------------------------------------------------------
Please mark the post as an answer "Accept as solution" in case it helped resolve your query.
Give kudos in case a post in case it guided to the solution.
0 Kudos
Highlighted
Moderator
Moderator
6,843 Views
Registered: ‎07-21-2014

markg@prosensing.com

 

How many logic levels you have in the critical path? Did you try synth_design -retiming in 2016.2/2016.3?

 

>>Is pipelining with feedback possible?

Can you show us the schematic of above logic?

 

Thanks,
Anusheel
-----------------------------------------------------------------------------------------------
Search for documents/answer records related to your device and tool before posting query on forums.
Search related forums and make sure your query is not repeated.

Please mark the post as an answer "Accept as solution" in case it helps to resolve your query.
Helpful answer -> Give Kudos
-----------------------------------------------------------------------------------------------

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
6,839 Views
Registered: ‎05-07-2015

HI markg@prosensing.com

>>However, I find that results formed late in the pipeline influence the new value being fed into pipeline
please explain this observation if your ind etails with simulation snapshots if possible.

Simply looking at the schematic of the  pipelined  logic also should  give you clear picture of what was synthesized  post your RTL modification for pipelining.

Thanks
Bharath
--------------------------------------------------​--------------------------------------------
Please mark the Answer as "Accept as solution" if information provided addresses your query/concern.
Give Kudos to a post which you think is helpful.
--------------------------------------------------​-------------------------------------------
0 Kudos
Highlighted
6,824 Views
Registered: ‎01-22-2015

To All: 
Thanks for your comments. 

 

Please consider my questions to be a general concept questions. 

 

To clarify, I am trying to do something that has inherent feedback.  That is, I have sets of inputs upon which I am doing calculations.  Depending on the calculation results for a particular set of inputs, I (sometimes) need to adjust the *very next* set of inputs.   When I try to pipeline this, the adjustment (ie. feedback) does not get applied to the very next set of inputs, but is instead delayed by the latency of the pipeline.   Perhaps traditional pipelining is the wrong approach for helping my "calculations with feedback" pass timing analysis.

 

Do you have other recommendations for helping "calculations with feedback" pass timing?

 

The "synthesis retiming" suggested by anusheel sounds interesting!  My quick reading on the topic shows that it achieves improved performance by moving registers around - but does not change circuit latency!   It sounds like an automatic "pipeline improver" - but not a "pipeline creator"?  Can someone say a little more about how synthesis retiming works?

0 Kudos
Highlighted
Guide
Guide
11,777 Views
Registered: ‎01-23-2009

Fundamentally, this is the inherent limitation to performance of an algorithm when implemented in hardware.

If your system has feedback, this creates a loop. Because of the feedback, one iteration of the loop must be complete before the next iteration can begin. As a result, you have a "certain amount of work" to do in each iteration. Lets say that the amount of "work" in the loop (conceptually) takes time T.

You have a number of ways of implementing this. You can implement the loop in one clock cycle, in which case your maximum frequency is 1/T.

If you pipeline this, and place a set of flip-flops in the middle of the combinatorial logic, then you can increase your frequency to 2/T. However - now it takes 2 clock cycles for the loop to get its feedback information. As a result, you can only perform one iteration of the loop every 2 clock cycles (for a total of 1/T iterations per second).

The "loop" that has the most work to do in an architecture is the "critical loop" - it cannot be accelerated by pipelining or increasing the frequency, and hence ultimately limits the performance of your system.

So, you need to reduce the complexity of the critical loop. There is no "simple" way of doing this. You may be able to improve it be separating the parts of the loop that are truly critical from other parts that are not, and moving the non-critical stuff into the cycle before (i.e. pre-calculate as much of the inputs to the loop as possible), or postpone it to post-processing. If you leave just the critical portion of the loop in the clock cycle of the loop, then the synthesis tool may be able to implement it faster.

However, again, there is a limit above which you just cannot go - if you have isolated the critical loop to just the critical portion, and you must run the loop each iteration, then that defines the performance limit of your architecture. If that's not fast enough, then you have to find another architecture (which may or may not be possible).

Avrum

View solution in original post

Tags (1)
Highlighted
Scholar
Scholar
6,813 Views
Registered: ‎09-16-2009

 

To add on to Avrum's response: although this is a hard problem to fix in general, I've often been surprised on what sort of algorithms (even with feedback) that one can mathematically break down, and still pipeline in some ways to achieve a higher clock speed. 

 

A recent example for me - calculating CRCs.  CRCs often push the critical path in FPGA designs. A CRC has feedback: the current remainder calculation relies on the result of the last remainder. The long XOR tree's tend to not lend themselves to fixing purely by combinational optimizations.

 

But CRC's can still be pipelined.  You can tease the math apart to calculate "feedforward" only "partial" CRCs (in a multi-clock cycle fashion) and combine all the results in the end.

 

So - you need to look at each calculation specifically.  (In the back of my mind I hear some math professor and/or synthesis expert grating his/her teeth at my description.  I'm sure there's a more mathematically rigorous definition of what I'm trying to state here...)

 

And regarding the "synthesis retiming" - and I always leave that switch on - I've not found the feature powerful enough to actually EVER solve a timing problem.  i.e. I've got a large combinational path failing timing.  I add N pipeline registers (usually to the output of the calculation).  And then re-run the tools, hoping "synthesis retiming" will push those pipeline registers into the calculation. It'd be great if this worked, and the tools could do this "grunt work" for us.  But it's never really worked for me. "Worked" meaning my circuit now passes timing.  It moves a register or two around, but not really enough to make a significant difference.  I've found that I've always had to go into the RTL and intelligently add the pipeline myself. 

 

In defense of the switch - I do think it's helping me "behind the scenes" for paths that are very close to meeting timing.  Hence I leave those optimization's always on.  But when you really need an extra pipeline state - I've found you need to add it yourself.

 

Regards,

 

Mark

Highlighted
6,808 Views
Registered: ‎01-22-2015

Avrum:  A very clear (as always) answer to the question I shoulda asked.  Thanks!   As you suggest, I will try pre-computing things, lookup tables, etc.  Sometimes reality really bites.

 

--

 

Mark:  Many thanks for sharing your experience from the trenches.  Currently, my code is failing timing by just a little.  So, I was hoping that "synthesis retiming" might be the easy solution.  -will now not waste time trying to understand it, if it does not work immediately.

0 Kudos
Highlighted
Guide
Guide
6,796 Views
Registered: ‎01-23-2009

I'm sure there's a more mathematically rigorous definition of what I'm trying to state here...)

Specifically for a CRC, the function

CRC[i+1] = f (CRC[i], data_in[i])

where CRC[i]  is the accumulated CRC at cycle i; it is the width of the CRC (often 32) and data_in[i] are the new bits to be accumulated at cycle i; it is the width of the datapath (lets say 64 bits).

The CRC is a polynomial, but primitively, it assumes a new data bit is added one at a time. To add the 64 bits to the CRC, we "unroll" the CRC polynomial 64 times. When you do this, what you essentially get is that each bit of the new CRC is an XOR of some combination of the 32 CRC[i] and the 64 data[i]. In general, the terms tend to be the XORs of roughly half the bits coming into the system (so 16 of the old CRC bits and 32 of the new data_in bits - a different combination for each of the 32 bits of CRC[i+1]),

The function can therefore be broken apart to

CRC[i+1] = f'(CRC[i]) ^ g'(data_in[i])

The f' function will be smaller - just the (approximately) 16 terms from CRC[i]. The g' doesn't depend on anything - it can be calculated in advance - thus taking the (roughly) 32 bits of the data_in XOR chain off the critical path - leaving the roughly 16 from f', and the one extra one from the final XOR with the f' and the g'.

This is exactly what I was referring to about taking the "non-critical" part of the critical loop out and pre-calculating it...

Avrum

Tags (2)
Highlighted
Explorer
Explorer
6,710 Views
Registered: ‎09-27-2013

One strategy I've used (which is very application dependent for being successful), is to have a prediction circuit. You predict where your output will be in 2 loops. Accounting for the extra latency, your predicted output should appear with the correct timing and you can use it as your normal feedback.

 

This is all very dependent on how accurate your prediction model will be and how much error tolerance you can accept.

 

 

0 Kudos
Highlighted
4,737 Views
Registered: ‎01-22-2015

 

I was just reading on Wikipedia here that the predictive pipelining you mention is being used to speedup CPUs - but is difficult and (as you say) sometimes not appropriate/successful.   Thanks for the idea!  I will consider using it to solve my problem.

0 Kudos
Highlighted
4,627 Views
Registered: ‎01-22-2015

Avrum,

 

Pre-calculating things was the solution to my timing problem.  Thanks for your additional contribution to this post that clarifies and emphasizes the importance of pre-calculation.

 

 

0 Kudos