UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Observer uncle_tim
Observer
3,409 Views
Registered: ‎12-04-2015

What's the cost of excessive pipelining

Jump to solution

In my application throughput is everything and latency is nothing. Therefore I naturally use a lot of pipeline stages.

But sometimes I am not sure, how fine-grained my pipelines should be.

Should I use a Flipflop after every LUT ?

 

To my knowledge routing would become simpler, because timing is more relaxed. The router can focus on the hard parts when everything else is relaxed.

Is that true?

 

What is the downside?

* Of course the flipflops can't be used for something else, but I don't care, I have enough.

* I guess the power consumption would be higher, but I don't care, I have good power supply.

* Is there anything else? E.g. high load on clock network increase clock uncertainty and therefore negatively effects timing? Or something subtle I have missed?

 

 

I am trying to clock a Kintex-7 at 500 MHz.

 

 

 

 

 

 

 

 

 

0 Kudos
1 Solution

Accepted Solutions
Guide avrumw
Guide
6,192 Views
Registered: ‎01-23-2009

Re: What's the cost of excessive pipelining

Jump to solution

In the Xilinx FPGAs, there are two flip-flops for each LUT (one for the O6 output and one for the O5 output). This generally means that there are "more than enough" flip-flops for pipelining - you can afford to have one flip-flop after every LUT.

 

Generally, this will give you the fastest possible clock speed and hence the fastest throughput IF it is architecturally possible.

 

Arbitrary pipelining is only possible if you are doing a pure datapath with absolutely no feedback (or data driven conditions). If that is the case with your design, then, yes, pipeline it all the way.

 

The ratio of FFs to LUTs in the FPGA is designed specifically for this reason - in order to reach the maximum clock frequencies you need to do this - I think Xilinx actually bases the metric of "maximum frequency" on a design that consists of no more than one LUT and some number of carry chain elements (each doing 4 bits) between pipeline stages. And for the Kintex-7, 500MHz is pretty close to this "maximum frequency" limit (depending on speedgrade).

 

You will also need to be very careful getting to and from the big blocks like the BRAMs and DSP48s - at these frequencies you need to budget multiple pipeline stages for the routing to and from these blocks.

 

Yes, all of this will use tons of FFs, and yes it will burn power, but this is the way to get maximum throughput. The load in terms of flip-flops does not affect clock uncertainty (the clock trees are fully buffered and are already designed on the die).

 

Of course, you are going to need a KILLER heat sink (or heat pipe) on this thing to keep it under the maximum temperature... If you are planning on using a significant percentage of the resources you will need to do careful power analysis to determine the cooling requirements.

 

Avrum

View solution in original post

Tags (1)
0 Kudos
3 Replies
Guide avrumw
Guide
6,193 Views
Registered: ‎01-23-2009

Re: What's the cost of excessive pipelining

Jump to solution

In the Xilinx FPGAs, there are two flip-flops for each LUT (one for the O6 output and one for the O5 output). This generally means that there are "more than enough" flip-flops for pipelining - you can afford to have one flip-flop after every LUT.

 

Generally, this will give you the fastest possible clock speed and hence the fastest throughput IF it is architecturally possible.

 

Arbitrary pipelining is only possible if you are doing a pure datapath with absolutely no feedback (or data driven conditions). If that is the case with your design, then, yes, pipeline it all the way.

 

The ratio of FFs to LUTs in the FPGA is designed specifically for this reason - in order to reach the maximum clock frequencies you need to do this - I think Xilinx actually bases the metric of "maximum frequency" on a design that consists of no more than one LUT and some number of carry chain elements (each doing 4 bits) between pipeline stages. And for the Kintex-7, 500MHz is pretty close to this "maximum frequency" limit (depending on speedgrade).

 

You will also need to be very careful getting to and from the big blocks like the BRAMs and DSP48s - at these frequencies you need to budget multiple pipeline stages for the routing to and from these blocks.

 

Yes, all of this will use tons of FFs, and yes it will burn power, but this is the way to get maximum throughput. The load in terms of flip-flops does not affect clock uncertainty (the clock trees are fully buffered and are already designed on the die).

 

Of course, you are going to need a KILLER heat sink (or heat pipe) on this thing to keep it under the maximum temperature... If you are planning on using a significant percentage of the resources you will need to do careful power analysis to determine the cooling requirements.

 

Avrum

View solution in original post

Tags (1)
0 Kudos
Observer uncle_tim
Observer
3,318 Views
Registered: ‎12-04-2015

Re: What's the cost of excessive pipelining

Jump to solution

I just made a quick test, the additional FFs don't even consume much power. Dynamic power increases by just 5 % because of "over-pipelining".

 

But one thing I noticed: The design can consume more LUTs, because XST is more restriced in it's optimizations.

 

 

0 Kudos
Highlighted
Guide avrumw
Guide
3,295 Views
Registered: ‎01-23-2009

Re: What's the cost of excessive pipelining

Jump to solution

But one thing I noticed: The design can consume more LUTs, because XST is more restriced in it's optimizations.

 

When you infer a flip-flop (without register retiming on), the flip-flop can't be moved. That means that the tool has no ability to merge logic from one pipeline stage with the one before or after it.

 

So depending on what you code as the combinatorial function between pipeline stages, you may only partially use the capabilities of the LUT6. So if one pipeline stage only used 3 inputs (hence a LUT3) and the next pipeline stage only also used 3 (or 4 if you include the flopped output of the first stage), then it would take 2 LUTs. If both these were in the same pipeline stage then it would only take one LUT.

 

So, yes, your LUT count will increase as you pipeline, unless you pipeline "perfectly" and implement logic that fully utilizes every LUT6 in each pipeline stage.

 

Avrum

Tags (1)
0 Kudos