UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Adventurer
Adventurer
352 Views
Registered: ‎07-30-2013

re-optimize after modifying netlist

I have a post-route design which doesn't meet timing.  I'm considering promoting some of the nets to BUFG.

Let's say I use a command in the TCL Console such as:

set_property CLOCK_BUFFER_TYPE BUFG [get_nets {path-to-net}].

What's the correct process after making a change on the post-route netist to cause my change to take effect- Do I then re-run placment?  

0 Kudos
8 Replies
Highlighted
Historian
Historian
340 Views
Registered: ‎01-23-2009

Re: re-optimize after modifying netlist

That won't do anything. The CLOCK_BUFFER_TYPE property is to change the type of the buffer inserted on a net assuming the tool needs to insert the buffer in the first place. This would only be done for structural reasons, not timing reasons. In other words, if the tool sees a primary input that fans out to the clock pins of flip-flops, it knows that the net is a clock and hence it needs a clock buffer. The type of clock buffer is determined by the CLOCK_BUFFER_TYPE, with the default being BUFG.

So you can't do this this way.

Which doesn't mean that it can't be done - it can - you would use the "ECO" commands (disconnect_net, create_cell, connect_net...). If you made all the changes, then all you would have to do is route the affected nets (with route_design).

But, I am pretty sure it won't accomplish anything. The tool is able to use unused global buffers for high fanout nets if it determines that this will help timing. The only reasons it wouldn't have done so is that either there are no BUFGs available, or it decided that that wouldn't improve timing. Forcing it to do so (especially as a "post-processing" step) is almost certainly going to make timing worse.

You don't tell us what family you are using (7 series or UltraScale/UltraScale+), but particularly in the 7 series the global clock net is slow - it is routed to be able to reach all clocked cells in the design with low skew - this means (by definition) high insertion delay. The delay through the BUFG and global clock network can easily be 2-3ns (depending on the part and speedgrade). This means that using the BUFG for high fanout nets becomes useless at a certain frequency - if your design is running at 300MHz, then the 3.33ns period cannot accommodate a 3ns delay on the BUFG and global clock net. So using the BUFG for high fanout nets is only viable at lower frequencies...

Avrum

Adventurer
Adventurer
328 Views
Registered: ‎07-30-2013

Re: re-optimize after modifying netlist

Hi Avrum,

Interesting, and I get your point about the insertion delay with using a BUFG.

I'll add additional background in case it prompts any other ideas.  I have a highly congested design (level 7) on a US+ device with around a 4.5ns clock period.  I actually have two versions of the design, one which has a reasonable utilization (~32% LUT and FF) and the other much higher.  It's a highly intereconnected design with very large arrays which I'm sure is the reason for the congestion.  There are a couple enable signals which fanout to many thousands of nodes.  I tried pipelining the enables and duplicating the enables which helped, but also probably feeds into the congestion.  So, my attempt at BUFG was reduce the effect of the fanout in another way. 

Xilinx recommends not messing with max fanout although I've found the only way I can place and route the design is if I am careful with my selection of max fanout in the synthesis stage.  Otherwise apparently the congestion is too much for implementation to handle and duplicating in later stages cannot come to a route without overlapping nodes.  

I've tried various synthesis and implementatoin "strategies" and found a couple that work better than others for the design, and I've only found one synthesis strategy so far (PerfOptimizaed_high) which can ultimately lead the implementation to a successfully route.

All that to say, any wisdom is welcome.  Perhaps BUFG isn't the way to go, but I thought due to what could be a very high fanout, that's access to the clock network could satisfy many timing paths without need for duplication which adds to already bad congestion.

0 Kudos
Historian
Historian
303 Views
Registered: ‎01-23-2009

Re: re-optimize after modifying netlist

So, first, managing fanout at the synthesis stage is not recommended. Fanout management should be done in the implementation and routing stage - the BUFG insertion is done as part of place_design in the Post-Placement Optimizations (unless -no_bufg_opt is set), and other fanout improvements are done during post-placement phys_opt_design - make sure you are using strategies that do phys_opt_design both before and after routing.

Large arrays (presumably large BRAM structures) can be a big problem for placement/routing. When working with block RAMs you need to budget for lots of pipelining. Depending on the size of the structure and how many such structures there are in a given module, I will consider:

  • Have all data_in, address and control signals come from flip-flops
  • Have an additional pipeline stage to get these signals to the RAM (optional, but good if you can)
  • Allow 2 clock cycles for a RAM read (to use the DO_REG)
  • Pipeline the read data on the way back
  • Capture the read data directly in a flip-flop

This has an effective read latency of 5 clocks - which is huge, but if you can afford it, then it will ease some of the congestion around large RAMs. You can also play with the RAM cascading options in UltraScale, but I have had mixed results with letting the tool do this for me...

Good luck!

Avrum

0 Kudos
Adventurer
Adventurer
273 Views
Registered: ‎07-30-2013

Re: re-optimize after modifying netlist

Avrum,

Thank you for that information.  It is very useful.

In our case, the large arrays (60K+ bits) are FF based and not BRAM.  The reason is, the algorithm needs to be able to randomly access certain indexes accross the whole array within the same cycle.  I tried BRAMs, and I certainly hadn't tried all the pipelining you mentioned, but to represent what I needed in BRAM with the same wide access required several BRAMs in parallel which weren't too deep, but overall didn't reduce congestion.  

I suspect that with the wide arrays, and a lot of interconnection with some processing , it must create some kind of mesh of routing, that on matter what I try to do, stays highly congested.  In fact, most of the design stays on one of two dice of the US+ we are using, because the arrays and interconnection must be too much for the amount of SLLs available.

Any thoughts on reducing congestion if not BRAMS?  I figured pipelining and spreading.  

Dan

0 Kudos
Historian
Historian
253 Views
Registered: ‎01-23-2009

Re: re-optimize after modifying netlist

Is the access truly random - any combinations of the words of data can be accessed on any cycle (with any others)?

The memory hierarchy within the FPGA has three levels:

  • Flip-flops (1)
    • Essentially infinite read bandwidth since all FFs can be "read" on any cycle
    • Very inefficient in terms of area
    • Entirely distributed within the die (so the cell can be placed near the logic)
  • Block RAMs (3)
    • MUCH more efficent in terms of area
    • Significantly smaller bandwidth - two words (up to 36 or 72 bits) per RAM instance
    • Geographically dispersed around the die (so the cell may end up far from the logic that needs it, hence pipelining is needed)
  • SelectRAM aka. distributed RAM (2)
    • 64x (or 32x) the density of flip-flops
    • 1 (or 2) bit read per 64 bits - so substantially more bandwidth than BRAM
    • Also distributed within the die (so the cell can be near the logic that needs it)

So, the next obvious question is, you tried FFs (which cause huge utilization and congestion) and block RAM (which cause other problems - probably not enough bandwidth) - can your application use distributed RAMs?

Congestion can be one of the most difficult issues to deal with - it is sort of a 2nd order side effect of the FPGA architecture and how the tools implement the design in that FPGA architecture. Since the tools play a role in it, there are options (and strategies) to each of the processes that can change the behaviour of the tool in order to "affect" the congestion, but there is generally not a "simple" solution. These can be tried, and there are even some "brute force" approaches (like trying to introduce perturbations into the placement process), but these may not be sufficient.

Pipelining can be a double edged sword in congestion issues. You are helping the timing paths meet timing, but increasing the number of nets, which increases congestion. The spreading directives are specifically there to help alleviate congestion issues - it has been the "go to" solution for congestion since this started being a problem in the Virtex-6 days. While it can help, it doesn't usually help much...

There may be some other things that can help; changing the clocking (maybe doing some stuff on a faster clock in more ticks); maybe looking at reducing or eliminating resets (which contribute to congestion). 

Better soultions usually involve looking at the problem to be solved and finding a better implementation - one that matches the architecture of the FPGA device better. Things like choosing the right memory is certainly one area to explore, but there are other things to look at. Things like finding ways of partitioning the problem into smaller sub-problems, and maybe even doing this recursively (some kind of tree architecture, which could reduce "global" interconnectivity of the design). Without knowing the details of the application/algorithm being implemented, its not really possible to explore these...

Avrum

Tags (1)
Adventurer
Adventurer
151 Views
Registered: ‎07-30-2013

Re: re-optimize after modifying netlist

Thank you for the well worded response. 

I created distributed RAM attributes to target several large arrays which were in the areas of congestion.  While it is not immediately obvious that synthesis took those attributes because it doesn’t show up in the synthesis log in the list of final distributed RAMs, somehow in the end it lowered congestion and improved timing to some extent. (however only after “rip-up and reroute" phase and with only a particular combination of synthesis and implementation strategies)!

You are right, dealing with congestion is a delicate process with trade-offs and seems to require "experiments" to see which potential changes will actually help a specific design.  In the end the hard question is, for a challenging design, how much can the tool be expected to do before the right answer is to rearchitect a design (at least with certain features or limitations of the FPGA in mind).

0 Kudos
Xilinx Employee
Xilinx Employee
92 Views
Registered: ‎01-05-2017

Re: re-optimize after modifying netlist

Hi @daniel.cogan ,

What version of Vivado are you using?

Have you tried running report_qor_suggestions?

report_design_analysis will also show what type of congestion we are dealing with.

Does it fail setup timing after the placer stage?  (ignore the hold violations after placer unless they are large)

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
0 Kudos
Xilinx Employee
Xilinx Employee
73 Views
Registered: ‎05-08-2012

Re: re-optimize after modifying netlist

Hi @daniel.cogan 

For congestion analysis, I would suggest page 6 of the UlstraFast Quick Reference Guide UG1292. For the community to help, providing more details on the failure with the below reports would help.

implementation log

report_design_analysis -congestion -complexity -file RDA.rpt

report_qor_suggestions -file RQS.rpt

report_utilization -file utilization.rpt

report_timing_summary -file timing_summary.rpt

 

https://www.xilinx.com/support/documentation/sw_manuals/xilinx2019_1/ug1292-ultrafast-timing-closure-quick-reference.pdf#page=6

---------------------------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
---------------------------------------------------------------------------------------------
0 Kudos