cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
TomH_BAH
Observer
Observer
376 Views
Registered: ‎06-30-2020

Clock Network not passing placement

Jump to solution

clock_network.pngI'm trying to implement the above clock network in a Zynq US+ device using Vivado 2018.2, and have been unable to pass place_design during implementation. I've been trying to adjust the constraints accordingly, but it invariably fails with either

- sub-optimal placement for an MMCM-BUFG component pair. or

- sub-optimal placement for MMCM-BUFG-MMCM cascade.

 

My understanding of the clock resources is based on UG572 and UG949. I'd think the placer would understand to put both MMCMs in neighboring clock regions, yet that isn't the case when I look at the information in the implementation log.

Originally I tried to drive the twelve pairs of BUFGCE directly from the MMCM, but I discovered that was overly taxing of the clock distribution in that CMT, so I hoped inserting the pair of buffers after the MMCM would allow for an easier fanout to the array of buffer pairs. The buffer array does not need to have matched skew, because a clock-crossing-FIFO(Independent clocks DRAM using FIFO Generator 13.2) is inserted at the edges of the logic each buffer pair is used for.

 

Using LOC constraints to force the MMCMs to neighboring clock regions was unsuccessful. Setting CLOCK_DEDICATED_ROUTE constraints wouldn't bypass the issue. I haven't attempted to use both because I'm trying to leave flexibility in for the tools. At the moment I'm unsure how to proceed without reworking the design.

0 Kudos
Reply
1 Solution

Accepted Solutions
avrumw
Guide
Guide
309 Views
Registered: ‎01-23-2009

I'm pretty sure the problem is the number of BUFGCEs you are using. Each CMT has one MMCM, and this MMCM can reach the global buffers (of any type - BUFGCTRL, BUFGCE, BUFGCE_DIV) in the same clock region as the MMCM. However, there are only 24 in each clock region. The box on the right of your drawing is, therefore, all the clock resources in one clock region. And this doesn't include the two that you added (not sure why) before the 24, and probably one more for the clock feedback of the MMCM. This is just too many buffers.

This is forcing the tools to try and use a BUFG to route a clock from one clock region to the BUFGCEs in another, which is a sub-optimal placement.

I am not sure about the cascading of one MMCM to another - I don't know the rules for this in Ultrascale/Ultrascale+... 

My guess is that trying to use all 24 is impossible (or at least very difficult) - try scaling it back to 20 (or maybe even 22) and see if this works - this will allow for the clock feedback on the MMCM and still fit within one clock region. And drive them directly from the MMCM - not via two extra BUFGs. Also, if you can, merge the two MMCMs together - have the first MMCM generate all your clocks - take in the 125MHz from the pin, generate 62.5MHz and 250MHz in the first MMCM and try and make your 22 clock gated outputs.

But this is not really how FPGA power control should be done. As you can see, there is limited clock gating capability in the FPGA (at least from one source). But there are other power optimizations that the tools can do - they have clock gates at various points in the clock distribution tree - notably the BUFCE_LEAF. This clock buffer cannot be manually instantiated, but the tool will us them when it does power optimization when it sees large groups of flip-flops using the same CE. So if you code this as two clocks with 12 CEs, where each CE goes directly to the CE input of all flip-flops in one "power domain" for each of the two frequencies, the tools should be able to do much of this gating in the BUFCE_LEAF. It won't be perfect, and it won't save as much power as turning off the entire tree (since power is consumed by the tree itself, not just the loads) and it may create timing problems on the CEs (which may need to be replicated), but it will still reduce power and won't cause the kinds of problems you are seeing.

Ironically, the 7 series FPGAs were probably better at this. Each clock region in the 7 series can be driven by up to 12 BUFHCEs, where each of the 12 BUFHCEs can be driven by any one of the BUFGs - so you could instantiate just two BUFGs (one for 62.5 and one for 250) and then have up to 12*N gated clocks driven by these two clocks, where N is the number of clock regions in your device.

Avrum

View solution in original post

5 Replies
dpaul24
Scholar
Scholar
371 Views
Registered: ‎08-07-2014

@TomH_BAH ,

I am trying to understand why a simpler clocking mechanism such as the one below is not implemented.

MMCM -- 62.5M -- BUFGCEs

             -- 250M -- BUFGCEs

Which is the problem area ? The 1st MMCM with the BUFGCE or the 2nd MMCM and BUFGCEs?

It is better to let the tools insert clock buffers (specially after MMCMs/PLLs). I rarely hand instantiate them in the RTL unless I have a good reason that they must be manually inserted.

------------FPGA enthusiast------------
Consider giving "Kudos" if you like my answer. Please mark my post "Accept as solution" if my answer has solved your problem

0 Kudos
Reply
TomH_BAH
Observer
Observer
363 Views
Registered: ‎06-30-2020

The first MMCM is part of an example design my team uses as a foundation for projects. The second MMCM is there to help make the project specific portions more portable to other example designs. The other hope is to use the array of BUFGCEs(controlled via an AXI register) to disable portions of logic to better control power use. I'm not trying to optimize for the project as much as I'm trying to build a proof of concept for this use and arrangement of clock resources. There may come a day where my team has to cascade MMCMs in such a way, so figuring these things out now will help future projects as well.

The problem seems to center on the 2nd MMCM and the adjacent buffers. A run with no intervening constraints gives an error of Place 30-718: Sub-optimal placement for MMCM-BUFGCE-MMCM cascade. Using CLOCK_DEDICATED_ROUTE constraints gives similar errors. Using LOC constraints on the MMCMs also produces similar errors. Any combination thereof isn't producing a result that is noticeably different.

I'd like to avoid hand placing every instance because that seems counter to aiming for portability, and also feels like it would be a clumsy solution. I don't think I'm doing something that would be unachievable, but that's where my understanding of the clock resources appears to fail.

0 Kudos
Reply
avrumw
Guide
Guide
310 Views
Registered: ‎01-23-2009

I'm pretty sure the problem is the number of BUFGCEs you are using. Each CMT has one MMCM, and this MMCM can reach the global buffers (of any type - BUFGCTRL, BUFGCE, BUFGCE_DIV) in the same clock region as the MMCM. However, there are only 24 in each clock region. The box on the right of your drawing is, therefore, all the clock resources in one clock region. And this doesn't include the two that you added (not sure why) before the 24, and probably one more for the clock feedback of the MMCM. This is just too many buffers.

This is forcing the tools to try and use a BUFG to route a clock from one clock region to the BUFGCEs in another, which is a sub-optimal placement.

I am not sure about the cascading of one MMCM to another - I don't know the rules for this in Ultrascale/Ultrascale+... 

My guess is that trying to use all 24 is impossible (or at least very difficult) - try scaling it back to 20 (or maybe even 22) and see if this works - this will allow for the clock feedback on the MMCM and still fit within one clock region. And drive them directly from the MMCM - not via two extra BUFGs. Also, if you can, merge the two MMCMs together - have the first MMCM generate all your clocks - take in the 125MHz from the pin, generate 62.5MHz and 250MHz in the first MMCM and try and make your 22 clock gated outputs.

But this is not really how FPGA power control should be done. As you can see, there is limited clock gating capability in the FPGA (at least from one source). But there are other power optimizations that the tools can do - they have clock gates at various points in the clock distribution tree - notably the BUFCE_LEAF. This clock buffer cannot be manually instantiated, but the tool will us them when it does power optimization when it sees large groups of flip-flops using the same CE. So if you code this as two clocks with 12 CEs, where each CE goes directly to the CE input of all flip-flops in one "power domain" for each of the two frequencies, the tools should be able to do much of this gating in the BUFCE_LEAF. It won't be perfect, and it won't save as much power as turning off the entire tree (since power is consumed by the tree itself, not just the loads) and it may create timing problems on the CEs (which may need to be replicated), but it will still reduce power and won't cause the kinds of problems you are seeing.

Ironically, the 7 series FPGAs were probably better at this. Each clock region in the 7 series can be driven by up to 12 BUFHCEs, where each of the 12 BUFHCEs can be driven by any one of the BUFGs - so you could instantiate just two BUFGs (one for 62.5 and one for 250) and then have up to 12*N gated clocks driven by these two clocks, where N is the number of clock regions in your device.

Avrum

View solution in original post

TomH_BAH
Observer
Observer
259 Views
Registered: ‎06-30-2020
I'm pretty sure the problem is the number of BUFGCEs you are using. Each CMT has one MMCM, and this MMCM can reach the global buffers (of any type - BUFGCTRL, BUFGCE, BUFGCE_DIV) in the same clock region as the MMCM. However, there are only 24 in each clock region. The box on the right of your drawing is, therefore, all the clock resources in one clock region. And this doesn't include the two that you added (not sure why) before the 24, and probably one more for the clock feedback of the MMCM. This is just too many buffers.

My understanding lines up with this, and is spelled out several times in the user guides I referenced in the original post.. My first attempt was to drive the 24 buffers with the second MMCM, but that was clearly taking up more routing resources than were available. There is no requirement for each of those 24 to occupy the same clock region, so by inserting the two between, I was hoping that would allow the tools to place the final buffers where routing resources are more abundant, and reduce congestion around the MMCM. 

 

Also, if you can, merge the two MMCMs together - have the first MMCM generate all your clocks - take in the 125MHz from the pin, generate 62.5MHz and 250MHz in the first MMCM...

This is an option, and actually is the case for where I started. However, like I said, the goal is to try and make this cascade work, even if it is less resource efficient. As far as I'm aware it's perfectly legal and realistic to cascade MMCMs, so I'd like to stay focused on what is necessary to do that, or fill the gap in my knowledge about why it isn't possible. I believe the sub-optimal placement is also acceptable because I'm putting clock crossing circuits around the stamped out logic receiving the 24 buffers. I did consider the CE on the leaf buffers, but my concern was the high fanout and that I can't guarantee the tools are using them.

 

I'll attempt scaling down to see if that alleviates the problem.

 

 

0 Kudos
Reply
TomH_BAH
Observer
Observer
226 Views
Registered: ‎06-30-2020

After scaling things down to using 8 total buffers at the edge of the clock network, it still would not pass placement. On a sneaking suspicion I turned off incremental compile, and was finally able to pass placement, but I've been in routing for the last three hours. Unless some other information comes up, I'm going to conclude that the original goal isn't realistically achievable, and I'll transition to trying to infer the CE on the leaf buffers.

0 Kudos
Reply