cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
servdo
Visitor
Visitor
337 Views
Registered: ‎08-14-2020

Spartan 6 Block RAM timing closure

Jump to solution

Hello,

I inherited a FIR design with  96% BRAM  (mainly due to 24 Core Generated12x8k buffers) and 25% DSP utilization. I need to speed the main processing clock up to 200MHz from 150, which caused just endless ISE timing violations: I fix them in one place (adding registers/ pipelining/ SmartXplore, etc ) - they pop up in another... Now I got stuck with the BRAM negative slack errors, due to routing delays, as  elements of a buffer are spread  over large chip area (confirmed with FPGA Editor). I wonder what would be the right way to constrain a BRAM buffer, so that its elements stay close to each other?

The design right now has no area constrains. 

Thank you in advance,

Serguei

 

 

0 Kudos
Reply
1 Solution

Accepted Solutions
avrumw
Guide
Guide
306 Views
Registered: ‎01-23-2009

I added some MAXDELAY constrains to the ucf, targeting those BRAM net errors, e.g. 

I generally don't recommend this. What this does is over-constrain the paths (since the normal delay at 200MHz would be 5ns). In general this just squeezes certain paths at the expense of other paths. So while you may have reduce the worst violation, it is likely that you have created more of them - what is the overall timing score?

In any case, this is not likely to get you to timing closure - I think you need to experiment with what I suggested. I have had this exact same problem in a Spartan-6 - high RAM utilisation with larger buffers (I was trying to create a single buffer that used 85% of the RAMs), and solved it using the mechanism I described - I broke the RAMs into smaller subgroups, and replicated the flip-flops for each subgroup. Once they were broken into smaller subgroups with adequate pipelining, the tools had no problem with them. 

Avrum

 

View solution in original post

0 Kudos
Reply
4 Replies
avrumw
Guide
Guide
319 Views
Registered: ‎01-23-2009

 I wonder what would be the right way to constrain a BRAM buffer, so that its elements stay close to each other?

So, you can't.

BRAM cells are big cells that are scattered in different columns across the device. As a result it can be difficult to build larger buffers. In your case, each of your buffer requires 8 BRAMs - each being 1k x 16 (I presume your buffers are 8k x 12; each word is 12 bits and there are 8k of them). Where possible, these will be mapped to the same column, but there are a finite number of RAMs in a column - unless you get really lucky, with 96% usage,  one or more of the groups of 8 will have to span multiple columns. The columns are pretty far apart, and any logic that drives/receives signals to/from the entire bank will end up getting pulled between these two columns.

So the only way to fix this is "pipeline, pipeline, pipeline". Make sure your address, control and write data come directly from flip-flops, and, if you can tolerate the latency, have another set of pipeline flip-flops between the address/control flip-flops and the RAMs. Similarly the read outputs will need to be MUXed together; use the DOA/DOB registers in the RAMs (which increases the read latency to two clocks), maybe even have another set of flip-flops, and then do the MUX and bring them to a third set of flip-flops. Of course this assumes you can tolerate the latency - the entire read latency for this system will end up being like 5 clock cycles. If you can't afford the latency, it is possible that this is simply unimplementable at this speed.

Even with all this pipelining, it may not be enough. Likely each of the 8 RAMs in a group share the same address bus, so the flip-flops that drive the address need to reach all 8 RAMs. If these are in different columns, they likely can't find a place to reach both sets of RAMs in time. As a result, you may need to have multiple redundant sets of these address flip-flops - one for each group of 4 or even 2 RAMs (this is best done in the extra pipeline - have your one set of address FFs drive two or 4 sets of pipeline registers in parallel and then have each of these drive 4 or 2 RAMs). However, the tools won't like this, they will remove the redundant registers during synthesis - you will have to force the tools to keep them with a DONT_TOUCH attribute and/or place everything in multiple levels of hierarchy and ensure that flatten_hierarchy is set to NONE.

Driving 96% of your BRAMs are 200MHz in a Spartan-6 is going to be challenging. You will have to fight with it using techniques like the ones above. Pure placement constraints are not likely to help you - the tools already know they need to keep the RAMs near the logic, and they are trying to, but when you are using 96% of the RAMs it can't come up with a solution that manages this for all the groups of RAMs.

Avrum

servdo
Visitor
Visitor
314 Views
Registered: ‎08-14-2020

Thank you for your help, Avrum!

I added some MAXDELAY constrains to the ucf, targeting those BRAM net errors, e.g. 

NET "*ram_doutb*" MAXDELAY = 3 ns;

and it seems to be helping quite a bit : I'm getting timing score < 500. I wonder about your opinion on this approach?

regards,

Serguei

0 Kudos
Reply
avrumw
Guide
Guide
307 Views
Registered: ‎01-23-2009

I added some MAXDELAY constrains to the ucf, targeting those BRAM net errors, e.g. 

I generally don't recommend this. What this does is over-constrain the paths (since the normal delay at 200MHz would be 5ns). In general this just squeezes certain paths at the expense of other paths. So while you may have reduce the worst violation, it is likely that you have created more of them - what is the overall timing score?

In any case, this is not likely to get you to timing closure - I think you need to experiment with what I suggested. I have had this exact same problem in a Spartan-6 - high RAM utilisation with larger buffers (I was trying to create a single buffer that used 85% of the RAMs), and solved it using the mechanism I described - I broke the RAMs into smaller subgroups, and replicated the flip-flops for each subgroup. Once they were broken into smaller subgroups with adequate pipelining, the tools had no problem with them. 

Avrum

 

View solution in original post

0 Kudos
Reply
servdo
Visitor
Visitor
303 Views
Registered: ‎08-14-2020
the overall timing score dropped drastically after MAXDELAYs: from 100k to below 500...
0 Kudos
Reply