07-29-2015 10:32 AM
Synthesis is using the 'Block Ram Cascade chain' on a very fast clock which consequently fails timing in P&R:
INFO: [Synth 8-5555] Implemented Block Ram Cascade chain of height 8 and width 8 for RAM U0/bufmem_0_reg
Is there any flag to prevent Synthesis from doing this?
07-29-2015 10:35 AM
How big is the RAM that you are trying to implement. If the RAM is deep and narrow, and you are not allowing for pipelining of the output, then the cascade path is the best way to implement the RAM. If, however, synthesis is implementing the RAM inefficiently (like using depth expansion rather than width expansion), then there may be a way to change that.
So, this all depends on the RAM - exactly what size RAM are you trying to implement (width and depth).
07-29-2015 10:40 AM
type Tbufmem is array(0 to 32767) of std_logic_vector(7 downto 0);
attribute ram_style: string;
signal bufmem_0: Tbufmem;
attribute ram_style of bufmem_0: signal is "block";
This same code works fine (and meets timing) in Kintex-7 where there is no such hardware-cascade available. The problem arises only now that we target Kintex-U.
07-29-2015 11:42 AM
So, this needs to be implemented as 8 block RAMs (since each block RAM is 32kbit if you aren't using the "parity" bit).
There are two ways to implement this
Depth expansion: 8 RAMs, each RAM is 4kx8
- this requires multiplexing the data from the 8 RAMs for readback, which in UltraScale is done using the cascade paths
Width expansion: 8 RAMs, each RAM is 32kx1
- each RAM is responsible for one data of the entire RAM, so there is no MUXing/cascading
With UltraScale, it looks like it is choosing Depth expansion, which is costing you timing. All other things being equal, this is actually the better implementation in terms of power (it consumes 1/8th of the power of width expansion), but it costs in terms of timing.
Now the question is - how do we force Vivado to perform width expansion instead of depth expansion... Unfortunately, the answer is "I don't know..." I would have thought that Vivado would make the right choice based on timing requirements - are you sure your design is properly constrained at synthesis time? If there are no constraints on the output paths during synthesis, it would probably select depths expansion to save power.
I don't see any attribute that controls how the larger RAM is built (choosing depth vs. width). There are probably lots of ways of forcing it to use width expansion. You could implement a "bit enable" on writes; since each RAM has only one write enable per byte, this would force it to use the RAMs in parallel (but the tools will try and optimize this out if it is trivially redundant). You could also break your RAM into 8 parallel RAMs in RTL; you could even go so far as to implement a 32kx1 RAM in a submodule and instantiate it 8 times in parallel with a generate statement - if you have "flatten_hierarchy" set to "none" then it probably won't be able to merge the RAMs.
But you have hit an interesting issue. There probably needs to be some attribute added to synthesis to allow the user to control this...
07-29-2015 12:18 PM
Thanks @avrumw. I can certainly work around it using one of the methods you described. Was hoping for an "easy way out". When I open the synthesized-design run it certainly seems to know about the clock. I have even increased it to 500 MHz to no avail.
04-13-2018 09:24 AM
Related to this issue, I am interested in being able to force a cascade chain implementation.
In my design, I was consistently getting a cascade implementation for module A, which seemed efficient and appropriate. I locked down placement of the moduleA blockrams for reliable timing closure. Then, in an unrelated part of the design, moduleB, I replaced a distributed RAM with a blockram. Now, for no obvious reason, the cascade implementation of moduleA (cascade heights: 8,4,2) has been replaced by a larger implementation with no cascade. And so my placement constraints were ignored.
I know I can write the constraints so that they are less sensitive to naming, but with 2 additional BRAMs to place, that is much harder to accomodate.
I don't understand how a minor change to add 1 more blockRAM to a one module would cause such a dramatic change in implementation of an unrelated module (target device: VU7P). Ideally, I should be able to force these inferred block rams to be cascaded - maybe through something like a ramstyle setting.
Thanks for your help,