UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

Reply

Doubt about BRAM

Highlighted
Visitor
Posts: 13
Registered: ‎03-12-2018

Doubt about BRAM

I wanted to know if when we increase the size of the bram we are using, does the time required to access any location increase or is it consistent at 1 clock cycle? I noticed timing failures on my design after increase the size of my bram from reg [63:0] mem[511:0] to reg[63:0] mem[2**14 -1: 0] 

Scholar
Posts: 2,697
Registered: ‎04-26-2015

Re: Doubt about BRAM

It can still be one cycle, but the cycles have to get longer.

 

As you use more BRAM, Vivado has to spread them further and further across the chip. This implies that some of the RAMs will be further from your control/processing logic than previously. Obviously the signals will then take longer to get between the control/processing logic and the RAM, which results in timing failures.

 

Adding flip-flops in the middle means that Vivado can take a few cycles to get the signal across that distance, which then allows for each cycle to be shorter.

Visitor
Posts: 13
Registered: ‎03-12-2018

Re: Doubt about BRAM

So assuming I can't change the clock frequency, is it the only solution to use less bram? 

Scholar
Posts: 1,133
Registered: ‎09-16-2009

Re: Doubt about BRAM

 

The first thing to do is try it and see.  What clock speeds are you talking about?

 

If you try, and it fails at the clock speed you're targeting ( here, "fails" means doesn't pass static timing analysis), then you may need to get creative.  And "creative" may not be limited to just "making your RAM shallower". 

 

Regards,

 

Mark

Visitor
Posts: 13
Registered: ‎03-12-2018

Re: Doubt about BRAM

The clock is at 250Mhz, and it doesn't seem to work for that RAM size (I have 4 reg[63:0]  [2**14 -1:0] rams to implement a 256 bit write, 64 bit read ram ), I think access times seem to be around 2.5ns. I've added extra cycles for logic computation where it was failing by 0.3ns , but it still fails by 0.02ns on a simple a write path where I've just added a {}(concatenation) operator, although report_timing says the logic level is 4. Can you suggest ways I can improve timing? I can add extra cycles, but not sure where to add them

Scholar
Posts: 1,133
Registered: ‎09-16-2009

Re: Doubt about BRAM

 

250 MHz does push you into "tricky" territory.  I've got similar rams in my ultrascale, and ultrascale+ designs running the same speed, and the same depth, just a bit shorter (48-54 bits wide). 

 

Have you read the Ultrafast Design Methodology Guide (UG949)?  20ps negative slack isn't much - so you're close.  Getting that last few percent may take time.  UG949 has good guidelines for how to get there.

 

Regards,

 

Mark

Visitor
Posts: 13
Registered: ‎03-12-2018

Re: Doubt about BRAM

[ Edited ]

I'm quite new to programming an FPGA, my basic understanding is my route delays are caused by cascading BRAMs, which if increase beyond a point, accessing the ram in one clock cycle becomes impossible. Right now, I've barely managed to get it done by a margin of 20ps but I've yet to add code to compute the correct addresses for the read/write, I guess I could use another clock cycle for the read but I don't know how to do it for the writes, maybe I need to buffer them in a FIFO.

How can I understand the paths which cause delays better, I'm having a hard time understanding report_timing. 

Scholar
Posts: 1,133
Registered: ‎09-16-2009

Re: Doubt about BRAM

[ Edited ]

 

It's pretty critical to be able to interpret the "report_timing" analysis in order for you to solve these problems.  Don't rely on "gut feels" or things you've heard about FPGAs with respect to route delays vs logic delays, etc...

 

The data is there in the timing report, you're going to need to be able to interpret those results.  If you're got specific questions on the timing report, ask here for clarification, we can help.  But in general, that should be your primary tool.

 

Regards,

 

Mark

Historian
Posts: 4,420
Registered: ‎01-23-2009

Re: Doubt about BRAM

The size of RAM you are using requires 32 RAMB36. The RAMB36 are relatively large cells and are arranged in columns. A single column of 32 RAMs is large enough so that you can't reach from the "middle" of the column to the ends in one clock cycle at your clock frequency; this means that the tool is having trouble dealing with any shared control/input/output signals.

 

For example, if you have a single address for all 32 RAMs, the address has to fanout from the FFs that hold this address to the ADDRA/B ports of all 32 RAMB36 cells - this is clearly a problem.

 

In the past, I have manually broken this path and manually replicated these flip-flops. I also relied on hierarchy to help in this process, which requires flatten_hierarchy=none in synthesis. So, for example...

 

I would create a module with 8 block RAMs in it - these would be 8 of the 32. In this module, I would pipeline all the signals going into it with dedicated flops; each block with 8 RAMs would have FFs for write address, read address, write data, control signals, and would also have the any MUXing necesssary to MUX the read data (although you could use 8 SDP RAMs in 512x64 and not have any MUXes to generate your 512 bit word), and a set of external flip-flops (not the DOA/DOB flip-flops in the RAM) for the read data.

 

Then instantiate 4 copies of this, each connected to the one set of flip-flops that actually generate your addresses, write data and control, and, on the output MUX together the 4 output data and then go to a set of flip-flops.

 

The total latency of this system is now 2 clocks longer than using a single set of BRAMs (one extra pipe stage on the inputs and one extra on the outputs).

 

This allows the tool to only have to worry about distributing signals to groups of 8 RAMs instead of all 32 - this should fix your timing problems (at the cost of latency).

 

If you don't use flatten_hierarchy=none, synthesis will try to merge the four redundant copies of the pipeline FFs; with flatten_hierarchy=none, it cannot do this across hierarchical boundaries. If you need to do this, then you will have to play with the DONT_TOUCH attribute on these flip-flops to try and get the tools to leave them alone...

 

Avrum