12-28-2019 09:24 AM - edited 12-28-2019 09:25 AM
I am inferring a block ram with SystemVerilog code. I want the BRAM's optional output registers to be used. Synthesis gets me exactly that, but during implementation Vivado pulls the registers out of the BRAM into seperate flip-flops which kills timing.
I am adhering to the recommended language template, initializing to 0 and not using a reset. I also read several threads discussing similar problems and no workable solution is offered there for an inferred block ram. How can I force Vivado to keep the output registers in the BRAM?
module wram #(parameter adr_width = 11, data_width = 16, pipeline_delay = 2 // 1 = only synchronous ram, 2 = additional output register )( input clk, input we, input [adr_width-1 : 0] wadr, input [data_width-1 : 0] di, input [adr_width-1 : 0] radr1, output [data_width-1 : 0] do1 ); if ((pipeline_delay != 1) && (pipeline_delay != 2)) $error("pipeline_delay must be 1 or 2"); reg [data_width-1 : 0] mem_do; // ram data output reg [data_width-1 : 0] outreg = 0; // output register (or wire to mem_do) logic [data_width-1:0] mem [(2**adr_width)-1:0]; always_ff @(posedge clk) begin if (we) mem[wadr] <= di; // write operation mem_do <= mem[radr1]; // read operation (synchronous) end; always_ff @(posedge clk) begin outreg <= mem_do; // output register end; generate if (pipeline_delay == 2) assign do1 = outreg; // use output regsiter else assign do1 = mem_do; // bypass output register endgenerate; endmodule
12-28-2019 10:58 AM - edited 12-28-2019 11:00 AM
whats the device your targeting ?
Can you check that the output register of the bram can be initialised to 0 at start up, I though only LUT registers could be initialised like this .
Can you check that if you put a few more registers on the output, then you get the expected, else it could be the tools sucking the rgister into the IOB
12-28-2019 01:05 PM
whats the device your targeting ?
Yes, in fact AR#64049 states that the output registers SHOULD be initialized with zeros.
Can you check that the output register of the bram can be initialised to 0 at start up
That is not a workable solution since I cannot afford to make the pipeline longer.
Can you check that if you put a few more registers on the output
12-28-2019 02:41 PM
during implementation Vivado pulls the registers out of the BRAM into seperate flip-flops
Are you sure? I have never seen (or heard of) the tool doing this before. I thought that once the flip-flops are pulled in to the BRAM in synthesis, nothing could pull them back out.
And how is it breaking timing? The tool is timing driven, so even if it could pull the FFs out, it would only do so to improve timing. If this made the BRAM to FF timing fail, then you can only assume that the FF to "other logic" timing path is even worse (or similarly bad). You should check this before investing lots of time diagnosing what's happening here - if both the path to and from the FFs are violating, then you have a bigger problem - the tools can't find a combination of placement of these intermediate pipeline flip-flops that can pass timing.
12-29-2019 12:47 AM
Are you sure?
Yes: in the schematic of the synthesized design, the data out of the RAM is connected directly to the multiplexer in the following stage, i.e. the FF is in the BRAM. However, in the schematic of the implemented design, the RAM is connected to a FF in the fabric and the mux comes only after that FF. So it seems pretty clear the tool is pulling the registers out of the BRAM.
I have never seen (or heard of) the tool doing this before. I thought that once the flip-flops are pulled in to the BRAM in synthesis, nothing could pull them back out.
The tool is doing this, and other people reported the same before, such as here: Vivado Pulls Registers out of BRAM. That thread was never resolved.
And how is it breaking timing? The tool is timing driven, so even if it could pull the FFs out, it would only do so to improve timing. If this made the BRAM to FF timing fail, then you can only assume that the FF to "other logic" timing path is even worse (or similarly bad).
It is breaking timing because the output delay of the RAM plus the route to the FF is 3ns combined when the output register is not used. I am aiming at 2.5ns. In the following pipeline stage I only have one level of logic and then a DSP slice with enabled input registers. I suppose that the path from a BRAM output register through one LUT to a DSP input register would not take 2.5ns as long as the LUT is reasonably placed. I would try it, but I cannnot do anything as long as that output register is not being used, so this issue needs to be fixed first.
You are suggesting that the tool makes the best choice (timing-wise) because it is timing-driven. I don't believe it does. In this case, it's doing more harm than good. In the linked thread, several users agree that the placer makes bad choices regarding output registers and one user reports: "Unless one instantiates the BRAMs with a keep_hierarchy property or uses coregen IP, Vivado does this for you if it thinks pulling the register into the fabric is going to be beneficial timing-wise".
I don't mind that the placer sometimes makes wrong choices as long as I have a way to correct them. But how can I possibly tell the tool to keep the registers in the BRAM? I am already using the suggested template and it does not achieve what it's supposed to. Please don't tell me I have to instantiate primitives like Ken Chapman did 15 years ago because the tool is still not smart enough and at the same time too stubborn to accept help from a user trying to guide it :-(
12-29-2019 02:01 AM
12-30-2019 05:34 PM
If you enabled post-place phys_opt_design, it performs the following optimizations by default.
* high-fanout optimization
* placement-based optimization of critical paths
* critical-cell optimization
* DSP register optimization
* BRAM register optimization
* URAM register optimization
* a final fanout optimization
For the BRAM register optimization, it improves critical path delay by moving registers from slices to block RAMs, or from block RAMs to slices.
BRAM and DSP are dedicated resources in the hardware and the placement is not so flexible as fabric, especially when in the case of routing congestion. I don't think a LUT in between is a good idea for design running up to 400MHz.
The tool looks at the overall timing and tries to improve WNS and TNS.
If you would like to prevent the optimization, you may add DONT_TOUCH property to these registers.
(*DONT_TOUCH="TRUE"*) reg [data_width-1 : 0] outreg = 0; // output register (or wire to mem_do)