cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Observer
Observer
1,486 Views
Registered: ‎02-27-2017

Double pumped asymmetric SDP BRAM inference

Hi all,

 

I am having trouble with inferring and testing a double pumped asymmetric SDP BRAM spanning several primitives.

The goal is to have a BRAM, where both ports use the same clock, with the following configuration:

- write port data width = 1024 bits

- write port address width = 256 entries or 8 bits

- read port data width = 128 bits

- read port address width = 2048 entries or 11 bits

 

However, the closest BRAM primitive configuration is 512 entries with 64 bits data. Therefore I would like to double pump the BRAMs at 400MHz, while the rest of the system runs at 200MHz. By using eight BRAM primitives, a data width of 512 bits is achieved, and if we double pump we obtain 1024 bits. One bit of the address has to be toggled in order to read from two addresses within the same cycle (at 200MHz).

I struggle with inferring such a configuration. Is it better to use a behaviour model as described in UG901 or  use the xpm_memory_sdpram template instead?

 

Also, I would like to synthesise this design and check if it passes timing. How would you test a module like this by itself to see if it is able to achieve the target frequency? When I set a clock constraint for a simple BRAM configuration based on the SDP template, timing reports 'inf'.

 

All the best.

 

 

0 Kudos
4 Replies
Highlighted
Guide
Guide
1,471 Views
Registered: ‎01-23-2009

Re: Double pumped asymmetric SDP BRAM inference

This is not going to be easy...

 

I presume you realize that the tools won't do any of this so called "double pumping" - you are going to need to write a portion of your design that runs at 400MHz.

 

This is going to need to be more than the RAM itself. When you cross from the 200MHz domain to the 400MHz domain, you are going to incur a penalty; assuming the two clocks come from the same MMCM then there is a difference of tSTATPHAOFFSET between the two outputs which is around 120ps. On top of this, this will go through two different BUFGs which will be timed with different on-chip variation.

 

On top of this, the RAMs are in their own column, and using 8 of them is going to require some significant routing. As a result

  - you will almost certainly not be able to go from the 200MHz domain to or from your RAMs directly; you will need to have the signals driving address, control and capturing data all on the 400MHz domain. You can cross between the 200MHz and 400MHz domains in flip-flops in the fabric.

   - the clock to output time of the RAM is quite slow. Getting from the output of the RAMs even to the nearest flip-flops at 400MHz is tough. If you use the output registers, (DOB_REG) then you will have a better chance. Even with this, though, you will still need to go directly to another set of FFs at 400MHz in the fabric

 

All told, from the point of view of your 200MHz domain, you are going to incur a fair amount of latency

  - address and control in FFs at 200MHz

  - to address and control in FFs at 400MHz (1 clock)

  - to RAM at 400MHz (1 clock)

  - Read latency of RAM (2 clocks with DOB_REG)

  - to FFs at 400MHz (1 clock)

  - to FFs at 200MHz (1 clocks)

 

Making a total of 6 400MHz clocks, so 3 200MHz clocks. If you can't tolerate this latency, then you may not be able to do this. And this is pretty much a minimum - reaching 8 block RAMs may require pipelining and replication of the address flip-flops and more as you get the data back from the RAM.

 

As for testing the timing, you will have to implement everything

   - your clocks (including IBUFG, MMCM and two BUFGs)

   - the FFs at 200MHz as the starting and ending points

   - all the FFs at 400MHz in the middle

 

To really test timing, you need to do place and route - the placement and routing are critical here. To do that, you have two options

  - wrap this whole thing in a set of wrapping flip-flops connected to pins

     - this also won't be easy - you need things connected to pins to make sure the tool doesn't optimize everything out

     - even on a big package, there will not be enough pins in a package

     - so you will have to consider something like an LFSR to drive the inputs and an XOR chain to combine the outputs together

  - do the synthesis place and route "Out of context"

     - this can only be done in non-project mode

 

Avrum

Highlighted
Observer
Observer
1,450 Views
Registered: ‎02-27-2017

Re: Double pumped asymmetric SDP BRAM inference

@avrumw Thank you very much for your extensive reply. A very interesting read!

I would like to pursue this approach. I am targeting a KU15P device.

 

I was wondering what your take is on the selection at the end. In essence, the eight BRAM primitives together make up for 512 bits of data. Each primitive contains 64 bits. When I write, I would like to write 1024 bits at a time (512 bits per 400MHz cycle) and I would like to read 128 bits (64 bits per 400MHz cycle). So a MUX is required to choose the output from one of the eight BRAM primitives. Do you think that a multi-cycle MUX in the 400MHz clock domain is possible? Do you have any suggestions for the number of FF levels in such a MUX? Or are there other approaches which would work better?

I was also thinking about interleaving the write data in order to configure each BRAM primitive to have a 64 bit write port and an 8 bit read port. Reading from all primitives will then result in my desired 64 bit output data without the need of a MUX.

 

I would love to hear your thoughts on this :)

0 Kudos
Highlighted
Guide
Guide
1,429 Views
Registered: ‎01-23-2009

Re: Double pumped asymmetric SDP BRAM inference

So a MUX is required to choose the output from one of the eight BRAM primitives. Do you think that a multi-cycle MUX in the 400MHz clock domain is possible?

 

The MUX itself won't be much of a problem. An 8-1 MUX can be done with one level of logic and the MUXF7 (two bits can be done in each 7 series slice) which can comfortably run at 400MHz. The problem is just getting the data from the RAM to the MUX itself - you will certainly need a full clock of pipeline for the routing from the RAM, but the MUX itself can be done in one pipeline level.

 

I was also thinking about interleaving the write data in order to configure each BRAM primitive to have a 64 bit write port and an 8 bit read port. Reading from all primitives will then result in my desired 64 bit output data without the need of a MUX.

 

That is also possible. However, again, the MUX isn't really the issue - it's all the routing.

 

Avrum

Observer
Observer
1,407 Views
Registered: ‎02-27-2017

Re: Double pumped asymmetric SDP BRAM inference

@avrumw Thanks again for the help! I have started with a somewhat slimmed-down implementation using only one BRAM primitive, but with your suggested clock domain crossing registers. The clock constraints are 200 and 400MHz respectively.

In simulation, this all works. Basically write data is double-pumped using a 2:1 MUX with a clock follower (see XAPP706) as the select signal. Since the goal is to present a single BRAM primitive as a 256 entry by 128 bit memory, an external read or write address is concatenated with a toggling LSB, using a similar 2:1 MUX with hardcoded '0' and '1' inputs.

 

Currently, the 128 bit write data input and other control signals are routed from input pins to registers, which is routed like a vertical string along the BRAM 'spine'. As expected, timing is not met after synthesis (nor implementation). After synthesis, the WHS is -0.432ns and fails at 827 endpoints. I suspect that the clock signals are too far apart, due to the placement of logic close to all the input pins.

 

In your initial message, you mentioned that I should implement the IBUFG, MMCM and BUFGs. Currently I didn't configure anything, I just let the tool run its course. Since I have no experience with implementing these resources, I was wondering what your suggestions are on how to implement the clocking and how to make this initial design pass timing. Or do you think that using a LFSR connected to a single input pin, branching out to all DUT input signals, followed by a set of FFs at 200MHz will be enough for timing testing purposes?

0 Kudos