cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
vishalbk
Visitor
Visitor
450 Views
Registered: ‎03-21-2020

Using dsp48e1 as memory address generators

Jump to solution

In my design the memory address generators are causing a lot of routing delays and I read that one of the ways to solve this issue is to use DSPs as memory address generators to prevent timing violations. Can anybody help me with this? 

0 Kudos
1 Solution

Accepted Solutions
avrumw
Guide
Guide
216 Views
Registered: ‎01-23-2009

I started writing my answer assuming the endpoints of these paths were block RAMs - I then realized they were distributed RAMs. I am going to keep my answer assuming they are block RAMs first, and then address distributed RAMs later.

BlockRAMs (and UltraRAMs)

The problem has nothing to do with how you generate the address. The problem is that you are trying to use this address to access a large number of RAMs (I can't tell how many). RAMs are "big blocks" - they are physically large on the die. They are arranged in columns with the different columns relatively far apart from each other. So, if you try and drive (say) 12 RAMs with one address bus, then this address bus will need to span a row of RAMBs that are 12 RAMs tall. This is "far" on the FPGA. As you create larger and larger RAMs, these get taller and/or spill over to adjacent columns (which are far away).

The net result is that it becomes simply impossible to have one flip-flop drive more than N RAMs at a given speed due to the routing delay to get to all the endpoints. At 500MHz, the number is probably pretty small.

There is nothing that can be done about this - if you have a single source for the address, there is a limit to how many RAMs you can drive.

The solution is to have more than a single source for your address. So you need to replicate the flip-flops driving the address - you need to have one set of flip-flops driving the RAM inputs for every (say) group of 4 RAMs. If your address is pipelined, then you can use the pipelining to (manually) replicate these address flip-flops - but it isn't easy; synthesis will see these replicated flip-flops as redundant and try to remove them, so you need to protect them with DONT_TOUCH properties or by putting them in levels of hierarchy with flatten_hierarchy=none.

If this is a simple counter, then you can replicate the counter have M copies of the counter, start them all together and they will stay in sync as they count. If this is more complicated than a counter, then you will need to replicate all the logic that the counter performs in each of the M copies (and again, protect them from being removed as redundant).

This is not an easy problem to solve - particularly at high speeds. You need to understand the physical architecture of the device and architect your solution to do what you want in spite of these physical limitations, and also make sure the tool does what you want (or specifically doesn't undo what you want!).

Distributed RAMs (Select RAMs)

The answer is probably pretty much the same, except:

  • distributed RAMs are not "large blocks" they are available all throughout the die
  • it is, though, possible that you are using WAY more than it appears here
    • If this RAM is large and you are using distributed RAMs, then realize that it will take two LUTs for each 64 bits of the RAM
      • This quickly results in LOTS and LOTS of these

So while you don't have the "big block" problem, you may have huge fanout. And even though the distributed RAMs are spread throughout - if you use enough of them you get the same problem; a very large quantity of them will take up a lot of space (and routing) in an area of the die, and having fanout from one address source to all these RAMs can't make it to all these RAMs in one clock. The solution is the same - have replicated versions of the RAM inputs so that each replicated version is only trying to reach a smaller number of RAMs. And, as others have pointed out, 500MHz is fast - at these frequencies large fanout of any signal (to a RAM or otherwise) can quickly become a problem.

But I would ask the question - did you intend for these to be distributed RAMs? Again, distributed RAMs are small (64bitsx1 implemented in two LUTs for a dual port RAM), whereas block RAMs are bigger (36kbits per instance - even for dual port). The main difference between them is the read path - block RAMs are synchronous reads - the data is available on the clock after the address is presented. Distributed RAMs are asynchronous read - the data is available on the same clock as the address is presented. If you coded your RTL with asynchronous (combinatorial) read, then the tools have no choice to infer distributed RAMs. If you code with synchronous read, then the tool can choose to use block RAM, or distributed RAM followed by a flip-flop, depending on the size of the RAM. When inferring RAMs you should look at the RTL templates in the Language Templates, or in the synthesis user guide.

Avrum

View solution in original post

8 Replies
drjohnsmith
Teacher
Teacher
433 Views
Registered: ‎07-09-2009

The dsp48, is "just " a counter / multipler block,

  Its two advantages to you,

    its physiclay next to the RAM blocks of the FPGA 

   They are fast 

So if its a simple address incroment then the DSP48 could be for you 

but

the get their speed by being pipe line delayed, 

    i.e. multiple clock delay till you get an output 

      its a new vaoue on eacj clock, but delayed one or two clocks , depending how you do it

 

What address scheme do you want ?

   what is the timming you want, in what device.

 

Also note, that the BRAMs are fast , but again only if you enable thr output registers,

    if you access them asynchronously , then they will be a lot slower.

Whats your system block diagram look like, can you post a picture ?

 

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>
vishalbk
Visitor
Visitor
421 Views
Registered: ‎03-21-2020

I have uploaded the critical path and the screenshot for the routing delay, counter_x is the diagram is the address generator for x_memory. 

I cannot upload the entire routing path as it is very long.

I am using it as a simple increment address counter, I am aiming to run it 500 MHz(as fast as possible).

 

 

routing_path.jpg
critical_path1.1.jpg
0 Kudos
vishalbk
Visitor
Visitor
419 Views
Registered: ‎03-21-2020

This is a zoomed-in view of the routing path.

routing_path1.1.jpg
0 Kudos
drjohnsmith
Teacher
Teacher
382 Views
Registered: ‎07-09-2009

Enable the input and output registers on the BRAM you have, depending up the FPGA you might have both or only one lot .

500 MHZ is dammed fast , what part and speed grade you looking at ?

The clock, you are generating it from a MMCM / PLL  in the FPGA , not direct in ?

 

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>
vishalbk
Visitor
Visitor
287 Views
Registered: ‎03-21-2020

Thank you for your suggestion. I used PLL does it did not help solve the critical path delay. I want to improve the critical path because of the routing issues. I am still trying to enable the BRAM input and output registers. 

Maybe an additional pipelined register may help solve the problem.

0 Kudos
drjohnsmith
Teacher
Teacher
259 Views
Registered: ‎07-09-2009

Yep, PLL / MMCM or what ever is the way to do clocks always

     just asked as some people dont,

how are you coding this ?

how are you making the rams ? 

instantiated, inferred , ip block ?

 

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>
0 Kudos
vishalbk
Visitor
Visitor
239 Views
Registered: ‎03-21-2020

RAMs are inferred from my RTL code in VHDL. 

0 Kudos
avrumw
Guide
Guide
217 Views
Registered: ‎01-23-2009

I started writing my answer assuming the endpoints of these paths were block RAMs - I then realized they were distributed RAMs. I am going to keep my answer assuming they are block RAMs first, and then address distributed RAMs later.

BlockRAMs (and UltraRAMs)

The problem has nothing to do with how you generate the address. The problem is that you are trying to use this address to access a large number of RAMs (I can't tell how many). RAMs are "big blocks" - they are physically large on the die. They are arranged in columns with the different columns relatively far apart from each other. So, if you try and drive (say) 12 RAMs with one address bus, then this address bus will need to span a row of RAMBs that are 12 RAMs tall. This is "far" on the FPGA. As you create larger and larger RAMs, these get taller and/or spill over to adjacent columns (which are far away).

The net result is that it becomes simply impossible to have one flip-flop drive more than N RAMs at a given speed due to the routing delay to get to all the endpoints. At 500MHz, the number is probably pretty small.

There is nothing that can be done about this - if you have a single source for the address, there is a limit to how many RAMs you can drive.

The solution is to have more than a single source for your address. So you need to replicate the flip-flops driving the address - you need to have one set of flip-flops driving the RAM inputs for every (say) group of 4 RAMs. If your address is pipelined, then you can use the pipelining to (manually) replicate these address flip-flops - but it isn't easy; synthesis will see these replicated flip-flops as redundant and try to remove them, so you need to protect them with DONT_TOUCH properties or by putting them in levels of hierarchy with flatten_hierarchy=none.

If this is a simple counter, then you can replicate the counter have M copies of the counter, start them all together and they will stay in sync as they count. If this is more complicated than a counter, then you will need to replicate all the logic that the counter performs in each of the M copies (and again, protect them from being removed as redundant).

This is not an easy problem to solve - particularly at high speeds. You need to understand the physical architecture of the device and architect your solution to do what you want in spite of these physical limitations, and also make sure the tool does what you want (or specifically doesn't undo what you want!).

Distributed RAMs (Select RAMs)

The answer is probably pretty much the same, except:

  • distributed RAMs are not "large blocks" they are available all throughout the die
  • it is, though, possible that you are using WAY more than it appears here
    • If this RAM is large and you are using distributed RAMs, then realize that it will take two LUTs for each 64 bits of the RAM
      • This quickly results in LOTS and LOTS of these

So while you don't have the "big block" problem, you may have huge fanout. And even though the distributed RAMs are spread throughout - if you use enough of them you get the same problem; a very large quantity of them will take up a lot of space (and routing) in an area of the die, and having fanout from one address source to all these RAMs can't make it to all these RAMs in one clock. The solution is the same - have replicated versions of the RAM inputs so that each replicated version is only trying to reach a smaller number of RAMs. And, as others have pointed out, 500MHz is fast - at these frequencies large fanout of any signal (to a RAM or otherwise) can quickly become a problem.

But I would ask the question - did you intend for these to be distributed RAMs? Again, distributed RAMs are small (64bitsx1 implemented in two LUTs for a dual port RAM), whereas block RAMs are bigger (36kbits per instance - even for dual port). The main difference between them is the read path - block RAMs are synchronous reads - the data is available on the clock after the address is presented. Distributed RAMs are asynchronous read - the data is available on the same clock as the address is presented. If you coded your RTL with asynchronous (combinatorial) read, then the tools have no choice to infer distributed RAMs. If you code with synchronous read, then the tool can choose to use block RAM, or distributed RAM followed by a flip-flop, depending on the size of the RAM. When inferring RAMs you should look at the RTL templates in the Language Templates, or in the synthesis user guide.

Avrum

View solution in original post