cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Adventurer
Adventurer
1,279 Views
Registered: ‎08-10-2017

Timing Analysis - help

I made a hardware accelerator design in VC707 evaluation board containing Virtex7 485T FPGA. I have enabled Physical Optimization both after placement and routing, in Vivado 2017.2 . I have attached the timing report here (https://drive.google.com/open?id=10yD9ZG81oGTdTo1uB5-f8d2Hpuz3uDFB)

 

I observed the following 

  1. Some of the paths that have high WNS is between AXI BRAM Controller and Block Memory Generator. I need 2.5 megabytes of Block Ram. I used 5 Block Rams of 512 KB each. I would like to know if there is any difference in using 1 BRAM of 2.5 MB and using 5 BRAM of 512 KB (each having its own AXI Controller).
  2. Routing delays range from 3x logic delay to almost 10x logic delay. Floorplanning didn't help, it only made it worse. I'm not sure how to proceed.
  3. There are some paths which have very less clock skew (1 ps) and there are some paths with very large skew (500 ps). What could be the most likely reason for this ? (All clocks are generated by the clock generator in 7 series MIG Controller)
    Reducing this skew would help in bringing down the number of violating paths to a large extent.
  4. I'm pipelining some of the paths which have almost 50% logic delay and I'll be attaching the report later.
  5. Page 42 of "Ultrafast Design Methodology Guide" says that resets are not always required and it is not necessary to code a global reset for the sole purpose of initializing after power up, and sequential primitive almost always default to zero (except FDSE and FDPE). I get the point. The problems is, each one of my registers has a synchronous set/reset. I am not sure how to avoid using them safely without compromising my design.

I am currently going through the "Ultrafast Design Methodology Guide" and "Design Analysis and Closure Techniques" to see if I've missed anything important.

 

If you could give me hints at how to improve the timing based on my report, it would be really helpful .


Thank you


Jagannath

 

 

EDIT : fixed typos

0 Kudos
5 Replies
Highlighted
Guide
Guide
1,239 Views
Registered: ‎01-23-2009

Re: Timing Analysis - help

You say you need a RAM of 2.5MB.

 

Using the RAMs on the FPGA, this means that you need 640 RAMB36 to implement this RAM. The device on the VC707 has 1030 RAMB36, so you are using more than 60% of the available RAM.

 

The RAM on the FPGA die are scattered across the die - located in several discrete columns, that go from the top of the die to the bottom.

 

If you are trying to access them from a single source, then you are creating a situation where the tools need to fanout the control and other input signals to the RAMB36 cells literally across the entire FPGA die. The routing for that is going to be immense. Similarly, the tools will have to construct a 640:1 MUX on the return data (in addition to coming back from everywhere on the die) to get the return data (depending on your data width, it could be less than 640:1, but it is still pretty immense).

 

This can't meet timing.

 

To do this, you need to break the RAM into reasonable groups. You need to control the fanout of all the signals to the RAMs by manually replicating the drivers and pipelining them - probably over 4 or 5 pipeline levels. You need to make sure the tools don't merge what it sees as "redundant" logic. You need to manually construct a MUX tree that does the MUXing back to your single point - again, over several pipeline levels...

 

The block memory generator will do none of this for you. You will have to do it yourself.

 

It may be possible to do this in IP integrator by breaking the RAM into really small groups. You don't tell us what frequency you are running at, but at reasonable speeds, I would not try to do more than 16 RAMB36 in a single group (64KB each). (If your frequency is more than 200MHz, then you might even consider 8 RAMs instead)

 

Then you can try and interconnect these 40 groups of RAMs on separate AXI busses. But even at this, you will need to pipeline the AXI busses - something like a 2:1 AXI interconnect (registered) each connected to separate 4:1 AXI interconnects (registered), each connected to 5:1 AXI interconnects (registered) to get fanout to your 40 groups of 16 RAMs.

 

Avrum

Tags (1)
Highlighted
Adventurer
Adventurer
1,224 Views
Registered: ‎08-10-2017

Re: Timing Analysis - help

Thank you @avrumw

It was just an intuition (but wasn't sure) that having multiple small BRAM could be better routed than one huge BRAM. That's why I used five 512KB BRAMs. Your suggestion is a lot more clear and I now understand having multiple cascaded AXI Interconnect/SmartConnect blocks to drive the signals to smaller BRAMs is a much better approach. I am very grateful to you for your help.

 

My target frequency is 200 MHz.

 

Accelerator results are stored in that 2 MB (of the said 2.5 MB). Regular write transactions are being performed on this memory and read transactions from host PC via PCIe Gen2 are performed from this memory. If I were to move this 2 MB to onboard DDR3 memory, do you think it would alleviate some of the other problems I'm facing (at the expense of DDR write latency) ?

If routing resource weren't spent on BRAM, is it possible that routing delay in my actual logic design could improve ?

I am considering this, because my LUT utilization is approaching 90 % (enabled Physical Optimisation) and having multiple AXI Interconnect could push it higher, making routing even more difficult.

Once I release 2 MB worth of BRAM block, I could reuse some (a fraction) of them instead of Distributed RAM to reduce LUT utilisation.

 

One other problem I'm facing is, deciding whether I have to a floorplan a certain design. Most Xilinx videos, forum posts tell me the tool does a better job at appropriately placing the resources. Whenever I floorplanned a design which has routing delay of the range 5x to 10x logic delay, I always ended up getting worse results (higher WNS, more failing endpoints) than I began with. I'm unable to figure out if I'm floorplanning the wrong block, or that I'm floorplanning at all in the first place.

 

0 Kudos
Highlighted
Guide
Guide
1,211 Views
Registered: ‎01-23-2009

Re: Timing Analysis - help

Whenever I floorplanned a design which has routing delay of the range 5x to 10x logic delay, I always ended up getting worse results (higher WNS, more failing endpoints) than I began with.

 

This is pretty typical with Vivado. The Vivado placer is WAY better than the ISE placer, and it tends to find some of the best placements. Any manual floorplanning usually just makes things worse.

 

Floorplanning is still useful for really large things (particularly when mapping logic to SLRs on the non-monolithic dice), but generally not for fixing timing problems. 

 

Furthermore, in this case, the problem is not floorplanning - the resources are just to far apart...

 

If routing resource weren't spent on BRAM, is it possible that routing delay in my actual logic design could improve ?

 

Not due to contention for routing resources, but due to the impact on placement. Since your RAMs are all over the place, the placer is trying to compensate by stretching your user logic to be nearer to the RAMs it needs. Since the RAMs are everywhere, the logic tends to be spread everywhere, which has a (significant) negative impact on timing. So fixing the RAM access can help other logic.

 

Using this much block RAM as a single large RAM is never recommended (for the problems we have already discussed). Moving to DDR-SDRAM would help this problem, but comes with its own set of problems; mostly that the latency is long and unpredictable - there are definitely many cases where the fundamental differences in bandwidth, latency and access pattern can make it very difficult to switch from SRAM to DRAM.

 

Avrum

Highlighted
1,170 Views
Registered: ‎01-22-2015

Re: Timing Analysis - help

Hi Avrum,

           Moving to DDR-SDRAM would help this problem, but comes with its own set of problems; mostly that the latency is long and unpredictable…

 

Can you speak a little more about off-FPGA alternatives to block RAM (BRAM) that have the fast read access of BRAM.  Also, please describe the unpredictable latency of DRAM – that you mentioned.

 

Thanks,

Mark

0 Kudos
Highlighted
Guide
Guide
1,159 Views
Registered: ‎01-23-2009

Re: Timing Analysis - help

Can you speak a little more about off-FPGA alternatives to block RAM (BRAM) that have the fast read access of BRAM.

 

Any off-the-shelf static RAM can be used. The bandwidth will depend on the bus width (and whether it is SDR or DDR), the clock speed and the number of ports. For example you can use conventional SRAM (slow) or a variety of synchronous SRAMs (Synchronous Burst SRAM, ZBT SRAM, QDR SRAM).

 

It is worth pointing out that the original post was specifically talking about a VC707 - the only on-board memory is DDR3-SDRAM (and Flash and EEPROM, which are not general purpose memories).

 

But it is worth pointing out that no off-chip memory has the performance of BRAM. Looking at extremes of off-chip memory (and I am not sure any single RAM has all these characteristics)

   - 72 bit DDR

   - dual port

   - 1GHz clock operation

 

This leads to 144Gbps raw bandwidth.

 

Looking at the device in question (Virtex-7 485T) there are 1030 block RAMs, each with two 72-bit ports (in SDP mode) capable of operation at around 500MHz. This results in a staggering bandwidth of 74Tbps - that's over 500x the performance of the SRAM described above...

 

Also, please describe the unpredictable latency of DRAM – that you mentioned

 

From a user application point of view, the read latency of a DRAM is dependent on the state of the controller/DRAM. A number of things can affect this

  - whether the read is to a row that is already open (page hit), to a bank that is unopened, or to a bank that is opened to another row (page miss)

  - whether the previous operation was a read or a write

  - whether the controller is doing (or needs to do) a refresh operation

 

This is in addition to any optimizations that the controller may do - the controller can keep a list of outstanding requests and re-order them (within certain limits) to optimize efficiency

 

Finally, the user application clock is almost never the same as the SDRAM clock, and hence there is uncertainty in latency due to clock crossing.

 

All of these make using DDRx-SDRAM difficult to use. But, the flip-side is that DDRx-SDRAM can provide enormous memory space (much larger than any static RAM technology) and at pretty high bandwidths...

 

Avrum

Tags (1)