UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Visitor eugenesa
Visitor
134 Views
Registered: ‎02-15-2019

Huge (and rather random) timing violations in axi s2mm dma IP

Jump to solution

Can someone, please, take a look at the timing results below, and then hopefully either teach me some simple tricks or help me adjust my expectations about what is realistically possible to achieve with Xilinx FPGA fabric and with their IP. Because, honestly at this point I am fairly confused and somewhat disappointed.

Let us put together a simple design with Zynz7000 core + axi_gpio IP on its GP port at 111 MHz + axi S2MM dma IP (configured 64x64 S to MM) IP on its HP port at 133 Mhz (with dma axi-lite on the same 111 MHz as axi_gpio). Set the target chip to xc7z010 -1 (Artix fabric). The spec for S2MM dma IP says it should be good to 150Mhz axi and 120 MHz axi lite. So, we are in spec with our 111 MHz and 133MHz. Synthesise and implement with default settings. Find that the design fails timing.

And now it becomes interesting. Note that all failing nets are entirely in the s2mm dma IP (or occasionally also in the smart axi interconnect used for s2mm), which brings my first question:

Q1: How is it even possible for the pre-defined Xilinx IP to fail timing in a design with just 15% LUT/ 12% FF utilization?

Also notice that it does not just fail timing by a bit. It fails timing by a whopping  -17ns WNS and -5000 ns TNS. And that failing -17 ns consists of 2.8ns logic delay and 26 ns (!) net delay, which brings my next question:

Q2: Why does the router even bother to create paths with 26 ns of routing delay?

Now let us try to play with various settings and notice how the results get more and more strange:

  • Keep the target chip the same, but change the speed grade to -2. It still fails timing, but now with (only) -3.2 ns /  -306ns. And the very same net which was originally routed with 26 ns of net delay, has now been routed with only 1.9 ns of net delay, and this net now meets timing with a +2.8 ns margin!

Q3: What is so magical about the -2 speed grade? Why could the router not do the same for the -1 chip?

Try the -3 grade. It still fails timing, yet the margins are now worse(!) than with -2 grade (-6.4 ns / -203 ns).

Try a bigger chip (xc7z020 -1). The result is even worse than the base case. We are now at -29 ns (3.2 ns logic + 39 ns net delay !!!) / -6900 ns TNS.

Try Kintex fabric. For the base xc7z030 -1 chip the results are somewhat better: -10 ns / -1800 ns. For the -2 grade we’re almost there: -1.5 ns /-1.5 ns (yes, just one failed net!). And finally for xc7z030 -3 it meets timing with +1.15 ns margin.

Try various optimization strategies during implementation (the results here are for the -1 Artix chip). Performance_ExploreWithRemap produced -5.8 ns /-2424 ns with only 1h 40 min of run time (vs. 15 min for the default one), and Congestion_SpreadLogic_High was close second at -6.7 ns /-600 ns (taking 5h to run though).

Also, attached are routing examples for the exact same net under different settings. Notice the huge difference in net delay, while the path is reasonably local in all the cases. EDIT: Turns out the default device view shows just the connections between the elements and not the actual routes. Had to turn the routining view on to see that the routes for the same net do indeed end up very different, which I guess explains the differences in propagation delay. What it does not explain is the difference in overall after-routing timing closure/performance between say, grade -1 and grade -2 chips. How do seasoned FPGA folks deal with this in larger designs is beyond me. When the delay changes by 24 ns (!) from run to run just because of going from speed grade -1 to -2 chips, isn’t this basically voodoo?

Here’s my humble take away from this exercise (and I would much appreciate to be proven wrong here):

  1. All standard included IP is not created equal. My earlier exercises with axi_gpio and axi bram controller were compiling for the same xc7z010 -1 Artix chip, meeting timing and then working fine all the way to almost 160 MHz on bram dual port data.
  2. The listed performance numbers for various fabrics should be taken with a grain of salt. It is likely possible to achieve the stated ~150 MHz / ~200 MHz for common axi components on Artix and Kintex fabrics, but it may require either a lot of luck or getting a PhD in FPGA manual floorplanning techniques and then spending the rest of one’s best years learning router configurations. For the mere mortals among us, a more reasonable expectation would be to settle for 40-50% of these frequencies using default settings and nice green buttons in the UI.

Overall, I am somewhat baffled. I have now got this uneasy sense that if I just replace one existing block IP with another, then the design may no longer converge, and not because I am pushing any of the officially published chip metrics, but rather simply because the router decides so. Is this really how it works in the FPGA land? Is this more or less universal across different FPGA vendors (or maybe FPGA routers are not created equal either)? 

Net1-Route-Default.jpg
Net1-Route-Default-Grade2.jpg
Net1-Route-Phys_opt_des.jpg
0 Kudos
1 Solution

Accepted Solutions
Visitor eugenesa
Visitor
91 Views
Registered: ‎02-15-2019

SOLVED: Huge (and rather random) timing violations in axi s2mm dma IP (aka CONFIG.PCW_FCLK_CLK1_BUF property)

Jump to solution

SOLVED:

Many hours later I have finally traced this issue to be the same Vivado bug as has already been described here:

At some point somehow Vivado decided to change the default value of the notorious CONFIG.PCW_FCLK_CLK1_BUF property on the Zynq HPS IP block from TRUE to FALSE. Note that I am 100% blaming Vivado here, as there is no way to manipulate (or even impact?) this property from anywhere in the Zynq block IP customization UI, and the only operations I was ever performing were enabling and disabling CLK1, changing the source PLL for it and connecting and disconnecting the corresponding Zynq CLK1 pin. I didn’t even know such a property existed for Zynq clocks.

XILINX, with all due respect, this is a very bad bug! It is almost like one is trying to compile a “Hello world” program in some high level language, and the compiler decides to also insert a few instructions to mess up with, say, CPU cache coherency. How do you expect people to ever be able to diagnose this?? Here the problem manifests itself in the router honestly struggling to do its job – it produces crazy traces, fails timing, complains about congestion, etc., yet this is all because Vivado has silently chosen to drop some low-level clocking primitive from the clock net!

 If you cannot preserve integrity of this important property, then at the very least, make it easily discoverable in the block diagram somehow. The consequences of the wrong setting here are simply brutal.

Just for entertainment, here’s the process I had to go through in order to figure it all out:

  • Assume the router is struggling for real. This suggests the design is possibly too complex for the low-end chip. Research the differences in FPGA routing resources between Xilinx and vendor B. Find out that there is indeed a difference.
  • Check out if vendor B low-end chip can handle the same design. Install vendor B tools, ramp up on using them. Make a similar design with streaming dma and gpio. Observe it compiles in 2 minutes without issues. Start mentally preparing for shelling out 4 grand as the price of admission to vendor B club. Order a sample board based on vendor B fpga…
  • Check if Zynq UltraScale+ can possibly handle this design. (This requires deleting the original Zynq IP block and replacing it with Zynq Ultrascale. This turned out to be the key step!). Observe the design now compiles fine. Suddenly there’s hope…
  • Check if Zynq Kintex (xc7z030) can handle the design. This now requires deleting the Zynq Ultrascale block and adding back the regular Zynq IP. Surprisingly, the design again compiles fine! (Because, the Zynq IP does set the PCW_FCLK_CLK1_BUF property to TRUE, until that unfortunate moment when the property value gets lost). Now re-implement for the original xc7z010 -1, and the timing is now (suddenly) met!
  • Now export block designs from the 2 projects as tcl scripts and do a windiff on them… The rest is history.

Is this really what a Xilinx FPGA newbie has to go through in order to get admitted to Xilinx club? :) 

1 Reply
Visitor eugenesa
Visitor
92 Views
Registered: ‎02-15-2019

SOLVED: Huge (and rather random) timing violations in axi s2mm dma IP (aka CONFIG.PCW_FCLK_CLK1_BUF property)

Jump to solution

SOLVED:

Many hours later I have finally traced this issue to be the same Vivado bug as has already been described here:

At some point somehow Vivado decided to change the default value of the notorious CONFIG.PCW_FCLK_CLK1_BUF property on the Zynq HPS IP block from TRUE to FALSE. Note that I am 100% blaming Vivado here, as there is no way to manipulate (or even impact?) this property from anywhere in the Zynq block IP customization UI, and the only operations I was ever performing were enabling and disabling CLK1, changing the source PLL for it and connecting and disconnecting the corresponding Zynq CLK1 pin. I didn’t even know such a property existed for Zynq clocks.

XILINX, with all due respect, this is a very bad bug! It is almost like one is trying to compile a “Hello world” program in some high level language, and the compiler decides to also insert a few instructions to mess up with, say, CPU cache coherency. How do you expect people to ever be able to diagnose this?? Here the problem manifests itself in the router honestly struggling to do its job – it produces crazy traces, fails timing, complains about congestion, etc., yet this is all because Vivado has silently chosen to drop some low-level clocking primitive from the clock net!

 If you cannot preserve integrity of this important property, then at the very least, make it easily discoverable in the block diagram somehow. The consequences of the wrong setting here are simply brutal.

Just for entertainment, here’s the process I had to go through in order to figure it all out:

  • Assume the router is struggling for real. This suggests the design is possibly too complex for the low-end chip. Research the differences in FPGA routing resources between Xilinx and vendor B. Find out that there is indeed a difference.
  • Check out if vendor B low-end chip can handle the same design. Install vendor B tools, ramp up on using them. Make a similar design with streaming dma and gpio. Observe it compiles in 2 minutes without issues. Start mentally preparing for shelling out 4 grand as the price of admission to vendor B club. Order a sample board based on vendor B fpga…
  • Check if Zynq UltraScale+ can possibly handle this design. (This requires deleting the original Zynq IP block and replacing it with Zynq Ultrascale. This turned out to be the key step!). Observe the design now compiles fine. Suddenly there’s hope…
  • Check if Zynq Kintex (xc7z030) can handle the design. This now requires deleting the Zynq Ultrascale block and adding back the regular Zynq IP. Surprisingly, the design again compiles fine! (Because, the Zynq IP does set the PCW_FCLK_CLK1_BUF property to TRUE, until that unfortunate moment when the property value gets lost). Now re-implement for the original xc7z010 -1, and the timing is now (suddenly) met!
  • Now export block designs from the 2 projects as tcl scripts and do a windiff on them… The rest is history.

Is this really what a Xilinx FPGA newbie has to go through in order to get admitted to Xilinx club? :)