cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Observer
Observer
694 Views
Registered: ‎02-20-2008

Routing problem in Alveo U50 for simple C kernels

Jump to solution

Hi I was running simple C kernel for Alveo U50 bandwidth testing. The project has been modified from the Vitis example designs (C based).

I ran into routing failure with 30 HBM channels and default 300MHz, so I was going through several trial and error. Eventually I made a working design with 16 HBM channels and kernel frequency manually set to 150 MHz.

But considering that there is nothing complicated with my design (please see attached code), I find it very odd that I have to simplify my design so much.

I hope someone could give me some tips/workaround on possible coding style I should follow for U50. 

Thanks.

----------------------------------------------------

Environment:

Vitis 2019.2

make all TARGET=hw DEVICE=xilinx_u50_xdma_201920_1

----------------------------------------------------

Attached files:

Makefile, utils.mk (please remove .txt extension...)

src/vadd.cpp (the kernel file)

-----------------------------------------------------

Routing error message with 16 channels and default 300MHz:

[10:23:05] Phase 11 Post Router Timing
[10:25:11] Phase 12 Physical Synthesis in Router
[10:25:11] Phase 12.1 Physical Synthesis Initialization
[10:27:17] Phase 12.2 Critical Path Optimization
[10:29:22] Finished 5th of 6 tasks (FPGA routing). Elapsed time: 01h 42m 14s

[10:29:22] Starting bitstream generation..
[10:46:06] Run vpl: Step impl: Failed
[10:46:12] Run vpl: FINISHED. Run Status: impl ERROR

===>The following messages were generated while Compiling (bitstream) accelerator binary: vadd Log file: /home/ykchoi/hbm/xilinx/cons_out16_bur32/build_dir.hw.xilinx_u50_xdma_201920_1/link/vivado/vpl/prj/prj.runs/impl_1/runme.log :
ERROR: [VPL-4] design did not meet timing - Design did not meet timing. One or more unscalable system clocks did not meet their required target frequency. Please try specifying a clock frequency lower than 300 MHz using the '--kernel_frequency' switch for the next compilation. For all system clocks, this design is using 0 nanoseconds as the threshold worst negative slack (WNS) value. List of system clocks with timing failure:
system clock: clk_out1_bd_6c68_clkwiz_hbm_aclk_0; slack: -0.035 ns
WARNING: [VPL 60-732] Link warning: No monitor points found for BD automation.
ERROR: [VPL 60-704] Integration error, problem implementing dynamic region, route_design ERROR, please look at the run log file '/home/ykchoi/hbm/xilinx/cons_out16_bur32/build_dir.hw.xilinx_u50_xdma_201920_1/link/vivado/vpl/prj/prj.runs/impl_1/runme.log' for more information
ERROR: [VPL 60-1328] Vpl run 'vpl' failed
ERROR: [VPL 60-806] Failed to finish platform linker
INFO: [v++ 60-1442] [10:46:17] Run run_link: Step vpl: Failed
Time (s): cpu = 00:12:16 ; elapsed = 07:18:58 . Memory (MB): peak = 593.570 ; gain = 0.000 ; free physical = 202851 ; free virtual = 205109
ERROR: [v++ 60-661] v++ link run 'run_link' failed
ERROR: [v++ 60-626] Kernel link failed to complete
ERROR: [v++ 60-703] Failed to finish linking
Makefile:129: recipe for target 'build_dir.hw.xilinx_u50_xdma_201920_1/vadd.xclbin' failed

0 Kudos
1 Solution

Accepted Solutions
Highlighted
Xilinx Employee
Xilinx Employee
606 Views
Registered: ‎10-19-2015

Hi @mice101 

Timing problems can manifest themselves in interesting ways when the congestion of the FPGA is very high. When trying to place 15 kernels on a single side of the SLR with the hbm on it, you are likely over congesting that side of the chip. 11 Kernels is what I've personally had luck with. I try to use both stacks, or indicate that every other kernel should be placed in a different SLR. 

Since the u50 has a power budget of 75W and 10 of those watts are allocated for the HBM, it is not expected that all 30 ports of the HBM would have a kernel attached and running at full speed. 

Also, your make file is forcing congestion on a single half of the SLR attached to the HBM block since you are using HBM ports 0-15. 

If you change your makefile spread out the HBM usage over both stacks you'll maybe have better luck closing timing. 

CLFLAGS += --sp vadd_1.m_axi_hbm0:HBM[0]
CLFLAGS += --sp vadd_1.m_axi_hbm1:HBM[2]
CLFLAGS += --sp vadd_1.m_axi_hbm2:HBM[4]
CLFLAGS += --sp vadd_1.m_axi_hbm3:HBM[6]
CLFLAGS += --sp vadd_1.m_axi_hbm4:HBM[8]
CLFLAGS += --sp vadd_1.m_axi_hbm5:HBM[10]
CLFLAGS += --sp vadd_1.m_axi_hbm6:HBM[12]
CLFLAGS += --sp vadd_1.m_axi_hbm7:HBM[14]

CLFLAGS += --sp vadd_1.m_axi_hbm8:HBM[16]
CLFLAGS += --sp vadd_1.m_axi_hbm9:HBM[18]
CLFLAGS += --sp vadd_1.m_axi_hbm10:HBM[20]
CLFLAGS += --sp vadd_1.m_axi_hbm11:HBM[22]
CLFLAGS += --sp vadd_1.m_axi_hbm12:HBM[24]
CLFLAGS += --sp vadd_1.m_axi_hbm13:HBM[26]
CLFLAGS += --sp vadd_1.m_axi_hbm14:HBM[28]
CLFLAGS += --sp vadd_1.m_axi_hbm15:HBM[30]

 

Regards,

M

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------

View solution in original post

Tags (3)
6 Replies
Highlighted
Moderator
Moderator
658 Views
Registered: ‎11-04-2010

Hi, @mice101 ,

There is timing issue in the routed design which causes the ERROR: [VPL-4] in the platform linking.

Please add -R2 option in the linker to export the intermediate dcp (opt.dcp/routed.dcp) for the whole design.

You can open the routed.dcp in Vivado to debug the timing issue of the design. 

-------------------------------------------------------------------------
Don't forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
0 Kudos
Highlighted
Observer
Observer
649 Views
Registered: ‎02-20-2008

Thanks for your advice. I will try your approach to pinpoint the problem.

When I looked at the congestion image before with Vivado, it was mostly centered around the hbm pins. (Image attached)

But considering that the timing error is on a region that is not user kernel, I have some suspicion this is something not controllable by the user.

Anyhow, I will repost if I can understand the critical path.

2020-01-28.png

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
607 Views
Registered: ‎10-19-2015

Hi @mice101 

Timing problems can manifest themselves in interesting ways when the congestion of the FPGA is very high. When trying to place 15 kernels on a single side of the SLR with the hbm on it, you are likely over congesting that side of the chip. 11 Kernels is what I've personally had luck with. I try to use both stacks, or indicate that every other kernel should be placed in a different SLR. 

Since the u50 has a power budget of 75W and 10 of those watts are allocated for the HBM, it is not expected that all 30 ports of the HBM would have a kernel attached and running at full speed. 

Also, your make file is forcing congestion on a single half of the SLR attached to the HBM block since you are using HBM ports 0-15. 

If you change your makefile spread out the HBM usage over both stacks you'll maybe have better luck closing timing. 

CLFLAGS += --sp vadd_1.m_axi_hbm0:HBM[0]
CLFLAGS += --sp vadd_1.m_axi_hbm1:HBM[2]
CLFLAGS += --sp vadd_1.m_axi_hbm2:HBM[4]
CLFLAGS += --sp vadd_1.m_axi_hbm3:HBM[6]
CLFLAGS += --sp vadd_1.m_axi_hbm4:HBM[8]
CLFLAGS += --sp vadd_1.m_axi_hbm5:HBM[10]
CLFLAGS += --sp vadd_1.m_axi_hbm6:HBM[12]
CLFLAGS += --sp vadd_1.m_axi_hbm7:HBM[14]

CLFLAGS += --sp vadd_1.m_axi_hbm8:HBM[16]
CLFLAGS += --sp vadd_1.m_axi_hbm9:HBM[18]
CLFLAGS += --sp vadd_1.m_axi_hbm10:HBM[20]
CLFLAGS += --sp vadd_1.m_axi_hbm11:HBM[22]
CLFLAGS += --sp vadd_1.m_axi_hbm12:HBM[24]
CLFLAGS += --sp vadd_1.m_axi_hbm13:HBM[26]
CLFLAGS += --sp vadd_1.m_axi_hbm14:HBM[28]
CLFLAGS += --sp vadd_1.m_axi_hbm15:HBM[30]

 

Regards,

M

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------

View solution in original post

Tags (3)
Highlighted
Observer
Observer
584 Views
Registered: ‎02-20-2008

That makes a lot of sense. I will try some different combination of hbm ports and repost :)

 

By the way, regarding your comment that it is unlikely to utilize all hbm ports on u50 due to power,

in your opinion (or general consensus) would you say that for u280 it is likely to utilize all hbm ports (and its conventional drams)?

 

Thanks :)

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
563 Views
Registered: ‎10-19-2015

@mice101 

the u280 can go up to 250W and allows for the full 15W power consumption of the HBM. 

Can you fit 30 Kernels on the HBMs in the u280? Well that depends on your kernels. There's a finite amount of resources for routing and logic, and it is especially congested around the HBM.  

I think the design methodology is to determine what you are trying to accelerate, see how much of it you can accelerate right off the bat, then analyze it with the Vitis tools and determine where the system bottle neck is and accelerate that. Starting with trying to answer the question of "Will all the HBM ports get utilized?" might be a more difficult path to accelerating an application. 

At least that's how I see it, and I am interested in your thoughts as well. 

Regards,

M

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
0 Kudos
Highlighted
Observer
Observer
485 Views
Registered: ‎02-20-2008

I checked that putting the loop flatten off pragma on loops 12, 20, 31, 41 solves the problem, and can be synthesized at 16 channels + 300MHz. Routing still fails at 24 channels + 300MHz.

Not sure why loop flattening solves the problem though - seems to be simple two perfectly nested loops.

0 Kudos