cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
kkvasan
Visitor
Visitor
556 Views
Registered: ‎12-16-2019

Launching the run on U280 reboots the server

Hi All, 

Issue related to this has been asked previously in following links but couldn't solve the issue.

https://forums.xilinx.com/t5/Alveo-Accelerator-Cards/kernel-on-U280-freezes-Linux/td-p/1055662

https://forums.xilinx.com/t5/Alveo-Accelerator-Cards/linux-hangs-when-starting-kernel-on-Alveo-U280/td-p/1048903

Sometimes reboots happens just after programming the bitstream while in some cases launched run completes and produces the results but server get rebooted after some seconds. I am having this issue for a long time and every time I workaround by setting lower frequency in Vitis

  • Simple designs which consumes less number of device resources and get implemented with Vitis default frequency 300MHz works fine
  • When scaling design by instatiating multiple compute units reduces the implemented designs operating frequency(usually 250MHz-300MHz) and likely to get this issue
  • But when I further reduces the operating frequency to 200 MHz by --kernel_frequency option in Vitis, implemented design works fine but really limited in performance due to lower operating frequency. 

System and XRT information as follows. 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
System Configuration
OS name:        Linux
Release:        4.15.0-132-generic
Version:        #136-Ubuntu SMP Tue Jan 12 14:58:42 UTC 2021
Machine:        x86_64
Model:          PowerEdge T640
CPU cores:      48
Memory:         257637 MB
Glibc:          2.27
Distribution:   Ubuntu 18.04.3 LTS
Now:            Mon Jan 25 21:54:28 2021
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
XRT Information
Version:        2.5.309
Git Hash:       9a03790c11f066a5597b133db737cf4683ad84c8
Git Branch:     2019.2_PU2
Build Date:     2020-02-23 18:52:05
XOCL:           2.5.309,9a03790c11f066a5597b133db737cf4683ad84c8
XCLMGMT:        2.5.309,9a03790c11f066a5597b133db737cf4683ad84c8
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 [0] 0000:3b:00.1 xilinx_u280_xdma_201920_3(ID=0x5e278820) user(inst=128)


I checked AXI firewall trip issues mentioned in above threads with lapc but it didn't report any issues in xbutil status. 
Any Kind of helps are appreciated! Thanks in advance 

Regards,
Vasan

Tags (1)
0 Kudos
5 Replies
dsakjl
Voyager
Voyager
477 Views
Registered: ‎07-20-2018

0 Kudos
kkvasan
Visitor
Visitor
435 Views
Registered: ‎12-16-2019

Hi @dsakjl 
Thanks a lot for your suggestion.
tried disabling fatal error reporting on PCI port which connects U280. 
Now I don't observe the reboot after the FPGA Configuration and produced results match with golden. 
But it hangs when launching the run again, basically I can run only once after manually rebooting the server. 
when I checking the /var/log/syslog, health_check report weird temperature like (-1) and says a hot reset is required. 
All the temperature values in the xbutil query appears as "NA"

Many Thanks,
Vasan

0 Kudos
dsakjl
Voyager
Voyager
395 Views
Registered: ‎07-20-2018

Hi @kkvasan ,

please, can you list the steps you take to setup your card, load and launch the bitstream?

Regards.

0 Kudos
kkvasan
Visitor
Visitor
373 Views
Registered: ‎12-16-2019

Hi @dsakjl 
I got the server pre set-upped but I ran the validation tests. It passes all the tests. 
I am using Vitis flow and configuring FPGA through host program. 

Many Thanks,
Vasan

0 Kudos
dsakjl
Voyager
Voyager
349 Views
Registered: ‎07-20-2018

Hi @kkvasan ,

typically we need to disable error reporting before flashing a bitstream on the full board and then rebooting.

However, the fact that disabling error reporting helped you it's probably the first step to tracking down the issue.

I suggest you to try to catch what error is reported by the PCIe bus on flashing.

Regards.

 

0 Kudos