cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
puya
Contributor
Contributor
986 Views
Registered: ‎11-20-2018

Alveo U280 crashes after a while

Hi,

My Alveo U280 card works fine for a while (maybe 2 day without loading any new xclbin) but then it crashes and xbutil validate shows the following messages:

INFO: Found 1 cards

INFO: Validating card[0]: xilinx_u280_xdma_201920_3
INFO: == Starting Kernel version check: 
WARNING: Kernel verison 5.4.0-47-generic is not officially supported. 5.3.0 is the latest supported version
WARN: == Kernel version check PASSED with warning
INFO: == Starting AUX power connector check: 
INFO: == AUX power connector check PASSED
INFO: == Starting PCIE link check: 
LINK ACTIVE, ATTENTION
Ensure Card is plugged in to Gen3x16, instead of Gen1x16
Lower performance may be experienced
WARN: == PCIE link check PASSED with warning
INFO: == Starting SC firmware version check: 
INFO: == SC firmware version check PASSED
INFO: == Starting verify kernel test: 
XRT build version: 2.7.766
Build hash: 19bc791a7d9b54ecc23644649c3ea2c2ea31821c
Build date: 2020-08-17 16:52:05
Git branch: 2020.1_PU1
PID: 7285
UID: 1000
[Tue Oct 13 14:27:48 2020 GMT]
HOST: biest
EXE: /opt/xilinx/xrt/bin/unwrapped/xbutil
[XRT] ERROR: Can't reach out to mgmt for xclbin downloading
[XRT] ERROR: Is xclmgmt driver loaded? Or is MSD/MPD running?
[XRT] ERROR: See dmesg log for details. err=-110
ERROR: Failed to download xclbin: verify.xclbin
ERROR: == verify kernel test FAILED
INFO: Card[0] failed to validate.

and  dmesg reports :

[515647.611327] xocl 0000:09:00.1:  ffff907fd6d030b0 xocl_init_mem: ret 0
[515647.611328] xocl 0000:09:00.1:  ffff907fd6d030b0 xocl_read_axlf_helper: Failed to download xclbin, err: -110
[515647.612848] [drm] client exits pid(7285)
[515647.612850] xocl 0000:09:00.1:  ffff907fd6d030b0 xocl_drvinst_close: CLOSE 2
[515647.612851] xocl 0000:09:00.1:  ffff907fd6d030b0 xocl_drvinst_close: NOTIFY 0000000057b2fe5e
[515650.333401] xclmgmt 0000:09:00.0: check_temp_within_range: Warning: A Xilinx acceleration device is reporting a temperature of -1C. There is a card shutdown limit if the device hits 97C. Please keep the device below 88C.
[515650.333406] xclmgmt 0000:09:00.0: check_volt_within_range: Voltage outside normal range (500-2500)mV 65535.
[515650.333408] xclmgmt 0000:09:00.0: check_volt_within_range: Voltage outside normal range (500-2500)mV 65535.
[515650.333409] xclmgmt 0000:09:00.0: check_volt_within_range: Voltage outside normal range (500-2500)mV 65535.
[515650.333419] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: AXI Firewall 0 tripped, status: 0xfffefffe, bar offset 0xd0000, resource firewall.m.12582912
[515650.333422] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 0, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xd0000
[515650.333424] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 1, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xe0000
[515650.333427] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 2, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xe1000
[515650.333430] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 3, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xf0000
[515650.333431] xclmgmt 0000:09:00.0: health_check_cb: Card requires pci hot reset
[515655.453286] xclmgmt 0000:09:00.0: check_temp_within_range: Warning: A Xilinx acceleration device is reporting a temperature of -1C. There is a card shutdown limit if the device hits 97C. Please keep the device below 88C.
[515655.453291] xclmgmt 0000:09:00.0: check_volt_within_range: Voltage outside normal range (500-2500)mV 65535.
[515655.453293] xclmgmt 0000:09:00.0: check_volt_within_range: Voltage outside normal range (500-2500)mV 65535.
[515655.453294] xclmgmt 0000:09:00.0: check_volt_within_range: Voltage outside normal range (500-2500)mV 65535.
[515655.453304] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: AXI Firewall 0 tripped, status: 0xfffefffe, bar offset 0xd0000, resource firewall.m.12582912
[515655.453307] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 0, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xd0000
[515655.453309] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 1, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xe0000
[515655.453312] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 2, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xe1000
[515655.453314] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 3, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xf0000
[515655.453315] xclmgmt 0000:09:00.0: health_check_cb: Card requires pci hot reset
[515660.573174] xclmgmt 0000:09:00.0: check_temp_within_range: Warning: A Xilinx acceleration device is reporting a temperature of -1C. There is a card shutdown limit if the device hits 97C. Please keep the device below 88C.
[515660.573181] xclmgmt 0000:09:00.0: check_volt_within_range: Voltage outside normal range (500-2500)mV 65535.
[515660.573184] xclmgmt 0000:09:00.0: check_volt_within_range: Voltage outside normal range (500-2500)mV 65535.
[515660.573186] xclmgmt 0000:09:00.0: check_volt_within_range: Voltage outside normal range (500-2500)mV 65535.
[515660.573198] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: AXI Firewall 0 tripped, status: 0xfffefffe, bar offset 0xd0000, resource firewall.m.12582912
[515660.573202] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 0, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xd0000
[515660.573206] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 1, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xe0000
[515660.573208] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 2, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xe1000
[515660.573212] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 3, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xf0000
[515660.573213] xclmgmt 0000:09:00.0: health_check_cb: Card requires pci hot reset
[515665.693040] xclmgmt 0000:09:00.0: check_temp_within_range: Warning: A Xilinx acceleration device is reporting a temperature of -1C. There is a card shutdown limit if the device hits 97C. Please keep the device below 88C.
[515665.693046] xclmgmt 0000:09:00.0: check_volt_within_range: Voltage outside normal range (500-2500)mV 65535.
[515665.693048] xclmgmt 0000:09:00.0: check_volt_within_range: Voltage outside normal range (500-2500)mV 65535.
[515665.693049] xclmgmt 0000:09:00.0: check_volt_within_range: Voltage outside normal range (500-2500)mV 65535.
[515665.693060] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: AXI Firewall 0 tripped, status: 0xfffefffe, bar offset 0xd0000, resource firewall.m.12582912
[515665.693063] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 0, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xd0000
[515665.693065] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 1, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xe0000
[515665.693068] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 2, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xe1000
[515665.693070] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 3, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xf0000
[515665.693071] xclmgmt 0000:09:00.0: health_check_cb: Card requires pci hot reset
[515670.813075] xclmgmt 0000:09:00.0: check_temp_within_range: Warning: A Xilinx acceleration device is reporting a temperature of -1C. There is a card shutdown limit if the device hits 97C. Please keep the device below 88C.
[515670.813081] xclmgmt 0000:09:00.0: check_volt_within_range: Voltage outside normal range (500-2500)mV 65535.
[515670.813083] xclmgmt 0000:09:00.0: check_volt_within_range: Voltage outside normal range (500-2500)mV 65535.
[515670.813085] xclmgmt 0000:09:00.0: check_volt_within_range: Voltage outside normal range (500-2500)mV 65535.
[515670.813095] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: AXI Firewall 0 tripped, status: 0xfffefffe, bar offset 0xd0000, resource firewall.m.12582912
[515670.813097] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 0, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xd0000
[515670.813100] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 1, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xe0000
[515670.813102] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 2, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xe1000
[515670.813105] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 3, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xf0000
[515670.813106] xclmgmt 0000:09:00.0: health_check_cb: Card requires pci hot reset
[515675.932822] xclmgmt 0000:09:00.0: check_temp_within_range: Warning: A Xilinx acceleration device is reporting a temperature of -1C. There is a card shutdown limit if the device hits 97C. Please keep the device below 88C.
[515675.932827] xclmgmt 0000:09:00.0: check_volt_within_range: Voltage outside normal range (500-2500)mV 65535.
[515675.932829] xclmgmt 0000:09:00.0: check_volt_within_range: Voltage outside normal range (500-2500)mV 65535.
[515675.932831] xclmgmt 0000:09:00.0: check_volt_within_range: Voltage outside normal range (500-2500)mV 65535.
[515675.932840] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: AXI Firewall 0 tripped, status: 0xfffefffe, bar offset 0xd0000, resource firewall.m.12582912
[515675.932843] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 0, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xd0000
[515675.932845] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 1, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xe0000
[515675.932848] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 2, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xe1000
[515675.932850] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 3, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xf0000
[515675.932851] xclmgmt 0000:09:00.0: health_check_cb: Card requires pci hot reset
[515681.052687] xclmgmt 0000:09:00.0: check_temp_within_range: Warning: A Xilinx acceleration device is reporting a temperature of -1C. There is a card shutdown limit if the device hits 97C. Please keep the device below 88C.
[515681.052693] xclmgmt 0000:09:00.0: check_volt_within_range: Voltage outside normal range (500-2500)mV 65535.
[515681.052694] xclmgmt 0000:09:00.0: check_volt_within_range: Voltage outside normal range (500-2500)mV 65535.
[515681.052696] xclmgmt 0000:09:00.0: check_volt_within_range: Voltage outside normal range (500-2500)mV 65535.
[515681.052705] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: AXI Firewall 0 tripped, status: 0xfffefffe, bar offset 0xd0000, resource firewall.m.12582912
[515681.052708] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 0, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xd0000
[515681.052711] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 1, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xe0000
[515681.052713] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 2, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xe1000
[515681.052716] xclmgmt 0000:09:00.0: firewall.m.12582912 ffff907fc7ae5410 check_firewall: Firewall 3, ep firewall.m.12582912, status: 0xffffffff, bar offset 0xf0000
[515681.052717] xclmgmt 0000:09:00.0: health_check_cb: Card requires pci hot reset

Any idea what is the problem? The card is actively cooled. what does -1C means? Is it too cold or too hot? Voltages are out of range due to the temperature or the temperature is out of range due to wrong voltages? The card is installed on a PCIe 3 x16 slot but the report shows the wrong value after the crash. A  restart always solves the problem but we need a stable system.

0 Kudos
6 Replies
emeryw
Xilinx Employee
Xilinx Employee
935 Views
Registered: ‎12-06-2019

Hi @puya ,

Can you please send some sample outputs of "xbutil query" and "lspci -vd 10ee:" Has the card always acted like this, or is this a new behavior its developed? How long have you had the card?

The values for voltage and temperature would seem to indicate max values or overflow, but I also see XRT driver errors in the same output, as well as the firewall being tripped from the dmesg, so currently we may have multiple issues going on. Is there something that can reproduce this consistently, or does it only occur after the system is on for a while? Can you attach a whole dmesg log as soon as this issue comes up?

Do you have the aux power connector plugged into the card? Can you share any details about the host system?

Have you tried reseating the card or moving to another slot or another system?

Best,

-Emery

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------

-Emery
----------------------------------------------------------------------------------
* Please don't forget to reply, kudo and accept as a solution! *
0 Kudos
anatoli
Moderator
Moderator
837 Views
Registered: ‎06-14-2010

Hello @puya ,

Based on your error (i.e. [XRT] ERROR: Can't reach out to mgmt for xclbin downloading), it means that for some reason the xclmgmt driver is not loaded correctly.  Please use the following commands to reload the driver:

$sudo rmmod xocl
$sudo rmmod xclmgmt
$sudo modprobe xocl
$sudo modprobe xclmgmt
$xbutil validate

As you've already tested, alternatively, a warm reboot will fix this too.

The similar issue is described here: https://forums.xilinx.com/t5/Alveo-Accelerator-Cards/setup-issue-of-alveo-U250-with-quot-Getting-Started-with-Alveo/td-p/1056868

Hope this helps.

Kind Regards,
Anatoli Curran,
Xilinx Technical Support
------------------------------------------------------------------------------------------------

Don’t forget to reply, kudo, and accept as solution.

If starting with Versal, take a look at our Versal Design Process Hub and our
Versal Blogs

------------------------------------------------------------------------------------------------
0 Kudos
puya
Contributor
Contributor
761 Views
Registered: ‎11-20-2018

Hi @emeryw and @anatoli 

It took a while until the card crashed again and I could provide your requested details. I should mention this time after running several xclbins, one of them hanged and then the  xbutil validate again showed that the card is installed on Gen1X16 which is not true. Later on, I left the system idle and after about one week the card became unreachable with same dmsg report. I attached the result of the commands you asked before, both before and after the crash. I also did what @anatoli suggested but nothing happened.
Here are answers to your question:

"Has the card always acted like this, or is this a new behavior its developed?" - We cannot answer this clearly, as this started to happen after we began to tickle the card with more compute-intensive applications.

"How long have you had the card?" - About half a year, don't know the exact date.

"Do you have the aux power connector plugged into the card?" - Yes.

"Can you share any details about the host system?" - The card is plugged into the correct PCIe slot on the X570 AORUS PRO motherboard. It has 96GB memory and an AMD Ryzen 7 3700X 8-Core processor.

"Have you tried reseating the card or moving to another slot or another system?" - No, we did not try that. The system the card is plugged in only has one compatible PCIe slot. Other systems where we could test the card are currently occupied, so it'd be quite some hassle to transfer the card to another system to test things out. Of course, if all else fails, we can try that.

 

0 Kudos
emeryw
Xilinx Employee
Xilinx Employee
708 Views
Registered: ‎12-06-2019

Hi @puya ,

Just to confirm, the card got into this weird state after sitting idle for about a week?

It seems that unloading and reloading the drivers doesn't correct the issue. The dmesg output reports a tripped firewall; are you able to do a reset of the card:

xbutil reset

Or does it require a full reboot to bring it back?

xbutil query shows this to be an actively cooled card, and temps at the time of reporting all look good. Can you please check the following:

Does the system have Spread Spectrum enabled in the BIOS?

Does the card still get into this state if the aux power is not connected?

Can you please send along an xbutil query with the card under one of the more intense loads?

Does the system have a power supply capable of supplying adequate power to the system while under heavy load?

 

Thanks for your help.

Best,

-Emery

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------

 

-Emery
----------------------------------------------------------------------------------
* Please don't forget to reply, kudo and accept as a solution! *
0 Kudos
puya
Contributor
Contributor
685 Views
Registered: ‎11-20-2018

Hi @emeryw

Yes the card got into this state after sitting idle for about a week.
After this state it is not possible to recover it by unloading and reloading the drivers or by xbutil reset. The best state I can get via these commands is the following:

 

INFO: Found 1 cards

INFO: Validating card[0]:
INFO: == Starting Kernel version check:
INFO: == Kernel version check PASSED
INFO: == Starting AUX power connector check:
AUX power connector not available. Skipping validation
INFO: == AUX power connector check SKIPPED
INFO: == Starting PCIE link check:
LINK ACTIVE, ATTENTION
Ensure Card is plugged in to Gen3x16, instead of Gen1x16
Lower performance may be experienced
WARN: == PCIE link check PASSED with warning
INFO: == Starting SC firmware version check:
Failed to open /sys/bus/pci/devices/0000:09:00.1/xmc.u.14680064/bmc_ver for reading: No such file or directory

ERROR: == SC firmware version check FAILED
INFO: Card[0] failed to validate.

ERROR: Some cards failed to validate.

 

And it does require a full cold reset to bring it back again.

As at the moment I don't have a direct access to the card and workstation I cannot quickly answer to your questions. My access is just via ssh to remote server. But I will try to get a direct access to the hardware and report you back all the details you need to know.

 

0 Kudos
emeryw
Xilinx Employee
Xilinx Employee
648 Views
Registered: ‎12-06-2019

Hi @puya ,

Thanks for the info. I was also wondering - in the first post validate notes about the kernel version, however the latest validate passes without an issue. Was the kernel version rolled back?

Did you move to kernel version 5 recently? 

Certainly keep us posted as you are able to gain access to the machine for some of the details requested previously. Thanks again.

Best,

-Emery

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------

-Emery
----------------------------------------------------------------------------------
* Please don't forget to reply, kudo and accept as a solution! *
0 Kudos