cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Observer
Observer
1,037 Views
Registered: ‎10-06-2016

U280 validate fails, even after fresh power cycle

After fairly stable operation of an Alveo U280 board for 2-3 months, we now observe problems during operation that are not even resolved by a full power cycle of the machine.

There are two types of situation that we observed:
1) validate gets stuck during bandwidth test
2) xbmgmt flash --scan reports ERROR: XMC is not ready: 0x3 and validate gets stuck at verify kernel test

I should mention that in contrast to the installation guide, XRT is always freshly installed after boot up of the machine, it's currently not part of the image we boot from. However, the setup has worked fine so far.

I provide more details for both cases, which were created with the following script snippet:

   PREFIX="run_${i}"
   dmesg > ${PREFIX}_0_dmesg.out
   sudo lspci -vd 10ee: > ${PREFIX}_1_lspci.out
   /opt/xilinx/xrt/bin/xbutil scan > ${PREFIX}_2_scan.out
   sudo /opt/xilinx/xrt/bin/xbmgmt flash --scan > ${PREFIX}_3_flash_scan.out
   timeout 300s /opt/xilinx/xrt/bin/xbutil validate > ${PREFIX}_4_validate.out
   dmesg > ${PREFIX}_5_dmesg.out

Case 1) 

cat powercycle4/run_2_1_lspci.out 
16:00.0 Processing accelerators: Xilinx Corporation Device 500c
	Subsystem: Xilinx Corporation Device 000e
	Flags: bus master, fast devsel, latency 0, NUMA node 0
	Memory at 387ff2000000 (64-bit, prefetchable) [size=32M]
	Memory at 387ff4000000 (64-bit, prefetchable) [size=128K]
	Capabilities: [40] Power Management version 3
	Capabilities: [60] MSI-X: Enable+ Count=33 Masked-
	Capabilities: [70] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [1c0] #19
	Capabilities: [e00] Access Control Services
	Capabilities: [e10] #15
	Kernel driver in use: xclmgmt
	Kernel modules: xclmgmt

16:00.1 Processing accelerators: Xilinx Corporation Device 500d
	Subsystem: Xilinx Corporation Device 000e
	Flags: bus master, fast devsel, latency 0, IRQ 331, NUMA node 0
	Memory at 387ff0000000 (64-bit, prefetchable) [size=32M]
	Memory at 387ff4020000 (64-bit, prefetchable) [size=64K]
	Memory at 387fe0000000 (64-bit, prefetchable) [size=256M]
	Capabilities: [40] Power Management version 3
	Capabilities: [60] MSI-X: Enable+ Count=33 Masked-
	Capabilities: [70] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [e00] Access Control Services
	Capabilities: [e10] #15
	Kernel driver in use: xocl
	Kernel modules: xocl
cat powercycle4/run_2_2_scan.out 
INFO: Found total 1 card(s), 1 are usable
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
System Configuration
OS name:	Linux
Release:	3.10.0-1062.4.1.el7.x86_64
Version:	#1 SMP Fri Oct 18 17:15:30 UTC 2019
Machine:	x86_64
Glibc:		2.17
Distribution:	CentOS Linux 7 (Core)
Now:		Wed Mar  4 17:38:42 2020
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
XRT Information
Version:	2.3.1301
Git Hash:	192e706aea53163a04c574f9b3fe9ed76b6ca471
Git Branch:	2019.2
Build Date:	2019-10-25 03:04:42
XOCL:		2.3.1301,192e706aea53163a04c574f9b3fe9ed76b6ca471
XCLMGMT:	2.3.1301,192e706aea53163a04c574f9b3fe9ed76b6ca471
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 [0] 0000:16:00.1 xilinx_u280_xdma_201920_1(ts=0x5da8da6e) user(inst=128)
cat powercycle4/run_2_3_flash_scan.out 
Card [0000:16:00.0]
    Card type:		u280
    Flash type:		SPI
    Flashable partition running on FPGA:
        xilinx_u280_xdma_201920_1,[ID=0x000000005da8da6e],[SC=4.2.0]
    Flashable partitions installed in system:	
        xilinx_u280_xdma_201920_1,[ID=0x000000005da8da6e],[SC=4.3.4]
cat powercycle4/run_2_4_validate.out   
INFO: Found 1 cards

INFO: Validating card[0]: xilinx_u280_xdma_201920_1
INFO: == Starting AUX power connector check: 
INFO: == AUX power connector check PASSED
INFO: == Starting PCIE link check: 
INFO: == PCIE link check PASSED
INFO: == Starting verify kernel test: 
INFO: == verify kernel test PASSED
INFO: == Starting DMA test: 
Buffer Size: 256 MB
Host -> PCIe -> FPGA write bandwidth = 11849.1 MB/s
Host <- PCIe <- FPGA read bandwidth = 12105.7 MB/s
INFO: == DMA test PASSED
INFO: == Starting device memory bandwidth test: 
..........................................................

 Test killed after 5 minutes here (also have tried to wait for much longer to no avail). Previous successful tests completed in around 3 minutes.

I'd like to attach the output of dmesg, but currently the forum doesn't let me upload any files, so here is an excerpt of the relevant time frame

[  656.913991] xocl 0000:16:00.1: xocl_init_mem: Found a new memory region
[  656.913996] xocl 0000:16:00.1: xocl_init_mem: drm_mm_init called
[  656.913997] xocl 0000:16:00.1: xocl_init_mem: Allocating Memory Bank: HBM[31]
[  656.913998] xocl 0000:16:00.1: xocl_init_mem:   base_addr:0x1f0000000, total size:0x10000000
[  656.913999] xocl 0000:16:00.1: xocl_init_mem: Found a new memory region
[  656.914005] xocl 0000:16:00.1: xocl_init_mem: drm_mm_init called
[  656.914006] xocl 0000:16:00.1: xocl_init_mem: Allocating Memory Bank: DDR[0]
[  656.914007] xocl 0000:16:00.1: xocl_init_mem:   base_addr:0x4000000000, total size:0x400000000
[  656.914008] xocl 0000:16:00.1: xocl_init_mem: Found a new memory region
[  656.914011] xocl 0000:16:00.1: xocl_init_mem: drm_mm_init called
[  656.914012] xocl 0000:16:00.1: xocl_init_mem: Allocating Memory Bank: DDR[1]
[  656.914013] xocl 0000:16:00.1: xocl_init_mem:   base_addr:0x8000000000, total size:0x400000000
[  656.914014] xocl 0000:16:00.1: xocl_init_mem: Found a new memory region
[  656.914017] xocl 0000:16:00.1: xocl_init_mem: drm_mm_init called
[  656.914019] xocl 0000:16:00.1: xocl_read_axlf_helper: Loaded xclbin 454483ed-ad89-4e1f-9bb8-a7b58f838784
[  656.914087] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485760: No such file or directory
[  656.914272] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485761: No such file or directory
[  656.914453] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485762: No such file or directory
[  656.914635] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485763: No such file or directory
[  656.914826] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485764: No such file or directory
[  656.915006] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485765: No such file or directory
[  656.915191] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485766: No such file or directory
[  656.915374] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485767: No such file or directory
[  656.915437] icap.u icap.u.15728640: icap_lock_bitstream: bitstream 454483ed-ad89-4e1f-9bb8-a7b58f838784 locked, ref=1
[  656.915440] xocl 0000:16:00.1: exec_reset: exec_reset(0) cfg(0)
[  656.915441] xocl 0000:16:00.1: exec_reset: exec_reset resets
[  656.915442] xocl 0000:16:00.1: exec_reset: exec->xclbin(62fcdc33-e895-48ee-8124-f025f7fbc6bb),xclbin(454483ed-ad89-4e1f-9bb8-a7b58f838784)
[  656.915445] xocl_mb_sche mb_scheduler.u.4194304: client_ioctl_ctx: CTX add(454483ed-ad89-4e1f-9bb8-a7b58f838784, pid 51854, cu_idx 0xffffffff) = 0, ctx=1
[  656.915456] xocl 0000:16:00.1: exec_cfg_cmd: ert per feature rom = 1
[  656.915457] xocl 0000:16:00.1: exec_cfg_cmd: dsa52 = 1
[  656.915460] xocl 0000:16:00.1: cu_reset: configured cu(0) base@0x1800000 poll@0x          (null) control(0) ctx(0)
[  656.915462] xocl 0000:16:00.1: cu_reset: configured cu(1) base@0x1810000 poll@0x          (null) control(0) ctx(0)
[  656.915463] xocl 0000:16:00.1: exec_cfg_cmd: configuring embedded scheduler mode
[  656.915465] xocl 0000:16:00.1: exec_cfg_cmd: scheduler config ert(1), dataflow(0), slots(16), cudma(1), cuisr(0), cdma(0), cus(2)
[  656.915499] icap.u icap.u.15728640: icap_unlock_bitstream: bitstream 454483ed-ad89-4e1f-9bb8-a7b58f838784 unlocked, ref=0
[  656.915500] xocl 0000:16:00.1: exec_stop: exec_stop(ffff92eae03b5018)
[  656.915506] xocl_mb_sche mb_scheduler.u.4194304: client_ioctl_ctx: CTX del(454483ed-ad89-4e1f-9bb8-a7b58f838784, pid 51854, cu_idx 0xffffffff) = 0, ctx=0
[  656.915555] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485768: No such file or directory
[  656.915735] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485769: No such file or directory
[  656.915929] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485770: No such file or directory
[  656.916109] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485771: No such file or directory
[  656.916292] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485772: No such file or directory
[  656.916471] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485773: No such file or directory
[  656.916653] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485774: No such file or directory
[  656.916881] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485775: No such file or directory
[  656.917076] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485776: No such file or directory
[  656.917256] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485777: No such file or directory
[  656.917441] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485778: No such file or directory
[  656.917619] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485779: No such file or directory
[  656.917800] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485780: No such file or directory
[  656.917984] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485781: No such file or directory
[  656.918163] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485782: No such file or directory
[  656.918339] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485783: No such file or directory
[  656.918520] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485784: No such file or directory
[  656.918702] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485785: No such file or directory
[  656.918890] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485786: No such file or directory
[  656.919070] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485787: No such file or directory
[  656.919252] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485788: No such file or directory
[  656.919442] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485789: No such file or directory
[  656.919628] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485790: No such file or directory
[  656.919807] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485791: No such file or directory
[  656.920001] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485792: No such file or directory
[  656.920186] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mig.u.10485793: No such file or directory
[  656.920327] systemd-journald[1151]: no db file to read /run/udev/data/+platform:icap.u.15728640: No such file or directory
[  656.920373] systemd-journald[1151]: no db file to read /run/udev/data/+platform:icap.u.15728640: No such file or directory
[  656.920419] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mailbox.u.13631488: No such file or directory
[  656.920466] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mailbox.m.13631488: No such file or directory
[  656.920589] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mailbox.m.13631488: No such file or directory
[  656.935893] systemd-journald[1151]: no db file to read /run/udev/data/+platform:icap.u.15728640: No such file or directory
[  656.936076] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mb_scheduler.u.4194304: No such file or directory
[  656.936386] systemd-journald[1151]: no db file to read /run/udev/data/+platform:icap.u.15728640: No such file or directory
[  656.936478] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mb_scheduler.u.4194304: No such file or directory
[  657.154382] xocl 0000:16:00.1: _xocl_drvinst_open: OPEN 2
[  657.154388] [drm] creating scheduler client for pid(51881), ret: 0
[  657.155240] xmc.u xmc.u.11534336: xmc_read_from_peer: reading from peer
[  657.155249] mailbox.u mailbox.u.13631488: mailbox_request: sending request: 10 via HW
[  657.155348] mailbox.m mailbox.m.13631488: process_request: received request from peer: 10, passed on
[  657.155350] xclmgmt 0000:16:00.0: xclmgmt_read_subdev_req: req kind 0
[  657.155409] mailbox.m mailbox.m.13631488: mailbox_post_response: posting response for: 10 via HW
[  657.155418] systemd-journald[1151]: no db file to read /run/udev/data/+platform:xmc.u.11534336: No such file or directory
[  657.155477] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mailbox.u.13631488: No such file or directory
[  657.155527] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mailbox.m.13631488: No such file or directory
[  657.155644] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mailbox.m.13631488: No such file or directory
[  657.213449] xocl 0000:16:00.1: xocl_read_axlf_helper: xclbin is already downloaded
[  657.213453] xocl 0000:16:00.1: xocl_read_axlf_helper: Loaded xclbin 454483ed-ad89-4e1f-9bb8-a7b58f838784
[  657.213516] icap.u icap.u.15728640: icap_lock_bitstream: bitstream 454483ed-ad89-4e1f-9bb8-a7b58f838784 locked, ref=1
[  657.213518] xocl 0000:16:00.1: exec_reset: exec_reset(0) cfg(1)
[  657.213521] xocl_mb_sche mb_scheduler.u.4194304: client_ioctl_ctx: CTX add(454483ed-ad89-4e1f-9bb8-a7b58f838784, pid 51881, cu_idx 0xffffffff) = 0, ctx=1
[  657.213533] [drm] command scheduler is already configured for this device
[  657.213543] icap.u icap.u.15728640: icap_unlock_bitstream: bitstream 454483ed-ad89-4e1f-9bb8-a7b58f838784 unlocked, ref=0
[  657.213545] xocl 0000:16:00.1: exec_stop: exec_stop(ffff92eae03b5018)
[  657.213551] xocl_mb_sche mb_scheduler.u.4194304: client_ioctl_ctx: CTX del(454483ed-ad89-4e1f-9bb8-a7b58f838784, pid 51881, cu_idx 0xffffffff) = 0, ctx=0
[  657.214412] systemd-journald[1151]: no db file to read /run/udev/data/+platform:icap.u.15728640: No such file or directory
[  657.214555] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mb_scheduler.u.4194304: No such file or directory
[  657.214643] systemd-journald[1151]: no db file to read /run/udev/data/+platform:icap.u.15728640: No such file or directory
[  657.214774] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mb_scheduler.u.4194304: No such file or directory
[  657.216802] icap.u icap.u.15728640: icap_lock_bitstream: bitstream 454483ed-ad89-4e1f-9bb8-a7b58f838784 locked, ref=1
[  657.216804] xocl 0000:16:00.1: exec_reset: exec_reset(0) cfg(1)
[  657.216806] xocl_mb_sche mb_scheduler.u.4194304: client_ioctl_ctx: CTX add(454483ed-ad89-4e1f-9bb8-a7b58f838784, pid 51881, cu_idx 0xffffffff) = 0, ctx=1
[  657.217144] systemd-journald[1151]: no db file to read /run/udev/data/+platform:icap.u.15728640: No such file or directory
[  657.217239] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mb_scheduler.u.4194304: No such file or directory
[  661.122807] icap.u icap.u.15728640: icap_read_from_peer: reading from peer
[  661.122820] mailbox.u mailbox.u.13631488: mailbox_request: sending request: 10 via HW
[  661.122937] mailbox.m mailbox.m.13631488: process_request: received request from peer: 10, passed on
[  661.122941] xclmgmt 0000:16:00.0: xclmgmt_read_subdev_req: req kind 1
[  661.123030] systemd-journald[1151]: no db file to read /run/udev/data/+platform:icap.u.15728640: No such file or directory
[  661.123116] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mailbox.u.13631488: No such file or directory
[  661.123168] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mailbox.m.13631488: No such file or directory
[  661.125981] mailbox.m mailbox.m.13631488: mailbox_post_response: posting response for: 10 via HW
[  661.127000] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mailbox.m.13631488: No such file or directory
[  661.131041] xocl_mb_sche mb_scheduler.u.4194304: client_ioctl_ctx: CTX add(454483ed-ad89-4e1f-9bb8-a7b58f838784, pid 51881, cu_idx 0x0) = 0, ctx=2
[  661.131958] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mb_scheduler.u.4194304: No such file or directory
[  661.136149] xocl_mb_sche mb_scheduler.u.4194304: client_ioctl_ctx: CTX add(454483ed-ad89-4e1f-9bb8-a7b58f838784, pid 51881, cu_idx 0x1) = 0, ctx=3
[  661.136976] systemd-journald[1151]: no db file to read /run/udev/data/+platform:mb_scheduler.u.4194304: No such file or directory
[  725.178482] systemd-journald[1151]: Sent WATCHDOG=1 notification.
[  827.569994] systemd-journald[1151]: Sent WATCHDOG=1 notification.
[  898.929841] systemd-journald[1151]: Sent WATCHDOG=1 notification.
[  898.941881] systemd-journald[1151]: Successfully sent stream file descriptor to service manager.
[  946.966457] xocl 0000:16:00.1: destroy_client: pid(51881) waiting for 1 outstanding execs to finish
[  947.465916] xocl 0000:16:00.1: destroy_client: pid(51881) waiting for 1 outstanding execs to finish
[  947.966671] xocl 0000:16:00.1: destroy_client: pid(51881) waiting for 1 outstanding execs to finish
[  948.467390] xocl 0000:16:00.1: destroy_client: pid(51881) waiting for 1 outstanding execs to finish
[  948.967503] xocl 0000:16:00.1: destroy_client: pid(51881) waiting for 1 outstanding execs to finish
[  949.468140] xocl 0000:16:00.1: destroy_client: pid(51881) waiting for 1 outstanding execs to finish
[  949.968962] xocl 0000:16:00.1: destroy_client: pid(51881) waiting for 1 outstanding execs to finish
[  950.468692] xocl 0000:16:00.1: destroy_client: pid(51881) waiting for 1 outstanding execs to finish
[  950.969609] xocl 0000:16:00.1: destroy_client: pid(51881) waiting for 1 outstanding execs to finish
[  951.470432] xocl 0000:16:00.1: destroy_client: pid(51881) waiting for 1 outstanding execs to finish
[  951.971254] xocl 0000:16:00.1: destroy_client: pid(51881) waiting for 1 outstanding execs to finish
[  952.472067] xocl 0000:16:00.1: destroy_client: pid(51881) waiting for 1 outstanding execs to finish
[  952.972899] xocl 0000:16:00.1: destroy_client: pid(51881) waiting for 1 outstanding execs to finish
[  953.473179] xocl 0000:16:00.1: destroy_client: pid(51881) waiting for 1 outstanding execs to finish
[  953.973282] xocl 0000:16:00.1: destroy_client: pid(51881) waiting for 1 outstanding execs to finish
[  954.472309] xocl 0000:16:00.1: destroy_client: pid(51881) waiting for 1 outstanding execs to finish
[  954.973197] xocl 0000:16:00.1: destroy_client: pid(51881) waiting for 1 outstanding execs to finish
[  955.474019] xocl 0000:16:00.1: destroy_client: pid(51881) waiting for 1 outstanding execs to finish
[  955.974842] xocl 0000:16:00.1: destroy_client: pid(51881) waiting for 1 outstanding execs to finish
[  956.475664] xocl 0000:16:00.1: destroy_client: pid(51881) waiting for 1 outstanding execs to finish
[  956.976487] xocl 0000:16:00.1: destroy_client: pid(51881) gives up with 1 outstanding execs.
[  956.976488] xocl 0000:16:00.1: destroy_client: Please reset device with 'xbutil reset'

 

Case 2) After a fresh power cycle

cat powercycle5_forum_case2/run_1_1_lspci.out 
16:00.0 Processing accelerators: Xilinx Corporation Device 500c
	Subsystem: Xilinx Corporation Device 000e
	Flags: bus master, fast devsel, latency 0, NUMA node 0
	Memory at 387ff2000000 (64-bit, prefetchable) [size=32M]
	Memory at 387ff4000000 (64-bit, prefetchable) [size=128K]
	Capabilities: [40] Power Management version 3
	Capabilities: [60] MSI-X: Enable+ Count=33 Masked-
	Capabilities: [70] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [1c0] #19
	Capabilities: [e00] Access Control Services
	Capabilities: [e10] #15
	Kernel driver in use: xclmgmt
	Kernel modules: xclmgmt

16:00.1 Processing accelerators: Xilinx Corporation Device 500d
	Subsystem: Xilinx Corporation Device 000e
	Flags: bus master, fast devsel, latency 0, IRQ 331, NUMA node 0
	Memory at 387ff0000000 (64-bit, prefetchable) [size=32M]
	Memory at 387ff4020000 (64-bit, prefetchable) [size=64K]
	Memory at 387fe0000000 (64-bit, prefetchable) [size=256M]
	Capabilities: [40] Power Management version 3
	Capabilities: [60] MSI-X: Enable+ Count=33 Masked-
	Capabilities: [70] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [e00] Access Control Services
	Capabilities: [e10] #15
	Kernel driver in use: xocl
	Kernel modules: xocl
cat powercycle5_forum_case2/run_1_2_scan.out  
INFO: Found total 1 card(s), 1 are usable
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
System Configuration
OS name:	Linux
Release:	3.10.0-1062.4.1.el7.x86_64
Version:	#1 SMP Fri Oct 18 17:15:30 UTC 2019
Machine:	x86_64
Glibc:		2.17
Distribution:	CentOS Linux 7 (Core)
Now:		Wed Mar  4 17:55:08 2020
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
XRT Information
Version:	2.3.1301
Git Hash:	192e706aea53163a04c574f9b3fe9ed76b6ca471
Git Branch:	2019.2
Build Date:	2019-10-25 03:04:42
XOCL:		2.3.1301,192e706aea53163a04c574f9b3fe9ed76b6ca471
XCLMGMT:	2.3.1301,192e706aea53163a04c574f9b3fe9ed76b6ca471
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 [0] 0000:16:00.1 xilinx_u280_xdma_201920_1(ts=0x5da8da6e) user(inst=128)

 Here the first problems show up with xbmgmt flash --scan:

cat powercycle5_forum_case2/run_1_3_flash_scan.out 
ERROR: XMC is not ready: 0x3
ERROR: XMC is not ready: 0x3
ERROR: XMC is not ready: 0x3
Card [0000:16:00.0]
    Card type:		u280
    Flash type:		SPI
    Flashable partition running on FPGA:
        xilinx_u280_xdma_201920_1,[ID=0x000000005da8da6e],[SC=UNKNOWN]
    Flashable partitions installed in system:	
        xilinx_u280_xdma_201920_1,[ID=0x000000005da8da6e],[SC=4.3.4]
cat powercycle5_forum_case2/run_1_4_validate.out   
INFO: Found 1 cards

INFO: Validating card[0]: xilinx_u280_xdma_201920_1
INFO: == Starting AUX power connector check: 
INFO: == AUX power connector check PASSED
INFO: == Starting PCIE link check: 
INFO: == PCIE link check PASSED
INFO: == Starting verify kernel test: 

Again, the test was killed after 5 minutes, but also longer tests provide the same outcome.

Excerpts from the dmesg:

[  466.764198] xocl 0000:16:00.1: xocl_init_mem:   Memory Bank: PLRAM[5]
[  466.764199] xocl 0000:16:00.1: xocl_init_mem:   Base Address:0x201400000
[  466.764200] xocl 0000:16:00.1: xocl_init_mem:   Size:0x20000
[  466.764201] xocl 0000:16:00.1: xocl_init_mem:   Type:2
[  466.764203] xocl 0000:16:00.1: xocl_init_mem:   Used:0
[  466.764205] xocl 0000:16:00.1: xocl_init_mem: Allocating Memory Bank: HBM[0]
[  466.764207] xocl 0000:16:00.1: xocl_init_mem:   base_addr:0x0, total size:0x10000000
[  466.764208] xocl 0000:16:00.1: xocl_init_mem: Found a new memory region
[  466.764213] xocl 0000:16:00.1: xocl_init_mem: drm_mm_init called
[  466.764215] xocl 0000:16:00.1: xocl_read_axlf_helper: Loaded xclbin 62fcdc33-e895-48ee-8124-f025f7fbc6bb
[  466.764392] systemd-journald[1148]: no db file to read /run/udev/data/+platform:mailbox.m.13631488: No such file or directory
[  466.765428] icap.u icap.u.15728640: icap_lock_bitstream: bitstream 62fcdc33-e895-48ee-8124-f025f7fbc6bb locked, ref=1
[  466.765431] xocl 0000:16:00.1: exec_reset: exec_reset(0) cfg(0)
[  466.765432] xocl 0000:16:00.1: exec_reset: exec_reset resets
[  466.765433] xocl 0000:16:00.1: exec_reset: exec->xclbin(00000000-0000-0000-0000-000000000000),xclbin(62fcdc33-e895-48ee-8124-f025f7fbc6bb)
[  466.765436] xocl_mb_sche mb_scheduler.u.4194304: client_ioctl_ctx: CTX add(62fcdc33-e895-48ee-8124-f025f7fbc6bb, pid 47015, cu_idx 0xffffffff) = 0, ctx=1
[  466.765449] xocl 0000:16:00.1: exec_cfg_cmd: ert per feature rom = 1
[  466.765451] xocl 0000:16:00.1: exec_cfg_cmd: dsa52 = 1
[  466.765454] xocl 0000:16:00.1: cu_reset: configured cu(0) base@0x1800000 poll@0x          (null) control(0) ctx(0)
[  466.765455] xocl 0000:16:00.1: exec_cfg_cmd: configuring embedded scheduler mode
[  466.765457] xocl 0000:16:00.1: exec_cfg_cmd: scheduler config ert(1), dataflow(0), slots(16), cudma(1), cuisr(0), cdma(0), cus(1)
[  466.773669] systemd-journald[1148]: no db file to read /run/udev/data/+platform:icap.u.15728640: No such file or directory
[  466.773854] systemd-journald[1148]: no db file to read /run/udev/data/+platform:mb_scheduler.u.4194304: No such file or directory
[  547.880615] systemd-journald[1148]: Sent WATCHDOG=1 notification.
[  635.621240] systemd-journald[1148]: Sent WATCHDOG=1 notification.
[  737.151182] systemd-journald[1148]: Sent WATCHDOG=1 notification.
[  763.782428] xocl 0000:16:00.1: destroy_client: pid(47015) waiting for 1 outstanding execs to finish
[  764.282632] xocl 0000:16:00.1: destroy_client: pid(47015) waiting for 1 outstanding execs to finish
[  764.783467] xocl 0000:16:00.1: destroy_client: pid(47015) waiting for 1 outstanding execs to finish
[  765.284289] xocl 0000:16:00.1: destroy_client: pid(47015) waiting for 1 outstanding execs to finish
[  765.785114] xocl 0000:16:00.1: destroy_client: pid(47015) waiting for 1 outstanding execs to finish
[  766.285935] xocl 0000:16:00.1: destroy_client: pid(47015) waiting for 1 outstanding execs to finish
[  766.786756] xocl 0000:16:00.1: destroy_client: pid(47015) waiting for 1 outstanding execs to finish
[  767.286881] xocl 0000:16:00.1: destroy_client: pid(47015) waiting for 1 outstanding execs to finish
[  767.787405] xocl 0000:16:00.1: destroy_client: pid(47015) waiting for 1 outstanding execs to finish
[  768.288226] xocl 0000:16:00.1: destroy_client: pid(47015) waiting for 1 outstanding execs to finish
[  768.789049] xocl 0000:16:00.1: destroy_client: pid(47015) waiting for 1 outstanding execs to finish
[  769.289872] xocl 0000:16:00.1: destroy_client: pid(47015) waiting for 1 outstanding execs to finish
[  769.790694] xocl 0000:16:00.1: destroy_client: pid(47015) waiting for 1 outstanding execs to finish
[  770.291517] xocl 0000:16:00.1: destroy_client: pid(47015) waiting for 1 outstanding execs to finish
[  770.792339] xocl 0000:16:00.1: destroy_client: pid(47015) waiting for 1 outstanding execs to finish
[  771.292931] xocl 0000:16:00.1: destroy_client: pid(47015) waiting for 1 outstanding execs to finish
[  771.792987] xocl 0000:16:00.1: destroy_client: pid(47015) waiting for 1 outstanding execs to finish
[  772.293808] xocl 0000:16:00.1: destroy_client: pid(47015) waiting for 1 outstanding execs to finish
[  772.793582] xocl 0000:16:00.1: destroy_client: pid(47015) waiting for 1 outstanding execs to finish
[  773.294455] xocl 0000:16:00.1: destroy_client: pid(47015) waiting for 1 outstanding execs to finish
[  773.795278] xocl 0000:16:00.1: destroy_client: pid(47015) gives up with 1 outstanding execs.
[  773.795279] xocl 0000:16:00.1: destroy_client: Please reset device with 'xbutil reset'
[  773.795280] [drm] client exits pid(47015)
[  773.795283] icap.u icap.u.15728640: icap_read_from_peer: reading from peer
[  773.795291] mailbox.u mailbox.u.13631488: mailbox_request: sending request: 10 via HW
[  773.795397] mailbox.m mailbox.m.13631488: process_request: received request from peer: 10, passed on
[  773.795401] xclmgmt 0000:16:00.0: xclmgmt_read_subdev_req: req kind 1
[  773.796273] systemd-journald[1148]: no db file to read /run/udev/data/+platform:icap.u.15728640: No such file or directory
[  773.796325] systemd-journald[1148]: no db file to read /run/udev/data/+platform:mailbox.u.13631488: No such file or directory
[  773.796374] systemd-journald[1148]: no db file to read /run/udev/data/+platform:mailbox.m.13631488: No such file or directory
[  773.798443] mailbox.m mailbox.m.13631488: mailbox_post_response: posting response for: 10 via HW
[  773.798542] icap.u icap.u.15728640: icap_unlock_bitstream: bitstream 62fcdc33-e895-48ee-8124-f025f7fbc6bb unlocked, ref=0
[  773.798544] xocl 0000:16:00.1: exec_stop: exec_stop(ffff8bd05a56e018)
[  773.798549] xocl 0000:16:00.1: exec_stop: Waiting for 1 outstanding commands to finish
[  773.799175] systemd-journald[1148]: no db file to read /run/udev/data/+platform:mailbox.m.13631488: No such file or directory
[  773.799246] systemd-journald[1148]: no db file to read /run/udev/data/+platform:icap.u.15728640: No such file or directory
[  773.899099] xocl 0000:16:00.1: exec_stop: Waiting for 1 outstanding commands to finish
[  773.999856] xocl 0000:16:00.1: exec_stop: Waiting for 1 outstanding commands to finish
[  774.100615] xocl 0000:16:00.1: exec_stop: Waiting for 1 outstanding commands to finish
[  774.201381] xocl 0000:16:00.1: exec_stop: Waiting for 1 outstanding commands to finish
[  774.302144] xocl 0000:16:00.1: exec_stop: Waiting for 1 outstanding commands to finish
[  774.402907] xocl 0000:16:00.1: exec_stop: Waiting for 1 outstanding commands to finish
[  774.503669] xocl 0000:16:00.1: exec_stop: Waiting for 1 outstanding commands to finish
[  774.604433] xocl 0000:16:00.1: exec_stop: Waiting for 1 outstanding commands to finish
[  774.705196] xocl 0000:16:00.1: exec_stop: Waiting for 1 outstanding commands to finish
[  774.805949] xocl 0000:16:00.1: exec_stop: Waiting for 1 outstanding commands to finish
[  774.906546] xocl 0000:16:00.1: exec_stop: Waiting for 1 outstanding commands to finish
[  775.007486] xocl 0000:16:00.1: exec_stop: Waiting for 1 outstanding commands to finish
[  775.108248] xocl 0000:16:00.1: exec_stop: Waiting for 1 outstanding commands to finish
[  775.209012] xocl 0000:16:00.1: exec_stop: Waiting for 1 outstanding commands to finish
[  775.309778] xocl 0000:16:00.1: exec_stop: Waiting for 1 outstanding commands to finish
[  775.410538] xocl 0000:16:00.1: exec_stop: Waiting for 1 outstanding commands to finish
[  775.511300] xocl 0000:16:00.1: exec_stop: Waiting for 1 outstanding commands to finish
[  775.612063] xocl 0000:16:00.1: exec_stop: Waiting for 1 outstanding commands to finish
[  775.712828] xocl 0000:16:00.1: cmd_update_state: aborting stale exec pid (0) cmd(0)
[  776.713589] xocl 0000:16:00.1: xocl_drvinst_close: CLOSE 2
[  776.713591] xocl 0000:16:00.1: xocl_drvinst_close: NOTIFY ffff8be6dba87410

 

0 Kudos
17 Replies
Highlighted
Observer
Observer
977 Views
Registered: ‎10-06-2016

Further testing shows that the failures in "Case 1) validate gets stuck during bandwidth test" are not deterministic. In a test loop, currently 17 out of 33 tests succeeded:

INFO: Found 1 cards

INFO: Validating card[0]: xilinx_u280_xdma_201920_1
INFO: == Starting AUX power connector check: 
INFO: == AUX power connector check PASSED
INFO: == Starting PCIE link check: 
INFO: == PCIE link check PASSED
INFO: == Starting verify kernel test: 
INFO: == verify kernel test PASSED
INFO: == Starting DMA test: 
Buffer Size: 256 MB
Host -> PCIe -> FPGA write bandwidth = 11390.9 MB/s
Host <- PCIe <- FPGA read bandwidth = 12126.4 MB/s
INFO: == DMA test PASSED
INFO: == Starting device memory bandwidth test: 
............
Maximum throughput: 43690 MB/s
INFO: == device memory bandwidth test PASSED
INFO: == Starting PCIE peer-to-peer test: 
P2P BAR is not enabled. Skipping validation
INFO: == PCIE peer-to-peer test SKIPPED
INFO: == Starting memory-to-memory DMA test: 
M2M is not available. Skipping validation
INFO: == memory-to-memory DMA test SKIPPED
INFO: Card[0] validated successfully.

INFO: All cards validated successfully.

However, self-compiled .xclbin designs that earlier worked fine still tend to get stuck or output wrong results, also direclty after one of the successful validation tests.

0 Kudos
Highlighted
Observer
Observer
957 Views
Registered: ‎10-06-2016

After further repeated runs within "Case 1) validate gets stuck during bandwidth test", another type of error showed up: DMA Test data integrity check failed.

cat run_99_4_validate.out
INFO: Found 1 cards
 
INFO: Validating card[0]: xilinx_u280_xdma_201920_1
INFO: == Starting AUX power connector check:
INFO: == AUX power connector check PASSED
INFO: == Starting PCIE link check:
INFO: == PCIE link check PASSED
INFO: == Starting verify kernel test:
INFO: == verify kernel test PASSED
INFO: == Starting DMA test:
Buffer Size: 256 MB
Host -> PCIe -> FPGA write bandwidth = 11405.2 MB/s
Host <- PCIe <- FPGA read bandwidth = 12139 MB/s
DMA Test data integrity check failed
ERROR: == DMA test FAILED
INFO: Card[0] failed to validate.
 
ERROR: Some cards failed to validate.

 

0 Kudos
Highlighted
Moderator
Moderator
922 Views
Registered: ‎06-14-2010

Hello @kenter ,

Can you please obtain and use the latest 201920_3 shell / xrt_2.5.309, and see if you see a better outcome?

These can be obtained here: https://www.xilinx.com/products/boards-and-kits/alveo/u280.html#gettingStarted 

Hope this helps.

Kind Regards,
Anatoli Curran,
Xilinx Technical Support
-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
Highlighted
Observer
Observer
896 Views
Registered: ‎10-06-2016

Hi @anatoli ,

thanks for the hint. Initial tests indeed look better:

 

cat run_1_4_validate.out 
INFO: Found 1 cards

INFO: Validating card[0]: xilinx_u280_xdma_201920_3
INFO: == Starting AUX power connector check: 
INFO: == AUX power connector check PASSED
INFO: == Starting PCIE link check: 
INFO: == PCIE link check PASSED
INFO: == Starting SC firmware version check: 
SC FIRMWARE MISMATCH, ATTENTION
SC firmware running on board: 4.2.0. Expected SC firmware from installed Shell: 4.3.10
Please use "xbmgmt flash --scan" to check installed Shell.
WARN: == SC firmware version check PASSED with warning
INFO: == Starting verify kernel test: 
INFO: == verify kernel test PASSED
INFO: == Starting DMA test: 
Host -> PCIe -> FPGA write bandwidth = 11932 MB/s
Host <- PCIe <- FPGA read bandwidth = 12123.5 MB/s
INFO: == DMA test PASSED
INFO: == Starting device memory bandwidth test: 
...........
Maximum throughput: 43690 MB/s
INFO: == device memory bandwidth test PASSED
INFO: == Starting PCIE peer-to-peer test: 
P2P BAR is not enabled. Skipping validation
INFO: == PCIE peer-to-peer test SKIPPED
INFO: == Starting memory-to-memory DMA test: 
M2M is not available. Skipping validation
INFO: == memory-to-memory DMA test SKIPPED
INFO: Card[0] validated with warnings.

INFO: All cards validated successfully but with warnings.

 

Can you help to make sense of the SC Firmware mismatch? 

 

cat run_1_3_flash_scan.out 
Card [0000:af:00.0]
    Card type:		u280
    Flash type:		SPI
    Flashable partition running on FPGA:
        xilinx_u280_xdma_201920_3,[ID=0x5e278820],[SC=4.2.0]
    Flashable partitions installed in system:	
        xilinx_u280_xdma_201920_3,[ID=0x5e278820],[SC=4.3.10]

 

 As we can see, the mismatch also showed up in the earlier output of flash --scan, just with older partition names, and it didn't trigger a warning.

Could it be related to the fact that we boot from an OS image without xrd and deployment shell installed, and just add them by script after booting? If so, where does the SC=4.2.0 version come from?

Or is there any chance that the SC Firmware is not read or readable from the flashable partition and taken from some fallback / golden partition?

Edit: you may notice that the PCIe ID has changed, since we put the card into a different slot for testing. However, the improvements are definitely related to the updated shell / xrt.

0 Kudos
Highlighted
Moderator
Moderator
850 Views
Registered: ‎06-14-2010

Hello @kenter ,

Thanks for the info.

As instructed in https://www.xilinx.com/support/documentation/boards_and_kits/accelerator-cards/ug1370-u50-installation.pdf, page 20 states that to update the SC, after a cold reboot of the machine (step#10), you'd then need to re-run once again the same sudo /opt/xilinx/xrt/bin/xbmgmt flash --update --shell <shell_name> command that was done in the previous step, which will then update your SC on FPGA with the same version found on the System:

image.png

Hope this helps.

Kind Regards,
Anatoli Curran,
Xilinx Technical Support
-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
Highlighted
Observer
Observer
837 Views
Registered: ‎10-06-2016

Hi @anatoli ,

thanks, that helps to resolve the SC mismatch warning. The https://www.xilinx.com/support/documentation/boards_and_kits/accelerator-cards/1_4/ug1301-getting-started-guide-alveo-accelerator-cards.pdf that I find for U280 still ends after "9. Cold boot your machine to load the new firmware image on the FPGA."

Even prior to that step, a test series of 100 validates ran through without any further problems. We will soon start to test again with custom kernels.

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
813 Views
Registered: ‎12-10-2013

@kenter 

To clarify, you only need the cold reboot if you need to update what is running on the FPGA vs. what is installed on the main system.  The loss of power is what triggers that reload.  You shouldn't need a cold reboot after step 11, as the SC FW can be update / loaded without needing the power cycle.

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
Highlighted
Observer
Observer
778 Views
Registered: ‎10-06-2016

Further testing with the new firmware shows that unfortunately not all problems are resolved. The PCIe connection may be the core problem.

1) We ran a few own test kernels, some of which succeeded with the expected performance, some hung (not a 100% sure it was a test setup that worked earlier), and one ran through with a strange performance anomaly. I attach the performance numbers as provided from CL profiling events. 3 of the copy in calls took around 10s (10000ms) instead of the expected ~0.05ms:

Interestingly the effect seems to be correlated with the hardware design itself or the bitstream. The same host code with kernels generated from .cl instead of .cpp, showed no problems or anomalies.

 

./host ../sys_builds/dot.cpp/build_dir.hw.xilinx_u280_xdma_201920_3/dot.xclbin 4096 768 2
Testing dot...
Found Platform
Platform Name: Xilinx
INFO: Reading ../sys_builds/dot.cpp/build_dir.hw.xilinx_u280_xdma_201920_3/dot.xclbin
Loading: '../sys_builds/dot.cpp/build_dir.hw.xilinx_u280_xdma_201920_3/dot.xclbin'
Trying to program device[0]: xilinx_u280_xdma_201920_3
Device[0]: program successful!
block 0 width: 2048, block 1 width: 2048, Allocating host buffers...
Allocating device buffers...
Per kernel: copy host buffers, set args, copy in, enque kernel
Per kernel copy out, finish, accumulate result
Waiting for read back finish...
Waiting for read back finish...
krnl 0, event 0 (copy in vecA) from 3037.276 to 3037.339 =   0.063 ms, throughput   0.121 GB/s,   0.000 GFLOPS/s
krnl 0, event 1 (copy in vecB) from 3037.368 to 13036.782 = 9999.414 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
krnl 0, event 2 (     compute) from 13036.837 to 13036.936 =   0.099 ms, throughput   0.155 GB/s,   0.039 GFLOPS/s
krnl 0, event 3 (copy out val) from 13036.978 to 13037.043 =   0.065 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
krnl 1, event 0 (copy in vecA) from 3037.366 to 3037.388 =   0.022 ms, throughput   0.346 GB/s,   0.000 GFLOPS/s
krnl 1, event 1 (copy in vecB) from 3037.394 to 3037.413 =   0.019 ms, throughput   0.403 GB/s,   0.000 GFLOPS/s
krnl 1, event 2 (     compute) from 3037.455 to 3037.609 =   0.154 ms, throughput   0.099 GB/s,   0.025 GFLOPS/s
krnl 1, event 3 (copy out val) from 13037.112 to 13037.131 =   0.019 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
fpga wall time:                        10000.118 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
cblas reference time:                    2.586 ms, throughput   0.006 GB/s,   0.003 GFLOPS/s
krnl 0, event 1 (copy in vecB) from 13039.904 to 13039.954 =   0.050 ms, throughput   0.152 GB/s,   0.000 GFLOPS/s
krnl 0, event 2 (     compute) from 13039.978 to 13040.031 =   0.054 ms, throughput   0.284 GB/s,   0.071 GFLOPS/s
krnl 0, event 3 (copy out val) from 13040.079 to 13040.100 =   0.021 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
krnl 1, event 1 (copy in vecB) from 13039.963 to 13040.007 =   0.044 ms, throughput   0.174 GB/s,   0.000 GFLOPS/s
krnl 1, event 2 (     compute) from 13040.045 to 13040.139 =   0.093 ms, throughput   0.164 GB/s,   0.041 GFLOPS/s
krnl 1, event 3 (copy out val) from 13040.166 to 13040.182 =   0.016 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
fpga wall time:                          0.456 ms, throughput   0.033 GB/s,   0.017 GFLOPS/s
cblas reference time:                    0.003 ms, throughput   5.641 GB/s,   2.820 GFLOPS/s
krnl 0, event 1 (copy in vecB) from 13040.345 to 13040.373 =   0.029 ms, throughput   0.267 GB/s,   0.000 GFLOPS/s
krnl 0, event 2 (     compute) from 13040.390 to 13040.428 =   0.038 ms, throughput   0.398 GB/s,   0.099 GFLOPS/s
krnl 0, event 3 (copy out val) from 13040.491 to 13040.510 =   0.020 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
krnl 1, event 1 (copy in vecB) from 13040.407 to 13040.428 =   0.021 ms, throughput   0.367 GB/s,   0.000 GFLOPS/s
krnl 1, event 2 (     compute) from 13040.465 to 13040.516 =   0.051 ms, throughput   0.298 GB/s,   0.074 GFLOPS/s
krnl 1, event 3 (copy out val) from 13040.549 to 13040.571 =   0.022 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
fpga wall time:                          0.352 ms, throughput   0.043 GB/s,   0.022 GFLOPS/s
cblas reference time:                    0.001 ms, throughput  21.922 GB/s,  10.961 GFLOPS/s
krnl 0, event 1 (copy in vecB) from 13040.705 to 13041.084 =   0.379 ms, throughput   0.020 GB/s,   0.000 GFLOPS/s
krnl 0, event 2 (     compute) from 13041.096 to 13041.131 =   0.035 ms, throughput   0.439 GB/s,   0.110 GFLOPS/s
krnl 0, event 3 (copy out val) from 13041.148 to 13041.165 =   0.017 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
krnl 1, event 1 (copy in vecB) from 13040.753 to 23042.414 = 10001.661 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
krnl 1, event 2 (     compute) from 23042.462 to 23042.557 =   0.095 ms, throughput   0.161 GB/s,   0.040 GFLOPS/s
krnl 1, event 3 (copy out val) from 23042.602 to 23042.665 =   0.063 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
fpga wall time:                        10002.124 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
cblas reference time:                    0.005 ms, throughput   2.953 GB/s,   1.477 GFLOPS/s
krnl 0, event 1 (copy in vecB) from 23042.888 to 23043.332 =   0.444 ms, throughput   0.017 GB/s,   0.000 GFLOPS/s
krnl 0, event 2 (     compute) from 23043.349 to 23043.392 =   0.043 ms, throughput   0.355 GB/s,   0.089 GFLOPS/s
krnl 0, event 3 (copy out val) from 23043.415 to 23043.440 =   0.025 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
krnl 1, event 1 (copy in vecB) from 23042.942 to 33042.809 = 9999.867 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
krnl 1, event 2 (     compute) from 33042.856 to 33042.960 =   0.105 ms, throughput   0.146 GB/s,   0.036 GFLOPS/s
krnl 1, event 3 (copy out val) from 33043.007 to 33043.073 =   0.066 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
fpga wall time:                        10000.371 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
cblas reference time:                    0.006 ms, throughput   2.605 GB/s,   1.303 GFLOPS/s
krnl 0, event 1 (copy in vecB) from 33043.285 to 33043.331 =   0.046 ms, throughput   0.166 GB/s,   0.000 GFLOPS/s
krnl 0, event 2 (     compute) from 33043.351 to 33043.396 =   0.045 ms, throughput   0.339 GB/s,   0.085 GFLOPS/s
krnl 0, event 3 (copy out val) from 33043.442 to 33043.462 =   0.020 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
krnl 1, event 1 (copy in vecB) from 33043.339 to 33043.392 =   0.054 ms, throughput   0.142 GB/s,   0.000 GFLOPS/s
krnl 1, event 2 (     compute) from 33043.420 to 33043.454 =   0.034 ms, throughput   0.454 GB/s,   0.114 GFLOPS/s
krnl 1, event 3 (copy out val) from 33043.491 to 33043.512 =   0.021 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
fpga wall time:                          0.396 ms, throughput   0.039 GB/s,   0.019 GFLOPS/s
cblas reference time:                    0.001 ms, throughput  11.875 GB/s,   5.937 GFLOPS/s
krnl 0, event 1 (copy in vecB) from 33043.655 to 33043.681 =   0.026 ms, throughput   0.294 GB/s,   0.000 GFLOPS/s
krnl 0, event 2 (     compute) from 33043.697 to 33043.760 =   0.063 ms, throughput   0.243 GB/s,   0.061 GFLOPS/s
krnl 0, event 3 (copy out val) from 33043.791 to 33044.385 =   0.594 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
krnl 1, event 1 (copy in vecB) from 33043.719 to 33044.386 =   0.667 ms, throughput   0.011 GB/s,   0.000 GFLOPS/s
krnl 1, event 2 (     compute) from 33044.402 to 33044.437 =   0.036 ms, throughput   0.429 GB/s,   0.107 GFLOPS/s
krnl 1, event 3 (copy out val) from 33044.454 to 33044.470 =   0.017 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
fpga wall time:                          0.936 ms, throughput   0.016 GB/s,   0.008 GFLOPS/s
cblas reference time:                    0.001 ms, throughput  20.845 GB/s,  10.422 GFLOPS/s
krnl 0, event 1 (copy in vecB) from 33044.604 to 33044.634 =   0.030 ms, throughput   0.256 GB/s,   0.000 GFLOPS/s
krnl 0, event 2 (     compute) from 33044.653 to 33044.698 =   0.046 ms, throughput   0.333 GB/s,   0.083 GFLOPS/s
krnl 0, event 3 (copy out val) from 33044.774 to 33044.881 =   0.107 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
krnl 1, event 1 (copy in vecB) from 33044.638 to 33044.661 =   0.023 ms, throughput   0.334 GB/s,   0.000 GFLOPS/s
krnl 1, event 2 (     compute) from 33044.691 to 33044.744 =   0.054 ms, throughput   0.284 GB/s,   0.071 GFLOPS/s
krnl 1, event 3 (copy out val) from 33044.909 to 33044.928 =   0.019 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
fpga wall time:                          0.441 ms, throughput   0.035 GB/s,   0.017 GFLOPS/s
cblas reference time:                    0.001 ms, throughput  20.025 GB/s,  10.012 GFLOPS/s
krnl 0, event 1 (copy in vecB) from 33045.056 to 33045.080 =   0.025 ms, throughput   0.310 GB/s,   0.000 GFLOPS/s
krnl 0, event 2 (     compute) from 33045.096 to 33045.134 =   0.038 ms, throughput   0.400 GB/s,   0.100 GFLOPS/s
krnl 0, event 3 (copy out val) from 33045.196 to 33045.216 =   0.019 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
krnl 1, event 1 (copy in vecB) from 33045.119 to 33045.147 =   0.028 ms, throughput   0.271 GB/s,   0.000 GFLOPS/s
krnl 1, event 2 (     compute) from 33045.178 to 33045.217 =   0.039 ms, throughput   0.396 GB/s,   0.099 GFLOPS/s
krnl 1, event 3 (copy out val) from 33045.247 to 33045.263 =   0.016 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
fpga wall time:                          0.324 ms, throughput   0.047 GB/s,   0.024 GFLOPS/s
cblas reference time:                    0.001 ms, throughput  24.731 GB/s,  12.365 GFLOPS/s
krnl 0, event 1 (copy in vecB) from 33045.391 to 33045.415 =   0.024 ms, throughput   0.321 GB/s,   0.000 GFLOPS/s
krnl 0, event 2 (     compute) from 33045.430 to 33045.465 =   0.035 ms, throughput   0.431 GB/s,   0.108 GFLOPS/s
krnl 0, event 3 (copy out val) from 33045.525 to 33045.542 =   0.018 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
krnl 1, event 1 (copy in vecB) from 33045.444 to 33045.465 =   0.021 ms, throughput   0.372 GB/s,   0.000 GFLOPS/s
krnl 1, event 2 (     compute) from 33045.502 to 33045.537 =   0.035 ms, throughput   0.432 GB/s,   0.108 GFLOPS/s
krnl 1, event 3 (copy out val) from 33045.574 to 33045.591 =   0.017 ms, throughput   0.000 GB/s,   0.000 GFLOPS/s
fpga wall time:                          0.312 ms, throughput   0.049 GB/s,   0.024 GFLOPS/s
cblas reference time:                    0.001 ms, throughput  24.259 GB/s,  12.130 GFLOPS/s
ERROR: expected 4096, received 0

 

2) After some tests with own kernels, we tested validate again. The PCIe link had degraded to Gen2x16 in the meantime. Subsequently this error showed up. Shortly after, the machine crashed completely during host code compilation.

 

 

/opt/xilinx/xrt/bin/xbutil validate                                                                                                                                                                                                                     [20-03-10 9:40:11]
INFO: Found 1 cards
 
INFO: Validating card[0]: xilinx_u280_xdma_201920_3
INFO: == Starting AUX power connector check:
INFO: == AUX power connector check PASSED
INFO: == Starting PCIE link check:
LINK ACTIVE, ATTENTION
Ensure Card is plugged in to Gen3x16, instead of Gen2x16
Lower performance may be experienced
WARN: == PCIE link check PASSED with warning
INFO: == Starting SC firmware version check:
INFO: == SC firmware version check PASSED
INFO: == Starting verify kernel test:
INFO: == verify kernel test PASSED
INFO: == Starting DMA test:
Host -> PCIe -> FPGA write bandwidth = 6187.31 MB/s
Host <- PCIe <- FPGA read bandwidth = 6381.02 MB/s
INFO: == DMA test PASSED
INFO: == Starting device memory bandwidth test:
.................*** Error in `/usr/bin/python': free(): invalid pointer: 0x0000000016953ff0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7c619)[0x2aaaaba3c619]
/lib64/libpython2.7.so.1.0(PyInt_ClearFreeList+0x11c)[0x2aaaaad4280c]
/lib64/libpython2.7.so.1.0(+0x114b9f)[0x2aaaaade3b9f]
/lib64/libpython2.7.so.1.0(PyGC_Collect+0x28)[0x2aaaaade3f78]
/lib64/libpython2.7.so.1.0(Py_Finalize+0xf9)[0x2aaaaadd1089]
/lib64/libpython2.7.so.1.0(Py_Exit+0x8)[0x2aaaaadd0988]
/lib64/libpython2.7.so.1.0(+0x101ac7)[0x2aaaaadd0ac7]
/lib64/libpython2.7.so.1.0(PyErr_PrintEx+0x1dd)[0x2aaaaadd0d8d]
/lib64/libpython2.7.so.1.0(PyRun_SimpleFileExFlags+0x20e)[0x2aaaaadd19ae]
/lib64/libpython2.7.so.1.0(Py_Main+0xc9f)[0x2aaaaade2a3f]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaaab9e1c05]
/usr/bin/python[0x40071e]
======= Memory map: ========
00400000-00401000 r-xp 00000000 00:1d 77317                              /usr/bin/python2.7
00600000-00601000 r--p 00000000 00:1d 77317                              /usr/bin/python2.7
00601000-00602000 rw-p 00001000 00:1d 77317                              /usr/bin/python2.7
00602000-1a06c000 rw-p 00000000 00:00 0                                  [heap]
2aaaaaaab000-2aaaaaacc000 r-xp 00000000 00:1d 108701                     /usr/lib64/ld-2.17.so
2aaaaaacc000-2aaaaaace000 r-xp 00000000 00:00 0                          [vdso]
2aaaaaace000-2aaaaaad0000 rw-p 00000000 00:00 0
2aaaaaad0000-2aaaaaad1000 rw-s 00000000 00:05 5958533                    /dev/dri/renderD128
2aaaaaad1000-2aaaaaad2000 rw-s 00000000 00:05 5958533                    /dev/dri/renderD128
2aaaaaad2000-2aaaaaad3000 rw-p 00000000 00:00 0
2aaaaaae1000-2aaaaab68000 rw-p 00000000 00:00 0
2aaaaab68000-2aaaaab7e000 r-xp 00000000 00:2d 333220013                  /cm/shared/apps/pc2/EB-SW/software/zlib/1.2.11-GCCcore-8.3.0/lib/libz.so.1.2.11
2aaaaab7e000-2aaaaab7f000 ---p 00016000 00:2d 333220013                  /cm/shared/apps/pc2/EB-SW/software/zlib/1.2.11-GCCcore-8.3.0/lib/libz.so.1.2.11
2aaaaab7f000-2aaaaab80000 r--p 00016000 00:2d 333220013                  /cm/shared/apps/pc2/EB-SW/software/zlib/1.2.11-GCCcore-8.3.0/lib/libz.so.1.2.11
2aaaaab80000-2aaaaab81000 rw-p 00017000 00:2d 333220013                  /cm/shared/apps/pc2/EB-SW/software/zlib/1.2.11-GCCcore-8.3.0/lib/libz.so.1.2.11
2aaaaab81000-2aaaaab88000 r--s 00000000 00:1d 109577                     /usr/lib64/gconv/gconv-modules.cache
2aaaaab99000-2aaaaac1b000 rw-p 00000000 00:00 0
2aaaaac1b000-2aaaaac32000 r-xp 00000000 00:2d 11044454883                /cm/shared/apps/pc2/EB-SW/software/GCCcore/8.3.0/lib64/libgcc_s.so.1
2aaaaac32000-2aaaaac33000 r--p 00016000 00:2d 11044454883                /cm/shared/apps/pc2/EB-SW/software/GCCcore/8.3.0/lib64/libgcc_s.so.1
2aaaaac33000-2aaaaac34000 rw-p 00017000 00:2d 11044454883                /cm/shared/apps/pc2/EB-SW/software/GCCcore/8.3.0/lib64/libgcc_s.so.1
2aaaaac5c000-2aaaaac9d000 rw-p 00000000 00:00 0
2aaaaaccc000-2aaaaaccd000 r--p 00021000 00:1d 108701                     /usr/lib64/ld-2.17.so
2aaaaaccd000-2aaaaacce000 rw-p 00022000 00:1d 108701                     /usr/lib64/ld-2.17.so
2aaaaacce000-2aaaaaccf000 rw-p 00000000 00:00 0
2aaaaaccf000-2aaaaae4c000 r-xp 00000000 00:1d 109159                     /usr/lib64/libpython2.7.so.1.0
2aaaaae4c000-2aaaab04c000 ---p 0017d000 00:1d 109159                     /usr/lib64/libpython2.7.so.1.0
2aaaab04c000-2aaaab04e000 r--p 0017d000 00:1d 109159                     /usr/lib64/libpython2.7.so.1.0
2aaaab04e000-2aaaab08c000 rw-p 0017f000 00:1d 109159                     /usr/lib64/libpython2.7.so.1.0
2aaaab08c000-2aaaab09b000 rw-p 00000000 00:00 0
2aaaab09b000-2aaaab0b2000 r-xp 00000000 00:1d 109152                     /usr/lib64/libpthread-2.17.so
2aaaab0b2000-2aaaab2b1000 ---p 00017000 00:1d 109152                     /usr/lib64/libpthread-2.17.so
2aaaab2b1000-2aaaab2b2000 r--p 00016000 00:1d 109152                     /usr/lib64/libpthread-2.17.so
2aaaab2b2000-2aaaab2b3000 rw-p 00017000 00:1d 109152                     /usr/lib64/libpthread-2.17.so
2aaaab2b3000-2aaaab2b7000 rw-p 00000000 00:00 0
2aaaab2b7000-2aaaab2b9000 r-xp 00000000 00:1d 108881                     /usr/lib64/libdl-2.17.so
2aaaab2b9000-2aaaab4b9000 ---p 00002000 00:1d 108881                     /usr/lib64/libdl-2.17.so
2aaaab4b9000-2aaaab4ba000 r--p 00002000 00:1d 108881                     /usr/lib64/libdl-2.17.so
2aaaab4ba000-2aaaab4bb000 rw-p 00003000 00:1d 108881                     /usr/lib64/libdl-2.17.so
2aaaab4bb000-2aaaab4bd000 r-xp 00000000 00:1d 109229                     /usr/lib64/libutil-2.17.so
2aaaab4bd000-2aaaab6bc000 ---p 00002000 00:1d 109229                     /usr/lib64/libutil-2.17.so
2aaaab6bc000-2aaaab6bd000 r--p 00001000 00:1d 109229                     /usr/lib64/libutil-2.17.so
2aaaab6bd000-2aaaab6be000 rw-p 00002000 00:1d 109229                     /usr/lib64/libutil-2.17.so
2aaaab6be000-2aaaab7bf000 r-xp 00000000 00:1d 109032                     /usr/lib64/libm-2.17.so
2aaaab7bf000-2aaaab9be000 ---p 00101000 00:1d 109032                     /usr/lib64/libm-2.17.so
2aaaab9be000-2aaaab9bf000 r--p 00100000 00:1d 109032                     /usr/lib64/libm-2.17.so
2aaaab9bf000-2aaaab9c0000 rw-p 00101000 00:1d 109032                     /usr/lib64/libm-2.17.so
2aaaab9c0000-2aaaabb78000 r-xp 00000000 00:1d 108836                     /usr/lib64/libc-2.17.so
2aaaabb78000-2aaaabd78000 ---p 001b8000 00:1d 108836                     /usr/lib64/libc-2.17.so
2aaaabd78000-2aaaabd7c000 r--p 001b8000 00:1d 108836                     /usr/lib64/libc-2.17.so
2aaaabd7c000-2aaaabd7e000 rw-p 001bc000 00:1d 108836                     /usr/lib64/libc-2.17.so
2aaaabd7e000-2aaaabd83000 rw-p 00000000 00:00 0
2aaaabd83000-2aaab22ac000 r--p 00000000 00:1d 95497                      /usr/lib/locale/locale-archive
2aaab22ac000-2aaab22af000 r-xp 00000000 00:1d 111546                     /usr/lib64/python2.7/lib-dynload/_functoolsmodule.so
2aaab22af000-2aaab24ae000 ---p 00003000 00:1d 111546                     /usr/lib64/python2.7/lib-dynload/_functoolsmodule.so
2aaab24ae000-2aaab24af000 r--p 00002000 00:1d 111546                     /usr/lib64/python2.7/lib-dynload/_functoolsmodule.so
2aaab24af000-2aaab24b0000 rw-p 00003000 00:1d 111546                     /usr/lib64/python2.7/lib-dynload/_functoolsmodule.so
2aaab24b0000-2aaab24ba000 r-xp 00000000 00:1d 111576                     /usr/lib64/python2.7/lib-dynload/itertoolsmodule.so
2aaab24ba000-2aaab26b9000 ---p 0000a000 00:1d 111576                     /usr/lib64/python2.7/lib-dynload/itertoolsmodule.so
2aaab26b9000-2aaab26ba000 r--p 00009000 00:1d 111576                     /usr/lib64/python2.7/lib-dynload/itertoolsmodule.so
2aaab26ba000-2aaab26bf000 rw-p 0000a000 00:1d 111576                     /usr/lib64/python2.7/lib-dynload/itertoolsmodule.so
2aaab26bf000-2aaab26c8000 r-xp 00000000 00:1d 111581                     /usr/lib64/python2.7/lib-dynload/operator.so
2aaab26c8000-2aaab28c7000 ---p 00009000 00:1d 111581                     /usr/lib64/python2.7/lib-dynload/operator.so
2aaab28c7000-2aaab28c8000 r--p 00008000 00:1d 111581                     /usr/lib64/python2.7/lib-dynload/operator.so
2aaab28c8000-2aaab28ca000 rw-p 00009000 00:1d 111581                     /usr/lib64/python2.7/lib-dynload/operator.so
2aaab28ca000-2aaab28ce000 r-xp 00000000 00:1d 111592                     /usr/lib64/python2.7/lib-dynload/timemodule.so
2aaab28ce000-2aaab2acd000 ---p 00004000 00:1d 111592                     /usr/lib64/python2.7/lib-dynload/timemodule.so
2aaab2acd000-2aaab2ace000 r--p 00003000 00:1d 111592                     /usr/lib64/python2.7/lib-dynload/timemodule.so
2aaab2ace000-2aaab2ad0000 rw-p 00004000 00:1d 111592                     /usr/lib64/python2.7/lib-dynload/timemodule.so
2aaab2ad0000-2aaab2ad4000 r-xp 00000000 00:1d 111566                     /usr/lib64/python2.7/lib-dynload/cStringIO.so
2aaab2ad4000-2aaab2cd3000 ---p 00004000 00:1d 111566                     /usr/lib64/python2.7/lib-dynload/cStringIO.so
2aaab2cd3000-2aaab2cd4000 r--p 00003000 00:1d 111566                     /usr/lib64/python2.7/lib-dynload/cStringIO.so
2aaab2cd4000-2aaab2cd6000 rw-p 00004000 00:1d 111566                     /usr/lib64/python2.7/lib-dynload/cStringIO.so
2aaab2cd6000-2aaab2cdc000 r-xp 00000000 00:1d 111539                     /usr/lib64/python2.7/lib-dynload/_collectionsmodule.so
2aaab2cdc000-2aaab2edb000 ---p 00006000 00:1d 111539                     /usr/lib64/python2.7/lib-dynload/_collectionsmodule.so
2aaab2edb000-2aaab2edc000 r--p 00005000 00:1d 111539                     /usr/lib64/python2.7/lib-dynload/_collectionsmodule.so
2aaab2edc000-2aaab2ede000 rw-p 00006000 00:1d 111539                     /usr/lib64/python2.7/lib-dynload/_collectionsmodule.so
2aaab2ede000-2aaab2ee1000 r-xp 00000000 00:1d 111548                     /usr/lib64/python2.7/lib-dynload/_heapq.so
2aaab2ee1000-2aaab30e0000 ---p 00003000 00:1d 111548                     /usr/lib64/python2.7/lib-dynload/_heapq.so
2aaab30e0000-2aaab30e1000 r--p 00002000 00:1d 111548                     /usr/lib64/python2.7/lib-dynload/_heapq.so
2aaab30e1000-2aaab30e3000 rw-p 00003000 00:1d 111548                     /usr/lib64/python2.7/lib-dynload/_heapq.so
2aaab30e3000-2aaab31b1000 r-xp 00000000 00:1d 303214                     /usr/lib64/python2.7/site-packages/pyopencl/_cl.so
2aaab31b1000-2aaab33b0000 ---p 000ce000 00:1d 303214                     /usr/lib64/python2.7/site-packages/pyopencl/_cl.so
2aaab33b0000-2aaab33b5000 rw-p 000cd000 00:1d 303214                     /usr/lib64/python2.7/site-packages/pyopencl/_cl.so
2aaab33b5000-2aaab33b7000 rw-p 00000000 00:00 0
2aaab33b7000-2aaab33c0000 rw-p 0013f000 00:1d 303214                     /usr/lib64/python2.7/site-packages/pyopencl/_cl.so
2aaab33c0000-2aaab33dc000 r-xp 00000000 00:1d 303283                     /usr/lib64/python2.7/site-packages/pyopencl/.libs/libOpenCL-c80442de.so.1.0.0
2aaab33dc000-2aaab35dc000 ---p 0001c000 00:1d 303283                     /usr/lib64/python2.7/site-packages/pyopencl/.libs/libOpenCL-c80442de.so.1.0.0
2aaab35dc000-2aaab35dd000 rw-p 0001c000 00:1d 303283                     /usr/lib64/python2.7/site-packages/pyopencl/.libs/libOpenCL-c80442de.so.1.0.0
2aaab35dd000-2aaab35df000 rw-p 00075000 00:1d 303283                     /usr/lib64/python2.7/site-packages/pyopencl/.libs/libOpenCL-c80442de.so.1.0.0
2aaab35df000-2aaab3768000 r-xp 00000000 00:2d 11044454908                /cm/shared/apps/pc2/EB-SW/software/GCCcore/8.3.0/lib64/libstdc++.so.6.0.25
2aaab3768000-2aaab3772000 r--p 00188000 00:2d 11044454908                /cm/shared/apps/pc2/EB-SW/software/GCCcore/8.3.0/lib64/libstdc++.so.6.0.25
2aaab3772000-2aaab3776000 rw-p 00192000 00:2d 11044454908                /cm/shared/apps/pc2/EB-SW/software/GCCcore/8.3.0/lib64/libstdc++.so.6.0.25
2aaab3776000-2aaab3779000 rw-p 00000000 00:00 0
2aaab3779000-2aaab3780000 r-xp 00000000 00:1d 111578                     /usr/lib64/python2.7/lib-dynload/math.so
2aaab3780000-2aaab397f000 ---p 00007000 00:1d 111578                     /usr/lib64/python2.7/lib-dynload/math.so
2aaab397f000-2aaab3980000 r--p 00006000 00:1d 111578                     /usr/lib64/python2.7/lib-dynload/math.so
2aaab3980000-2aaab3982000 rw-p 00007000 00:1d 111578                     /usr/lib64/python2.7/lib-dynload/math.so
2aaab3982000-2aaab3b5c000 r-xp 00000000 00:2c 8278608907                 /upb/departments/pc2/users/m/mariusme/.local/lib/python2.7/site-packages/numpy/core/multiarray.so
2aaab3b5c000-2aaab3d5b000 ---p 001da000 00:2c 8278608907                 /upb/departments/pc2/users/m/mariusme/.local/lib/python2.7/site-packages/numpy/core/multiarray.so
2aaab3d5b000-2aaab3d74000 rw-p 001d9000 00:2c 8278608907                 /upb/departments/pc2/users/m/mariusme/.local/lib/python2.7/site-packages/numpy/core/multiarray.so
2aaab3d74000-2aaab3d93000 rw-p 00000000 00:00 0
2aaab3d93000-2aaab3d9a000 rw-p 001f3000 00:2c 8278608907                 /upb/departments/pc2/users/m/mariusme/.local/lib/python2.7/site-packages/numpy/core/multiarray.so
2aaab3d9a000-2aaab6177000 r-xp 00000000 00:2c 8279386448                 /upb/departments/pc2/users/m/mariusme/.local/lib/python2.7/site-packages/numpy/.libs/libopenblasp-r0-8dca6697.3.0.dev.so
2aaab6177000-2aaab6376000 ---p 023dd000 00:2c 8279386448                 /upb/departments/pc2/users/m/mariusme/.local/lib/python2.7/site-packages/numpy/.libs/libopenblasp-r0-8dca6697.3.0.dev.so
2aaab6376000-2aaab6396000 rw-p 023dc000 00:2c 8279386448                 /upb/departments/pc2/users/m/mariusme/.local/lib/python2.7/site-packages/numpy/.libs/libopenblasp-r0-8dca6697.3.0.dev.so
2aaab6396000-2aaab63f9000 rw-p 00000000 00:00 0
2aaab63f9000-2aaab6498000 rw-p 02501000 00:2c 8279386448                 /upb/departments/pc2/users/m/mariusme/.local/lib/python2.7/site-packages/numpy/.libs/libopenblasp-r0-8dca6697.3.0.dev.so
2aaab6498000-2aaab6588000 r-xp 00000000 00:2c 8279250611                 /upb/departments/pc2/users/m/mariusme/.local/lib/python2.7/site-packages/numpy/.libs/libgfortran-ed201abd.so.3.0.0
2aaab6588000-2aaab6787000 ---p 000f0000 00:2c 8279250611                 /upb/departments/pc2/users/m/mariusme/.local/lib/python2.7/site-packages/numpy/.libs/libgfortran-ed201abd.so.3.0.0
2aaab6787000-2aaab6789000 rw-p 000ef000 00:2c 8279250611                 /upb/departments/pc2/users/m/mariusme/.local/lib/python2.7/site-packages/numpy/.libs/libgfortran-ed201abd.so.3.0.0
2aaab6789000-2aaab678a000 rw-p 00000000 00:00 0
2aaab678a000-2aaab6792000 rw-p 000f2000 00:2c 8279250611                 /upb/departments/pc2/users/m/mariusme/.local/lib/python2.7/site-packages/numpy/.libs/libgfortran-ed201abd.so.3.0.0
2aaab6792000-2aaab6793000 ---p 00000000 00:00 0
2aaab6793000-2aaab6993000 rw-p 00000000 00:00 0
2aaab6993000-2aaab6994000 ---p 00000000 00:00 0
2aaab6994000-2aaab6b94000 rw-p 00000000 00:00 0
2aaab6b94000-2aaab6b95000 ---p 00000000 00:00 0
2aaab6b95000-2aaab6d95000 rw-p 00000000 00:00 0
2aaab6d95000-2aaabad95000 rw-p 00000000 00:00 0
2aaabad95000-2aaabad96000 ---p 00000000 00:00 0
2aaabad96000-2aaabaf96000 rw-p 00000000 00:00 0
2aaabaf96000-2aaabcf96000 rw-p 00000000 00:00 0
2aaabcf96000-2aaabcf97000 ---p 00000000 00:00 0
2aaabcf97000-2aaabd197000 rw-p 00000000 00:00 0
2aaabd197000-2aaabf197000 rw-p 00000000 00:00 0
2aaabf197000-2aaabf198000 ---p 00000000 00:00 0
2aaabf198000-2aaabf398000 rw-p 00000000 00:00 0
2aaabf398000-2aaac1398000 rw-p 00000000 00:00 0
2aaac1398000-2aaac1399000 ---p 00000000 00:00 0
2aaac1399000-2aaac1599000 rw-p 00000000 00:00 0
2aaac1599000-2aaac3599000 rw-p 00000000 00:00 0
2aaac3599000-2aaac359a000 ---p 00000000 00:00 0
2aaac359a000-2aaac379a000 rw-p 00000000 00:00 0
2aaac379a000-2aaac579a000 rw-p 00000000 00:00 0
2aaac579a000-2aaac579b000 ---p 00000000 00:00 0
2aaac579b000-2aaac599b000 rw-p 00000000 00:00 0
2aaac599b000-2aaac799b000 rw-p 00000000 00:00 0
2aaac799b000-2aaac799c000 ---p 00000000 00:00 0
2aaac799c000-2aaac7b9c000 rw-p 00000000 00:00 0
2aaac7b9c000-2aaac9b9c000 rw-p 00000000 00:00 0
2aaac9b9c000-2aaac9b9d000 ---p 00000000 00:00 0
2aaac9b9d000-2aaac9d9d000 rw-p 00000000 00:00 0
2aaac9d9d000-2aaacbd9d000 rw-p 00000000 00:00 0
2aaacbd9d000-2aaacbd9e000 ---p 00000000 00:00 0
2aaacbd9e000-2aaacbf9e000 rw-p 00000000 00:00 0
2aaacbf9e000-2aaacdf9e000 rw-p 00000000 00:00 0
2aaacdf9e000-2aaacdf9f000 ---p 00000000 00:00 0
2aaacdf9f000-2aaace19f000 rw-p 00000000 00:00 0
2aaace19f000-2aaad019f000 rw-p 00000000 00:00 0
2aaad019f000-2aaad01a0000 ---p 00000000 00:00 0
2aaad01a0000-2aaad03a0000 rw-p 00000000 00:00 0
2aaad03a0000-2aaad23a0000 rw-p 00000000 00:00 0
2aaad23a0000-2aaad23a1000 ---p 00000000 00:00 0
2aaad23a1000-2aaad25a1000 rw-p 00000000 00:00 0
2aaad25a1000-2aaad45a1000 rw-p 00000000 00:00 0
2aaad45a1000-2aaad45a2000 ---p 00000000 00:00 0
2aaad45a2000-2aaad47a2000 rw-p 00000000 00:00 0
2aaad47a2000-2aaad67a2000 rw-p 00000000 00:00 0
2aaad67a2000-2aaad67a3000 ---p 00000000 00:00 0
2aaad67a3000-2aaad69a3000 rw-p 00000000 00:00 0
2aaad69a3000-2aaad89a3000 rw-p 00000000 00:00 0
2aaad89a3000-2aaad89a4000 ---p 00000000 00:00 0
2aaad89a4000-2aaad8ba4000 rw-p 00000000 00:00 0
2aaad8ba4000-2aaadaba4000 rw-p 00000000 00:00 0
2aaadaba4000-2aaadaba5000 ---p 00000000 00:00 0
2aaadaba5000-2aaadada5000 rw-p 00000000 00:00 0
2aaadada5000-2aaadcda5000 rw-p 00000000 00:00 0
2aaadcda5000-2aaadcda6000 ---p 00000000 00:00 0
2aaadcda6000-2aaadcfa6000 rw-p 00000000 00:00 0
2aaadcfa6000-2aaadefa6000 rw-p 00000000 00:00 0
2aaadefa6000-2aaadefa7000 ---p 00000000 00:00 0
2aaadefa7000-2aaadf1a7000 rw-p 00000000 00:00 0
2aaadf1a7000-2aaae11a7000 rw-p 00000000 00:00 0
2aaae11a7000-2aaae11a8000 ---p 00000000 00:00 0
2aaae11a8000-2aaae13a8000 rw-p 00000000 00:00 0
2aaae13a8000-2aaae33a8000 rw-p 00000000 00:00 0
2aaae33a8000-2aaae33a9000 ---p 00000000 00:00 0
2aaae33a9000-2aaae35a9000 rw-p 00000000 00:00 0
2aaae35a9000-2aaae55a9000 rw-p 00000000 00:00 0
2aaae55a9000-2aaae55aa000 ---p 00000000 00:00 0
2aaae55aa000-2aaae57aa000 rw-p 00000000 00:00 0
2aaae57aa000-2aaae77aa000 rw-p 00000000 00:00 0
2aaae77aa000-2aaae77ab000 ---p 00000000 00:00 0
2aaae77ab000-2aaae79ab000 rw-p 00000000 00:00 0
2aaae79ab000-2aaae99ab000 rw-p 00000000 00:00 0
2aaae99ab000-2aaae99ac000 ---p 00000000 00:00 0
2aaae99ac000-2aaae9bac000 rw-p 00000000 00:00 0
2aaae9bac000-2aaaebbac000 rw-p 00000000 00:00 0
2aaaebbac000-2aaaebbad000 ---p 00000000 00:00 0
2aaaebbad000-2aaaebdad000 rw-p 00000000 00:00 0
2aaaebdad000-2aaaeddad000 rw-p 00000000 00:00 0
2aaaeddad000-2aaaeddae000 ---p 00000000 00:00 0
2aaaeddae000-2aaaedfae000 rw-p 00000000 00:00 0
2aaaedfae000-2aaaeffae000 rw-p 00000000 00:00 0
2aaaeffae000-2aaaeffaf000 ---p 00000000 00:00 0
2aaaeffaf000-2aaaf01af000 rw-p 00000000 00:00 0
2aaaf01af000-2aaaf21af000 rw-p 00000000 00:00 0
2aaaf21af000-2aaaf21b0000 ---p 00000000 00:00 0
2aaaf21b0000-2aaaf23b0000 rw-p 00000000 00:00 0
2aaaf23b0000-2aaaf43b0000 rw-p 00000000 00:00 0
2aaaf43b0000-2aaaf43b1000 ---p 00000000 00:00 0
2aaaf43b1000-2aaaf45b1000 rw-p 00000000 00:00 0
2aaaf45b1000-2aaaf65b1000 rw-p 00000000 00:00 0
2aaaf65b1000-2aaaf65b2000 ---p 00000000 00:00 0
2aaaf65b2000-2aaaf67b2000 rw-p 00000000 00:00 0
2aaaf67b2000-2aaaf87b2000 rw-p 00000000 00:00 0
2aaaf87b2000-2aaaf87b3000 ---p 00000000 00:00 0
2aaaf87b3000-2aaaf89b3000 rw-p 00000000 00:00 0
2aaaf89b3000-2aaafa9b3000 rw-p 00000000 00:00 0
2aaafa9b3000-2aaafa9b4000 ---p 00000000 00:00 0
2aaafa9b4000-2aaafabb4000 rw-p 00000000 00:00 0
2aaafabb4000-2aaafcbb4000 rw-p 00000000 00:00 0
2aaafcbb4000-2aaafcbb5000 ---p 00000000 00:00 0
2aaafcbb5000-2aaafcdb5000 rw-p 00000000 00:00 0
2aaafcdb5000-2aaafedb5000 rw-p 00000000 00:00 0
2aaafedb5000-2aaafedb6000 ---p 00000000 00:00 0
2aaafedb6000-2aaafefb6000 rw-p 00000000 00:00 0
2aaafefb6000-2aab00fb6000 rw-p 00000000 00:00 0
2aab00fb6000-2aab00fb7000 ---p 00000000 00:00 0
2aab00fb7000-2aab011b7000 rw-p 00000000 00:00 0
2aab011b7000-2aab031b7000 rw-p 00000000 00:00 0
2aab031b7000-2aab031b8000 ---p 00000000 00:00 0
2aab031b8000-2aab033b8000 rw-p 00000000 00:00 0
2aab033b8000-2aab053b8000 rw-p 00000000 00:00 0
2aab053b8000-2aab053b9000 ---p 00000000 00:00 0
2aab053b9000-2aab055b9000 rw-p 00000000 00:00 0
2aab055b9000-2aab095b9000 rw-p 00000000 00:00 0
2aab095b9000-2aab095ca000 r-xp 00000000 00:1d 111568                     /usr/lib64/python2.7/lib-dynload/datetime.so
2aab095ca000-2aab097c9000 ---p 00011000 00:1d 111568                     /usr/lib64/python2.7/lib-dynload/datetime.so
2aab097c9000-2aab097ca000 r--p 00010000 00:1d 111568                     /usr/lib64/python2.7/lib-dynload/datetime.so
2aab097ca000-2aab097ce000 rw-p 00011000 00:1d 111568                     /usr/lib64/python2.7/lib-dynload/datetime.so
2aab097ce000-2aab09980000 r-xp 00000000 00:2c 8279250639                 /upb/departments/pc2/users/m/mariusme/.local/lib/python2.7/site-packages/numpy/core/umath.so
2aab09980000-2aab09b7f000 ---p 001b2000 00:2c 8279250639                 /upb/departments/pc2/users/m/mariusme/.local/lib/python2.7/site-packages/numpy/core/umath.so
2aab09b7f000-2aab09b86000 rw-p 001b1000 00:2c 8279250639                 /upb/departments/pc2/users/m/mariusme/.local/lib/python2.7/site-packages/numpy/core/umath.so
2aab09b86000-2aab09c49000 rw-p 00000000 00:00 0
2aab09c49000-2aab09c63000 r-xp 00000000 00:1d 111542                     /usr/lib64/python2.7/lib-dynload/_ctypes.so
2aab09c63000-2aab09e62000 ---p 0001a000 00:1d 111542                     /usr/lib64/python2.7/lib-dynload/_ctypes.so
2aab09e62000-2aab09e63000 r--p 00019000 00:1d 111542                     /usr/lib64/python2.7/lib-dynload/_ctypes.so
2aab09e63000-2aab09e67000 rw-p 0001a000 00:1d 111542                     /usr/lib64/python2.7/lib-dynload/_ctypes.so
2aab09e67000-2aab09e6e000 r-xp 00000000 00:1d 108901                     /usr/lib64/libffi.so.6.0.1
2aab09e6e000-2aab0a06d000 ---p 00007000 00:1d 108901                     /usr/lib64/libffi.so.6.0.1
2aab0a06d000-2aab0a06e000 r--p 00006000 00:1d 108901                     /usr/lib64/libffi.so.6.0.1
2aab0a06e000-2aab0a06f000 rw-p 00007000 00:1d 108901                     /usr/lib64/libffi.so.6.0.1
2aab0a06f000-2aab0a076000 r-xp 00000000 00:1d 111560                     /usr/lib64/python2.7/lib-dynload/_struct.so
2aab0a076000-2aab0a275000 ---p 00007000 00:1d 111560                     /usr/lib64/python2.7/lib-dynload/_struct.so
2aab0a275000-2aab0a276000 r--p 00006000 00:1d 111560                     /usr/lib64/python2.7/lib-dynload/_struct.so
2aab0a276000-2aab0a278000 rw-p 00007000 00:1d 111560                     /usr/lib64/python2.7/lib-dynload/_struct.so
2aab0a278000-2aab0a28a000 r-xp 00000000 00:1d 111565                     /usr/lib64/python2.7/lib-dynload/cPickle.so
2aab0a28a000-2aab0a48a000 ---p 00012000 00:1d 111565                     /usr/lib64/python2.7/lib-dynload/cPickle.so
2aab0a48a000-2aab0a48b000 r--p 00012000 00:1d 111565                     /usr/lib64/python2.7/lib-dynload/cPickle.so
2aab0a48b000-2aab0a48c000 rw-p 00013000 00:1d 111565                     /usr/lib64/python2.7/lib-dynload/cPickle.so
2aab0a48c000-2aab0a48e000 r-xp 00000000 00:1d 111574                     /usr/lib64/python2.7/lib-dynload/grpmodule.so
2aab0a48e000-2aab0a68d000 ---p 00002000 00:1d 111574                     /usr/lib64/python2.7/lib-dynload/grpmodule.so
2aab0a68d000-2aab0a68e000 r--p 00001000 00:1d 111574                     /usr/lib64/python2.7/lib-dynload/grpmodule.so
2aab0a68e000-2aab0a68f000 rw-p 00002000 00:1d 111574                     /usr/lib64/python2.7/lib-dynload/grpmodule.so
2aab0a68f000-2aab0a6ab000 r-xp 00000000 00:1d 111550                     /usr/lib64/python2.7/lib-dynload/_io.so
2aab0a6ab000-2aab0a8aa000 ---p 0001c000 00:1d 111550                     /usr/lib64/python2.7/lib-dynload/_io.so
2aab0a8aa000-2aab0a8ab000 r--p 0001b000 00:1d 111550                     /usr/lib64/python2.7/lib-dynload/_io.so
Platform Information:
Platform name:       Xilinx
Platform version:    OpenCL 1.0
Platform profile:    EMBEDDED_PROFILE
Platform extensions: cl_khr_icd
Loading xclbin
LOOP PIPELINE 16 beats
Test 0, Throughput: 6553 MB/s
LOOP PIPELINE 64 beats
Test 1, Throughput: 18724 MB/s
LOOP PIPELINE 256 beats
Test 2, Throughput: 32768 MB/s
LOOP PIPELINE 1024 beats
ERROR: Failed to copy entries
/usr/lib/python2.7/site-packages/pkg_resources/py2_warn.py:22: UserWarning: Setuptools will stop working on Python 2
************************************************************
You are running Setuptools on Python 2, which is no longer
supported and
>>> SETUPTOOLS WILL STOP WORKING <<<
in a subsequent release (no sooner than 2020-04-20).
Please ensure you are installing
Setuptools using pip 9.x or later or pin to `setuptools<45`
in your environment.
If you have done those things and are still encountering
this message, please comment in
https://github.com/pypa/setuptools/issues/1458
about the steps that led to this unsupported combination.
************************************************************
  sys.version_info < (3,) and warnings.warn(pre + "*" * 60 + msg + "*" * 60)
XRT build version: 2.5.309
Build hash: 9a03790c11f066a5597b133db737cf4683ad84c8
Build date: 2020-02-24 02:54:37
Git branch: 2019.2_PU2
PID: 260044
UID: 18577
[Tue Mar 10 09:40:30 2020]
HOST: fpga-0010
EXE: /usr/bin/python2.7
[XRT] ERROR: kernel '_source' not found
[XRT] ERROR: kernel '_source' not found
 
ERROR: == device memory bandwidth test FAILED
INFO: Card[0] failed to validate.
 
ERROR: Some cards failed to validate.

 

 

3) After the first power cycle after the crash, the card didn't show up as PCIe device at all (lspci).

4) After another power cycle, the card showed up, but only with a Gen1x16 PCIe link. During the next validate, the machine crashed at the kernel test:

 

cat run_1_4_validate.out
INFO: Found 1 cards
 
INFO: Validating card[0]: xilinx_u280_xdma_201920_3
INFO: == Starting AUX power connector check:
INFO: == AUX power connector check PASSED
INFO: == Starting PCIE link check:
LINK ACTIVE, ATTENTION
Ensure Card is plugged in to Gen3x16, instead of Gen1x16
Lower performance may be experienced
WARN: == PCIE link check PASSED with warning
INFO: == Starting SC firmware version check:
INFO: == SC firmware version check PASSED
INFO: == Starting verify kernel test:

 

Now our last resort is to plug the card into another server to rule out problems with the mainboard, but since we already swapped the PCIe slot (and a different card works fine in the respective other PCIe slot), I'm not confident that this will help.

0 Kudos
Highlighted
Observer
Observer
761 Views
Registered: ‎10-06-2016

We tested the card in a different server node, again the PCIe link is degraded (here to Gen2). The first validation run crashed the machine during the bandwidth test.

INFO: Found 1 cards

INFO: Validating card[0]: xilinx_u280_xdma_201920_3
INFO: == Starting AUX power connector check: 
INFO: == AUX power connector check PASSED
INFO: == Starting PCIE link check: 
LINK ACTIVE, ATTENTION
Ensure Card is plugged in to Gen3x16, instead of Gen2x16
Lower performance may be experienced
WARN: == PCIE link check PASSED with warning
INFO: == Starting SC firmware version check: 
INFO: == SC firmware version check PASSED
INFO: == Starting verify kernel test: 
INFO: == verify kernel test PASSED
INFO: == Starting DMA test: 
Host -> PCIe -> FPGA write bandwidth = 6705.61 MB/s
Host <- PCIe <- FPGA read bandwidth = 6723.4 MB/s
INFO: == DMA test PASSED
INFO: == Starting device memory bandwidth test: 
........

 

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
753 Views
Registered: ‎10-19-2015

Hi @kenter 

I agree this looks like a PCIe link problem. 

Please send me the output of $sudo lspci -vvd 10ee: 

What server are you testing in? 

Regards,

M

 

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
0 Kudos
Highlighted
Observer
Observer
696 Views
Registered: ‎10-06-2016

Hi @mcertosi ,

the servers are Intel R2000WF servers with Intel S2600WFT Wolf Pass server board and two Skylake Gold 6148 processors.https://www.intel.com/content/www/us/en/server-chassis/server-chassis-r2000wf.html

 

sudo lspci -vvd 10ee: 
16:00.0 Processing accelerators: Xilinx Corporation Device 500c
	Subsystem: Xilinx Corporation Device 000e
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 32 bytes
	NUMA node: 0
	Region 0: Memory at 387ff2000000 (64-bit, prefetchable) [size=32M]
	Region 2: Memory at 387ff4000000 (64-bit, prefetchable) [size=128K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [60] MSI-X: Enable+ Count=33 Masked-
		Vector table: BAR=2 offset=00009000
		PBA: BAR=2 offset=00009fe0
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 1024 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
	Capabilities: [1c0 v1] #19
	Capabilities: [e00 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl+ DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Capabilities: [e10 v1] #15
	Kernel driver in use: xclmgmt
	Kernel modules: xclmgmt

16:00.1 Processing accelerators: Xilinx Corporation Device 500d
	Subsystem: Xilinx Corporation Device 000e
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 32 bytes
	Interrupt: pin A routed to IRQ 321
	NUMA node: 0
	Region 0: Memory at 387ff0000000 (64-bit, prefetchable) [size=32M]
	Region 2: Memory at 387ff4020000 (64-bit, prefetchable) [size=64K]
	Region 4: Memory at 387fe0000000 (64-bit, prefetchable) [size=256M]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [60] MSI-X: Enable+ Count=33 Masked-
		Vector table: BAR=2 offset=00008000
		PBA: BAR=2 offset=00008fe0
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 1024 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
		DevCtl:	Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
	Capabilities: [e00 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl+ DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Capabilities: [e10 v1] #15
	Kernel driver in use: xocl
	Kernel modules: xocl

 

0 Kudos
Highlighted
Observer
Observer
693 Views
Registered: ‎10-06-2016

For info, after generating the lspci output, I ran validate again to see which PCIe status would be reported there.

If succeeded 3 times with warnings, as before in Gen2 mode

INFO: Found 1 cards

INFO: Validating card[0]: xilinx_u280_xdma_201920_3
INFO: == Starting AUX power connector check: 
INFO: == AUX power connector check PASSED
INFO: == Starting PCIE link check: 
LINK ACTIVE, ATTENTION
Ensure Card is plugged in to Gen3x16, instead of Gen2x16
Lower performance may be experienced
WARN: == PCIE link check PASSED with warning
INFO: == Starting SC firmware version check: 
INFO: == SC firmware version check PASSED
INFO: == Starting verify kernel test: 
INFO: == verify kernel test PASSED
INFO: == Starting DMA test: 
Host -> PCIe -> FPGA write bandwidth = 6712.99 MB/s
Host <- PCIe <- FPGA read bandwidth = 6726.22 MB/s
INFO: == DMA test PASSED
INFO: == Starting device memory bandwidth test: 
............
Maximum throughput: 43690 MB/s
INFO: == device memory bandwidth test PASSED
INFO: == Starting PCIE peer-to-peer test: 
P2P BAR is not enabled. Skipping validation
INFO: == PCIE peer-to-peer test SKIPPED
INFO: == Starting memory-to-memory DMA test: 
M2M is not available. Skipping validation
INFO: == memory-to-memory DMA test SKIPPED
INFO: Card[0] validated with warnings.

INFO: All cards validated successfully but with warnings.

In the fourth iteration during bandwidth test, the machine crashed again, this time producing this output:

Message from syslogd@fpga-0011 at Mar 11 13:41:51 ...
 kernel:{1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
 
Message from syslogd@fpga-0011 at Mar 11 13:41:51 ...
 kernel:{1}[Hardware Error]: event severity: fatal
 
Message from syslogd@fpga-0011 at Mar 11 13:41:51 ...
 kernel:{1}[Hardware Error]:  Error 0, type: fatal
 
Message from syslogd@fpga-0011 at Mar 11 13:41:51 ...
 kernel:{1}[Hardware Error]:   section_type: PCIe error
 
Message from syslogd@fpga-0011 at Mar 11 13:41:51 ...
 kernel:{1}[Hardware Error]:   port_type: 4, root port
 
Message from syslogd@fpga-0011 at Mar 11 13:41:51 ...
 kernel:{1}[Hardware Error]:   version: 3.0
 
Message from syslogd@fpga-0011 at Mar 11 13:41:51 ...
 kernel:{1}[Hardware Error]:   command: 0x0547, status: 0x4010
 
Message from syslogd@fpga-0011 at Mar 11 13:41:51 ...
 kernel:{1}[Hardware Error]:   device_id: 0000:15:00.0
 
Message from syslogd@fpga-0011 at Mar 11 13:41:51 ...
 kernel:{1}[Hardware Error]:   slot: 0
 
Message from syslogd@fpga-0011 at Mar 11 13:41:51 ...
 kernel:{1}[Hardware Error]:   secondary_bus: 0x16
 
Message from syslogd@fpga-0011 at Mar 11 13:41:51 ...
 kernel:{1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x2030
 
Message from syslogd@fpga-0011 at Mar 11 13:41:51 ...
 kernel:{1}[Hardware Error]:   class_code: 000406
 
Message from syslogd@fpga-0011 at Mar 11 13:41:51 ...
 kernel:{1}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0003
 
Message from syslogd@fpga-0011 at Mar 11 13:41:51 ...
 kernel:Kernel panic - not syncing: Fatal hardware error!
 
Message from syslogd@fpga-0011 at Mar 11 13:41:51 ...
0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
670 Views
Registered: ‎12-10-2013

Hi @kenter 

Do you have another server you could try the card in - or a different slot?  We are suspecting a potential hardware issue, and it would be good to rule out the slot. 

 

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
0 Kudos
Highlighted
Observer
Observer
657 Views
Registered: ‎10-06-2016

Hi @bethe ,

we tried a different slot in the first machine already last week and tested with a different machine since yesterday, always with errors.

So from my perspective it now boils down to either the card is broken, or the type of server board is incompatible. Then again, the card was running fine for several weeks in the same server.

0 Kudos
Highlighted
Observer
Observer
539 Views
Registered: ‎10-06-2016

Short update for anyone following this now or later:

  • Shortly after the last post, the card no longer showed up as PCIe device at all
  • We now have a new U280 running in one of those PCIe slots
  • The firmware mismatch I observed earlier in this thread doesn't show up with this card, possibly due to newer firmware in the golden image?
## New card shipping/factory reset status:
/opt/xilinx/xrt/bin/xbmgmt flash --scan --verbose
...
    Flashable partition running on FPGA:
        xilinx_u280_GOLDEN_8,[SC=4.3]

## Old card shipping/factory reset status:
/opt/xilinx/xrt/bin/xbmgmt flash --scan --verbose
...
    Flashable partition running on FPGA:
        xilinx_u280_GOLDEN_8,[SC=4.2]

 

0 Kudos
Highlighted
Moderator
Moderator
526 Views
Registered: ‎06-14-2010

Hello @kenter ,

So, if i understood correctly, you now have 2 U280 cards, and with one of these, you are still seeing issues (on the same server)? And the new U280 card is working as expected? Can you confirm on this please? Thanks

Kind Regards,
Anatoli Curran,
Xilinx Technical Support
-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
0 Kudos
Highlighted
Observer
Observer
482 Views
Registered: ‎10-06-2016

Hi @anatoli , yes correct, a new U280 is working as expected, while the old one is not usable. I'm currently waiting on an RMA request to be reviewed (SR# 10487667).