cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
gitbisector
Visitor
Visitor
296 Views
Registered: ‎09-27-2020

Kernel download crashed the U50 board?

My kernel download call fails with err=-110.

Guess that is ETIMEDOUT. 

cl::Program program(context, {device}, bins, NULL, &err);

After this other kernels downloads (like hello_world) fail too:

Git branch: 2020.1
PID: 703
UID: 0
[Sat Oct 17 02:59:24 2020]
HOST: JARVICENAE-0A0A185E
EXE: /data/Vitis_Accel_Examples/hello_world/host
[XRT] ERROR: Can't reach out to mgmt for xclbin downloading
[XRT] ERROR: Is xclmgmt driver loaded? Or is MSD/MPD running?
[XRT] ERROR: See dmesg log for details. err=-110
[XRT] ERROR: Failed to load xclbin. Timeout, see dmesg for details
Failed to program device[0] with xclbin file!

Even validation fails after this. Have to start a new cloud instance to do anything else.

'hw_emu' target seems to work fine.

Any pointers for next debug steps?

0 Kudos
Reply
2 Replies
emeryw
Xilinx Employee
Xilinx Employee
229 Views
Registered: ‎12-06-2019

Hi @gitbisector ,

Can you please upload the outputs of the following:

dmesg
lspci -vd 10ee:
xbutil query

 Those will be helpful to seeing what is going on.

Best,

-Emery

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------

0 Kudos
Reply
gitbisector
Visitor
Visitor
203 Views
Registered: ‎09-27-2020

Hi @emeryw,

Thanks for looking into my troubles.

lspci -vd 10ee:

 

 

82:00.0 Processing accelerators: Xilinx Corporation Device 5020
	Subsystem: Xilinx Corporation Device 000e
	Physical Slot: 1
	Flags: bus master, fast devsel, latency 0
	Memory at 27ff2000000 (64-bit, prefetchable) [size=32M]
	Memory at 27ff4020000 (64-bit, prefetchable) [size=128K]
	Capabilities: <access denied>
	Kernel driver in use: xclmgmt
lspci: Unable to load libkmod resources: error -12

82:00.1 Processing accelerators: Xilinx Corporation Device 5021
	Subsystem: Xilinx Corporation Device 000e
	Physical Slot: 1
	Flags: bus master, fast devsel, latency 0, IRQ 32
	Memory at 27ff0000000 (64-bit, prefetchable) [size=32M]
	Memory at 27ff4000000 (64-bit, prefetchable) [size=128K]
	Memory at 27fe0000000 (64-bit, prefetchable) [size=256M]
	Capabilities: <access denied>
	Kernel driver in use: xocl

 

 

xbutil query;

 

 

INFO: Found total 1 card(s), 1 are usable
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
System Configuration
OS name:	Linux
Release:	4.15.0-107-generic
Version:	#108~16.04.1-Ubuntu SMP Fri Jun 12 02:57:13 UTC 2020
Machine:	x86_64
Model:		SYS-1028GQ-TR
CPU cores:	16
Memory:		128824 MB
Glibc:		2.23
Distribution:	Ubuntu 16.04.6 LTS
Now:		Tue Oct 20 00:44:47 2020
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
XRT Information
Version:	2.6.655
Git Hash:	2d6bfe4ce91051d4e5b499d38fc493586dd4859a
Git Branch:	2020.1
Build Date:	2020-05-22 12:03:17
XOCL:		2.6.655,2d6bfe4ce91051d4e5b499d38fc493586dd4859a
XCLMGMT:	2.6.655,2d6bfe4ce91051d4e5b499d38fc493586dd4859a

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Shell                           FPGA                            IDCode
xilinx_u50_gen3x16_xdma_201920_3                                0x14b77093
Vendor          Device          SubDevice       SubVendor       SerNum          
0x10ee          0x5021          0x000e          0x10ee          501211101H9D    
DDR size        DDR count       Clock0          Clock1          Clock2          
0 Byte          0               300             500             450             
PCIe            DMA chan(bidir) MIG Calibrated  P2P Enabled     OEM ID          
GEN 3x16        2               true            N/A             0x30314144(N/A) 
Interface UUID
862c7020a250293e32036f19956669e5
Logic UUID
f465b0a3ae8c64f619bc150384ace69b
DNA

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Temperature(C)
PCB TOP FRONT   PCB TOP REAR    PCB BTM FRONT   VCCINT TEMP     
22              22              N/A             26              
FPGA TEMP       TCRIT Temp      FAN Presence    FAN Speed(RPM)  
30              21              P               N/A             
QSFP 0          QSFP 1          QSFP 2          QSFP 3          
N/A             N/A             N/A             N/A             
HBM TEMP        
24              
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Electrical(mV|mA)
12V PEX         12V AUX         12V PEX Current 12V AUX Current 
12068           N/A             1137            N/A             
3V3 PEX         3V3 AUX         DDR VPP BOTTOM  DDR VPP TOP     
3372            N/A             N/A             N/A             
SYS 5V5         1V2 TOP         1V8 TOP         0V85            
5007            N/A             1804            N/A             
MGT 0V9         12V SW          MGT VTT         1V2 BTM         
899             N/A             1202            N/A             
VCCINT VOL      VCCINT CURR     VCCINT IO VOL   VCC3V3 VOL      
849             4300            850             3307            
3V3 PEX CURR    VCCINT IO CURR  HBM1V2 VOL      VPP2V5 VOL      
223             1900            1196            2490            
VCC1V2 CURR     V12 I CURR      V12 AUX0 CURR   V12 AUX1 CURR   
N/A             N/A             N/A             N/A             
12V AUX1 VOL    VCCAUX VOL      VCCAUX PMC VOL  VCCRAM VOL      
N/A             N/A             N/A             N/A             
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Card Power(W)
14
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Firewall Last Error Status
Level 0 : 0x0(GOOD)

ECC Error Status
Tag     Errors      CE Count  UE Count  CE FFA              UE FFA              
HBM[0]  (None)      0         0         0x0                 0x0                 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Memory Status
     Tag         Type        Temp(C)  Size    Mem Usage       BO count
[ 0] HBM[0]      MEM_HBM     24       256 MB  0 Byte          0       
[ 1] HBM[1]      **UNUSED**  24       256 MB  0 Byte          0       
[ 2] HBM[2]      **UNUSED**  24       256 MB  0 Byte          0       
[ 3] HBM[3]      **UNUSED**  24       256 MB  0 Byte          0       
[ 4] HBM[4]      **UNUSED**  24       256 MB  0 Byte          0       
[ 5] HBM[5]      **UNUSED**  24       256 MB  0 Byte          0       
[ 6] HBM[6]      **UNUSED**  24       256 MB  0 Byte          0       
[ 7] HBM[7]      **UNUSED**  24       256 MB  0 Byte          0       
[ 8] HBM[8]      **UNUSED**  24       256 MB  0 Byte          0       
[ 9] HBM[9]      **UNUSED**  24       256 MB  0 Byte          0       
[ a] HBM[10]     **UNUSED**  24       256 MB  0 Byte          0       
[ b] HBM[11]     **UNUSED**  24       256 MB  0 Byte          0       
[ c] HBM[12]     **UNUSED**  24       256 MB  0 Byte          0       
[ d] HBM[13]     **UNUSED**  24       256 MB  0 Byte          0       
[ e] HBM[14]     **UNUSED**  24       256 MB  0 Byte          0       
[ f] HBM[15]     **UNUSED**  24       256 MB  0 Byte          0       
[10] HBM[16]     **UNUSED**  24       256 MB  0 Byte          0       
[11] HBM[17]     **UNUSED**  24       256 MB  0 Byte          0       
[12] HBM[18]     **UNUSED**  24       256 MB  0 Byte          0       
[13] HBM[19]     **UNUSED**  24       256 MB  0 Byte          0       
[14] HBM[20]     **UNUSED**  24       256 MB  0 Byte          0       
[15] HBM[21]     **UNUSED**  24       256 MB  0 Byte          0       
[16] HBM[22]     **UNUSED**  24       256 MB  0 Byte          0       
[17] HBM[23]     **UNUSED**  24       256 MB  0 Byte          0       
[18] HBM[24]     **UNUSED**  24       256 MB  0 Byte          0       
[19] HBM[25]     **UNUSED**  24       256 MB  0 Byte          0       
[1a] HBM[26]     **UNUSED**  24       256 MB  0 Byte          0       
[1b] HBM[27]     **UNUSED**  24       256 MB  0 Byte          0       
[1c] HBM[28]     **UNUSED**  24       256 MB  0 Byte          0       
[1d] HBM[29]     **UNUSED**  24       256 MB  0 Byte          0       
[1e] HBM[30]     **UNUSED**  24       256 MB  0 Byte          0       
[1f] HBM[31]     **UNUSED**  24       256 MB  0 Byte          0       
[20] PLRAM[0]    **UNUSED**  N/A      0 Byte  0 Byte          0       
[21] PLRAM[1]    **UNUSED**  N/A      0 Byte  0 Byte          0       
[22] PLRAM[2]    **UNUSED**  N/A      0 Byte  0 Byte          0       
[23] PLRAM[3]    **UNUSED**  N/A      0 Byte  0 Byte          0       
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
DMA Transfer Metrics
Chan[0].h2c:  2 KB
Chan[0].c2h:  2 KB
Chan[1].h2c:  0 Byte
Chan[1].c2h:  0 Byte
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Streams
     Tag         Flow ID  Route ID Status   Total (B/#)     Pending (B/#)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Xclbin UUID
bc161fa8-f4dc-49d3-99c9-53a0b644c836
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Compute Unit Status
CU[ 0]: hello:hello_1                   @0x1400000         (IDLE)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Partition Info:
    installation
        installed_package_dir: /opt/xilinx/firmware/u50/gen3x16-xdma/blp
        installed_package_name: xilinx-u50-gen3x16-xdma-blp
        installed_package_release: 2784799
        installed_package_version: 1
    partition_card: u50
    partition_family: gen3x16-xdma
    partition_features
        pcie
            device_ids
                5020
                    role: management_pf
                5021
                    role: user_pf
            extended_capabilities: enabled
            max_link_speed: gen3x16
            subsystem_id: 000e
            vendor_id: 10ee
    partition_identifiers
        interface_uuids
            862c7020a250293e32036f19956669e5
                type: exposed
        logic_uuid: f465b0a3ae8c64f619bc150384ace69b
    partition_name: blp
    partition_type: blp
    partition_vendor: xilinx
    vbnv_override: xilinx:u50:gen3x16_xdma:201920_3
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
INFO: xbutil query succeeded.

 

 

 

dmesg output attached. These might be the most interesting lines:

 

 

[Tue Oct 20 00:45:58 2020] BUG: unable to handle kernel paging request at ffffbaef49181aa8
[Tue Oct 20 00:45:58 2020] IP: icap_download_bitstream_axlf+0x186b/0x1cd0 [xclmgmt]
[Tue Oct 20 00:45:58 2020] PGD 103f948067 P4D 103f948067 PUD 203f002067 PMD 203948d067 PTE 0
[Tue Oct 20 00:45:58 2020] Oops: 0000 [#1] SMP PTI
[Tue Oct 20 00:45:58 2020] Modules linked in: veth overlay ceph fscache rbd libceph binfmt_misc xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) mlx5_ib(OE) mlx4_ib(OE) ib_uverbs(OE) ib_core(OE) mlx4_en(OE) mlx4_core(OE) bridge 8021q garp mrp stp llc bonding intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 input_leds joydev crypto_simd glue_helper cryptd ipmi_ssif mei_me intel_cstate intel_rapl_perf xclmgmt(OE) xocl(OE) fpga_mgr ioatdma mei libcrc32c ipmi_si ipmi_devintf
[Tue Oct 20 00:45:58 2020]  ipmi_msghandler acpi_power_meter shpchp acpi_pad mac_hid lpc_ich nfsd auth_rpcgss nfs_acl lockd knem(OE) grace sunrpc autofs4 hid_generic usbhid hid mxm_wmi ast ttm drm_kms_helper syscopyarea sysfillrect mlx5_core(OE) sysimgblt mlxfw(OE) devlink fb_sys_fops mlx_compat(OE) ixgbe igb drm i2c_algo_bit dca ahci ptp libahci pps_core mdio wmi
[Tue Oct 20 00:45:58 2020] CPU: 9 PID: 600 Comm: kworker/u32:4 Tainted: G           OE    4.15.0-107-generic #108~16.04.1-Ubuntu
[Tue Oct 20 00:45:58 2020] Hardware name: Supermicro SYS-1028GQ-TR/X10DGQ, BIOS 3.1a 04/17/2019
[Tue Oct 20 00:45:58 2020] Workqueue: mailbox.m.15728640 mailbox_recv_request [xclmgmt]
[Tue Oct 20 00:45:58 2020] RIP: 0010:icap_download_bitstream_axlf+0x186b/0x1cd0 [xclmgmt]
[Tue Oct 20 00:45:58 2020] RSP: 0018:ffffbaef47d8bcd0 EFLAGS: 00010282
[Tue Oct 20 00:45:58 2020] RAX: ffffbaef49181aa8 RBX: ffffbaef472d1000 RCX: ffff8fde7b907050
[Tue Oct 20 00:45:58 2020] RDX: 0000000000000a28 RSI: 0000000000000092 RDI: 0000000000000000
[Tue Oct 20 00:45:58 2020] RBP: ffffbaef47d8bd80 R08: 000000000001c90c R09: ffffffff9595f4c4
[Tue Oct 20 00:45:58 2020] R10: 0000000000000000 R11: 0000000000000c3f R12: ffff8fde78b1c058
[Tue Oct 20 00:45:58 2020] R13: 0000000000000041 R14: ffff8fde78b1f800 R15: 0000000000000002
[Tue Oct 20 00:45:58 2020] FS:  0000000000000000(0000) GS:ffff8fdebfc40000(0000) knlGS:0000000000000000
[Tue Oct 20 00:45:58 2020] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Tue Oct 20 00:45:58 2020] CR2: ffffbaef49181aa8 CR3: 0000001cace0a002 CR4: 00000000001606e0
[Tue Oct 20 00:45:58 2020] Call Trace:
[Tue Oct 20 00:45:58 2020]  xclmgmt_mailbox_srv+0x426/0xee0 [xclmgmt]
[Tue Oct 20 00:45:58 2020]  ? __dev_printk+0x3c/0x80
[Tue Oct 20 00:45:58 2020]  ? _dev_info+0x64/0x80
[Tue Oct 20 00:45:58 2020]  mailbox_recv_request+0x106/0x4a0 [xclmgmt]
[Tue Oct 20 00:45:58 2020]  ? health_check_cb+0x400/0x400 [xclmgmt]
[Tue Oct 20 00:45:58 2020]  ? mailbox_recv_request+0x106/0x4a0 [xclmgmt]
[Tue Oct 20 00:45:58 2020]  process_one_work+0x14d/0x410
[Tue Oct 20 00:45:58 2020]  worker_thread+0x4b/0x460
[Tue Oct 20 00:45:58 2020]  kthread+0x105/0x140
[Tue Oct 20 00:45:58 2020]  ? process_one_work+0x410/0x410
[Tue Oct 20 00:45:58 2020]  ? kthread_bind+0x40/0x40
[Tue Oct 20 00:45:58 2020]  ret_from_fork+0x35/0x40
[Tue Oct 20 00:45:58 2020] Code: 4c 89 eb 45 89 d5 eb 50 49 63 c5 48 6b d0 28 80 7c 13 09 00 74 3e 48 8b 8d 60 ff ff ff 48 69 c0 a8 10 00 00 48 03 81 00 01 00 00 <48> 8b 38 48 85 ff 74 1c 48 8b 40 08 48 85 c0 74 13 48 8b 40 18 
[Tue Oct 20 00:45:58 2020] RIP: icap_download_bitstream_axlf+0x186b/0x1cd0 [xclmgmt] RSP: ffffbaef47d8bcd0
[Tue Oct 20 00:45:58 2020] CR2: ffffbaef49181aa8
[Tue Oct 20 00:45:58 2020] ---[ end trace 9e7254a84f91ca99 ]---

 

 

 

xbutil hangs after the failed download. The output here is from before that.

 

BTW to verify there is no problem with my host code, I pointed it at the xclbin generated for the 'hbm_bandwidth' example instead. Then my host driver loads the fpga image ok. Similarly, loading my kernel with the hbm_bandwidth host code throws the same error.

0 Kudos
Reply