cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Observer
Observer
807 Views
Registered: ‎11-21-2018

xbutil hangs when trying to program Alveo u250

I am having a problem when trying to program a U250 card with my latest xclbin image.

I have previously had no problems programming the card, but having added a second kernel I now see xbutil hang when trying to program the board with the new xclbin file:

mike@pc:~$ xbutil program -p ~/builds/2-kernel-image.xclbin
INFO: Found total 1 card(s), 1 are usable

At this point it hangs.

I am able to program other images with no issues, but the new xclbin seems to cause a hang during programming.

 

If I run xbutil query after attempting to program this new xclbin image it appears that programming has succeeded, however the card is then reported as "unusable" by xbutil list, and xbutil reset no longer works: see the error with renderD129 below.

 

mike@pc:~$ xbutil list                                                                                                                                                                                       
INFO: Found total 1 card(s), 0 are usable                                                                                                                                                                    
Cannot open: /dev/dri/renderD129 

 

 

If I don't run xbutil query, this does not happen and I'm able to run xbutil reset and program a working image.

 

System Information

XRT version:
xrt_201910.2.2.2158_18.04-xrt


Shell version:
xilinx-u250-qdma-201910.1-2552052_18.04


xbutil query outputs before and after programming with the 'broken' xclbin are attached
"before.query" shows an old xclbin programmed
"after.query" seems to show that the new xclbin has programmed, however running this command causes the error with renderD129 making the card unusable

I have also attached the output from running xclbinutil --info on the xclbin file, and also the vivado log from the output folder "_x/logs/link/vivado.log"

 

This sounds like the same problem as in post:
https://forums.xilinx.com/t5/Alveo-Accelerator-Cards/Stuck-when-loading-Binary-file-using-ALVEO-u280-es1/m-p/1015132
however there was no resolution posted in that thread.

 

 

0 Kudos
7 Replies
Highlighted
Xilinx Employee
Xilinx Employee
790 Views
Registered: ‎10-19-2015

Hi @mikemyrtle 

The vivado log prints a bunch of critical warnings. Are you decreasing the severity of errors to CWs on purpose? 

Whats in your pre-place.tcl?

Can you send me DMESG from the linux terminal after the programming hangs? 

Where did you get a QDMA shell from? 

Does the new Xclbin work in hardware emulation?

I'm also not sure why running xbutil query changes some of the functionality of the card, does running xbutil query multiple times change the report it generates?

Can you check DMESG after trying to use xbutil reset? 

Regards,

M

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
0 Kudos
Highlighted
Observer
Observer
740 Views
Registered: ‎11-21-2018

Hi @mcertosi,

Thanks for your quick response!

I believe all but the last critical warnings are all related to the shell / static region? They seem to be the same set that I see when I have built a single-kernel xclbin that works.

I am only decreasing the severity of one Critical Warning, which is intentional as it is required to instantiate a PLL in the partial reconfiguration region. The relevant constraint is the only thing in my pre-place.tcl (attached). This is what results in the warning on line 5525 of the log about lowering the severity from an error.

 

I attach the dmesg output after programming has hung, and also after I have subsequently killed the process.

 

The QDMA shell was downloaded from the early access lounge.

 

In the process of trying to debug this issue, I have removed all functionality from the kernels, so I haven't tested them in hw_emu as they no longer do anything. They are RTL kernels.

 

I will follow up with the results from running query multiple times.

Many thanks,

Mike

0 Kudos
Highlighted
Observer
Observer
737 Views
Registered: ‎11-21-2018

I attach the output from running query multiple times, it is not the same each time!

 

I also attach the dmesg output after running each command.


0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
696 Views
Registered: ‎10-19-2015

Hi @mikemyrtle 

Early access QDMA shell might not be functioning correctly. Did it come with any example xclbins that work? 

In DMESG2 it says that the firewall has tripped, that starts a process called "health_check_cb" The health check sends an xbutil reset -r from the management driver. This functionality had some bugs in the older shells, and I believe you are hitting this problem since the shell you have is a bit older than the most current XDMA shells. 

 

[418022.397832] xocl_firewall firewall.m.256: check_firewall: AXI Firewall 1 tripped, status: 0x4
[418022.397838] xclmgmt 0000:01:00.0: health_check_cb: firewall tripped, notify peer
[418022.397845] mailbox.m mailbox.m.256: mailbox_post_notify: posting request: 6 via HW
[418024.641964] mailbox.m mailbox.m.256: timeout_msg: found outstanding msg time'd out
[418024.641973] mailbox.m mailbox.m.256: timeout_msg: peer becomes dead
[418024.641981] mailbox.m mailbox.m.256: dft_post_msg_cb: failed to post msg, err=-62
[418027.518114] xocl_firewall firewall.m.256: check_firewall: AXI Firewall 1 tripped, status: 0x4
[418027.518119] xclmgmt 0000:01:00.0: health_check_cb: firewall tripped, notify peer
[418027.518125] mailbox.m mailbox.m.256: mailbox_post_notify: posting request: 6 via HW
[418027.518184] mailbox.m mailbox.m.256: chann_worker: peer becomes active
[418029.762206] mailbox.m mailbox.m.256: timeout_msg: found outstanding msg time'd out
[418029.762215] mailbox.m mailbox.m.256: timeout_msg: peer becomes dead
[418029.762222] mailbox.m mailbox.m.256: dft_post_msg_cb: failed to post msg, err=-62
[418032.638325] xocl_firewall firewall.m.256: check_firewall: AXI Firewall 1 tripped, status: 0x4
[418032.638330] xclmgmt 0000:01:00.0: health_check_cb: firewall tripped, notify peer
[418032.638336] mailbox.m mailbox.m.256: mailbox_post_notify: posting request: 6 via HW
[418032.638394] mailbox.m mailbox.m.256: chann_worker: peer becomes active
[418034.882448] mailbox.m mailbox.m.256: timeout_msg: found outstanding msg time'd out
[418034.882457] mailbox.m mailbox.m.256: timeout_msg: peer becomes dead
[418034.882465] mailbox.m mailbox.m.256: dft_post_msg_cb: failed to post msg, err=-62

The reset is causing a link down on the PCIe bus, and while it isn't crashing your server, it is killing the device in /dev/dri/renderD129

 

 

Cannot open: /dev/dri/renderD129
ERROR: Card [0] is not ready
mike@pc:~$
mike@pc:~$
mike@pc:~$ xbutil reset
Cannot open: /dev/dri/renderD129

 

So this functionality is getting in our way, but we do have a clue from DMESG

 

[418032.638325] xocl_firewall firewall.m.256: check_firewall: AXI Firewall 1 tripped, status: 0x4

 

xbutil query is designed to tell us what status: 0x4 means, but in our case we can use the AXI firewall user guide (PG293) to determine what that error is. 

0x4 = 4'b0100 = ERRS_RID = Bit 4 Slave can only give read data in response to an outstanding read transaction, and the RID, if any, must match an outstanding ARID.

So if that's correct then your RTL kernel is misbehaving and somehow the endpoint is trying to send DMA data without the host requesting it. I think? 

Does your host code look like anything? 

Do you think the RTL could be doing that? 

We can verify what I'm seeing in dmesg by disabling the health_check function: 

sudo modinfo xclmgmt: This command lists the current configuration of the module
and indicates if the health_check parameter is on or off. It also returns the path to the
xclmgmt module.
4.  sudo rmmod xclmgmt: This removes and therefore disables the xclmgmt kernel
module.
5.  sudo insmod <path to module>/xclmgmt.ko health_check=0: This reinstalls
the xclmgmt kernel module with the health check disabled.

Then we can retest and see if that makes a difference. 

Dmesg1 says some interesting stuff about clocks being mismatched. Were you able to user the RTL wizard successfully to generate the RTL kernel? 

[418002.422361] icap.m icap.m.256: clock_freqs_show: Frequency mismatch, Should be 300000 khz, Now is 159997khz

I am hoping this is a reporting bug, and maybe it falls out when we start looking at the firewall trip.  

DMESG3 is the same as 2, firewall trip 0x4

DMESG4 says both, the clocks are wrong and the firewall is tripped. 

Regards,

M

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
Tags (3)
0 Kudos
Highlighted
Observer
Observer
515 Views
Registered: ‎11-21-2018

Thanks @mcertosi,

I have been using the QDMA shell successfully with a design with a single kernel in the xclbin for some time.
This issue has appeared since trying to add a second kernel.

It would seem that if the firewall is tripped in dmesg2, but not in dmesg1, then it is running xbutil query that is tripping the firewall?
The first time xbutil query is run, it reports no problems with the firewall, but we can see it is tripped in dmesg (but only after the query).

Do you think this is firewall trip is the problem that occurs during programming that makes it hang?
Are transactions issued to the kernels on the AXI4Lite bus during programming with xbutil program?


Looking at the firewall status, it looks to me like the status: 0x4 = 4'b0100 = RECS_CONTINUOUS_RTRANSFERS_MAX_WAIT (p17, PG293)

This would make more sense than ERRS_RID, as my kernel currently has no implementation inside it, other than setting the AXI4Lite Valids = 0 and readys = 1.
I expected that this would be ok, at least for programming the xclbin onto to board, as I wasn't expecting any transactions to be issued on the bus during programming.
Do you know if this is the case?

If transactions are issued, then I can understand the firewall complaining that there is no response from the kernel within the timeout period.

Since seeing that the firewall is tripping with this error, I have added the AXI4 slave example code from the rtl kernel wizard to my kernel ("kernel_control_s_axi"). This still results in the AXI firewall tripping.
Dmesg reports the same error status = 0x4, although xbutil query reports that the "Last Error Status" is 0x0(RECS_CONTINUOUS_RTRANSFERS_MAX_WAIT)".
This was tested with the health check function disabled.

I would have expected this slave from the rtl kernel wizard to be behave correctly, so I'm not sure what to try now.


I'm not sure what the clock mismatch is. None of the reported mismatched frequencies relate to the settings for the user clocks in my kernel, so I would assume they are in the static region.
I did not see this message when I retested with the different kernel.

I am using a script-driven build, which is based on using the rtl kernel wizard.

I can post the output from retesting if that would be helpful.

Best regards,

Mike

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
489 Views
Registered: ‎10-19-2015

Hi @mikemyrtle 

I think the problem is somewhere in the kernel still. 

XRT should not be issuing any AXI transfers if the host code is not initiating any AXI transfers. 

I can't say with any confidence that the 'out of the box' RTL generated with the RTL kernel wizard is correct. I think the QDMA shell + RTL flow might have some gaps in the testing until a production QDMA shell is released. 

Do you think this is firewall trip is the problem that occurs during programming that makes it hang?

I don't think there is a firewall trip AT programming time, but possibly right after. You say you are setting the AXI4lite signals, but what about the AXI or AXI Stream interface for data? 

I would have expected this slave from the rtl kernel wizard to be behave correctly, so I'm not sure what to try now.

Possibly, here is a resource for creating RTL kernels, it does not go over streaming. 

https://github.com/Xilinx/Vitis-Tutorials/tree/master/docs/getting-started-rtl-kernels

Here is the Vitis documentation that covers RTL streaming kernels https://www.xilinx.com/html_docs/xilinx2019_2/vitis_doc/Chunk2020182740.html

I'm wondering if you are accidentally connecting a streaming interface from the QDMA to a memory mapped interface in your kernel and that could be creating a mismatch that is confusing the tools. 

If you just duplicate your kernel that you already have working, does that work? 

Regards,

M

 

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
Tags (2)
0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
480 Views
Registered: ‎10-19-2015

@mikemyrtle 

Could you try upgrading your shell to the shell from this lounge? https://www.xilinx.com/member/qdma-shell.html

The latest XRT and QDMA shell (2019.2) are posted on the lounge. 

It's possible you are running into a compatibility issue with that XRT version and the shell. The older shell is not supported, and I think there have been shell fixes added to the latest QDMA shell above. 

Regards,

M

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
Tags (2)
0 Kudos