cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
pasmith50
Visitor
Visitor
2,193 Views
Registered: ‎03-03-2021

xbmgmt flash U250

I am trying to update some U250 cards with the latest xrt (2020.2) on Centos 8.2 and it times out trying to stop a user application when there is nothing running and I have just power cycled.   

The scan works fine

/opt/xilinx/xrt/bin/xbmgmt flash --scan
Card [0000:21:00.0]
Card type: u250
Flash type: SPI
Flashable partition running on FPGA:
xilinx_u250_xdma_201830_2,[ID=0x5d14fbe6],[SC=4.2.0]
Flashable partitions installed in system:
xilinx_u250_gen3x16_base_3,[ID=0x48810c9d17860ef5],[SC=4.6.6]

but the flashing does not

/opt/xilinx/xrt/bin/xbmgmt flash --update --shell xilinx_u250_gen3x16_base_3 --card 0000:21:00.0
Status: SC needs updating
Current SC: 4.2.0
SC to be flashed: 4.6.6
Status: shell needs updating
Current shell: xilinx_u250_xdma_201830_2
Shell to be flashed: xilinx_u250_gen3x16_base_3
Are you sure you wish to proceed? [y/n]: y

Updating SC firmware on card[0000:21:00.0]
Stopping user function...
ERROR: Shutdown user function timeout.
Only proceed with SC update if all user applications for the target card(s) are stopped.
WARNING: Failed to update SC firmware on card [0000:21:00.0]
Updating shell on card[0000:21:00.0]
terminate called after throwing an instance of 'std::runtime_error'
what(): Failed to open flash device on card
/opt/xilinx/xrt/bin/unwrapped/loader: line 57: 4824 Aborted (core dumped) "${XRT_PROG_UNWRAPPED}" "${XRT_LOADER_ARGS[@]}"

0 Kudos
33 Replies
JohnFedakIV
Moderator
Moderator
2,018 Views
Registered: ‎09-04-2020

Hi @pasmith50 ,

Welcome to the Xilinx Forums! Given this is your first post, I want to highlight the Getting Started in Alveo Community topic, which contains a list of useful resources that may be helpful during your Alveo development.

When changing between shells/platforms, we recommend to revert the card to the golden/factory image first using the $ xbmgmt flash --factory_reset --card <bdf> command. This should clear the card and will require a cold boot to load the golden image. After this, please run the $ xbmgmt flash --scan again to confirm that the golden is on the card before flashing the new shell.

Please let me know how the process goes and if it gets past the issue that you are seeing.

I do also want to note that the new U250 shell/platform is a DFX-2RP, which simply means there are 3 partitions (base, shell, user). Before validating the card with $ xbutil validate (or running an application), you will need to load the shell partition using the $ xbmgmt partition command. More information is available in AR 75975.

Regards,
~John

----------------------------------------------------------------------------------
* Please don't forget to reply, kudo and accept as a solution! *
pasmith50
Visitor
Visitor
1,987 Views
Registered: ‎03-03-2021

Hi John,

Thanks for your help. The first flash now works but I see the  same error when I do the second flash to update the SC

/opt/xilinx/xrt/bin/xbmgmt flash --scan

Card [0000:21:00.0]
Card type: u250
Flash type: SPI
Flashable partition running on FPGA:
xilinx_u250_gen3x16_base_3,[ID=0x48810c9d17860ef5],[SC=4.2]
Flashable partitions installed in system:
xilinx_u250_gen3x16_base_3,[ID=0x48810c9d17860ef5],[SC=4.6.6]

 /opt/xilinx/xrt/bin/xbmgmt flash --update --shell xilinx_u250_gen3x16_base_3 --card 0000:21:00.0
Status: SC needs updating
Current SC: 4.2
SC to be flashed: 4.6.6
Are you sure you wish to proceed? [y/n]: y

Updating SC firmware on card[0000:21:00.0]
Stopping user function...
ERROR: Shutdown user function timeout.
Only proceed with SC update if all user applications for the target card(s) are stopped.
WARNING: Failed to update SC firmware on card [0000:21:00.0]

No cards were flashed.
WARNING:1 Card(s) not flashed

 

0 Kudos
JohnFedakIV
Moderator
Moderator
1,965 Views
Registered: ‎09-04-2020

Hi @pasmith50 ,

Edit: Before we go ahead with my suggestion below. I want to take a look at a couple pieces of information if possible:

  • Is any information in dmesg about this error?
  • Can you provide the output (txt file preferred) of $ xbutil query?
  • Can you load the partition ($ xbmgmt partition --name xilinx_u250_gen3x16_xdma_shell_3_1 --card 0000:21:00.0) and then provide the result (txt file preferred) from $ xbutil validate?

It looks like the SC is in a bad state, the next step here is to remove power from the server - the SC is powered by the PCIe 3.3V AUX, which remains powered during most cold boots/shutdowns. If you can, shutdown the server and pull the plug from the server for ~10 minutes (this ensures that all of the capacitors on the 3.3V AUX line are completely discharged). This should reset the SC, with it reset - then please attempt to update the SC version as you have been.

Regards,
~John

----------------------------------------------------------------------------------
* Please don't forget to reply, kudo and accept as a solution! *
0 Kudos
pasmith50
Visitor
Visitor
1,938 Views
Registered: ‎03-03-2021

Hi John,

There was a lot in dmesg. I attach it plus the xbutil query output. 

The partition load works fine

/opt/xilinx/xrt/bin/xbmgmt partition --program --name xilinx_u250_gen3x16_xdma_shell_3_1 --
card 0000:21:00.0
Programming PLP on Card [0000:21:00.0]...
Partition file: /opt/xilinx/firmware/u250/gen3x16/xdma-shell/partition.xsabin
Program successfully

Thanks,

Paul

 

 

0 Kudos
JohnFedakIV
Moderator
Moderator
1,932 Views
Registered: ‎09-04-2020

Hi @pasmith50 ,

Thank you for providing this information. I have a few more questions/requests:

  • After loading the partition, can you run $ xbutil validate -d 0000:21:00.0 and send the output?
  • I noticed that there are 4 cards in the system, is the xbutil query the response for the 0000:21:00.0 card ($ xbutil query -d 0000:21:00.0)?
  • How many of the 4 cards are showing this behavior?
  • DMESG does have a lot of information, I do see some earlier firewall trips on 21:00.0 - to eliminate older information, can you provide dmesg right before running the SC update (xbmgmt flash --update --shell xilinx_u250_gen3x16_base_3 --card 0000:21:00.0) and right after? That will help me to focus on the information from the attempt to update the SC

Thank you for doing this information collecting - it helps to get a whole view of what is happening with the card.

Regards,
~John

----------------------------------------------------------------------------------
* Please don't forget to reply, kudo and accept as a solution! *
0 Kudos
pasmith50
Visitor
Visitor
1,905 Views
Registered: ‎03-03-2021

Hi John,

Validate output attached.

I think the query was for that card but I attach it again to be sure. 

I attach just the dmesg section that was added during the shell update attempt. 

 

 

 

0 Kudos
pasmith50
Visitor
Visitor
1,905 Views
Registered: ‎03-03-2021

Sorry meant to say i am seeing it on all cards.

0 Kudos
JohnFedakIV
Moderator
Moderator
1,753 Views
Registered: ‎09-04-2020

Hi @pasmith50,

Thank you for this information. Looking again at xbutil query, these commands are being run through a VM - can you run these commands directly on the machine?

When the SC is updated, the first thing that XRT does is a hot reset and I wonder if the VM is having an effect on this.

Regards,
~John

----------------------------------------------------------------------------------
* Please don't forget to reply, kudo and accept as a solution! *
0 Kudos
pasmith50
Visitor
Visitor
1,731 Views
Registered: ‎03-03-2021

Hi John,

This is bare metal. No VM being used.  

 

 

 

0 Kudos
JohnFedakIV
Moderator
Moderator
1,701 Views
Registered: ‎09-04-2020

Hi @pasmith50 ,

Thank you for the clarification - I misunderstood Compute Engine listed as the model.

I attempted to recreate this on a U250 on my side (with a CentOS 8.2 machine using the same kernel and XRT versions as listed in xbutil query) and wasn't able to reproduce the issue.

A few questions/test ideas:

  • Can you provide the output of $ sudo lspci -vvd 10ee: ?
  • Does running $ xbutil reset -d 0000:21:00.0 ahead of the SC update change anything?
  • Do you have another machine/server to try this update on?

Regards,
~John

----------------------------------------------------------------------------------
* Please don't forget to reply, kudo and accept as a solution! *
0 Kudos
pasmith50
Visitor
Visitor
1,641 Views
Registered: ‎03-03-2021

lspci output attached

No the reset doesn't change anything

# xbutil reset -d 0000:21:00.0
All existing processes will be killed.
Are you sure you wish to proceed? [y/n]: y
# /opt/xilinx/xrt/bin/xbmgmt flash --update --shell xilinx_u250_gen3x16_base_3 --card 0000:21:00.0
Status: SC needs updating
Current SC: 4.2
SC to be flashed: 4.6.6
Are you sure you wish to proceed? [y/n]: y

Updating SC firmware on card[0000:21:00.0]
Stopping user function...
ERROR: Shutdown user function timeout.
Only proceed with SC update if all user applications for the target card(s) are stopped.
WARNING: Failed to update SC firmware on card [0000:21:00.0]

No cards were flashed.
WARNING:1 Card(s) not flashed.

Yes I should be able to get another machine to test on. I will see if it is the same. 

 

0 Kudos
pasmith50
Visitor
Visitor
1,578 Views
Registered: ‎03-03-2021

Tested on a second system and saw the same behavior. 

#/opt/xilinx/xrt/bin/xbmgmt flash --update --shell xilinx_u250_gen3x16_base_3 --card 0000:21:00.0
Status: SC needs updating
Current SC: 4.0
SC to be flashed: 4.6.6
Are you sure you wish to proceed? [y/n]: y

Updating SC firmware on card[0000:21:00.0]
Stopping user function...
ERROR: Shutdown user function timeout.
Only proceed with SC update if all user applications for the target card(s) are stopped.
WARNING: Failed to update SC firmware on card [0000:21:00.0]

No cards were flashed.
WARNING:1 Card(s) not flashed.

0 Kudos
JohnFedakIV
Moderator
Moderator
1,526 Views
Registered: ‎09-04-2020

Hi @pasmith50 ,

Thank you for the feedback and testing on another machine.

In looking back at the dmesg output, it looks like the xclmgmt driver can't access the config space to turn the card back on after the reset:

 

[ 6213.078412] xclmgmt 0000:21:00.0: can't change power state from D3hot to D0 (config space inaccessible)

 

 Before that it indicates that the PCIe link does go down and the device is removed from the IOMMU group:

 

[ 6211.942478] xclmgmt 0000:21:00.0: xclmgmt_reset_pci: Reset PCI
[ 6211.942918] pcieport 0000:20:03.1: Slot(2-1): Link Down
...
[ 6211.944356] pci 0000:21:00.1: Removing from iommu group 24
[ 6211.944463] xclmgmt 0000:21:00.0: xclmgmt_remove: remove(0x00000000c4cf58ec) where pdev->dev.driver_data = 0x000000001cf5be25

 

With this, a clean PCIe reset isn't happening on the system. I checked internally and we aren't sure exactly what would cause the reset not to work. I did notice that IOMMU is enabled, does the issue happen with it disabled?

Are there any other special BIOS configurations to the PCIe bus or PCIe switches between the motherboard and the cards?

I'm looking into a way around this by directly loading the SC through the CMC. This will bypass XRT and the need for the PCIe reset. There is a tool for loading the SC FW in this fashion as part of AR 73654 and the SC FW (txt file) is installed locally in the /opt/xilinx/firmware/sc-fw/u200-u250/ folder.

Regards,
~John

----------------------------------------------------------------------------------
* Please don't forget to reply, kudo and accept as a solution! *
0 Kudos
JohnFedakIV
Moderator
Moderator
1,282 Views
Registered: ‎09-04-2020

Hi @pasmith50 ,

I have confirmed that using the code provided in AR 73654 can update the SC FW in the pre-built shell/platforms with a couple modifications. This will bypass the need for the PCIe hot reset. Please try the following:

  • Download the SC FW tool from AR 73654
  • Hardcode the BAR address and the device ID in the core_tool_bar.c file for the respective card (for the BAR address below, I'm using the 32M region output from the lspci provided for 21:00.0):
    • Line 68: long Result = 0x30070000000;
    • Add line below: *dev_id = 0x5004;
  • Modify the core_tool_memory_map.c file for the U250 Base 3 platform:
    • Line 66 (in function MemoryMapGet_CMC_OffsetFromBAR): return 0x01E00000;
    • Line 76 (in function MemoryMapGetRAMControllerOffsetFromBAR): return MemoryMapGet_CMC_OffsetFromBAR()+0x8000;
      • *This has been corrected from the original post
    • Line 81 (in function MemoryMapGetHostMicroblazeAXIGPIOOffsetFromBAR): return MemoryMapGet_CMC_OffsetFromBAR()+0x1000;
  • Compile with the command: $ bash compile.bash
  • Run $ ./loadsc /opt/xilinx/firmware/sc-fw/u200-u250/sc-fw-u200*.txt 21
    • Ensure that loadsc has permission to be executed ($ ls -l)

I want to note that this will only work with the xilinx_u250_gen3x16_base_3 platform and the BAR address will need to be updated to match the 32M region in lspci for the card's management function (xx:00.0).

I am interested if the PCIe hot reset issue is resolved with IOMMU disabled (or if another setting may be causing the PCIe reset to not work), but the above should provide a workaround for now.

Please let me know if you have any questions and how the process goes.

Regards,
~John

----------------------------------------------------------------------------------
* Please don't forget to reply, kudo and accept as a solution! *
pasmith50
Visitor
Visitor
1,241 Views
Registered: ‎03-03-2021

I did try booting the OS with IOMMU disabled at it had no impact. I was going  to test it disabled in the BIOS but can't do that test remotely. 

 

I followed you instructions and the tool core dumps

[root@localhost cms_sc_download_tool]# ./loadsc /opt/xilinx/firmware/sc-fw/u200-u250/sc-fw-u200-u250-4.6.6-5544ce756e5c8407f1a66895edcf34df.txt 21

 

Core Tool SC DownloadFound Device: 0000:21:00.0 (BAR is 0x3007000000)
dev_id: 5004
Segmentation fault (core dumped)

0 Kudos
JohnFedakIV
Moderator
Moderator
1,229 Views
Registered: ‎09-04-2020

Hi @pasmith50 ,

Thank you for the feedback on IOMMU disabled at the OS level.

Based on the output, I believe the BAR address is off by a zero: (BAR is 0x3007000000) vs. 0x30070000000 shown in lspci for the 32M region.

Regards,
~John

----------------------------------------------------------------------------------
* Please don't forget to reply, kudo and accept as a solution! *
0 Kudos
pasmith50
Visitor
Visitor
1,222 Views
Registered: ‎03-03-2021

(gdb) run /opt/xilinx/firmware/sc-fw/u200-u250/sc-fw-u200-u250-4.6.6-5544ce756e5c8407f1a66895edcf34df.txt 21
Starting program: /home/ilmnadmin/cms_sc_download_tool.mod/loadsc /opt/xilinx/firmware/sc-fw/u200-u250/sc-fw-u200-u250-4.6.6-5544ce756e5c8407f1a66895edcf34df.txt 21
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-101.el8.x86_64

 

Core Tool SC DownloadFound Device: 0000:21:00.0 (BAR is 0x3007000000)
dev_id: 5004

Program received signal SIGSEGV, Segmentation fault.
0x0000000000400e30 in Microblaze_BringOutOfReset (bus=33 '!') at core_tool_microblaze_reset.c:85
85 RegisterValue=pDeviceRAM[0];
(gdb) backtrace
#0 0x0000000000400e30 in Microblaze_BringOutOfReset (bus=33 '!') at core_tool_microblaze_reset.c:85
#1 0x0000000000401137 in main (argc=3, argv=0x7fffffffe438) at core_tool_sc_download.c:113

0 Kudos
JohnFedakIV
Moderator
Moderator
1,158 Views
Registered: ‎09-04-2020

Hi @pasmith50 ,

It looks like the debugger output is also using the incorrect BAR address. From the earlier lspci output, the BAR address should be 0x300 7000 0000 rather than 0x30 0700 0000.

Regards,
~John

----------------------------------------------------------------------------------
* Please don't forget to reply, kudo and accept as a solution! *
0 Kudos
pasmith50
Visitor
Visitor
1,126 Views
Registered: ‎03-03-2021

Hi John,

Thanks for spotting that. I fixed that and I get a little further but I still get a sigsegv. 

Core Tool SC DownloadFound Device: 0000:21:00.0 (BAR is 0x30070000000)
dev_id: 5004
Successfully brought Microblaze out of reset.
Opened /dev/mem

Program received signal SIGSEGV, Segmentation fault.
0x00000000004011d5 in main (argc=3, argv=0x7fffffffe458) at core_tool_sc_download.c:127
127 RAM_ControllerInterpret_MagicRegister(pDeviceRAM[HOST_REGISTER_MAGIC]);
(gdb) backtrace
#0 0x00000000004011d5 in main (argc=3, argv=0x7fffffffe458) at core_tool_sc_download.c:127

 

 

0 Kudos
JohnFedakIV
Moderator
Moderator
1,049 Views
Registered: ‎09-04-2020

Hi @pasmith50,

Thank you for running debug on this, I'm looking into what might cause this error.

I do want to take a step back at the original issue, outside of the IOMMU, are there any special BIOS configurations to the PCIe bus or PCIe switches between the motherboard and the cards?

Can you also let me know the System Information and BIOS information that is provided by $ sudo dmidecode | less, please remove the serial number. As an example, here is the information of interest using our R740:

Handle 0x0100, DMI type 1, 27 bytes
System Information
        Manufacturer: Dell Inc.
        Product Name: PowerEdge R740
        Version: Not Specified
        (Serial Number Removed)
        UUID: 4c4c4544-0035-3010-804a-cac04f435332
        Wake-up Type: Power Switch
        SKU Number: SKU=NotProvided;ModelName=PowerEdge R740
        Family: PowerEdge

Example BIOS information:

BIOS Information
        Vendor: Dell Inc.
        Version: 2.8.1
        Release Date: 06/26/2020

 

I should note that the cards are still functional without the latest SC FW version, the SC Release Notes for the U250 are available in AR 75174 to show the differences between the versions.

Regards,
~John

----------------------------------------------------------------------------------
* Please don't forget to reply, kudo and accept as a solution! *
0 Kudos
pasmith50
Visitor
Visitor
935 Views
Registered: ‎03-03-2021

Hi John,

There are no switches but we do have a redriver. 

I am not sure dmidecode output is going to help you much but here you are. 

Handle 0x0001, DMI type 1, 27 bytes
System Information
Manufacturer: iEi
Product Name: Compute Engine
Version: Default string
Serial Number: Default string
UUID: 03000200-0400-0500-0006-000700080009
Wake-up Type: Power Switch
SKU Number: Default string
Family: Default string

BIOS Information
Vendor: American Megatrends Inc.
Version: B596AR02.ROM
Release Date: 06/30/2020

0 Kudos
JohnFedakIV
Moderator
Moderator
803 Views
Registered: ‎09-04-2020

Hi @pasmith50 ,

Thank you for the feedback on the System/BIOS information.

There are a couple paths I'm interested in going down:
1.) I'd like to take a look at the tree from the CPU to the card, this should be shown in $ lspci -td, I think the redriver should be for signal integrity and not affect a PCIe reset

2.) For debugging the SC update through the CMC, I want to make sure that the communication is working well between the host and the CMC. In the install files of the new U250 platform there are some tools for this - these are found in /opt/xilinx/firmware/cmc/u200-u250/tools/pyxbcmc/, after enabling the python scripts for execution, I'm interested for the output of $ sudo python3 ./cmc_regdump.py -c 21 (the 21 is for the BDF). The first line should have the Register Map ID, FW Version ID, Status, and Error registers.

Regards,
~John

----------------------------------------------------------------------------------
* Please don't forget to reply, kudo and accept as a solution! *
0 Kudos
pasmith50
Visitor
Visitor
750 Views
Registered: ‎03-03-2021

HI John,

I attach the tree output (I just did the lspci -t since the d option gave no output)

Here is the debug SC output

sudo python3 /opt/xilinx/firmware/cmc/u200-u250/tools/pyxbcmc/cmc_regdump.py -c 21
pyxbcmc: U250 XDMA on PCI bus 0000:21:00.0 found
0x01E08000: 74736574 0c01020b 10000801 00000000
0x01E08010: 00000000 55325858 00000000 00000000
0x01E08020: 000030b4 00003065 0000305c 00000d31
0x01E08030: 00000d1a 00000d16 00000d30 00000d1f
0x01E08040: 00000d1f 000030c0 00003079 0000306e
0x01E08050: 000009c4 000009c4 000009c4 000015c4
0x01E08060: 00001587 00001588 000004c2 000004b5
0x01E08070: 000004ba 00000740 00000732 00000732

 

 

 

0 Kudos
JohnFedakIV
Moderator
Moderator
710 Views
Registered: ‎09-04-2020

Hi @pasmith50 ,

Thank you for correcting, I had mistyped.

In looking at the tree, I see that there is both 0000:20:00.0 and 0000:20:03.1 ahead of the card. Can you provide the output of $ lspci -vs 0000:20:00.0 and $ lspci -vs 0000:20:03.1? Let's also check $ lspci -vs 0000:20:03.0 as well to get a full picture of that device.

This should provide more information on the full PCIe path to the card. In our server, the cards come off of a single branch like below (0000:17:00.0 is an Intel PCI bridge in this case):

 

+-[0000:17]-+-00.0-[18]--+-00.0
|           |            \-00.1

 

 

The communication between the host and the CMC looks to be working as expected, the first number in the dump is the Register Map ID (also known as the magic number) which is where the segfault error is happening. I'm wondering if the BAR address that is being seen from $ lspci -vd 10ee: isn't the full address and if one of upstream devices is having an effect - the lspci outputs should help determine. 

Regards,
~John

----------------------------------------------------------------------------------
* Please don't forget to reply, kudo and accept as a solution! *
0 Kudos
pasmith50
Visitor
Visitor
695 Views
Registered: ‎03-03-2021

Hi John,

These devices seem to be bridges.

# lspci -vs 0000:20:00.0
20:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Root Complex
Subsystem: Advanced Micro Devices, Inc. [AMD] Device 1450
Flags: fast devsel, NUMA node 0

# lspci -vs 0000:20:03.1
20:03.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0, IRQ 46, NUMA node 0
Bus: primary=20, secondary=21, subordinate=21, sec-latency=0
I/O behind bridge: [disabled]
Memory behind bridge: [disabled]
Prefetchable memory behind bridge: 0000030060000000-00000300740fffff [size=321M]
Capabilities: [50] Power Management version 3
Capabilities: [58] Express Root Port (Slot+), MSI 00
Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [c0] Subsystem: Advanced Micro Devices, Inc. [AMD] Device 1453
Capabilities: [c8] HyperTransport: MSI Mapping Enable+ Fixed+
Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
Capabilities: [150] Advanced Error Reporting
Capabilities: [270] Secondary PCI Express
Capabilities: [2a0] Access Control Services
Capabilities: [370] L1 PM Substates
Capabilities: [380] Downstream Port Containment
Capabilities: [400] Data Link Feature <?>
Capabilities: [410] Physical Layer 16.0 GT/s <?>
Capabilities: [440] Lane Margining at the Receiver <?>
Capabilities: [488] Designated Vendor-Specific <?>
Kernel driver in use: pcieport

# lspci -vs 0000:20:03.0
20:03.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse PCIe Dummy Host Bridge
Flags: fast devsel, NUMA node 0

 

 

0 Kudos
JohnFedakIV
Moderator
Moderator
657 Views
Registered: ‎09-04-2020

Hi @pasmith50,

I shared the register dump with the CMC team and in that the CMC indicates that the communication between the CMC and the SC isn't negotiated properly. Usually we see an error like SC is not ready (example in the Alveo Debug Guide) to indicate that the SC has gone into a bad state, but here it is relatively hidden.

With this, I'm interested in the output of $ xbutil query -d 0000:21:00.0 one more time. The sensor data in this comes from the SC, so with the communication not working as expected - I'm checking to see if the sensor values match before.

Edit: Can you also provide the full output of $ xbmgmt flash --scan, we are expecting to see an error message

I mentioned similar steps in a post much earlier, let's give them a try now that it is confirmed the SC is not in a good state:
The next step here is to remove power from the server - the SC is powered by the PCIe 3.3V AUX, which remains powered during most cold boots/shutdowns. Shutdown the server and pull the plug from the server for >10 minutes (this ensures that all of the capacitors on the 3.3V AUX line are completely discharged). Sometimes it does take a couple server off attempts. This should reset the SC, with it reset - then please attempt to update the SC version as you have been.

Thank you for sharing the lspci output, this does show that the BAR for the card was in the prefetched memory region of the bridge.

Regards,
~John

----------------------------------------------------------------------------------
* Please don't forget to reply, kudo and accept as a solution! *
0 Kudos
pasmith50
Visitor
Visitor
573 Views
Registered: ‎03-03-2021

HI John, 

I attach the query and scan output. I tried leaving the system powered down for 30 minutes and still got the same behavior when I powered up. 

 

0 Kudos
JohnFedakIV
Moderator
Moderator
535 Views
Registered: ‎09-04-2020

Hi @pasmith50 ,

It's good to see that one of the cards was able to get past the issue and now shows the new SC version running on the card (4.6.6). Did anything different happen with card 0000:a1:00.0?

It's interesting to see that the xbutil query is updating the sensor values even with the CMC<>SC communication not working properly. I do want to double check, in addition to the system being powered down, were the power cords pulled from the server? (The PCIe 3.3V AUX power is usually powered during a shutdown)

In addition to pulling power from the server, another idea is to uninstall and remove XRT and then reinstall XRT and the platform packages.

Regards,
~John

----------------------------------------------------------------------------------
* Please don't forget to reply, kudo and accept as a solution! *
0 Kudos
pasmith50
Visitor
Visitor
474 Views
Registered: ‎03-03-2021

HI John,

Card a1 I updated on an Intel server. 

Yes I pulled the power cable out. 

I will try uninstalling and re-installing. 

Thanks.

 

 

 

0 Kudos