cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Observer
Observer
988 Views
Registered: ‎06-09-2016

Alveo U50 powers down instantly from Cold Boot after flashing firmware

Jump to solution

I have a customer using the Vivado Design Flow to develop on the Alveo U50 board. The customer has had the U50 board since January 2020. Up till now, the Vivado projects which the customer was building all worked fine (with one exception – the QSFP28 interface would never work with an optical QSFP transceiver). The customer recently followed the instructions in UG1370 v1.6 Chapter 4 to update the TI MSP432 Satellite Controller firmware and the MCS file stored on the Flash (with the latest XDMA Gen 3 x16 Shell image). This resulted in the following:

  • Good: Some of the projects where the QSFP28 optical interface would not work, suddenly started working!
  • Bad: Some of the projects now cause the U50 to shut down almost immediately after the FPGA is configured.
    • The customer was first testing these images by JTAG’ing the bitfile onto the FPGA… It appeared as though the FPGA would be fully configured, but would shut down immediately after configuration completed.
    • The customer then followed the steps in UG1371 v1.2 Chapter 2 to load an MCS file onto the U50 flash – unfortunately, this led to the board shutting down immediately after every cold boot (whereas the remainder of the server would continue to boot fine). It almost seemed like the U50 was “bricked” at this point, but the customer luckily found a workaround after a couple of days to unbrick the board and restore the MCS file with the public U50 XDMA Gen 3 x16 shell image (MCS file) for the Alveo Vitis Design Flow.

It is not clear what is causing some Vivado projects to now shut down the U50 board, whereas other projects appear to work fine. It seems as though the Satellite Controller firmware update somehow changed something which is now causing this problem. How can the customer ensure that the projects they are building will not cause the board to immediately shut down?

0 Kudos
Reply
1 Solution

Accepted Solutions
Xilinx Employee
Xilinx Employee
793 Views
Registered: ‎10-19-2015

Hi @romelh 

Thanks for bringing this to our attention, I replicated the failure and realized that the failure was from the FPGA in the example design not properly driving the HBM cat trip pin to the satellite controller. 

The cat trip pin comes from the HBM IP, out of an FPGA pin, and directly into the satellite controller as an interrupt. When this pin is driven high the satellite controller cuts all power to the board. 

The GT reference design was built before the satellite controller firmware implemented this feature and was then released to the public through a momentary lapse in testing coherency. The development team has been alerted and we have a CR filed to fix the reference design. 

Please make sure all custom designs also drive the cat trip pin correctly as directed in the XDC provided for the U50 and U280, or any future HBM enabled Alveo cards. 

@lowearthorbit I'm certainly glad we have root caused this failure outside of a resonance problem with our PDN. Since this is an example design that has previously worked and now doesn't due to a firmware change, the chances and evidence presented that this is a 1 in a million failure seem non-existent (in this case alone). 

Keep in mind that this failure only occurs after flashing an MCS file to the card. Should you program a standalone bit stream a cold reboot of the host computer would clear the error since the FPGA would load a new bitstream from flash memory after the cold boot.

Let me know if you have any questions! 

-M

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------

View solution in original post

0 Kudos
Reply
11 Replies
966 Views
Registered: ‎09-17-2018

It is possible a design clocks so much logic, that

It trips the power on reset because the design requires more current from the power supply than is available.

What is the estimated power of the design (from Vivado)?  What is the usage (% of LUT, DFF, BRAM, DSP used)? 

Check you are not asking for more than your supply is able to supply.

At Xilinx, such a design is called a 'hammer.'  For good reason.  Typically if every DFF is used with as a shift register, passing 1, 0, 1, 0, ... (on every clock every DFF toggles 1 to 0 or 0 to 1) you can easily configure, start, and immediately trip the power on reset threshold (crash the power rail).

lowearthorbit

0 Kudos
Reply
Observer
Observer
855 Views
Registered: ‎06-09-2016

I doubt that resource utilization is the problem. Here's what Vivado is reporting for one of the designs which is causing the U50 power to cut off:

  • Total Power: 21.409 W
  • LUT%: 13.18
  • FF%: 9.58
  • BRAM%: 11.5
  • URAM%: 0
  • DSP%: 0.07

I believe that the U50 is rated at 75W for Max Total Power: https://www.xilinx.com/products/boards-and-kits/alveo/u50.html#specifications

Additionally, some of the non-working designs worked fine before the Satellite Controller firmware was updated on the U50. The problem seems to be related to this firmware update, but there does not seem to be any revision history posted anywhere about the Satellite Controller firmware update or what it could impact in the design.

852 Views
Registered: ‎09-17-2018

OK,

That implies you may have excited resonance in the power distribution network (PDN).  Definitely will need Xilinx help for this.

Even with only 20 watts, one can excite a resonance, and cause a power on reset trip.  Just vary the load at the resonant frequency.

Definitely not something Xilinx will talk openly about.  Good luck.  Remember:  as a customer, you are always right (even when they try to convince you that you are wrong).

At various tech conferences, putting FPGA devices in the data center and causing problems to show up in power distribution networks was something whispered about in the hall ways.  Impossible to design a PDN if you have no idea what the current vs. time is doing.  Pretty rare occurrence, but with "a million monkeys and a million typewriters" ...

lowearthorbit

0 Kudos
Reply
Xilinx Employee
Xilinx Employee
794 Views
Registered: ‎10-19-2015

Hi @romelh 

Thanks for bringing this to our attention, I replicated the failure and realized that the failure was from the FPGA in the example design not properly driving the HBM cat trip pin to the satellite controller. 

The cat trip pin comes from the HBM IP, out of an FPGA pin, and directly into the satellite controller as an interrupt. When this pin is driven high the satellite controller cuts all power to the board. 

The GT reference design was built before the satellite controller firmware implemented this feature and was then released to the public through a momentary lapse in testing coherency. The development team has been alerted and we have a CR filed to fix the reference design. 

Please make sure all custom designs also drive the cat trip pin correctly as directed in the XDC provided for the U50 and U280, or any future HBM enabled Alveo cards. 

@lowearthorbit I'm certainly glad we have root caused this failure outside of a resonance problem with our PDN. Since this is an example design that has previously worked and now doesn't due to a firmware change, the chances and evidence presented that this is a 1 in a million failure seem non-existent (in this case alone). 

Keep in mind that this failure only occurs after flashing an MCS file to the card. Should you program a standalone bit stream a cold reboot of the host computer would clear the error since the FPGA would load a new bitstream from flash memory after the cold boot.

Let me know if you have any questions! 

-M

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------

View solution in original post

0 Kudos
Reply
Observer
Observer
729 Views
Registered: ‎06-09-2016

@mcertosi - that was the problem! The customer modified his design such that DRAM_y_STAT_CATTRIP described in Table 5 of the AXI HBM Memory COntroller Product Guide (PG276 v1.0) was driving the HBM_CATTRIP signal mapped in the XDC file for the U50 to PIN J18. After doing this, every one of his designs started working properly again without the U50 quickly powering off after a cold boot. Thanks for your help!

One other related comment  - once the board got into the semi-bricked state (with an MCS file not driving HBM_CATTRIP correctly), the customer was able to "unbrick" the board following the solution described in this thread. He had to have the Alveo Programming cable connected to the board with Vivado Hardware Manager ready before cold-booting the board. By doing this, the U50 did not automatically power off quickly after the cold boot and he was able to load a different MCS file (which drives the HBM_CATTRIP pin correctly) such that future cold boots without the cable would not force the board to power off.

0 Kudos
Reply
717 Views
Registered: ‎09-17-2018

I was wrong, (my guess was incorrect)

I should have remembered that almost always problems are related to bits in the bitstream (you told it to behave in this way, so it does exactly what you programmed it to do).

It happens,  Bizarre problem, great trouble-shooting.  But, still a behavior entirely predictable (it was programmed by the bitstream to do exactly that).

lowearthorbit

0 Kudos
Reply
Observer
Observer
713 Views
Registered: ‎06-09-2016

@lowearthorbit - you might not have been correct, but you provided some good insight based on experience! "Excited resonance in the power distribution network" will have to be something to watch out for the next time someone observes similar behavior.

0 Kudos
Reply
Adventurer
Adventurer
676 Views
Registered: ‎12-04-2019

Hi

I am not able to figure out a way of unbricking it using the thread you have mentioned. Cold rebooting involves turning off the host machine too right? Are you implying that I use a different machine for the programming? Could you please elaborate on the steps to be followed. It would be really helpful.

P.S. I have the XDC file ready with CAT-trip pin driven low but I am not able to reprogram the board.

0 Kudos
Reply
Xilinx Employee
Xilinx Employee
673 Views
Registered: ‎10-19-2015

Hi @naarayananrao,

Did you program a bitstream onto the card that first didn't drive cat_trip correctly? Another option would be to have loaded the U50 GT reference design into the SPI (programmed an MCS file into card memory) since this design has a flaw in it that causes it to shut down the card. 

If you've done the above, fixing the card is tricky. 

If you have just loaded a bitstream you should be able to cold boot the server and then it should recover to a working image. 

What card are you using? 

The fix posted in this thread seems unstable and while I'm happy that user was able to do that and move forwards, I'm unsure of how robust that solution is.

Are you able to connect to the Vivado hardware manager and see the FPGA in the card? You'd have to use a programming cable to do this. 

Regards,

M

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
0 Kudos
Reply
Adventurer
Adventurer
666 Views
Registered: ‎12-04-2019

Hello @mcertosi 

Thanks for the reply. I am using the Alveo U50 and have used the SPI flash to program the MCS file of my custom design which "didn't" drive the cat_trip. I am still not able to figure out how to unbrick it.

I am able to see the device in the hardware manager for a brief amount of time and then it disconnects as soon as the power rails go down.

0 Kudos
Reply
Observer
Observer
648 Views
Registered: ‎06-09-2016

@naarayananrao - to answer your questions:

> Cold rebooting involves turning off the host machine too right?

Yes

> Are you implying that I use a different machine for the programming? Could you please elaborate on the steps to be followed. It would be really helpful.

Yes. Here was the feedback from my customer on how he unbricked his board:

I seem to be able to recover the U50 by attaching the JTAG programmer from a secondary computer and having Vivado Hardware Manager open and ready to talk to the FPGA when it first powers up.  In that state, the U50 does not power off like it does when the JTAG programmer is not actively connected to it.  Even if the programmer is physically connected, you cannot recover from the computer it is plugged into since the Hardware Manager in the Vivado GUI has to be up and connected to the JTAG programmer in order to catch the U50 before it powers itself off.

I hope this makes sense and helps. My customer had assumed that the board was permanently bricked and was quite surprised that this worked. When the U50 did not shut itself down again, he was just able to load a new MCS file.

0 Kudos
Reply