05-21-2020 06:40 PM - edited 05-22-2020 01:15 PM
I just received an Alveo U200 card. I connected it to the PCIe and PCIe 8-pin power cable. Currently blue, orange and green LED lights are on but I can't see any red LED. After few minutes I felt that the board is getting increasingly very hot so I turned it off. Is this normal? Why is it getting so hot? shouldn't I see a red LED?
05-25-2020 06:28 AM
Hi @nzh :
Alveo cards are available in both active and passive cooling configurations.
The difference is, the fan included on active cooled cards. Active cooling card configuration includes a heat sink and fan enclosure cover to provide appropriate cooling. The passive cooling card is designed to be installed into a data center server, where controlled airflow provides direct cooling. Which one is yours (Active or Passive)?
Red LED is Power good LED. What do you mean by you can't see any red LED? Is this red LED on or off?
05-25-2020 08:57 AM
@panantra My alveo card is passive cooling. But it gets hot without even being used. I am not running anything on it I just turn on the computer and by the time that the server boots(about 3 min) the card is feels pretty warm. I waited a couple more minuets without using it and it got even got hotter.
I assume I can't see any red LED, because it's off, I can see that blue, green, orange, yellow LEDs are on. They turn on all together as soon as the computer is turned on. Does it mean that the power is not good?It is connected to PCIe 8pin.
I feel that there is something wrong with the card and its power management that "power good" is not on and it gets hot when it's not being even used.
05-25-2020 09:09 AM
Hi @nzh :
As I mentioned earlier, Passive cards do not include a built-in fan and therefore require an external mechanism to ensure proper airflow for cooling.
Note that Passive cards should not be powered without a forced airflow mechanism in place.
Please refer to DS962 https://www.xilinx.com/support/documentation/data_sheets/ds962-u200-u250.pdf (operating conditions) to get the details.
If you find any post has resolved your query, mark it as an accepted solution.
05-25-2020 09:42 AM
@panantra Thank you for your reply but I don't think it has anything to do with cooling system. I am not even using the board yet and it gets hot and also the "power good" LED did not turn on. It looks like there is something wrong with the power management.
05-25-2020 10:03 PM
05-26-2020 10:06 AM
Is this the first time you are powering on the FPGA?
Blue light is the FPGA is programmed.
Red is power good.
Is there an inconsistency between the documentation and what you are seeing?
The cards aren't meant to be debugged with the LEDs alone.
Can you use your host machine to debug? The lspci tool should tell you if the PCIe link is up. Once XRT is installed and the shell is loaded on the card, you can use xbutil to debug.
These are the two basic commands:
$lspci -vd 10ee: $sudo xbutil flash scan
You can also debug with the jtag/usb cable.
Can you see the FPGA with the maintenance connector plugged in?
I see you are also interchanging server and computer. What is the model of either server or desktop workstation you are using?
This passive card needs forced air regardless if it is being used or not. If it is on, it needs to be cooled.
05-26-2020 12:07 PM - edited 05-26-2020 12:07 PM
@mcertosi So the problem is that according to the document, I should see a red LED illuminating at the back of the card to make sure the power voltage is sufficient, correct? But the red LED is off. So I think there is something wrong and I don't know how to figure it out. I suspect that there is something wrong with the board. I tried this on a couple Dell poweredge servers and workstations and the red LED didn't turn on on any of them.
When I run the commands, I get this:
root@argus:/home/argus# xbutil flash scan WARNING: The xbutil sub-command flash has been deprecated. Please use the xbmgmt utility with flash sub-command for equivalent functionality. Card [0000:23:00.0] Card type: u200 Flash type: SPI Flashable partition running on FPGA: xilinx_u200_xdma_201830_2,[ID=0x5d1211e8],[SC=4.2.0] Flashable partitions installed in system: xilinx_u200_xdma_201830_2,[ID=0x5d1211e8],[SC=4.2.0] root@argus:/home/argus# sudo lspci -vd 10ee: 23:00.0 Processing accelerators: Xilinx Corporation Device 5000 Subsystem: Xilinx Corporation Device 000e Flags: bus master, fast devsel, latency 0 Memory at da000000 (64-bit, prefetchable) [size=32M] Memory at dffc0000 (64-bit, prefetchable) [size=128K] Capabilities:  Power Management version 3 Capabilities:  MSI-X: Enable+ Count=33 Masked- Capabilities:  Express Endpoint, MSI 00 Capabilities:  Advanced Error Reporting Capabilities: [1c0] #19 Capabilities:  Access Control Services Capabilities:  #15 Kernel driver in use: xclmgmt Kernel modules: xclmgmt 23:00.1 Processing accelerators: Xilinx Corporation Device 5001 Subsystem: Xilinx Corporation Device 000e Flags: bus master, fast devsel, latency 0, IRQ 99 Memory at dc000000 (64-bit, prefetchable) [size=32M] Memory at dfff0000 (64-bit, prefetchable) [size=64K] Memory at c0000000 (64-bit, prefetchable) [size=256M] Capabilities:  Power Management version 3 Capabilities:  MSI-X: Enable+ Count=33 Masked- Capabilities:  Express Endpoint, MSI 00 Capabilities:  Advanced Error Reporting Capabilities:  Access Control Services Capabilities:  #15 Kernel driver in use: xocl Kernel modules: xocl
05-26-2020 12:58 PM
Based on the information you've provided, I do not see anything wrong with the board. While the LEDs being inconsistent with documentation is frustrating, lets continue to debug and see if we can see anything else wrong with the card.
LSPCI shows the card is linked via pcie, and the management and user physical functions are operational.
From xbutil scan we see the shell is programmed and the satellite controller is also communicating with XRT.
These are the main parts of the card.
Next run xbutil validate to see if the DDR and DMA are working correctly.
05-26-2020 03:15 PM - edited 05-27-2020 10:44 AM
@mcertosi This is what I get when I run validate:
xbutil validate INFO: Found 1 cards INFO: Validating card: xilinx_u200_xdma_201830_2 INFO: == Starting AUX power connector check: INFO: == AUX power connector check PASSED INFO: == Starting PCIE link check: LINK ACTIVE, ATTENTION Ensure Card is plugged in to Gen3x16, instead of Gen2x8 Lower performance may be experienced WARN: == PCIE link check PASSED with warning INFO: == Starting SC firmware version check: INFO: == SC firmware version check PASSED INFO: == Starting verify kernel test: INFO: == verify kernel test PASSED INFO: == Starting DMA test: Host -> PCIe -> FPGA write bandwidth = 2833.39 MB/s Host <- PCIe <- FPGA read bandwidth = 2999.45 MB/s INFO: == DMA test PASSED INFO: == Starting device memory bandwidth test: Traceback (most recent call last): File "/opt/xilinx/xrt/test/23_bandwidth.py", line 1, in <module> import pyopencl as cl File "/usr/lib/python3/dist-packages/pyopencl/__init__.py", line 30, in <module> import pyopencl._cl as _cl ModuleNotFoundError: No module named 'pyopencl._cl' ERROR: == device memory bandwidth test FAILED INFO: Card failed to validate. ERROR: Some cards failed to validate.
I also tried to install pyopencl but it looks like that I already have it:
root@argus:/home/argus# python -m pip install pyopencl Requirement already satisfied: pyopencl in /usr/lib/python3/dist-packages (2015.1)
05-27-2020 11:00 AM
You have two things going on here that stand out to me. First, the card is not plugged into a Gen3x16 PCIe slot. This will be a bandwidth issue later.
Second, the python problem. If you are using ubuntu, using apt to install pyopencl doesn't seem to work right. This answer record is how we recommend fixing the problem. https://www.xilinx.com/support/answers/73055.html
Unfortunately, your problem has a slightly different error signature than what is in the answer record. You should still try it, but it might not work.
There are some other forum posts about python, https://forums.xilinx.com/t5/Alveo-Accelerator-Cards/failed-to-validate-U50DD/td-p/1067758
What Linux kernel are you using?