05-31-2019 01:05 AM
Hi,
this is a follow-up of this thread:
https://forums.xilinx.com/t5/Alveo-Data-Center-Accelerator/Bandwidth-problem-on-Alveo-200/m-p/976658
I ran 240 iterations of xbutil dmatest (nimbix-xbutil-bench-stability.sh), then processed the results to get these graphs.
Graphs are obtained this way:
$ bash bench-stability.sh $ gnuplot -p -e 'fileout="image.png"' gnuplot-settings-stability-b0w.txt
What could be the cause of this erratic behavior ?
08-20-2019 01:38 PM
Thanks everyone for your contributions to this issue. Here comes the solution.
Reasons of the problem:
For a NUMA system, DMA performance could be greatly impacted if the CPU core running the application is on the different PCIe switch with the FPGA card. Although Xilinx driver (xocl) is NUMA aware, the application does not. The “xbutil dmatest” is an application, same to most of the applications that doesn't handle NUMA issues.
Solution:
1. Find the Alveo card device ID, for example below "02:00.1" is the ID you are looking for:
$ lspci | grep Xilinx
02:00.0 Processing accelerators: Xilinx Corporation Device 5004
02:00.1 Processing accelerators: Xilinx Corporation Device 5005
2. Find the CPU cores close to the Alveo card:
$ cat /sys/bus/pci/devices/0000\:02\:00.1/local_cpulist
0-7
3. Run the application and lock it to the CPU cores you found above:
$ taskset -c 0,1,2,3,4,5,6,7 xbutil dmatest
You should see a good DMA test performance now.
05-31-2019 01:15 AM
On Linux, a full "sudo lspci -vvvn" or variants of "sudo ls -t" will show the topology (flattened or in tree form). This helps you figure out at least if the U2x0 accelerator is directly attached to a CPU PCIe Root Complex port.
05-31-2019 01:48 AM
02:00.0 Processing accelerators: Xilinx Corporation Device 5000 02:00.1 Processing accelerators: Xilinx Corporation Device 5001
\-[0000:00]-+-00.0
... +-02.0-[02]--+-00.0 | \-00.1
02:00.0 1200: 10ee:5000 Subsystem: 10ee:000e Physical Slot: 4 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 32 bytes Region 0: Memory at c2000000 (32-bit, non-prefetchable) [size=32M] Region 1: Memory at c4000000 (32-bit, non-prefetchable) [size=128K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [60] MSI-X: Enable+ Count=33 Masked- Vector table: BAR=1 offset=00009000 PBA: BAR=1 offset=00009fe0 Capabilities: [70] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 1024 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported- RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+ EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest- Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [1c0 v1] #19 Capabilities: [400 v1] Access Control Services ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl+ DirectTrans- ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- Kernel driver in use: xclmgmt 02:00.1 1200: 10ee:5001 Subsystem: 10ee:000e Physical Slot: 4 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 32 bytes Interrupt: pin A routed to IRQ 26 Region 0: Memory at c0000000 (32-bit, non-prefetchable) [size=32M] Region 1: Memory at c4020000 (32-bit, non-prefetchable) [size=64K] Capabilities: [40] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [60] MSI-X: Enable+ Count=33 Masked- Vector table: BAR=1 offset=00008000 PBA: BAR=1 offset=00008fe0 Capabilities: [70] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 1024 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [100 v1] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn- Capabilities: [400 v1] Access Control Services ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl+ DirectTrans- ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans- Kernel driver in use: xocl_xdma
I guess that it's what I have to look after, but righ now I'm not yet sure about how to read that.
05-31-2019 01:56 AM
By looking at only the memory ranges, seems to me both Accelerators are attached to the 1st Root Complex port of the Xeon.
I do not think PCIe switches are visible (they are transparent AFAIK), so my assumption is there is a switch inbetween, unless of course there is a possibility that Root Complex port has 2 x16 ports. That would require more in-depth datasheets.
00:02.0 8086:2f04 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI Express Root Port 2 (rev 02)Memory behind bridge: c0000000-c40fffff
02:00.0 1200: 10ee:5000 Processing accelerators: Xilinx Corporation Device 5000
Region 0: Memory at c2000000 (32-bit, non-prefetchable) [size=32M]
Region 1: Memory at c4000000 (32-bit, non-prefetchable) [size=128K]
02:00.1 1200: 10ee:5001 Processing accelerators: Xilinx Corporation Device 5001
Region 0: Memory at c0000000 (32-bit, non-prefetchable) [size=32M]
Region 1: Memory at c4020000 (32-bit, non-prefetchable) [size=64K]
05-31-2019 07:21 AM
According to the Nimbix support, there's only one card per host.
06-03-2019 10:35 AM
Hi @xil_tour,
Xilinx is aware of this issue. The variation in bandwidth is attributed to NUMA and is not believed to be an issue with the U200 card.
The Nimbix system has two NUMA nodes. The U200 is on NUMA node 0 (cpu 1-7) while the interrupt affinity mask is set across both NUMA nodes (cpu 1-16). When the CPUs of node 1 are used for IRQ processing, the performance goes down.
The IRQ CPU affinity can be changed via /proc/irq/<irp vector #>/smp_affinity.
Regards,
Deanna
06-04-2019 01:43 AM
Is it something I can set up myself ?
06-04-2019 11:12 AM - edited 06-04-2019 11:13 AM
Hi @xil_tour
As Deanna mentioned, you can change this in the smp_affinity file, if you have access. However, I do not know the permissions in the Nimbix cloud environment. So you would grep your device in /proc/interrupts to get the IRQ number, then modify the /proc/irq/<irq#>/smp_affinity to only reflect the CPU affinity for NUMA node 0 CPUs, for example.
06-05-2019 01:58 AM
Here's the Nimbix response about Deanna's response I copied/pasted to them:
Thanks for the update, it would appear to be an issue which Xilinx is aware of. I'm not sure what more we can do on the Nimbix end, short of waiting for a Xilinx fix. I'm going to close out this ticket; however, you can always reopen it by responding to this email should you need additional assistance.
I tried to change some affinity by echoing > the previous value, but it seems that I don't have the rights to do so (permission error).
Here's the content of /proc/interrupts, can you tell me which lines are the ones of interest please ?
06-11-2019 12:19 PM
Hi @xil_tour @bethe @demarco @likewise
I pingged Nimbix again and I'm going to see if we can look into this together. Ideally I'll post a solution shortly, otherwise I'll post an update.
Regards,
-M
06-12-2019 01:29 AM
Thanks a lot.
I stay tuned.
06-17-2019 02:04 PM - edited 06-17-2019 02:08 PM
Hi @xil_tour,
We got squared away with Nimbix and I believe they are handling the debug on their end. There isn't much I can do at this point, but please keep me posted as you continue testing.
Regards,
Matt
06-18-2019 01:46 AM
I hope this will be fixed soon, proving that it's not the normal behavior of a $9000 card...
08-20-2019 01:38 PM
Thanks everyone for your contributions to this issue. Here comes the solution.
Reasons of the problem:
For a NUMA system, DMA performance could be greatly impacted if the CPU core running the application is on the different PCIe switch with the FPGA card. Although Xilinx driver (xocl) is NUMA aware, the application does not. The “xbutil dmatest” is an application, same to most of the applications that doesn't handle NUMA issues.
Solution:
1. Find the Alveo card device ID, for example below "02:00.1" is the ID you are looking for:
$ lspci | grep Xilinx
02:00.0 Processing accelerators: Xilinx Corporation Device 5004
02:00.1 Processing accelerators: Xilinx Corporation Device 5005
2. Find the CPU cores close to the Alveo card:
$ cat /sys/bus/pci/devices/0000\:02\:00.1/local_cpulist
0-7
3. Run the application and lock it to the CPU cores you found above:
$ taskset -c 0,1,2,3,4,5,6,7 xbutil dmatest
You should see a good DMA test performance now.
09-06-2019 07:21 AM
Looks like the workaround is working, thanks !
Was something hardware also fixed or taskset would have been sufficient from the start ?
10-15-2019 10:54 AM
The problem is back.
It's been a week or so I think, maybe more.
Here's the results as graphs (not polished).