cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Explorer
Explorer
2,953 Views
Registered: ‎06-14-2018

Alveo 200 bandwidth stability issue

Jump to solution

Hi,

this is a follow-up of this thread:

https://forums.xilinx.com/t5/Alveo-Data-Center-Accelerator/Bandwidth-problem-on-Alveo-200/m-p/976658

 

I ran 240 iterations of xbutil dmatest (nimbix-xbutil-bench-stability.sh), then processed the results to get these graphs.

Graphs are obtained this way:

$ bash bench-stability.sh
$ gnuplot -p -e 'fileout="image.png"' gnuplot-settings-stability-b0w.txt

What could be the cause of this erratic behavior ?

image-write.png
image-read.png
0 Kudos
1 Solution

Accepted Solutions
Highlighted
Xilinx Employee
Xilinx Employee
1,684 Views
Registered: ‎11-11-2012

Thanks everyone for your contributions to this issue. Here comes the solution. 

Reasons of the problem:

      For a NUMA system, DMA performance could be greatly impacted if the CPU core running the application is on the different PCIe switch with the FPGA card. Although Xilinx driver (xocl) is NUMA aware, the application does not. The “xbutil dmatest” is an application, same to most of the applications that doesn't handle NUMA issues. 

 

Solution:

1. Find the Alveo card device ID, for example below "02:00.1" is the ID you are looking for:

        $ lspci | grep Xilinx
        02:00.0 Processing accelerators: Xilinx Corporation Device 5004
        02:00.1 Processing accelerators: Xilinx Corporation Device 5005

2. Find the CPU cores close to the Alveo card:

        $ cat /sys/bus/pci/devices/0000\:02\:00.1/local_cpulist

        0-7

3. Run the application and lock it to the CPU cores you found above:

        $ taskset -c 0,1,2,3,4,5,6,7 xbutil dmatest

You should see a good DMA test performance now. 

 

View solution in original post

15 Replies
Highlighted
Contributor
Contributor
2,943 Views
Registered: ‎09-24-2016

On Linux, a full "sudo lspci -vvvn" or variants of "sudo ls -t" will show the topology (flattened or in tree form). This helps you figure out at least if the U2x0 accelerator is directly attached to a CPU PCIe Root Complex port.

 

Highlighted
Explorer
Explorer
2,936 Views
Registered: ‎06-14-2018
02:00.0 Processing accelerators: Xilinx Corporation Device 5000
02:00.1 Processing accelerators: Xilinx Corporation Device 5001
 \-[0000:00]-+-00.0
... +-02.0-[02]--+-00.0 | \-00.1
02:00.0 1200: 10ee:5000
	Subsystem: 10ee:000e
	Physical Slot: 4
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 32 bytes
	Region 0: Memory at c2000000 (32-bit, non-prefetchable) [size=32M]
	Region 1: Memory at c4000000 (32-bit, non-prefetchable) [size=128K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [60] MSI-X: Enable+ Count=33 Masked-
		Vector table: BAR=1 offset=00009000
		PBA: BAR=1 offset=00009fe0
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 1024 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
	Capabilities: [1c0 v1] #19
	Capabilities: [400 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl+ DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Kernel driver in use: xclmgmt

02:00.1 1200: 10ee:5001
	Subsystem: 10ee:000e
	Physical Slot: 4
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 32 bytes
	Interrupt: pin A routed to IRQ 26
	Region 0: Memory at c0000000 (32-bit, non-prefetchable) [size=32M]
	Region 1: Memory at c4020000 (32-bit, non-prefetchable) [size=64K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [60] MSI-X: Enable+ Count=33 Masked-
		Vector table: BAR=1 offset=00008000
		PBA: BAR=1 offset=00008fe0
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 1024 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
	Capabilities: [400 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl+ DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Kernel driver in use: xocl_xdma

I guess that it's what I have to look after, but righ now I'm not yet sure about how to read that.

0 Kudos
Highlighted
Contributor
Contributor
2,924 Views
Registered: ‎09-24-2016

By looking at only the memory ranges, seems to me both Accelerators are attached to the 1st Root Complex port of the Xeon.

I do not think PCIe switches are visible (they are transparent AFAIK), so my assumption is there is a switch inbetween, unless of course there is a possibility that Root Complex port has 2 x16 ports. That would require more in-depth datasheets.

 

00:02.0 8086:2f04 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI Express Root Port 2 (rev 02)Memory behind bridge: c0000000-c40fffff

02:00.0 1200: 10ee:5000 Processing accelerators: Xilinx Corporation Device 5000
Region 0: Memory at c2000000 (32-bit, non-prefetchable) [size=32M]
Region 1: Memory at c4000000 (32-bit, non-prefetchable) [size=128K]

02:00.1 1200: 10ee:5001 Processing accelerators: Xilinx Corporation Device 5001
Region 0: Memory at c0000000 (32-bit, non-prefetchable) [size=32M]
Region 1: Memory at c4020000 (32-bit, non-prefetchable) [size=64K]

Highlighted
Explorer
Explorer
2,905 Views
Registered: ‎06-14-2018

According to the Nimbix support, there's only one card per host.

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
2,874 Views
Registered: ‎10-04-2016

Hi @xil_tour,

Xilinx is aware of this issue. The variation in bandwidth is attributed to NUMA and is not believed to be an issue with the U200 card.

The Nimbix system has two NUMA nodes. The U200 is on NUMA node 0 (cpu 1-7) while the interrupt affinity mask is set across both NUMA nodes (cpu 1-16). When the CPUs of node 1 are used for IRQ processing, the performance goes down. 

The IRQ CPU affinity can be changed via /proc/irq/<irp vector #>/smp_affinity.

Regards,

Deanna

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
Highlighted
Explorer
Explorer
2,857 Views
Registered: ‎06-14-2018

Is it something I can set up myself ?

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
2,836 Views
Registered: ‎12-10-2013

Hi @xil_tour 

As Deanna mentioned, you can change this in the smp_affinity file, if you have access.  However, I do not know the permissions in the Nimbix cloud environment.   So you would grep your device in /proc/interrupts to get the IRQ number, then modify the /proc/irq/<irq#>/smp_affinity to only reflect the CPU affinity for NUMA node 0 CPUs, for example.

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
Highlighted
Explorer
Explorer
2,822 Views
Registered: ‎06-14-2018

Here's the Nimbix response about Deanna's response I copied/pasted to them:

Thanks for the update, it would appear to be an issue which Xilinx is aware of.

I'm not sure what more we can do on the Nimbix end, short of waiting for a Xilinx fix.

I'm going to close out this ticket; however, you can always reopen it by responding to this email should you need additional assistance.

I tried to change some affinity by echoing > the previous value, but it seems that I don't have the rights to do so (permission error).

Here's the content of /proc/interrupts, can you tell me which lines are the ones of interest please ?

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
2,783 Views
Registered: ‎10-19-2015

Hi @xil_tour @bethe @demarco @likewise 

I pingged Nimbix again and I'm going to see if we can look into this together. Ideally I'll post a solution shortly, otherwise I'll post an update. 

Regards,

-M

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
Explorer
Explorer
2,744 Views
Registered: ‎06-14-2018

Thanks a lot.

I stay tuned.

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
2,112 Views
Registered: ‎10-19-2015

Hi @xil_tour,

We got squared away with Nimbix and I believe they are handling the debug on their end. There isn't much I can do at this point, but please keep me posted as you continue testing. 

Regards,

Matt 

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
Highlighted
Explorer
Explorer
2,090 Views
Registered: ‎06-14-2018

I hope this will be fixed soon, proving that it's not the normal behavior of a $9000 card...

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
1,685 Views
Registered: ‎11-11-2012

Thanks everyone for your contributions to this issue. Here comes the solution. 

Reasons of the problem:

      For a NUMA system, DMA performance could be greatly impacted if the CPU core running the application is on the different PCIe switch with the FPGA card. Although Xilinx driver (xocl) is NUMA aware, the application does not. The “xbutil dmatest” is an application, same to most of the applications that doesn't handle NUMA issues. 

 

Solution:

1. Find the Alveo card device ID, for example below "02:00.1" is the ID you are looking for:

        $ lspci | grep Xilinx
        02:00.0 Processing accelerators: Xilinx Corporation Device 5004
        02:00.1 Processing accelerators: Xilinx Corporation Device 5005

2. Find the CPU cores close to the Alveo card:

        $ cat /sys/bus/pci/devices/0000\:02\:00.1/local_cpulist

        0-7

3. Run the application and lock it to the CPU cores you found above:

        $ taskset -c 0,1,2,3,4,5,6,7 xbutil dmatest

You should see a good DMA test performance now. 

 

View solution in original post

Highlighted
Explorer
Explorer
1,599 Views
Registered: ‎06-14-2018

Looks like the workaround is working, thanks !

Was something hardware also fixed or taskset would have been sufficient from the start ?

alveo-bench-bandwidth-avg_all.png
alveo-bench-stability-readb0.png
alveo-bench-stability-writeb0.png
0 Kudos
Highlighted
Explorer
Explorer
990 Views
Registered: ‎06-14-2018

The problem is back.

It's been a week or so I think, maybe more.

Here's the results as graphs (not polished).

 

alveo-bench-stability-readb0.png
alveo-bench-stability-writeb0.png
bank0-read.dat-density.png
bank0-write.dat-density.png
0 Kudos