UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Explorer
Explorer
2,007 Views
Registered: ‎06-14-2018

Alveo 200 bandwidth stability issue

Jump to solution

Hi,

this is a follow-up of this thread:

https://forums.xilinx.com/t5/Alveo-Data-Center-Accelerator/Bandwidth-problem-on-Alveo-200/m-p/976658

 

I ran 240 iterations of xbutil dmatest (nimbix-xbutil-bench-stability.sh), then processed the results to get these graphs.

Graphs are obtained this way:

$ bash bench-stability.sh
$ gnuplot -p -e 'fileout="image.png"' gnuplot-settings-stability-b0w.txt

What could be the cause of this erratic behavior ?

0 Kudos
1 Solution

Accepted Solutions
Xilinx Employee
Xilinx Employee
738 Views
Registered: ‎11-11-2012

Re: Alveo 200 bandwidth stability issue

Jump to solution

Thanks everyone for your contributions to this issue. Here comes the solution. 

Reasons of the problem:

      For a NUMA system, DMA performance could be greatly impacted if the CPU core running the application is on the different PCIe switch with the FPGA card. Although Xilinx driver (xocl) is NUMA aware, the application does not. The “xbutil dmatest” is an application, same to most of the applications that doesn't handle NUMA issues. 

 

Solution:

1. Find the Alveo card device ID, for example below "02:00.1" is the ID you are looking for:

        $ lspci | grep Xilinx
        02:00.0 Processing accelerators: Xilinx Corporation Device 5004
        02:00.1 Processing accelerators: Xilinx Corporation Device 5005

2. Find the CPU cores close to the Alveo card:

        $ cat /sys/bus/pci/devices/0000\:02\:00.1/local_cpulist

        0-7

3. Run the application and lock it to the CPU cores you found above:

        $ taskset -c 0,1,2,3,4,5,6,7 xbutil dmatest

You should see a good DMA test performance now. 

 

15 Replies
Contributor
Contributor
1,997 Views
Registered: ‎09-24-2016

Re: Alveo 200 bandwidth stability issue

Jump to solution

On Linux, a full "sudo lspci -vvvn" or variants of "sudo ls -t" will show the topology (flattened or in tree form). This helps you figure out at least if the U2x0 accelerator is directly attached to a CPU PCIe Root Complex port.

 

Highlighted
Explorer
Explorer
1,990 Views
Registered: ‎06-14-2018

Re: Alveo 200 bandwidth stability issue

Jump to solution
02:00.0 Processing accelerators: Xilinx Corporation Device 5000
02:00.1 Processing accelerators: Xilinx Corporation Device 5001
 \-[0000:00]-+-00.0
... +-02.0-[02]--+-00.0 | \-00.1
02:00.0 1200: 10ee:5000
	Subsystem: 10ee:000e
	Physical Slot: 4
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 32 bytes
	Region 0: Memory at c2000000 (32-bit, non-prefetchable) [size=32M]
	Region 1: Memory at c4000000 (32-bit, non-prefetchable) [size=128K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [60] MSI-X: Enable+ Count=33 Masked-
		Vector table: BAR=1 offset=00009000
		PBA: BAR=1 offset=00009fe0
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 1024 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd- ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
	Capabilities: [1c0 v1] #19
	Capabilities: [400 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl+ DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Kernel driver in use: xclmgmt

02:00.1 1200: 10ee:5001
	Subsystem: 10ee:000e
	Physical Slot: 4
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 32 bytes
	Interrupt: pin A routed to IRQ 26
	Region 0: Memory at c0000000 (32-bit, non-prefetchable) [size=32M]
	Region 1: Memory at c4020000 (32-bit, non-prefetchable) [size=64K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [60] MSI-X: Enable+ Count=33 Masked-
		Vector table: BAR=1 offset=00008000
		PBA: BAR=1 offset=00008fe0
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 1024 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable+ Non-Fatal+ Fatal+ Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
	Capabilities: [400 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl+ DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Kernel driver in use: xocl_xdma

I guess that it's what I have to look after, but righ now I'm not yet sure about how to read that.

0 Kudos
Contributor
Contributor
1,978 Views
Registered: ‎09-24-2016

Re: Alveo 200 bandwidth stability issue

Jump to solution

By looking at only the memory ranges, seems to me both Accelerators are attached to the 1st Root Complex port of the Xeon.

I do not think PCIe switches are visible (they are transparent AFAIK), so my assumption is there is a switch inbetween, unless of course there is a possibility that Root Complex port has 2 x16 ports. That would require more in-depth datasheets.

 

00:02.0 8086:2f04 PCI bridge: Intel Corporation Xeon E7 v3/Xeon E5 v3/Core i7 PCI Express Root Port 2 (rev 02)Memory behind bridge: c0000000-c40fffff

02:00.0 1200: 10ee:5000 Processing accelerators: Xilinx Corporation Device 5000
Region 0: Memory at c2000000 (32-bit, non-prefetchable) [size=32M]
Region 1: Memory at c4000000 (32-bit, non-prefetchable) [size=128K]

02:00.1 1200: 10ee:5001 Processing accelerators: Xilinx Corporation Device 5001
Region 0: Memory at c0000000 (32-bit, non-prefetchable) [size=32M]
Region 1: Memory at c4020000 (32-bit, non-prefetchable) [size=64K]

Explorer
Explorer
1,959 Views
Registered: ‎06-14-2018

Re: Alveo 200 bandwidth stability issue

Jump to solution

According to the Nimbix support, there's only one card per host.

0 Kudos
Xilinx Employee
Xilinx Employee
1,928 Views
Registered: ‎10-04-2016

Re: Alveo 200 bandwidth stability issue

Jump to solution

Hi @xil_tour,

Xilinx is aware of this issue. The variation in bandwidth is attributed to NUMA and is not believed to be an issue with the U200 card.

The Nimbix system has two NUMA nodes. The U200 is on NUMA node 0 (cpu 1-7) while the interrupt affinity mask is set across both NUMA nodes (cpu 1-16). When the CPUs of node 1 are used for IRQ processing, the performance goes down. 

The IRQ CPU affinity can be changed via /proc/irq/<irp vector #>/smp_affinity.

Regards,

Deanna

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
Explorer
Explorer
1,911 Views
Registered: ‎06-14-2018

Re: Alveo 200 bandwidth stability issue

Jump to solution

Is it something I can set up myself ?

0 Kudos
Xilinx Employee
Xilinx Employee
1,890 Views
Registered: ‎12-10-2013

Re: Alveo 200 bandwidth stability issue

Jump to solution

Hi @xil_tour 

As Deanna mentioned, you can change this in the smp_affinity file, if you have access.  However, I do not know the permissions in the Nimbix cloud environment.   So you would grep your device in /proc/interrupts to get the IRQ number, then modify the /proc/irq/<irq#>/smp_affinity to only reflect the CPU affinity for NUMA node 0 CPUs, for example.

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
Explorer
Explorer
1,876 Views
Registered: ‎06-14-2018

Re: Alveo 200 bandwidth stability issue

Jump to solution

Here's the Nimbix response about Deanna's response I copied/pasted to them:

Thanks for the update, it would appear to be an issue which Xilinx is aware of.

I'm not sure what more we can do on the Nimbix end, short of waiting for a Xilinx fix.

I'm going to close out this ticket; however, you can always reopen it by responding to this email should you need additional assistance.

I tried to change some affinity by echoing > the previous value, but it seems that I don't have the rights to do so (permission error).

Here's the content of /proc/interrupts, can you tell me which lines are the ones of interest please ?

0 Kudos
Xilinx Employee
Xilinx Employee
1,837 Views
Registered: ‎10-19-2015

Re: Alveo 200 bandwidth stability issue

Jump to solution

Hi @xil_tour @bethe @demarco @likewise 

I pingged Nimbix again and I'm going to see if we can look into this together. Ideally I'll post a solution shortly, otherwise I'll post an update. 

Regards,

-M

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
Explorer
Explorer
1,798 Views
Registered: ‎06-14-2018

Re: Alveo 200 bandwidth stability issue

Jump to solution

Thanks a lot.

I stay tuned.

0 Kudos
Xilinx Employee
Xilinx Employee
1,166 Views
Registered: ‎10-19-2015

Re: Alveo 200 bandwidth stability issue

Jump to solution

Hi @xil_tour,

We got squared away with Nimbix and I believe they are handling the debug on their end. There isn't much I can do at this point, but please keep me posted as you continue testing. 

Regards,

Matt 

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
Explorer
Explorer
1,144 Views
Registered: ‎06-14-2018

Re: Alveo 200 bandwidth stability issue

Jump to solution

I hope this will be fixed soon, proving that it's not the normal behavior of a $9000 card...

0 Kudos
Xilinx Employee
Xilinx Employee
739 Views
Registered: ‎11-11-2012

Re: Alveo 200 bandwidth stability issue

Jump to solution

Thanks everyone for your contributions to this issue. Here comes the solution. 

Reasons of the problem:

      For a NUMA system, DMA performance could be greatly impacted if the CPU core running the application is on the different PCIe switch with the FPGA card. Although Xilinx driver (xocl) is NUMA aware, the application does not. The “xbutil dmatest” is an application, same to most of the applications that doesn't handle NUMA issues. 

 

Solution:

1. Find the Alveo card device ID, for example below "02:00.1" is the ID you are looking for:

        $ lspci | grep Xilinx
        02:00.0 Processing accelerators: Xilinx Corporation Device 5004
        02:00.1 Processing accelerators: Xilinx Corporation Device 5005

2. Find the CPU cores close to the Alveo card:

        $ cat /sys/bus/pci/devices/0000\:02\:00.1/local_cpulist

        0-7

3. Run the application and lock it to the CPU cores you found above:

        $ taskset -c 0,1,2,3,4,5,6,7 xbutil dmatest

You should see a good DMA test performance now. 

 

Explorer
Explorer
653 Views
Registered: ‎06-14-2018

Re: Alveo 200 bandwidth stability issue

Jump to solution

Looks like the workaround is working, thanks !

Was something hardware also fixed or taskset would have been sufficient from the start ?

alveo-bench-bandwidth-avg_all.png
alveo-bench-stability-readb0.png
alveo-bench-stability-writeb0.png
0 Kudos
Explorer
Explorer
44 Views
Registered: ‎06-14-2018

Re: Alveo 200 bandwidth stability issue

Jump to solution

The problem is back.

It's been a week or so I think, maybe more.

Here's the results as graphs (not polished).

 

alveo-bench-stability-readb0.png
alveo-bench-stability-writeb0.png
bank0-read.dat-density.png
bank0-write.dat-density.png
0 Kudos