cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
liuyz
Adventurer
Adventurer
1,161 Views
Registered: ‎01-13-2019

U250 Host <-> PCIe <-> FPGA bandwidth

Hi,

Observing low bandwidth in DMA test when using the folloiwng command:

xbutil validate

OS: RHEL 7.6

XRT/SHELL: U250 201830_2

INFO: Found 1 cards

INFO: Validating card[0]: xilinx_u250_xdma_201830_2
INFO: Checking PCIE link status: PASSED
INFO: Starting verify kernel test: 
INFO: verify kernel test PASSED
INFO: Starting DMA test
Host -> PCIe -> FPGA write bandwidth = 4475.71 MB/s
Host <- PCIe <- FPGA read bandwidth = 3086.84 MB/s
INFO: DMA test PASSED
INFO: Starting DDR bandwidth test: ..........
Maximum throughput: 47624.445312 MB/s
INFO: DDR bandwidth test PASSED
INFO: Card[0] validated successfully.

INFO: All cards validated successfully.

As a comparision, the following DMA bandwidth is shown in the UG1301.

INFO: Starting DMA test
Host -> PCIe -> FPGA write bandwidth = 11346.1 MB/s
Host <- PCIe <- FPGA read bandwidth = 11333.6 MB/s
INFO: DMA test PASSED

Any steps I should follow to increase the DMA bandwidth or did I missed anything here?

(Expecting to get a comparable DMA bandwidth.)

Thanks.

Tags (3)
0 Kudos
5 Replies
liuyz
Adventurer
Adventurer
1,133 Views
Registered: ‎01-13-2019

here are some extra test results:

Run 1:

[007@007 ~]$ xbutil dmatest
INFO: Found total 1 card(s), 1 are usable
Total DDR size: 65536 MB
Reporting from mem_topology:
Data Validity & DMA Test on bank0
Host -> PCIe -> FPGA write bandwidth = 4306.14 MB/s
Host <- PCIe <- FPGA read bandwidth = 2925.48 MB/s
Data Validity & DMA Test on bank1
Host -> PCIe -> FPGA write bandwidth = 5290.42 MB/s
Host <- PCIe <- FPGA read bandwidth = 3421.05 MB/s
Data Validity & DMA Test on bank2
Host -> PCIe -> FPGA write bandwidth = 4452.09 MB/s
Host <- PCIe <- FPGA read bandwidth = 3057.45 MB/s
Data Validity & DMA Test on bank3
Host -> PCIe -> FPGA write bandwidth = 4374.55 MB/s
Host <- PCIe <- FPGA read bandwidth = 2878.66 MB/s
INFO: xbutil dmatest succeeded.

Run 2:

[007@007 ~]$ xbutil dmatest
INFO: Found total 1 card(s), 1 are usable
Total DDR size: 65536 MB
Reporting from mem_topology:
Data Validity & DMA Test on bank0
Host -> PCIe -> FPGA write bandwidth = 11135.3 MB/s
Host <- PCIe <- FPGA read bandwidth = 12128.7 MB/s
Data Validity & DMA Test on bank1
Host -> PCIe -> FPGA write bandwidth = 11208.1 MB/s
Host <- PCIe <- FPGA read bandwidth = 12154.4 MB/s
Data Validity & DMA Test on bank2
Host -> PCIe -> FPGA write bandwidth = 11353.1 MB/s
Host <- PCIe <- FPGA read bandwidth = 12150 MB/s
Data Validity & DMA Test on bank3
Host -> PCIe -> FPGA write bandwidth = 11259 MB/s
Host <- PCIe <- FPGA read bandwidth = 12154.4 MB/s
INFO: xbutil dmatest succeeded.

This seems the DMA performance varries a lot. Note, no other tasks were performed during the test.

Any suggestions/advices to get a consistant DMA performance/bandwidth? Something to do with workstation settings?

Thanks a lot.

0 Kudos
mcertosi
Xilinx Employee
Xilinx Employee
1,110 Views
Registered: ‎10-19-2015

Hi @liuyz 

I'll need some more information about your system in order to look into this. 

Can you tell me the following: 

Host workstation

Operating system

Kernel

XRT version 

Were the different results captured after a system reboot, or were the commands issued right after one another? 

Could you test a few more times and let me know the ratio of high bandwidth to low bandwidth results? 

If the data was captured on a single boot, please send me the output of $sudo lspci -vvv -d 10ee: 

If the data was capture from multiple boots, make sure you can reproduce the issue, then send me the corresponding $sudo lspci -vvv -d 10ee: 

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
0 Kudos
liuyz
Adventurer
Adventurer
1,102 Views
Registered: ‎01-13-2019

Hi @mcertosi ,

Thanks for reply. Please find further information in below:

* Lenovo ThinkStation P900

* RHEL

* Linux kernel 3.10.0-957.10.1.el7.x86_64

* XRT/Shell xrt_201830.2.1.1746_7.4.1708-xrt | xilinx-u250-xdma-dev-201830.2 | xilinx-u250-xdma-201830.2-2468403

Generally, the bandwidth varies between 4000MB/s to 12000MB/s.

After 1-2 runs, the bandwidth can generally reach around 11000MB/s.

If leave it for a while or open a new terminal or after reboot, the initial bandwidth can be lower at around 4000MB/s to 6000MB/s.

Sometimes, different DDR banks also have different bandwidth result, e.g. bank0 - 11000MB/s, while bank4 - 6000MB/s.

Sometimes, read and write bandwidth also different, e.g. write - 6000MB/s, while read - 3500MB/s.

lspci output is as below:

[007@007 ~]$ sudo lspci -vvv -d 10ee:
03:00.0 Processing accelerators: Xilinx Corporation Device 5004
	Subsystem: Xilinx Corporation Device 000e
	Physical Slot: 1
	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	NUMA node: 0
	Region 0: Memory at a2000000 (64-bit, prefetchable) [size=32M]
	Region 2: Memory at a4000000 (64-bit, prefetchable) [size=128K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [60] MSI-X: Enable+ Count=33 Masked-
		Vector table: BAR=2 offset=00009000
		PBA: BAR=2 offset=00009fe0
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 1024 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
	Capabilities: [1c0 v1] #19
	Capabilities: [400 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl+ DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Capabilities: [410 v1] #15
	Kernel driver in use: xclmgmt
	Kernel modules: xclmgmt

03:00.1 Processing accelerators: Xilinx Corporation Device 5005
	Subsystem: Xilinx Corporation Device 000e
	Physical Slot: 1
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 32 bytes
	Interrupt: pin A routed to IRQ 27
	NUMA node: 0
	Region 0: Memory at a0000000 (64-bit, prefetchable) [size=32M]
	Region 2: Memory at a4020000 (64-bit, prefetchable) [size=64K]
	Region 4: Memory at 90000000 (64-bit, prefetchable) [size=256M]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [60] MSI-X: Enable+ Count=33 Masked-
		Vector table: BAR=2 offset=00008000
		PBA: BAR=2 offset=00008fe0
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 1024 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x16, ASPM not supported, Exit Latency L0s unlimited, L1 unlimited
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range BC, TimeoutDis+, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
	Capabilities: [100 v1] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
	Capabilities: [400 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl+ DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Capabilities: [410 v1] #15
	Kernel driver in use: xocl_xdma
	Kernel modules: xocl

Thanks.

0 Kudos
mcertosi
Xilinx Employee
Xilinx Employee
1,084 Views
Registered: ‎10-19-2015

Hi @liuyz 

My test setup:

Dell Percision T3600 

RHEL 7.4

Linus kernel 3.10.0-693.el7.x86_64

XRT/Shell xrt_201830.2.1.1746_7.4.1708-xrt | xilinx-u250-xdma-201830.2-2468403

 

I wasn't able to reproduce the issue exactly as you described it. I was able to run the dmatest 20 times over the course of an hour and while there was some performance variance, I did not see a swing between 4000MB/s to 12000MB/s

I was able to decrease my bandwidth reported from the bandwidth test by running multiple tests at a time, or doing things with my CPU. Since you mentioned you aren't doing anything else, maybe you can check to see if there are interrupts coming from somewhere else offchip that your CPU needs to service during the runtime of the dmatest. 

Can you take a look a CPU usage as well as interrupts generated while running the DMA test? 

cpu usage with $top -i 

interrupts with $cat /proc/interrupts, or whichever your favorite methods are. 

I'd expect to see that the CPU's time is being shared between the xbutil task and another process or task. 

Here is a sample result from when I ran xbutil dmatest twice. The CPU usage for a single task as reported by top -i was about 60%

$ xbutil dmatest
INFO: Found total 1 card(s), 1 are usable
Total DDR size: 65536 MB
Reporting from mem_topology:
Data Validity & DMA Test on bank0
Host -> PCIe -> FPGA write bandwidth = 5318.18 MB/s
Host <- PCIe <- FPGA read bandwidth = 11720.9 MB/s
Data Validity & DMA Test on bank1
Host -> PCIe -> FPGA write bandwidth = 4976.19 MB/s
Host <- PCIe <- FPGA read bandwidth = 11675.5 MB/s
Data Validity & DMA Test on bank2
Host -> PCIe -> FPGA write bandwidth = 5667.56 MB/s
Host <- PCIe <- FPGA read bandwidth = 11809.7 MB/s
Data Validity & DMA Test on bank3
Host -> PCIe -> FPGA write bandwidth = 4598.17 MB/s
Host <- PCIe <- FPGA read bandwidth = 11921.2 MB/s
INFO: xbutil dmatest succeeded.

When running the test without any other processes 

xcowtstest40:/home/mcertosi $ xbutil dmatest
INFO: Found total 1 card(s), 1 are usable
Total DDR size: 65536 MB
Reporting from mem_topology:
Data Validity & DMA Test on bank0
Host -> PCIe -> FPGA write bandwidth = 11428.2 MB/s
Host <- PCIe <- FPGA read bandwidth = 11985.3 MB/s
Data Validity & DMA Test on bank1
Host -> PCIe -> FPGA write bandwidth = 11308 MB/s
Host <- PCIe <- FPGA read bandwidth = 12089.4 MB/s
Data Validity & DMA Test on bank2
Host -> PCIe -> FPGA write bandwidth = 11261.5 MB/s
Host <- PCIe <- FPGA read bandwidth = 12094.1 MB/s
Data Validity & DMA Test on bank3
Host -> PCIe -> FPGA write bandwidth = 11234.4 MB/s
Host <- PCIe <- FPGA read bandwidth = 12103.5 MB/s
INFO: xbutil dmatest succeeded.

 

If it is none of the above, try removing the card from the slot and cleaning the slot and testing again in either that same PCIe slot or a new one. 

-M

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
liuyz
Adventurer
Adventurer
1,051 Views
Registered: ‎01-13-2019

hi @mcertosi ,

Thanks for the suggestions. I do find a process (tracker-extract) taking up some CPU load when "xbutil dmatest" is running.

After disable the process, the DMA bandwidth reported gets a lot better. Much higher chance to reach ~10-12GB/s for each bank. CPU usage can reach between 80 - 100%.

For the variance, I still can observe it. Sometimes the bandwidth will just holding at ~5000-8000 MB/s for both read and write. The CPU usage is just around 30-40% with no other process consuming more than 2% of the CPU time. (still plenty of CPU recources left unused.). When the reduced bandwidth is observed on the Bank0, the rest of the banks are more likely to get a reduced bandwidth reported.

0 Kudos