cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
xpaillard
Participant
Participant
16,608 Views
Registered: ‎04-11-2013

KC705 - PCIe - DMA - DDR3 SLOW BANDWIDTH

Hi,

 

Actually i make an architecture to transfer data from host Linux PC to the DDR3 on the KC705 board.

 

I obtain only about 2.4 Gbps (Giga bits per second) of bandwidth for a transfer of 1 MB (Mega Bytes).

 

The setting :

 

Last Ubuntu 32 bits

Driver is written with the base of xapp1052 with bus mastering.

The data transfer is OK, all datas are OK

 

Xilinx Board KC705

AXI to PCIe bridge : Gen 2.0 - x4

Max payload is 128 bytes

Max request size is 512 bytes

AXI architecture : width 128 bits, clock 125 MHz

MIG with DDR3 at max frequency (800 MHz for the DDR3, 200 MHz for this AXI branch, AXI 128 bits)

 

Custom DMA module design with Xilinx Vivado HLS

Connected on AXI architecture, 128 bits, 125 MHz

source pointer, destination pointer, size

The C code is juste 2 memcpy : it's needed to pass data into a intermediate local buffer

This DMA, gets the data into Linux RAM (with correct address  translation) and put the data into DDR3

A burst is 512 Bytes due to the restriction of the PCIe Max request size. So i do a loop with many many memcpy (actually 2048 cycle to get 1 MB).

HLS inform me of a latency of 75 clock cycle to execute on loop cycle.

 

So, the datapath is correct because i acquire the good data.

I take the time in my driver i start a counter just before i send the command (over pcie) to start the DMA (over AXIlite) and stop the timer when i enter in the Interupt handler.

 

I measure about 3.7 ms for 1MB of (efficient) data i omit the overhead due to the header of the TLP frame. This time is only in one direction (read channel).

 

Theorical bandwidth for PCIe gen 2 x4 is 16 Gbps per direction !

 

Do you have an idea, or advice or a suggestion to help me ?

 

Many thanks

 

 

 

 

 

0 Kudos
11 Replies
yenigal
Xilinx Employee
Xilinx Employee
16,601 Views
Registered: ‎02-06-2013

Hi

 

Moving this post to PCIe board where this can be answered better.

Regards,

Satish

--------------------------------------------------​--------------------------------------------
Kindly note- Please mark the Answer as "Accept as solution" if information provided is helpful.

Give Kudos to a post which you think is helpful.
--------------------------------------------------​-------------------------------------------
0 Kudos
xpaillard
Participant
Participant
16,565 Views
Registered: ‎04-11-2013

Hi, thank you for the move.

Some new informations. I have mounted some Chipscope probes into my design. I can trig the start of my DMA module, and stop the acquisition when the module send the interupt.

The latency measured for the DMA transfert is exactly the same that Vivado HLS give me (actually 75 clock cycle to transfer 512 bytes from Linux Host RAM to the FPGA board DDR3).

The latency to acquire the 512 bytes from the Linux host is 175 clock cycle (AXI clock is at 125 MHz). So this is the latency to get datas from Linux RAM to IP Core PCIe.

I think my problem is here because this latency is very big. It's about 2.7 Gbps of bandwidth just for PCIe transfer. And in my FPGA design i have calculated a bandwidth of about 6 Gbps.

On the host Linux, with the command lspci -vvv the command return for the Xilinx board :

Gen 2.0

5 GT/s
4 lanes
Max payload = 128 Bytes
Max Request Size = 512 bytes

I do DMA transfer directly from Linux RAM to FPGA, so no latency or wasted time in the PCIe driver.

My bandwidth is very lower than the 16 Gbps theorical.

 

In attached file a screen capture of the Chipscope capture. It begins when the DMA module request the PCIe IP Core to read data, then the DMA transfer composed of 2 synthesized memcpy (one to acquire data from PCIe core, and one to write data into the DDR3. The interupt is not visible on this capture because i transfer 1KB on this run.
 
Any idea or sugestion ?

Thank you.

Capture.JPG
0 Kudos
xpaillard
Participant
Participant
16,534 Views
Registered: ‎04-11-2013

Hi,

 

Nobody has test the Axi Memory Mapped PCIe core (with some benchmark) ?

 

Thank you.

0 Kudos
366155
Participant
Participant
16,518 Views
Registered: ‎08-12-2013

Hi!
I made own tests PCIe-DMA-DDR3 on KC705. I used a bridge "axis pcie v1.08a". I configured PCIe as x4-gen2 but I've done the tests on a PC with PCIe gen1. I got the following test's results:
770 MB/s - transfer a data from KC705 to PC.
680 MB/s - transfer a data from PC to KC705.

I need to change a "payload" in bridge "axis pcie v1.08a". How can I do it?
And other question - in the bridge "axis pcie v1.08a" dont works a dynamic address translation!!! I set "C_INCLUDE_BAROFFSET_REG" and use a function "XAxiPcie_SetLocalBusBar2PcieBar":
When I set address 0 - the data is written to 0
When I set address close to 0 - data is written to 0
When I set address dedicated PC for DMA - data is written is not clear where - in a memory dump I can not find them!!!

Who can help???

0 Kudos
xpaillard
Participant
Participant
16,510 Views
Registered: ‎04-11-2013

Hello,

 

thank you for your answer.

 

Actually i use the Ip Core : AXI Bridge for PCI Express v2.2 (pg055.pdf) in Xilinx Vivado 2013.3

 

Which tools do you use to generate your system, because your IP Core is not the newest version (?)

 

I don't know how do you change the payload, in theory all devices plugged on the PCIe bridge must have the same payload (the slowest device limits all devices on the pcie bus). For my host computer, my PCI bridge (chipset Intel) have a max payload of 128 Bytes, but the IP Core is 256 Bytes capables.

 

The IP Core limits the max read request size to 512 Bytes.

 

What is your host computer configuration (Linux, Windows) ? If you are on Linux try to type sudo lspci -vvv to see the payload of others devices.

 


Are you sure of your memory mapping of your system (FPGA side AND driver side) ? Be sure to work with physicall address to realize DMA transfer from RAM host to FPGA memory.

0 Kudos
366155
Participant
Participant
16,496 Views
Registered: ‎08-12-2013

I used ISE14.6.
The test of the DDR3 into KC705 with use DMA (DMA-DDR3) takes a result 1800 MB/s.
Test of "memcpy" (on side of PC) to KC705 showed 7MB/s.

 

"Are you sure of your memory mapping of your system (FPGA side AND driver side) ? Be sure to work with physicall address to realize DMA transfer from RAM host to FPGA memory."

 

I copied some data from PC to DDR3 of KC705 with using DMA and then I copied this data from DDR3 of KC705 to PC's memory with using "memcpy".

 

And I found that dynamic address translation is working!!! - The value to write must be a multiple of the size of the BAR!!!

0 Kudos
xpaillard
Participant
Participant
16,481 Views
Registered: ‎04-11-2013

Hi,

 

Yes for the IP core version you are right (the version is not the same under ISE or VIVADO).

 

Good new for the address translation.

 

You obtain 1800 MB/s for transer into FPGA DDR3-DMA-DDR3 ?

 

7 MB/s is PC-DDR3 without DMA ?

 

Thank you,

 

Xavier

0 Kudos
366155
Participant
Participant
16,478 Views
Registered: ‎08-12-2013

Hi.

 

"You obtain 1800 MB/s for transer into FPGA DDR3-DMA-DDR3 "
Yes. It's internal transfer into FPGA "DDR3-DMA-DDR3"

 

"7 MB/s is PC-DDR3 without DMA ?" 
Yes. I used "memcpy" on the PC's side.

 

Best regards, Dima.

0 Kudos
gordwait
Observer
Observer
15,438 Views
Registered: ‎09-18-2007

A question, on your host side, what did you have to do to allow PCIe cycles to access host memory?

I am working on a Zynq PCIe design, and I can not get AXI to PCIe cycles to work. (Bus mastering). 

 

I do not know if the FPGA core is the problem (AXI to PCIe Bridge in Vivado IP Builder 2014.1 and now 2014.2) or

if our host linux driver is not creating a memory area correctly (Ubuntu).

 

I will rent a PCIe analyzer to see if the FPGA is even starting a PCIe cycle on the host side..

 

 

0 Kudos
mjheyd
Visitor
Visitor
9,952 Views
Registered: ‎06-19-2014

Hi,

I want to do something like you have did: "make an architecture to transfer data from host PC to the DDR3 on the ML605 board instead of KC705 board" but i dont know how, can you help me by any guidline or documentation?

 

many thanks

 

0 Kudos
sherbin
Visitor
Visitor
9,304 Views
Registered: ‎01-21-2015

Hi, have you solved your problem about the low bandwidth?

also can you tell me what DMA core are you using? is that CDMA?

 

thanks!

0 Kudos