11-01-2013 08:58 AM
Actually i make an architecture to transfer data from host Linux PC to the DDR3 on the KC705 board.
I obtain only about 2.4 Gbps (Giga bits per second) of bandwidth for a transfer of 1 MB (Mega Bytes).
The setting :
Last Ubuntu 32 bits
Driver is written with the base of xapp1052 with bus mastering.
The data transfer is OK, all datas are OK
Xilinx Board KC705
AXI to PCIe bridge : Gen 2.0 - x4
Max payload is 128 bytes
Max request size is 512 bytes
AXI architecture : width 128 bits, clock 125 MHz
MIG with DDR3 at max frequency (800 MHz for the DDR3, 200 MHz for this AXI branch, AXI 128 bits)
Custom DMA module design with Xilinx Vivado HLS
Connected on AXI architecture, 128 bits, 125 MHz
source pointer, destination pointer, size
The C code is juste 2 memcpy : it's needed to pass data into a intermediate local buffer
This DMA, gets the data into Linux RAM (with correct address translation) and put the data into DDR3
A burst is 512 Bytes due to the restriction of the PCIe Max request size. So i do a loop with many many memcpy (actually 2048 cycle to get 1 MB).
HLS inform me of a latency of 75 clock cycle to execute on loop cycle.
So, the datapath is correct because i acquire the good data.
I take the time in my driver i start a counter just before i send the command (over pcie) to start the DMA (over AXIlite) and stop the timer when i enter in the Interupt handler.
I measure about 3.7 ms for 1MB of (efficient) data i omit the overhead due to the header of the TLP frame. This time is only in one direction (read channel).
Theorical bandwidth for PCIe gen 2 x4 is 16 Gbps per direction !
Do you have an idea, or advice or a suggestion to help me ?
11-01-2013 10:00 AM
Moving this post to PCIe board where this can be answered better.
11-02-2013 01:00 PM
Hi, thank you for the move.
Some new informations. I have mounted some Chipscope probes into my design. I can trig the start of my DMA module, and stop the acquisition when the module send the interupt.
The latency measured for the DMA transfert is exactly the same that Vivado HLS give me (actually 75 clock cycle to transfer 512 bytes from Linux Host RAM to the FPGA board DDR3).
The latency to acquire the 512 bytes from the Linux host is 175 clock cycle (AXI clock is at 125 MHz). So this is the latency to get datas from Linux RAM to IP Core PCIe.
I think my problem is here because this latency is very big. It's about 2.7 Gbps of bandwidth just for PCIe transfer. And in my FPGA design i have calculated a bandwidth of about 6 Gbps.
On the host Linux, with the command lspci -vvv the command return for the Xilinx board :
Max payload = 128 Bytes
Max Request Size = 512 bytes
I do DMA transfer directly from Linux RAM to FPGA, so no latency or wasted time in the PCIe driver.
My bandwidth is very lower than the 16 Gbps theorical.
In attached file a screen capture of the Chipscope capture. It begins when the DMA module request the PCIe IP Core to read data, then the DMA transfer composed of 2 synthesized memcpy (one to acquire data from PCIe core, and one to write data into the DDR3. The interupt is not visible on this capture because i transfer 1KB on this run.
Any idea or sugestion ?
11-05-2013 08:20 AM
I made own tests PCIe-DMA-DDR3 on KC705. I used a bridge "axis pcie v1.08a". I configured PCIe as x4-gen2 but I've done the tests on a PC with PCIe gen1. I got the following test's results:
770 MB/s - transfer a data from KC705 to PC.
680 MB/s - transfer a data from PC to KC705.
I need to change a "payload" in bridge "axis pcie v1.08a". How can I do it?
And other question - in the bridge "axis pcie v1.08a" dont works a dynamic address translation!!! I set "C_INCLUDE_BAROFFSET_REG" and use a function "XAxiPcie_SetLocalBusBar2PcieBar":
When I set address 0 - the data is written to 0
When I set address close to 0 - data is written to 0
When I set address dedicated PC for DMA - data is written is not clear where - in a memory dump I can not find them!!!
Who can help???
11-05-2013 09:19 AM
thank you for your answer.
Actually i use the Ip Core : AXI Bridge for PCI Express v2.2 (pg055.pdf) in Xilinx Vivado 2013.3
Which tools do you use to generate your system, because your IP Core is not the newest version (?)
I don't know how do you change the payload, in theory all devices plugged on the PCIe bridge must have the same payload (the slowest device limits all devices on the pcie bus). For my host computer, my PCI bridge (chipset Intel) have a max payload of 128 Bytes, but the IP Core is 256 Bytes capables.
The IP Core limits the max read request size to 512 Bytes.
What is your host computer configuration (Linux, Windows) ? If you are on Linux try to type sudo lspci -vvv to see the payload of others devices.
Are you sure of your memory mapping of your system (FPGA side AND driver side) ? Be sure to work with physicall address to realize DMA transfer from RAM host to FPGA memory.
11-06-2013 02:06 AM
I used ISE14.6.
The test of the DDR3 into KC705 with use DMA (DMA-DDR3) takes a result 1800 MB/s.
Test of "memcpy" (on side of PC) to KC705 showed 7MB/s.
"Are you sure of your memory mapping of your system (FPGA side AND driver side) ? Be sure to work with physicall address to realize DMA transfer from RAM host to FPGA memory."
I copied some data from PC to DDR3 of KC705 with using DMA and then I copied this data from DDR3 of KC705 to PC's memory with using "memcpy".
And I found that dynamic address translation is working!!! - The value to write must be a multiple of the size of the BAR!!!
11-06-2013 06:23 AM
Yes for the IP core version you are right (the version is not the same under ISE or VIVADO).
Good new for the address translation.
You obtain 1800 MB/s for transer into FPGA DDR3-DMA-DDR3 ?
7 MB/s is PC-DDR3 without DMA ?
11-06-2013 06:43 AM
"You obtain 1800 MB/s for transer into FPGA DDR3-DMA-DDR3 "
Yes. It's internal transfer into FPGA "DDR3-DMA-DDR3"
"7 MB/s is PC-DDR3 without DMA ?"
Yes. I used "memcpy" on the PC's side.
Best regards, Dima.
06-12-2014 05:08 PM
A question, on your host side, what did you have to do to allow PCIe cycles to access host memory?
I am working on a Zynq PCIe design, and I can not get AXI to PCIe cycles to work. (Bus mastering).
I do not know if the FPGA core is the problem (AXI to PCIe Bridge in Vivado IP Builder 2014.1 and now 2014.2) or
if our host linux driver is not creating a memory area correctly (Ubuntu).
I will rent a PCIe analyzer to see if the FPGA is even starting a PCIe cycle on the host side..
06-19-2014 05:04 AM
I want to do something like you have did: "make an architecture to transfer data from host PC to the DDR3 on the ML605 board instead of KC705 board" but i dont know how, can you help me by any guidline or documentation?