01-09-2019 05:56 AM - edited 01-09-2019 06:03 AM
I am developing project which needs high-speed memcopy (more than 2GBytes/s). Getting into trouble (~100MB/s for AXIDMA transfer to driver's kernelspace + memcpy to userspace), I wrote simple memcpy benchmark and got only 1GB/s for pure memcpy! even "time dd if=/dev/mem of=/dev/zero bs=1M count=100" gives 2.4GB/s.
The board has DDR4-2133 module which gives 17GB/s of raw performance.
I have created standard vivado 2017.4 block design for this board, not changed memory settings (ust verified that they are correct), then in petalinux chanted nothing related to ddr4 in what it produces form hdf.
Please clear me where I loose ~10 times of memory speed?
01-24-2019 09:10 AM
01-24-2019 09:26 AM - edited 01-24-2019 09:26 AM
01-24-2019 09:40 AM - edited 01-24-2019 09:44 AM
The maximum rate of DDR (e.g., 17 GB/s) is only guaranteed for 64 bytes at a time. That's the amount of data that can be transferred across a DDR3/4 memory interface in the most basic transaction--the simple-unit burst access. How long you can sustain that rate is up to you. You cannot haphazardly access DDR memory and expect a throughput like that. If you're trying to use a CPU to move data out of one spot of DDR memory, and then write it into another spot, you can give-up getting anywhere close to the maximum throughput.
The key to maximizing bandwidth is maintaining continuous flows of data into or out-of the memory. Once you stop a flow, and then change course, you pay a throughput penalty. These penalties add-up quickly when you try to move small groups of data using a CPU. Instead, use DMA to burst data from a source DDR memory location to OCM, and then from OCM to a destination DDR memory location. You'll see a substantial increase in throughput.
01-24-2019 01:17 PM - edited 01-24-2019 01:19 PM
Thank you very much.
Your board has PS DDR interface in x32 configuration that is half of that for so-dimm module, and you got ~2 times slower dd.
This shows the same situation as what I have.
01-24-2019 03:13 PM
It sounds like you're looking for a software-centric solution. I'm a hardware guy by trade, so I'm not sure where you can find such information.
This looks to be a decent tutorial on using AXI DMA:
The goal in the tutorial should be transferable to using GDMA in the PSU, but the tutorial does use an AXI DMA IP in the PL. Most people who target Zynq/Zynq MPSoC want to move data between the PL and PSU, so it seems most information available deals with using AXI DMA. An AXI DMA IP can still move data from PSU DDR to PSU DDR, though.
Best of luck.
01-24-2019 03:24 PM - edited 01-24-2019 03:26 PM
Yes, I'm looking for some SW-centric
I have read many topics related to AXI DMA, and what is under your link too.
I have already implemented working AXI DMA and got 1GB/s data to linux'kernel space through driver.
The problem is that the contiguous-memory-allocation block is restricted in size (at more than ~1GB the petalinux fails to biuld) therefore i need to move data to userspace during data transmission.
p.s. yes, i have back-up plan to restrict memory available to linux and manage the rest manually but currently want the nice driver-based solutoin