We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

Showing results for 
Search instead for 
Did you mean: 
Observer yuzheng
Registered: ‎12-10-2013

memory read is very slow when using mmap()


I reserved a big block of memory using bootargs with memmap=xxx$xxxx for video storage.



In most cases, the logic implemented in the FPGA processes video frames in the reserved memory with no problem.


We also need to read  and write the reserved momery in the ARM CPU bound for further processing and networking. The mmap() is used in a user space application to get a logical address, and read performence very bad. It seems that the cache is not on using mmap().  


I know that ioremap() in a kernel driver can enable cache, but it is for the kernel space only.

How can I do the ioremap()  function in a user space application?


We are using ZC706 with v2014.4 kernel.







0 Kudos
1 Reply
Scholar milosoftware
Registered: ‎10-26-2012

Re: memory read is very slow when using mmap()

Mapping non-kernel memory results in uncached memory.


If you just want to speed up processing, access the memory in large chunks. Use the NEON instructions to fetch chunks of 128-bits of data. This will give access speeds close to the DDR bandwidth.


Another approach is to memcpy() a range of data into a cachable area (malloc), process that, and when done memcpy() the results into the FPGA area for further processing. This may sound inefficient, but if the chunks fit into the cache (L1 or L2), the data won't actually be written out to DDR RAM, so the speed will be close to having cacheable memory. It is very important to keep all processing within the cache, so process large frames in tiles or scan lines to prevent cache trashing.


If that does not help, you'll need to write a driver. The driver can arrange cachable memory to be contiguous and accessible by both FPGA and software. Note that the overhead of the cache flushing/invalidation is large, just copying the data into another address isn't much slower. You can get rid of that overhead by using the ACP port, but that comes with a penalty on writing to DDR (the ACP can read/write cache at 1200MB/s, but I've measured it to drop to about 250MB/s when transfers need to go to DDR), and it won't do any good if you copy large regions that don't fit in the L2 cache of 512kB.


0 Kudos