06-25-2012 02:31 PM
We are using a part of the DDR3 memory as shared memory between the PL and the PS. We use the area that is already reserved in function xilinx_memory_init() -> memblock_remove((0xF000000, 0x1000000).
We use mmap in our user space application to map it into virtual memory. When we do memcpy on 620 kbytes from shared memory into a heap allocated buffer it takes appr 3,2 ms which is very slow.
I suspect that the memory area is not handled by the cache which gives worse performance. Does anyone know how that make this memory area cacheable? If I have understood it correctly the SCU should handle the cache coherency for the PL as well?
Does anyone have any other ideas on how to speed this up?
I have heard that Xilinx has a demo application that uses this area for frame buffers between PL and PS? How is the memory used by the PS in that application?
Thanks in advance!
06-25-2012 02:49 PM
06-25-2012 05:11 PM
06-27-2012 12:02 AM
Thanks for the quick reply!
I was hoping for a user space solution to make the area cachable.
For now we have been able to redesign the system so that we can live with the slow copy, but that is just temporary. I'll try the ioremap_wc later.
If you come to think of a way to make the area cachable when mapping it in user space please let me know.
08-31-2012 09:03 AM
According to your post, CPU writes to DDR are cached. Are CPU transactions cached by default with the Linux kernel, or do caches need to be enabled via a driver or boot option? Also, are PL DMA transactions cached in the setup you described?
I ask because I'm interested in benchmarking the performance of the HP AXI versus ACP AXI connections for transfers from PL to DDR. I expected the ACP connection to perform slightly better than a single HP connection due to the use of cache. However, so far, both interfaces have been performing identically. So I am unsure if cache is disabled for PL accesses, or if there's some other issue causing poor(er) ACP performance.
09-04-2012 09:10 AM
When you speak of caching it's not clear exactly what you are referring to. The kernel itself does use caches by default. Driver calls to ioremap usually are non-cached as it makes sense for most device I/O, but the ioremap call can also be for cached memory if I remember right.
One of the issues is that the Linux ARM architecture in general is considered a non-coherent architecture. The definition for coherency I'm using (but seems to be inconsistent or confusing to me lot's of places) is that for I/O, the CPU is responsible for invalidating and flushing the caches at the right time for DMA transactions. Because of this the DMA infrastructure takes care of that in Linux, but assumes non-coherent and this is all determined at kernel build time. There is no mixed coherent and non-coherent support yet, but it's coming in 3.7 it appears as there some patches I've tested for this.
So by default the DMA transactions are doing cache operations for ACP which means the performance won't be better. The other consideration is that the DMA in the PL must be setup for coherent transactions as they can be non-coherent. All of this is still a bit evolving in Xilinx.
Hope that helps a bit.
09-05-2012 01:24 PM
The best discussion of the issues seems to be i the linaro-mm-sig. Sadly, the Zynq kernel is not close to tracking current Linux development, so it would requiring back porting patches to test the dma api work done by them.
I'd love to see a Xilinx bleeding tree that was focused on recent kernels and the Xilinx drivers in a position to submit upstream. That would make things so much easier.