cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
jonkar
Visitor
Visitor
6,800 Views
Registered: ‎03-30-2012

Zynq: DDR3 performance issue

Hi,

 

We are using a part of the DDR3 memory as shared memory between the PL and the PS. We use the area that is already reserved in function xilinx_memory_init() -> memblock_remove((0xF000000, 0x1000000).

 

We use mmap in our user space application to map it into virtual memory. When we do memcpy on 620 kbytes from shared memory into a heap allocated buffer it takes appr 3,2 ms which is very slow.

 

I suspect that the memory area is not handled by the cache which gives worse performance. Does anyone know how that make this memory area cacheable? If I have understood it correctly the SCU should handle the cache coherency for the PL as well?

 

Does anyone have any other ideas on how to speed this up?

 

I have heard that Xilinx has a demo application that uses this area for frame buffers between PL and PS? How is the memory used by the PS in that application?

 

Thanks in advance!

0 Kudos
6 Replies
linnj
Xilinx Employee
Xilinx Employee
6,796 Views
Registered: ‎09-10-2008

The video frame buffer is passing kernel boot args so as to not let the kernel have the memory (as it's now using some other memory locations than what is in the memblock_remove). Then the driver does an ioremap with caching turned on (ioremap_wc).

Thanks.
0 Kudos
linnj
Xilinx Employee
Xilinx Employee
6,790 Views
Registered: ‎09-10-2008

After asking some others (as I'm not doing work in that area), I'm hearing that for high performance you really need to use a pull model where you have a DMA in the PL to pull from the DDR. The CPU would then do cached writes into DDR and the DMA in the PL (not the hard DMA in the PS) would pull from the DDR.

The technical reference manual (TRM) on line and the TRD (targeted reference design) for the ZC702 on xilinx.com should all have some more information I'm told that helps in this area.

Thanks.
0 Kudos
jonkar
Visitor
Visitor
6,768 Views
Registered: ‎03-30-2012

Thanks for the quick reply!

 

I was hoping for a user space solution to make the area cachable.

 

For now we have been able to redesign the system so that we can live with the slow copy, but that is just temporary. I'll try the ioremap_wc later.

 

If you come to think of a way to make the area cachable when mapping it in user space please let me know.

 

Kind regards,

Jonas

0 Kudos
nbedbury
Observer
Observer
6,648 Views
Registered: ‎07-31-2012

John,

 

According to your post, CPU writes to DDR are cached.  Are CPU transactions cached by default with the Linux kernel, or do caches need to be enabled via a driver or boot option?  Also, are PL DMA transactions cached in the setup you described? 

 

I ask because I'm interested in benchmarking the performance of the HP AXI versus ACP AXI connections for transfers from PL to DDR.  I expected the ACP connection to perform slightly better than a single HP connection due to the use of cache.  However, so far, both interfaces have been performing identically.  So I am unsure if cache is disabled for PL accesses, or if there's some other issue causing poor(er) ACP performance.

 

Nick

0 Kudos
linnj
Xilinx Employee
Xilinx Employee
6,612 Views
Registered: ‎09-10-2008

Hi Nick

 

When you speak of caching it's not clear exactly what you are referring to.  The kernel itself does use caches by default.  Driver calls to ioremap usually are non-cached as it makes sense for most device I/O, but the ioremap call can also be for cached memory if I remember right.

 

One of the issues is that the Linux ARM architecture in general is considered a non-coherent architecture.  The definition for coherency I'm using (but seems to be inconsistent or confusing to me lot's of places) is that for I/O, the CPU is responsible for invalidating and flushing the caches at the right time for DMA transactions.  Because of this the DMA infrastructure takes care of that in Linux, but assumes non-coherent and this is all determined at kernel build time.  There is no mixed coherent and non-coherent support yet, but it's coming in 3.7 it appears as there some patches I've tested for this.

 

So by default the DMA transactions are doing cache operations for ACP which means the performance won't be better. The other consideration is that the DMA in the PL must be setup for coherent transactions as they can be non-coherent.  All of this is still a bit evolving in Xilinx.

 

Hope that helps a bit.

0 Kudos
balister
Adventurer
Adventurer
6,591 Views
Registered: ‎05-07-2012

The best discussion of the issues seems to be i the linaro-mm-sig. Sadly, the Zynq kernel is not close to tracking current Linux development, so it would requiring back porting patches to test the dma api work done by them.

 

I'd love to see a Xilinx bleeding tree that was focused on recent kernels and the Xilinx drivers in a position to submit upstream. That would make things so much easier.

 

Philip

0 Kudos