08-30-2017 06:57 AM
My project has logic in the PL writing to a buffer in DDR (via S_AXI_HP0) that my PS code allocated with dma_alloc_coherent. The documentation for this function states that: "a write by either the device or the processor can immediately be read by the processor or device without having to worry about caching effects."
However, I'm seeing that data is getting written out of order, despite the fact that my PL logic is writing to the buffer in order. Periodic reads of the buffer from PS show, in one test, the first 50 indices that change value in a 1024-position buffer were these: 24 25 26 27 28 29 30 31 48 49 50 51 52 53 54 55 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 32 33 34 35 36 37 38 39 112 113.
If I wait some time after the PL tells me that it's written all the data I want, it will eventually all show up. But this is a high performance streaming data application, and I need to be able to use the data as soon as it's available. Does the zynq implementation of dma_alloc_coherent not adhere to this part of the documentation? Or is there something more my PL logic needs to do to ensure the write is fully flushed to memory before informing my PS that it's completed (beyond waiting for the AXI burst-write OKAY)? Or is there something else that could be causing the data to appear late and out of order?
I'm using Vivado 2016.2, and the 2016.2 linux kernel from Xilinx's github.
09-01-2017 04:25 PM
My initial thoughts on this is that the PL AXI shim will not allow for out-of-order execution regardless of the IP configuration but re-ordering will occur on the backend depending on some other settings. Overall your data should show up correctly but the amount of latency for the command may vary. My focus would be on the PS side but unfortunately that's not my area of expertise.
09-12-2017 08:01 AM
The first response to https://forums.xilinx.com/t5/Embedded-Linux/Flush-cache-on-Zynq-under-Linux/td-p/541815 led me to using the calls outer_inv_range() and __cpuc_flush_dcache_area() to invalidate my cache, and this seems to have worked.
Further research shows that this is not the proper way to do it, though, and I've since moved on to using dma_sync_single_for_cpu() instead, which also looks like it does the trick.
I'm curious if anyone knows - is the necessity of calling these cache management functions common to all architectures, when using a buffer allocated by dma_alloc_coherent()? Or is this a peculiarity of zynq and/or arm? Because the linux documentation's DMA-API.txt seems to declare that it should not be necessary -- it says that calling code must only "guarantee[ing] to the platform that you
have all the correct and necessary sync points" when using dma_alloc_noncoherent().