05-13-2019 01:44 AM
I would to know how to enable IO coherency on the Zynq UltraScale+ architecture.
I am using the development board ZCU102 on which a custom Real Time Operating System is executed by the cluster of four Cortex A-53. At boot time the OS builds the translation tables for the MMU and the SMMU, enabling the exception level EL0 to access to the GEM3 device, then the control is passed to the GEM's driver (executing at EL0) that builds the buffer descriptor for the GEM's DMA and then it starts the device. When an Ethernet frame is received the SMMU raises a permission fault because the GEM device tries to access to the wrong address 0. It is possible to correct this problem if the memory of the buffer descriptors is configured as non-cacheble in the translation table, in that case the GEM device accesses to the correct address and the SMMU doesn't raise any fault. Anyway I want to set the memroy as cacheable.
This behavior suggests that there is a coherency problem, because the buffer descriptors are kept in the cache and the GEM device accesses to a non-coherent copy of them, despite the whole memory is normally configured as outer shareable.
Reading the documentation I discoverd that the coherency in the outer domain is managed by CCI400 module that is disabled by default. To enable it, at boot time I set all bits in the register IOU_INTERCONNECT_ROUTE (0xFF180408) and I write 0xFFFFFFFF in the register IOU_COHERENT_CTRL (0xFF180400), but again it doesn't work.
So, I try also to set the bit Enable_snoops in the register Snoop_Control_Register_S3 of the CCI400 module, but this write raises a fault.
How can I solve the problem?
May The fact that the cache is "Physically Indexed-Physically Tagged" and that the virtual addresses are different from physical addresses be the problem?
Thank you very much,
05-14-2019 06:34 PM
I'll try my best to cover your numerous questions and hypothesis.
1 - Physical / logical addresses - the GEM DMA reads/writes from the physical memory addresses irrelevent if the MMU remaps memory sections. So you must set the memory physical addresses in the GEM DMA descriptors and not the logical addresses. The CPU sees the logical addresses; if logical is different from the physical you need to manually convert the addresses for the GEM DMA.
2 - I/O coherency - if the memory is set as "device" or "strongly ordered" then for sure sure it is not cached and this is coherent (in fact no other set-up is more guaranteed to be coherent that this)
3- Snoop control - the SCU (snoop control unit) is a module in the L1 data cache that tracks what is in the cache of each cores and copies cache lines when needed from the cache of one core to another one. It's in fact what makes the cache coherent (well L1 only as L2 is chared amongst all cores therefore coherent). If the memory the GEM accesses is not cached, you cannot have coherency issues as it's not cached.
4 - Buffers in cache - the cache must be flushed for the descriptor and the TX buffers and it must be invalidated for the RX buffers before the GEM DMA can use them. If this is not done then what is in the physical memory may not match what the CPU sees through its cache.
I can assure you that you don't need to do much about the cache to get the GEM up and running. Our RTOS does not use the BSP, it sets-up the cache in a straightforward manner and our GEM driver works with both cached and non-cached buffers.
One very important thing you may have missed in the TRM -------> the GEM DMA descriptors CAN'T be in cached memory. This is because they are 32 bytes and the A53 cache line is 64 bytes so flushing / invalidating the cache for one descriptor affects another one.
05-16-2019 09:11 AM
Thank you for your answer, but I think that my question wasn't very clear.
1 - In addition to the MMU, I am using also the SMMU. In this way the GEM DMA works correctly with visrtual addresses.
2 - My problem isn't on the device memory (as you say, it is "strongly ordered" memory), but the problem is on the memory that contains the buffer descriptors (that is normal cachebale memory).
4 - I think that if I configure the CCI400, it would manage the coherency between the cores' caches and the GEM DMA, because the memory of the buffer descriptor is "sharable". In other words I would to not flush the cache manually.
If I set the buffer descriptors' memory as non-cacheable my system works properly. Anyway, it would be fine if the coherency will be handled by the hardware.
05-16-2019 02:41 PM
I have tried but never gotten cached memory to work properly with the PL (e.g. enabling the CCU did not seem to help). I understand there is some magic needed when the firmware DMAs data into memory - there are 4 user bits and 4 cache control bits as part of a dma transfer.
Question is: Is there any writeup on how to get cache coherence to work with firmware DMA. Caching memory is very important for CPU performance, but the cached memory is sometimes inconsistent. For example: firmware writes via DMA to memory (e.g. with the memory mover) => software reads memory. If cached the software sometimes reads incorrect data, even when using a dma interface (e.g. HP0) that has cache coherence enabled. I have had to flush and invalidate to get things to work properly - why?
05-17-2019 01:04 PM
I am quite illeterate from the PL point of view... but a bit better with the PS. Bus transactions that are outside the CPUs/L1/L2 caches must go through the ACP (Accelerator Coherency Port) to access the L1 & L2 caches. If what you call the Firmware DMA is the PL330 DMA then you'll have to deal with the cache flushing / invalidation because that DMA cannot transfer data through the ACP (The DMA has no connection/path to the ACP). The ACP is only accessible at the PL level: TRM v1.11 section 1.2.1 in System Features: Accelerator coherency port (ACP) from PL (master) to PS (slave)
05-17-2019 02:04 PM
05-20-2019 07:02 AM
I think I am referring to AXI dma, not 330 dma??? I am willing to be flexible on what you call it, but I want the PL firmware to specify read and write addresses into physical memory, and initiate transfers without involvement of the PS. For for example, I have a memory mover that the firmware drives. It specifies the transfer address, transfer count, and cache control on the command interface, and then uses a stream interface to pass data between DRAM and internal firmware registers / buffers.
I WOULD LIKE THIS DATA TO BE CONSISTENT WITH THE PS Caches SO THAT WHEN DATA IS WRITTEN INTO MEMORY ANY CACHE OF THE PS IS INVALIDATED WHEN A REGION IS WRITTEN TO MEMORY, SIMILARLY I WOULD IDEALLY LIKE CACHED DATA THAT HAS NOT BEEN WRITTEN INTO MEMORY TO BE USED IF THE FIRMWARE IS READING FROM MEMORY.
There is a CCI check box to enable cache coherence for the HP0 and HP1 ports. Does that work - IS IT POSSIBLE FOR THE FIRMWARE TO ACCESS MEMOR, writing to memory, without requiring the PS software to do explicit invalidates before it uses a firmware written buffer. Is it possible for the firmware to read from memory that is cached after the PS software as written to that memory without the software PS to do explicit flushes before the firmware reads memory?
I have yet to be successful with this type of cache coherency. Is it possible? Is there any example or tutorial on how to make this work?