Showing results for 
Show  only  | Search instead for 
Did you mean:
Registered: ‎10-15-2015

Zynq ACP Cohereny with Linux ioremap_cache

My design has several AXI DMA engines. The first of these needs to transmit and receive large ideally contigous buffers - 10s of MBytes of data. I have connected these to AXI_HP ports. These two streams have large memory bandwidth requirements so are probably best on AXI_HP ports.


The other side of my data is in smaller blocks up to 32 kBytes at a time. I have connected these to the ACP port. Some of this data needs random access, so would strongly benifit from caching.


The software is a Linux device driver.  I have limitted the memory controlled by the kernel to 256M out of 512 MBytes. The upper 256 MBytes is split into two sections, 128 MBytes each.

The first of these 128 MByte sections is accessed via the AXI_HP ports, the second via the ACP port.


For initial testing I mapped all this memory in to Kernel space using ioremap_nocache

Like this it works, but I was hoping to be able to improve performance (from a software access point of view).


For the first section (accessed via AXI_HP) I have tried using ioremap_wc. I access the memory using:

wmb(); // From io.h note that barriers are used Before WRITE and After read.
if (copy_from_user((void *) ptr, (void __user *)param.ptr, sizeof(u_int32_t)*param.num))



if (copy_to_user((void __user *)param.ptr, (void *) ptr, sizeof(u_int32_t)*param.num))

The copy_*_user functions seem to use assembler macros to make the access of the kernal address space memory (the io memory in my case) using the stmia instruction. The user buffer has to be word at a time as it has to load from the user virtual address.

My understanding that the stmai instruction would generate 8 word bursts with the memroy marked as bufferable even if not cachable.

Testing access from the CPU, this gets just slightly faster when switching to ioreamp_wc, and is much slower than when using ioreamp_cache.ioremap_nocache 161 MBytes/s

ioremap_wc  167 MBytes/s

ioremap_cache  281 MBytes/s

I was somewhat disappointed with the ioreamp_wc performance, though this may be acceptable. For this area I could use ioreamp_cache with explicit cache flush/invalidate calls as the area are accessed in a few big blocks.


For the second section accessed via the ACP port I was expecting to be able to change to ioreamp_cache and the hardware would ensure coherency. However,  once I switch to ioremap_cache the system stops working completely. I assume this is a cache coherency issue either with the DMA descriptors or the DMA data. Actually it is so broken that it must be with the descriptors.


I have got the "Tie off User to ensure Coherency" ticked.

I have put the logic analyser on the AXI_MM bus to the ACP and see ARCACHE=3

I thought this was enough.

From the post

I think the problem might be with mismatched memory type attributes.

Following all this through the kernel is hard going, but looking at ./linux-xlnx/arch/arm/mm/proc-v7-2level.S

         * Memory region attributes with SCTLR.TRE=1
         *   n = TEX[0],C,B
         *   TR = PRRR[2n+1:2n]         - memory type
         *   IR = NMRR[2n+1:2n]         - inner cacheable property
         *   OR = NMRR[2n+17:2n+16]     - outer cacheable property
         *                      n       TR      IR      OR
         *   UNCACHED           000     00
         *   BUFFERABLE         001     10      00      00
         *   WRITETHROUGH       010     10      10      10
         *   WRITEBACK          011     10      11      11
         *   reserved           110
         *   WRITEALLOC         111     10      01      01
         *   DEV_SHARED         100     01
         *   DEV_NONSHARED      100     01
         *   DEV_WC             001     10
         *   DEV_CACHED         011     10
         * Other attributes:
         *   DS0 = PRRR[16] = 0         - device shareable property
         *   DS1 = PRRR[17] = 1         - device shareable property
         *   NS0 = PRRR[18] = 0         - normal shareable property
         *   NS1 = PRRR[19] = 1         - normal shareable property
         *   NOS = PRRR[24+n] = 1       - not outer shareable



It appears the WRITEBACK and DEV_CACHED end up as the same numbers, but perhaps there is more to it?


What am I doing wrong?


Is there fundementally a better way to go about this?

I went for the solution of mem=256M, with ioremap_* because I need large buffers. For the first half it is more efficient if these are contiguous.

I understand there are other ways of managing dma memory which gets the kernel to do more of the work.  However, with some of the DMAs via the ACP port and some via AXI_HP ports, could I describe this to the kernel through the device tree? How?





0 Kudos
7 Replies
Registered: ‎10-15-2015

I have now checked the implemented design and as far as I can see ARUSER and AWUSER are pulled to CONST1 in the wrapped PS7 block. This seems correct.



0 Kudos
Registered: ‎04-01-2015

An ordinary Linux device driver can be informed that DMA operations are coherent with a simple modification to the driver.  I have a AHCI SATA core attached to an ACP port and I inform the driver to use coherent DMA operations with the following code snippets.  The switch is performed either with a module parameter or an entry in the device tree.


   /* Module parameter to define whether DMA is coherent. */
static int arm_coherent_dma = 0;
module_param(arm_coherent_dma, int, 0);
MODULE_PARM_DESC(arm_coherent_dma, "Device is attached via ARM ACP port and DMA is coherent");


In the device driver probe function


    /* Coherent DMA set via module parameter or device tree entry. */
     if(arm_coherent_dma || of_dma_is_coherent(dev->of_node)) {
        dev_info(&pdev->dev,"Set COHERENT DMA ops via %s\n",
                               arm_coherent_dma?"module parameter":"device tree");


The effect is that the incredibly slow ARM DMA cache maintenance (coprocessor 15) instructions are bypassed and CPU utilization is much reduced.  Look at drivers/of/address.c for the definition of "of_dma_is_coherent" to find out what need to be entered into the device tree.


Note that around Linux kernel version 3.9, Xilinx broke the L2 cache coherency by setting bit 22 in the AUX register to fix some other problem (genesis of this is unclear to me, Xilinx please fix this!).  In order for the above to work it's required to reverse this fix.  In arch/arm/mach-zynq/common.c the commented out lines are Xilinx original.  Reverted lines are below

DT_MACHINE_START(XILINX_EP107, "Xilinx Zynq Platform")
/* 64KB way size, 8-way associativity, parity disabled */
// .l2c_aux_val = 0x30400000,
// .l2c_aux_mask = 0xcfbfffff,
.l2c_aux_val = 0x30000000,
.l2c_aux_mask = 0xcfffffff,
// .l2c_aux_val = 0x00400000,
// .l2c_aux_mask = 0xffbfffff,
.l2c_aux_val = 0x00000000,
.l2c_aux_mask = 0xffffffff,


Also note that for any of this to work ARCACHE and ARUSER must be set correctly (I don't recall what the correct settings are, I'm a software guy


Dan Ladd
Registered: ‎10-15-2015

Thank you for this information.

I made the changes to the kernel and the system appears to work as expected now.


So it appears it is necessasry to revert this change to bit 22 in the l2 cache aux register to allow the ACP port to provide coherency. This negates the Shareability disable flag, but I don't claim to understand and worry what else I might be breaking by changing this.


I have the ACP transaction checker enabled which did reject some of my ilegal DMA requestest which generate burst of len code 2 = > 3 words and size=3 (64 bit?). My shortest transfer is now 32 bytes and I have not seen the Slave Errors getting to the DMA. 

So I am not clear what else I have to do to ensure the cache is working without data corruption.


Thank you



0 Kudos
Registered: ‎10-15-2015

With bit 22 cleared the system works with ioremap_cache, with various wmb(), but no explict cache management.

For the case where I send multiple contigous 16k frames and do not access them imediately from software, I actually get slightly worse and more variable performance than with coherency not working.

For my lists of 16k bytes packets which I immediately process on IRQ (in a tasklet) I get significanly better performance with the cache coherency solution.


I have investigated the reason for the change to the aux register bit 22.

It is described here:

I don't claim to understand, but unfortunately it appears it is not safe for me to revert this fix.

Others posts suggest that this cache is important for system reliability


So I am wondering whether there is any other way to use the ACP with Linux?



0 Kudos
Registered: ‎10-15-2015

I have investigated the effect of bit 22 in the l2_cache_aux register. Setting this bit prevents non cached accesses being transformed into cacheable accesses.

The default behavior of the cache controller with respect to the shareable attribute is to transform Normal Memory Non-cacheable transactions into:

  • cacheable no allocate for reads

  • write through no write allocate for writes.

  • But setting bit 22 prevents this.

The Xilinx IP default is to make AxCAHCE=0010.

The meaning between AXI adn Arm7m seems slightly different.


0010 AXI meaning Cacheable but do not allocate.   Arm7 Meaning Outer non-cacheable.

The aux bit 22 setting only effect transfers of code 0011 and 0010.


So I forced AWCACHE and ARCACHE to 1111 and put the kernel back to unmodified.


This seems to work.

(The  first tests gave some trouble, but this may have been user error)


I have now tested various combinations.

With AWCACHE=1111 and the driver using ioremap_cache it works.

If I use AWCACHE=1111 but use ioremap_nocache, the data is not visible to the software imediately, but multiple attempts to read it show the data appear bit at  a time.  Presumably the data is put into CACHE by the DMA but the CPU reads the DRAM directly. Then other activity causes flushing of the data from the DMA into DRAM, whic his then seen by the CPU.


Using AWCACHE=0010 with ioremap_nocache works, but is slower.

Using AWCACHE=0010 with ioremap_cache, does not work. Presumably the DMA puts the data into DRAM but the CPU sees the cached data.


To summarise the solution:

AXI DMA via ACP with constant over-riding AWCACHE and ARCACHE = 1111

Kernel 4.0.0, unmodified.

Memory limitted to use only half of the memory via uBoot environment variables.

Driver using ioremap_cache to see the buffers and descriptors.


So this might be a solution though I would urge caution to anyone else trying this unless some else can confirm that this is an acceptable solution.



Tags (2)
0 Kudos
Registered: ‎08-27-2013



When you say constant over-riding of AxCACHE signals = 1111, what exactly do you mean by this?  I have been hardcoding AxCACHE signals in the firmware associated with the processing system.  Is that what you are doing it?


In the meantime, I am going to try testing the ACP port with the AxCACHE signals = 1111.  Previously we got the ACP port working by modifying bit 22 in the l2_cache_aux register and hardcoding the AxCACHE signals = 0011.  I can look at the device view and see the signals tied to either pwr or gnd.



0 Kudos
Registered: ‎10-15-2015

I drive constant values onto AWCACHE and ARCACHE in the block diagram by clicking on the + to expand the AXI_ACP bus port and then add an IP block of xlconstant and set width=4 value=15 and connect to AWCACHE and ARCACHE.