Showing results for 
Show  only  | Search instead for 
Did you mean: 
Registered: ‎06-29-2015

Why does turning on L2 Cache slow down GEM (Ethernet) performance

I have a working Bare-Metal app running on Zynq that processes Ethernet data. L2 Cache has never been switched on.

I tried switching it on, but found two issues:

1) Switching it on adds A LOT of latency to my Ethernet data. All Buffers and BD sits in non-cached RAM (as it always has). My ping goes up from < 1 ms to 20 ms to 50 ms.

2) My MMU for the non-cached RAM has been configured with TEX[2:0] as 0b101, C=0 and B=0. This means L1 cache was disabled, but L2 was enabled. When I switched on L2 cache, this clearly broke my Ethernet as buffers and BD must be non-cached. I fixed it by changing TEX[2:0] to 0b100 BUT, when it is in this state, and I disable L2 cache, my Ethernet does not work at all.

Neither of these two scenarios match with my understanding of the L2 Cache and GEM coherency. I would think that enabling L2 cache should have no effect on GEM but it did (it also sped up my processing by 10%, which is expected). I would also think that with TEX[2:0] set to 0b100 that enabling or disabling L2 should have no effect, but it did.



Tags (4)
0 Kudos
6 Replies
Xilinx Employee
Xilinx Employee
Registered: ‎02-01-2008

tex=101, C=0, B=0 means L1 non-cacheable, L2 cache is write-back, write-allocate.

tex=100, C=0, B=0 means L1 non-cacheable, L2 non-cacheable

Is S=1? Even though you have marked the memory as non-cached, I believe the cache lines are still used for accesses because the address range is configured as memory.

The GEM will not be coherent to L1 or L2 cache since its DMA has direct access to DDR and does not use the SCU.

Since you are using the baremetal drivers, they will be calling the cache flush/invalidate functions. I believe these functions call both L1 and L2 flush/invalidate functions. I wonder if those functions falter if you have tex=100 (L2 non-cacheable).

I could possibly understand ping going up a bit due to the time required to walk through cache during flush/invalidate but not sure if it would cause a 50x slow down.

0 Kudos
Registered: ‎06-29-2015

No, S = 0. Would it make a difference?

Could you please elaborate on "the cache lines are still used ... because the address range is configured as memory."? I find it especially puzzling that tex=100, C=0, B=0 does not work when L2 cache is disabled.

I am not using cache flush/invalidate functions. I have a section of DDR with the settings as described so that it is not cached. My GEM buffers and buffer descriptors are all linked to this memory section so that I do not have to do any cache maintenance (I generally try to avoid it if I can).

When I do use tex=100, C=0, B=0 and L2 cache enabled, my ping is very puzzling. As you can see below, every second ping takes as much as 1000 ms. Changing nothing but tex=101 and disabling L2 brings pings consistently down to 0.2 ms:

4 bytes from icmp_seq=0 ttl=127 time=1008.871 ms

64 bytes from icmp_seq=1 ttl=127 time=3.827 ms

64 bytes from icmp_seq=2 ttl=127 time=1008.903 ms

64 bytes from icmp_seq=3 ttl=127 time=4.020 ms

64 bytes from icmp_seq=4 ttl=127 time=1008.514 ms

64 bytes from icmp_seq=5 ttl=127 time=5.346 ms

64 bytes from icmp_seq=6 ttl=127 time=1004.805 ms

64 bytes from icmp_seq=7 ttl=127 time=0.712 ms

64 bytes from icmp_seq=8 ttl=127 time=1004.133 ms

64 bytes from icmp_seq=9 ttl=127 time=1.468 ms

64 bytes from icmp_seq=10 ttl=127 time=1002.789 ms

64 bytes from icmp_seq=11 ttl=127 time=0.233 ms

64 bytes from icmp_seq=12 ttl=127 time=1004.545 ms

64 bytes from icmp_seq=13 ttl=127 time=0.265 ms

64 bytes from icmp_seq=14 ttl=127 time=1003.857 ms

64 bytes from icmp_seq=15 ttl=127 time=0.279 ms

64 bytes from icmp_seq=16 ttl=127 time=1005.373 ms

64 bytes from icmp_seq=17 ttl=127 time=0.292 ms

64 bytes from icmp_seq=18 ttl=127 time=1008.737 ms

64 bytes from icmp_seq=19 ttl=127 time=5.721 ms

64 bytes from icmp_seq=20 ttl=127 time=1002.142 ms

64 bytes from icmp_seq=21 ttl=127 time=0.279 ms

64 bytes from icmp_seq=22 ttl=127 time=1005.727 ms

64 bytes from icmp_seq=23 ttl=127 time=1.855 ms

0 Kudos
Registered: ‎04-13-2015

I think I could shine a light on the 1 of of 2 very long pings response time. From what you wrote, I understand you've enabled caching on the GEM data buffers & the DMA descriptors.

If so, the DMA descriptors must always be in non-cached memory because on the Zynq they are 16 bytes and the cache lines (L1 & L2 are both 32 bytes). Any cache maintenance operation always affects 2 DMA descriptors.


0 Kudos
Registered: ‎06-29-2015

No, I've never had any of my GEM data buffers or buffer descriptors in cached memory. It was always linked to a part of DDR memory that was setup with its own block in the MMU with TEX=101, C=0, B=0.

When I enabled L2 cache it completely broke my Ethernet. It was only then that I noticed the TEX bits are wrong, enabling cache in L2 and not in L1. I was not affected, because L2 cache was off.

Fixing it by marking the memory completely as non-cached in both L1 and L2, with TEX=100, C=0, B=0 introduced the show Ethernet performance issue.

I am still at a loss as to why this affects the GEM and also why enabling L2 cache, but setting TEX=100 (i.e. disabling L1 and L2 cache for the memory block) broke the Ethernet. Am I missing something else?

0 Kudos
Registered: ‎04-13-2015

with TEX=100 the access is "memory type" and although you've set L1 and L2 as non-cached, some cache-like features are still operational. Among them is the speculative read where the A9 could read the memory before the instruction that perfoms the read is executed, possibly before the DMA data has landed in it. You should use "device type" or "strongly ordered" to truly bypass the caches

0 Kudos
Registered: ‎06-29-2015

Thanks. It makes sense but it is a bit counter-intuitive. I assume the delay is coming in on the transmit side when a ping reply is sent. The updating of the BD must only become visible to the GEM after one second. Does this sound correct?

I could change the memory to Device or Strongly ordered but that would break my stack. It is optimised to operate to operate directly on the data buffers in memory and being IP, a lot of the members is not aligned. If the memory is Device on Strongly ordered, then it will cause data aborts.

0 Kudos