04-04-2018 06:36 AM
I am developing an instrumentation technique where I want to send information from lots
of places in an application with very low overhead to a peripheral where the peripheral does
some processing on the information. I am doing this on a Zynq Ultrascale+. I need to
communicate two 64b values at a time to the peripheral for which I use a stp (store pair)
The peripheral has a 128b wide AXI full interface and is connected via AXI SmartConnect
to the PS block. The peripheral, the A53 cores, and AXI SmartConnect all run at the same
I measured that it takes about 10 cycles to execute a stp that writes two values to the
peripheral. This is a bit long for the purpose that I want to use it for. The device driver
of the peripheral uses pgprot_noncached() to mmap the registers of the peripheral in
user address space. pgprot_noncached() uses the DEVICE_nGnRnE memory attributes of ARM v8a.
I also tried other memory attribute settings: DEVICE_nGnRE gives the same 10 cycles,
DEVICE_nGRE gives the strange result of 61.3 cycles per stp, and DEVICE_GRE causes that
most of the data stored by stp does not arrive at the peripheral.
I was hoping that DEVICE_nGnRE would give better results than DEVICE_nGnRnE because
then the A53s should not wait on a write acknowledgement from the peripheral.
Does anybody have suggestions on how I can reduce the 10 cycles that it takes to execute
a stp to the peripheral? Is there a better configuration of the A53 cores, PS system
or interconnect possible? Or is the reason that DEVICE_nGnRE is not performing
better than DEVICE_nGnRnE a limitation of A53 and are more advanced v8a cores doing
04-04-2018 10:15 AM
its probably because your jumping bus's, hence you traverse the clock domain changer and the fifos'
theres no real way of getting around this if you are doing a read modify write type operation,
Cache coherency, DMA etc can hide things, but your crossing the clocks of the Ghz processor to the slower internal bus,
its going to have latency.
04-04-2018 10:56 AM
I am only doing stores to the peripheral. So no loads.
My expectation of DEVICE_nGnRE is that the A53 would sent the data to be stored over the AXI interconnect and directly continue executing successive instructions without waiting from an acknowledgement of the target. Apparently it still waits approx. 10 cycles.
Information from ARM:
"Early Write Acknowledgement (E or nE)
This determines whether an intermediate write buffer between the processor and the slave device being accessed is allowed to send an acknowledgement of a write completion. If the address is marked as non Early Write Acknowledgement (nE), then the write response must come from the peripheral. If the address is marked as Early Write Acknowledgement (E), then it is permissible for a buffer in the interconnect logic to signal write acceptance, in advance of the write actually being received by the end device. This is essentially a message to the external memory system."
04-04-2018 11:16 AM
beyond me I'm afraid,
I'm interested if you get a response though,
04-04-2018 12:36 PM
I haven't looked closely at the address space attributes on the A53 but I will try to throw a few comments at what might be happening.
Do you have a screen capture of the axi signals that can be shared?