07-22-2017 06:18 AM
I have created my own memory mapped peripheral that is compatible with the AXI4 Full protocol with the help of the template provided by Vivado. I have tested the peripheral with a BFM and i am able to successfully use bursts. My peripheral is connected to the AXI_GP_0.
My problem is on the Xilinx SDK side, is there any Xilinx driver in C that can generate bursts? My data are saved in the RAM and i want to pass them to the PL peripheral. So far i am using Xil_Out32 but these intructions are used for signle transfers.
From some research i have found that i can use an DMA IP of the PL but i would like to know is the ARM Processor itself can generate those burst or a DMA of some kind is always necessary?
Thanks is advance
07-22-2017 06:47 PM
@kgkougkoulias yes, apparently you need to mark your PL range as memory for the mmu to generate burst transactions. There was a thread a couple months ago which showed how to do this. If you can't find it I can look for it again.
07-22-2017 08:29 PM - edited 03-10-2018 12:30 PM
03-08-2018 03:40 AM
I am around this topic for a while now, but I am not sure I've got it right. When I define the PL memory as DEVICE_MEMORY, using
I get 6x better performance, so the number of clocks between bvalid fall from 18 to 3. But when I look into the AXI signal, this for me seems like colescence and not a real burst. Please see the attached print screen.
I obtain the same result when using DMA...
My question is: am I getting a burst or not?
03-08-2018 06:46 AM - edited 03-10-2018 12:30 PM
03-08-2018 07:46 AM
@hbucher According to Zynq TRM, one of Device Memory access rules is "both read and write accesses can have side effects on the system. Accesses are never cached. Speculative accesses are never be performed." So I understood it wasn't cached, but in the same document one reads "A write to device memory is permitted to complete before it reaches the peripheral or memory component accessed by the write". So this is the issue you refer to, right? Would the correct choice be Strongly Ordered?
Back to my question, are the accesses showed in the picture enclosed in my previous post a burst or not? In a burst I would expect the awaddr to remain constant...
03-08-2018 08:10 AM - edited 03-10-2018 12:29 PM
03-08-2018 08:20 AM
03-08-2018 08:29 AM - edited 03-10-2018 12:29 PM
03-08-2018 02:39 PM
On zynq, there are basically 3 types of accesses that can be configured in the mmu.
Strongly-ordered: a new transaction will not occur until the previous completes
Sharable Device: a new transaction can start before the response of the previous completes (can get up to 10x improvement over strongly-ordered if writing continuously to a peripheral)
memory: uses the cache controller.
Cache can get in the way, depending on what you want to do. If you set the address range as memory, and configure both inner and outer as non-cached, then the cache controller will coalesce multiple writes to incrementing addresses into burst transactions.
You are correct that the waveform you captured is not a burst transaction. And, I do not see any axi4 signals in the waveform capture which suggests to me that these may be axiLite transactions. AxiLite does not support bursting. I would have expected to see signals such as AWLEN, AWSIZE, etc.
03-08-2018 03:43 PM
@johnmcd thank you for your reply. First of all, yes I have AXI4 interface, I captured the handshake signals and just some others to not clutter the screen.
Now I understand why when I change from STRONGLY_ORDERED to DEVICE_MEMORY I have such improvement in the performance (6x faster in my case).
I didn't give a background on what I am trying to do. I want to move data from AXI Ethernet Lite to the DDR and the other way around for incoming and outcoming Ethernet packets. I use lwip and am making some changes to get it faster, avoiding unnecessary copies and speeding up the AXI4 interface, for example.
So for having bursts I need to set the memory as normal, correct? So far my echo server hangs after I get the first packet when I configure the memory as normal.
03-08-2018 04:28 PM
It's been a long time since I've looked at the ethernetLite. It is a slow interface since there is basically a ping/pong bram buffer within the core. As the ping block is written by the IP, you can read the pong block. So your cpu will be very busy.
So thinking aloud, if you setup ethernetLite address space as memory, non-cached, and if the ethernetLite tx fifo has an address range instead of a single address so that you can do incremental address writes to the core and therefore get coalescing, you may need to use a data barrier instruction before trying to access other addresses just to make sure the writes complete.
So CPU copies data to/from DDR/enetLite. DDR could be cachable since you are dealing with DMA accessing the same DDR space but the enetLite address space must not have cache enabled. If your enetLite address range is set as 'memory' non-cached, you may run into issues when reading/writing register locations. Use chipscope to verify. If you really want to move ahead with enetLite, and you want fast data movements using the CPU, then it might help to setup two virtual addresses to the same enetLite physical address. Virtual address #1 is memory-non-cached, and virtual address #2 is device strongly ordered. Use #2 for register reads/writes and #1 for data fifo access (as long as the fifo has an address range and not a single keyhole address).
I do question why you don't use either CDMA with ethernetLite or axi_dma with a hardcore enet block. CDMA is for axi to axi transactions, and axi_dma is for axiStream to/from axi transactions.
LWIP should have the necessary stuff to manage cache as you are probably well aware if you are digging around to do zero copy or equivalent. I'm not sure if LWIP includes enetLite with CDMA. You could probably tell me that.
So regarding a hanging echo server, I'm wondering if it has to do with register accesses to the enetLite while coalescing is occurring. Easy way to check is chipscope, and use xsdb to issue single read and write accesses to fifo space vs register space.
03-09-2018 02:38 AM
thank you a lot for this detailed answer. My application is all about low latency, not much throughput, and you might be surprised that Enet Lite has lower latencies than GEM for any size of packet. The cost is CPU occupation, sure.
You gave me a good insight and now I know how to continue my development. I will post here again whenever I am able to accomplish what I want or get stuck.
03-09-2018 02:48 AM - edited 03-10-2018 12:28 PM
03-10-2018 08:24 AM
@hbucher That's off-topic, but thanks for the comment. 200 ns is very good, certainly. The PHYs using cat 5 cables need at least 350 ns, not considering the higher layers, but I am still investigating more standard, lower costs interfaces using cables or at most plastic optical fibre. In industrial environments, with hundreds of nodes, we need to keep it simple and (relatively) cheap.
03-10-2018 08:29 AM - edited 03-10-2018 12:28 PM
03-10-2018 09:12 AM