cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
dbemmann
Observer
Observer
979 Views
Registered: ‎05-08-2018

Least processor-heavy way to get BRAM data

Jump to solution

A PL core periodically generates 1 KByte of data ("the chunk") into a BRAM every 10us. I am looking for the least processor-heavy way to get this data into the Cortex-A9 for further processing, i.e. I want to maximize the time that the Cortex-A9 can spend on other tasks. In order to make that happen, I need to minimize the latency of reads made by the processor to wherever it gets the data from. On the other hand, the latency of moving the data to that place is not critical.

The easiest solution would be reading from an AXI BRAM Controller... but then I'd have the latency of the AXI bus in every read, which will stall the processor while waiting for the data (right?)

So I assume I'd be better off using DMA to move the data to OCM or DDR and then let the processor read from there. 1KB = 256 * 32bits, so it fits into one burst.

I read that I can use AXI DataMover or AXI DMA, but it seems that all it does is read from an AXI Stream slave (as a master) and write to the PS's slave port (also as a master), so do I really need such a middleman? Wouldn't it be better to develop an AXI master to directly write to the PS?

Once the data is written to OCM or DDR, does the cache know about it?

How many processor cycles of latency can I typically expect for reading such data? Is there any better way?

Tags (3)
0 Kudos
1 Solution

Accepted Solutions
dgisselq
Scholar
Scholar
819 Views
Registered: ‎05-21-2015

@dbemmann ,

> I've only spent about a week with AXI and IP Integrator and my impression is that AXI is a very mighty sword frequently used for tasks that could be handled by a pocket knife

Nice quote.  Can I use it?

As for  your problem, back up a moment and imagine the ideal solution.  Now tell us what it is.

The reality, though, is that the processor speaks AXI.  All you need to do is speak AXI to talk to it.  The next question is what's the cheapest way to accomplish that.  Most PL memory is unpredictable.  This keeps it (appropriately) from ever being cached.  That keeps it from accessed w/ burst accesses.  If you use the DMA, you can burst your data to ... more memory?  That does seem like a waste.  How 'bout we cut out the middle man?  Does the processor really need to see the data at all?  The processor would be the slow item in the chain.

It is also possible to cut out the AXI-lite interface to the data mover (DMA).  You can script the DMA via AXI stream.  You can also cut out the Interconnect between the datamover's AXI master port and the CPU.  It all depends on what you need to do.

Dan

View solution in original post

15 Replies
dbemmann
Observer
Observer
897 Views
Registered: ‎05-08-2018

I found an answer record on OCM access http://www.xilinx.com/support/answers/50826.htm - unfortunately I cannot get it to work on my board.

But still the question remains why people would piece together a RAM + a RAM controller (AXI slave) + a DMA core (AXI master to the RAM controller, AXI lite slave itself) + an interconnect (AXI lite master to the DMA core, slave to the PS), which makes 4 cores, when it seems that one master (that reads the RAM and puts the data on the bus) would do it? I do assume that all of these cores add latency and overhead logic, so I have trouble understanding why so many example designs chain together tons of cores.

0 Kudos
drjohnsmith
Teacher
Teacher
889 Views
Registered: ‎07-09-2009

Assuming the ram is on the PL side, and the processor is on the PS side,

    then DMA is the thing,

Regarding your general comment

  on lines of a better way ?

   how are you going to connect the ram on the pl to the processor side, 

    apart from going through the bridges in the chip, which use AXI interface ?

 

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>
dbemmann
Observer
Observer
867 Views
Registered: ‎05-08-2018

So did I get it right that "DMA" just boils down to having a PL master write to a PS AXI slave port? There isn't more that the AXI DMA does, right? Just request data from a slave and pass it on as a master. Or did I miss something else that needs to be done in order to make the PS "see" the data (due to caching etc.)?

My own core works on a BRAM and once its results are ready, I would like to make that data available to the PS in such a way that minimizes the PS cycles spent on reading it. That's why I was thinking about copying the data to another memory in the PS which can be accessed faster. It seems the OCM would be a good choice, but I'm new to Zynq so I cannot speak from experience.

 how are you going to connect the ram on the pl to the processor side

That is part of my question, but putting 3 more IP cores between the RAM and the PS doesn't sound very efficient. Another obstacle is that the AXI DMA IP itself is controlled through yet another AXI port... which is inconvenient because I want to control it from within my own IP (not from a processor), so I would have to add yet another AXI port to my IP only to write two start addresses (which are constants) and set one bit to start the DMA. All this sounds like a massive overkill in terms of protocols.

So if all that these IPs wired together achieve is reading a BRAM and writing its content to an AXI slave port on the PS, then I assume it would be more straight forward and more efficient to have one IP that does just that.

Speaking in analogy: this feels like wanting to connect an iPod to a stereo and being presented with solutions consisting of 4 different audio adapters chained together in series, so I'm asking myself if I should really consider that or rather solder one proper cable. I've only spent about a week with AXI and IP Integrator and my impression is that AXI is a very mighty sword frequently used for tasks that could be handled by a pocket knife

0 Kudos
dgisselq
Scholar
Scholar
820 Views
Registered: ‎05-21-2015

@dbemmann ,

> I've only spent about a week with AXI and IP Integrator and my impression is that AXI is a very mighty sword frequently used for tasks that could be handled by a pocket knife

Nice quote.  Can I use it?

As for  your problem, back up a moment and imagine the ideal solution.  Now tell us what it is.

The reality, though, is that the processor speaks AXI.  All you need to do is speak AXI to talk to it.  The next question is what's the cheapest way to accomplish that.  Most PL memory is unpredictable.  This keeps it (appropriately) from ever being cached.  That keeps it from accessed w/ burst accesses.  If you use the DMA, you can burst your data to ... more memory?  That does seem like a waste.  How 'bout we cut out the middle man?  Does the processor really need to see the data at all?  The processor would be the slow item in the chain.

It is also possible to cut out the AXI-lite interface to the data mover (DMA).  You can script the DMA via AXI stream.  You can also cut out the Interconnect between the datamover's AXI master port and the CPU.  It all depends on what you need to do.

Dan

View solution in original post

drjohnsmith
Teacher
Teacher
767 Views
Registered: ‎07-09-2009

DMA. Wiki it.

  "All" it does is moves data from a range of addresses in memory to another.

     The processor sets it up, then leaves it till its "triggered" , 

So the processor overhead is very minimal.

Next up your quesitn about AXI.

what is your alternative suggestion ?

  The processor talks AXI,

     the interface between the PS and PL are AXI buses,

This is because, the PS and the PL side are talking to many different address's. 

   The PL and the PS can work independently , till you need to get them to talk,

       As PL and PS are both asynchronous processes, in that they do not generally have a specific clock time to talk to each other,

         then some sort of que, back pressure system is needed,

Your BRAM doe snot tlak AXI, yet it needs to get its data to an AXI processor, 

    also any DDR memory you are using, is on a AXI interface,

        that way the processor does not need to change for different types of memory, its just AXI.

give us your idea ,

 

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>
0 Kudos
dbemmann
Observer
Observer
743 Views
Registered: ‎05-08-2018

Thank you for your continued interest in my problem - it's great to get feedback from experienced engineers!

My point was that the BRAM is not an AXI peripheral, so it seems wasteful to hook the BRAM up to a contoller to make it speak AXI just so it can talk to the DMA which then reads from it through AXI and writes to the PS... and I'd have to build another AXIL-master to setup the DMA (because its control data originates from within the PL, not the PS). That solution instantiates 5 AXI interfaces.

> what is your alternative suggestion ?

My alternative suggestion would be building a custom IP for the task at hand: reading directly from the BRAM (through its native port), and writing to the PS as an AXI master. That solution instantiates only 1 AXI interface.

I believe the custom IP would be smaller and possibly faster (in terms of copying the data). On the downside, I realize that it would take much more time to develop such an IP, so to get up and running, I will start with the existing IPs and then optimize later.

Challenges I still have to find out about are cache coherency and read latency of the processor.

0 Kudos
vanmierlo
Mentor
Mentor
651 Views
Registered: ‎06-10-2008

It sounds like you want to transfer 1kB every 10us to the PS memory. That is 100MB/s or 800Mb/s. This is feasible yet already quite some bus load. The AXI bus will also be busy doing other things, so you should consider using a FIFO along the way.

The AXI-DMA engine takes data on an AXI-Stream slave input (not master, not AXI-Bus). The BRAM has neither an AXI-Bus nor an AXI-Stream interface. If you want/need to stay with the BRAM you need to bridge that. But wouldn't you be better off generating the data as an AXI-Stream right away instead of dumping it in a BRAM?

The AXI DataMover also takes an AXI-Stream as input and writes to an AXI-Bus, but it accepts its control (task description) from another AXI-Stream instead of an AXI-Lite register interface. I'm not sure this will make it any easier for you.

What do you intend to do with these blocks of data? Process them immediately? Log them for later inspection? Send them out over Ethernet?

0 Kudos
drjohnsmith
Teacher
Teacher
632 Views
Registered: ‎07-09-2009

I think the point is 

    the BRAM is not AXI , but everything else you are connecting to is AXI, 

so option is either to make BRAM AXI , using the IP,

  or make the rest of the system not AXI, a MUCH bigger task

 

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>
0 Kudos
dbemmann
Observer
Observer
601 Views
Registered: ‎05-08-2018

> 100MB/s [...] quite some bus load

@vanmierlo correct. Actually my source data words have a resolution of 24 bits. I might get away with 20 bits, but no less. So I'm doing 32 bit transfers (with 8 bits unused), to get my data aligned in memory. Alternatively, I could "pack" 4 words into 3*32bit in the PL and reduce bus traffic by 25%, but then I'd have to unpack in the PS and it doesn't seem a good idea to put that extra load on the processor. I assume AXI doesn't have a hardware feature to do this unpacking, or does it?

> Wouldn't you be better off generating the data as an AXI-Stream right away instead of dumping it in a BRAM?

That's a valid question. Short answer: I can't. (Long answer: the algorithm in the PL calculates data iteratively and has to re-read previous values in an order not known at design time, so the RAM in the PL is needed anyways).

> What do you intend to do with these blocks of data? Process them immediately?

The PS needs to process all of them in each 10 us interval, doing further calculations. I got DMA to work (needs to be tuned later), and I'm now focusing on the PS. First thing to do after the DMA is to invalidate the address range in the cache (otherwise the PS reads stale data from the cache). I measured the time the PS takes to read 256 u32 values in a for-loop:

  • From BRAM (via AXI): 46696 cycles (182.4 cycles per u32)
  • From DDR (uncached): 4266 cycles (16.7 cycles per u32)
  • From DDR again (cached): 3634 cycles (14.2 cycles per u32)
  • From OCM (uncached): 3678 cycles (14.4 cycles per u32)
  • From OCM again (cached): 3634 cycles (14.2 cycles per u32)
  • This is standalone, nothing else running and the figures include the time spent on the for-loop itself

My interpretation:

  • AXI clock was only 50 MHz which explains the sluggish AXI-BRAM, but it's still by far the slowest method
  • Cached OCM and cached DDR take the same time -> expected -> cache is working
  • Reading cache is only 1% faster than reading uncached OCM -> surprising
  • Reading OCM is only 16% faster than reading DDR -> suprising
  • Overall, the achieved speed is very disappointing

But I guess I'm mostly measuring access latency here and the processor sits idle while waiting for the data => so the measured times have limited relevance for a real application. Once I have code processing the acquired data, I really do hope that the processor puts meaningful instructions between issuing a read and processing the incoming result, so it can get much faster. Is that realistic? I have no idea how to write a test case to measure the time truly spent on reading data (and isolating it from other code). But how many processor cylces does it "really" take to read a value from the OCM (assuming there is enough parallelism in the program to make use of latency cycles)?

drjohnsmith
Teacher
Teacher
520 Views
Registered: ‎07-09-2009

BRAM via AXI,

   are you moving lumps of say 256 bytes, 

       or as data becomes availble ?

the overhead of moving between two busses, is always going to be significant, 

     what size of interface are you using between the PL and the PS,

    As has been said, 

  the fastest lowest overhead way to move data that you can not stream is to use DMA and move it directly in large lumps between the PL and the DRAM,

 

the PS side and PL side will do wat you tell them to do ,

    if you tell them to wait for data , then do a small read, before they can get on, then the average data rate is going to be limited.

you can't fight the tools or the silicon, 

    so you have to learn the best way t keep them both active as much as possible,

 you need to look in detail at to the arrangement of the buses, and the interfaces between them

 

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>
0 Kudos
dbemmann
Observer
Observer
500 Views
Registered: ‎05-08-2018

As I said, "I measured the time the PS takes to read 256 u32 values in a for-loop" - so basically the PS reads ready data as fast as it can, I just compared different sources. The number "BRAM via AXI" is not relevant, since it was clear from the beginning that PS reads through AXI would be too slow - that's why I wanted to use DMA in the first place. So let's focus on the other numbers instead. They are not related to the DMA transfer, but to what happens AFTER the DMA transfer, i.e. the PS reading from either cache, OCM or DDR. This is what needs optimization.

0 Kudos
drjohnsmith
Teacher
Teacher
485 Views
Registered: ‎07-09-2009

so do you have DMA set up now ?

    at the beginning you were saying not to use it , my apologies if you are now,

Is the PS or PL side DMA ?

     which bridge does the transfer take ?

    Have you the picture of how the busses and bridges are interconnected ?

        are any of the routes blocking ?

how wide is your access to the DRAM ?

    are you reading the effectively ? or across boundaries ?

         

there are lots of different thigns that can affect the performance, 

     and I have not seen evidence of a block diagram of how you are moving data when,

that is the key to effective speed, trying to keep all the parts busy 

      and every time the data crosses a bus, there is going to be a fifo / buffer structure to cross, that adds a delay

             if the delay is say 4 clocks, and your AXI on the PS is running at 100 MHz, then that will incur a overhead of 40 ns 

                 which at the processor clock speed is a good few cycles.

 

Now if you were moving say 10K bytes, that 40 ns or what ever it is, is in material,

     but if you were moving 4 bytes per read, then that 40 ns each time adds up.

 

 

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>
0 Kudos
vanmierlo
Mentor
Mentor
372 Views
Registered: ‎06-10-2008

@dbemmann I usually bring the data in on the ACP port. This way I don't have to worry about flushing the cache. It also means 64 bit per clock cycle. Further, the AXI bus can easily run at 100 MHz, and probably faster.

Since both the L2 cache and the OCM are directly connected to the SCU, running at the same CPU_6x4x clock, I don't think L2 caching will make any difference. For the CPU core the L1 D-cache should still help.

0 Kudos
dbemmann
Observer
Observer
354 Views
Registered: ‎05-08-2018

@vanmierlo During the weekend I made ACP work as well. I know it's working because the PS reads up-to-date data without having to invalidate the cache. I tried to measure the time for invalidating the cache range, but for some reason I always measure 0 cycles, even with DMB instructions before and after the invalidation - so I don't know how much time I am saving on cache invalidation. The reading speed of the PS is basically the same, no matter whether the data came in via the HP or ACP port.

I experimented with different caching policies, but I only found out how to make it slower When the range is set to non-cacheable, DDR takes much longer to read. But with any of the caching policies enabled, DDR and OCM are almost equally fast. I suppose that depends on the size of the block and OCM will make a difference for applications where the data doesn't fit into the cache, but not for my 1 KByte block.

I think I misunderstood snooping and coherency at first. I thought writing to the ACP port would make the SCU store the data in L1. But apparently "coherency" only means that stale data gets evicted from the caches, not that new data gets allocated in them. So the only way to get data into L1 is having the processor read it (and produce a cache miss)? It's not possible to write-through into L1?

0 Kudos
vanmierlo
Mentor
Mentor
337 Views
Registered: ‎06-10-2008

Indeed I don't think it is possible to have the ACP write into any of the four L1 caches (I+D, 2 cores), only to evict from them. Similarly one core cannot write into L1 of the other core.

And when writing into OCM I even doubt if you can fill L2 as it seems pointless to do so.

0 Kudos