cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Foxchild
Visitor
Visitor
527 Views
Registered: ‎10-03-2020

Experienced opinion needed for CNN implementation

Jump to solution

For my thesis I want to implement a CNN (convolutional neural network) on a Zynq-7000 board (Arty Z7). I have some experience with Xilinx (mostly Microblaze designs) but now I am struggling with the outline of my project.

I need to do the following:

  1. Continuously receive frames of RD (range doppler) maps in defined time periods over Ethernet
  2. Process CNN
  3. Transfer processed data to PC

Some details if you care:

  • RD maps:
    • 96x96 points
    • 32-bit floating complex values
    • In total one frame is 96x96x2x4 = 74kB
  • Receive period is variable (~100ms)
  • CNN is only three layer with 8-bit weights

Now I could need some experience with how I'd like to implement this system. This is my idea:

  1. Configure the Ethernet interface on the Arty Z7 in order to receive a frame
  2. Store the frame via DMA in BRAM on PL
  3. Use custom DMA on PL in order to stream RD map through CNN logic
  4. Store processed data in another BRAM on PL
  5. Use UART in order to transfer processed data from BRAM in PL to PC
  6. Repeat

I'd be happy to hear any better ideas! I have no experience with Vitis AI. Might this be a better solution here?

Otherwise, here are my questions regarding this implementation:

  • The receive buffer of the ethernet interface isn't big enough for a whole frame, so I guess I'll have to write 96 lines of the 96x96-point map (96x2x4=768 Bytes) and transfer each line from receive buffer to BRAM with PS DMA? Is there any better way?
  • Is it even possible for the PS DMA to directly write to PL BRAM?
  • Do I have to reconfigure the PS DMA everytime I received one line (in order to write to the correct consecutive memory region)? Is it even practical to use DMA then?
  • I need to do some scaling in floating-point math at first with the received RD map. Would it be practical to use one of the A9 cores to do this? Then it would make more sense to store a whole frame in OCM at first, right?
  • What would be the best solution in order to gather the processed data from BRAM again? PS DMA again? Or is it possible to directly access PL DRAM with the UART interface?

Here's what I have working so far...

  • Receive data via Ethernet (without DMA)
  • AXI Stream IP for CNN
  • Transmit data via UART

ANY help is appreciated! And thanks in advance!

0 Kudos
1 Solution

Accepted Solutions
drjohnsmith
Teacher
Teacher
442 Views
Registered: ‎07-09-2009

Ah, 

    if your asking about latency between PL and PS, 

        then I suggest that you have a big hill to climb.

 

PL runs at a different clock than the PS.

    Depending upon the chip, there are a number of different routes between the two sides,

        each have different characteristics,

for instance, you imply about direct access from the PL to the DRAM,

    If you do that , then while your transferring, the ARM is "stalled" unless it can operate out of local memory, 

       You also need the ARM to set up the DMA ?

             may or may not be problem, 

have a hunt 

   e.g. 

https://forums.xilinx.com/t5/Processor-System-Design-and-AXI/Zynq-to-PL-BRAM-Read-latency/td-p/1110380

 

also , how are you going to communicate control between the PL and PS .

    Interrupts ?

          these have to cross the clock boundary, and have a latency in them.

 

A micorblaze, is effectively a single clock processor, 

    whilst the ZYNQ with its multiple busses and PL / PS split is much more complex.

This might help 

https://www.aldec.com/en/company/blog/145--demystifying-axi-interconnection-for-zynq-soc-fpga

https://www.mit.bme.hu/system/files/oktatas/targyak/10107/lecture_zynq-slides.pdf

https://www.xilinx.com/support/answers/47266.html

 

 

 

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>

View solution in original post

6 Replies
drjohnsmith
Teacher
Teacher
501 Views
Registered: ‎07-09-2009

The Arty , seems to me to be well under powered for your needs.

The ethernet is going to be extremely "bursty" , your going to need a fair bit of buffering and processing to extract the raw data,

     why use the slow Serial link to the PC, when you already have a Ethernet, 

  

Scater gather DMA might be of use to move data around from / to different memory areas.

 

be aware of the links between the PL and PS, 

     these are "fast" , but have a "large" latency, and can block the PS or PL from working whilst they are transferring depending how done,

I'd suggest , 

  start off with the ethernet side and the PS. ignore the PL,

      see what you can do on that , 

         you will then become familiar with the internal data path structure of the Zynq processors, 

             Then , work on how you can and what you should move to the PL.

 

As an example of the extremes,

    you could do an ethernet interface in the PL side, and a hardware packet filter that fed the data you want direct to the CNN in the PL, 

        then sent the results out on the same Ethernet . The Zynq side only being used for control ( what ip to look at etc )

Or you could do it al in software, using the Zynq ARM processors, 

BTW: floating point maths in the PL is slow and takes up a lot of space,  Assuming the ARM is running at 1 GHz, and the PL is running at 250 MHz, the ARM can be faster at a single floating point than the PL, 

    but if you can make say 16 floating point units in the PL, all running in parallel, then the PL is faster.

 

     

 

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>
joancab
Mentor
Mentor
492 Views
Registered: ‎05-11-2015

74 kB every 100 ms is 7.4 kB/s. Honestly, you may not need to mess around with the PL at all having a dual-core CPU at GHz, unless your CNN is huge (but if so, it won't either fit on an Arty Z7).

Have a look at the SDSoC flow, at least it automates the DMA transfers so it's one thing less you need to care about

Foxchild
Visitor
Visitor
472 Views
Registered: ‎10-03-2020

Alright, I will start with a CPU only implementation and then see if I can accelerate anything with the PL! Thanks!

I will look into SDSoC. Do you think Vitis AI could be of benefit for me, too? I have the feeling that Vitis AI is more about training of NNs on hardware, or maybe not?

Foxchild
Visitor
Visitor
471 Views
Registered: ‎10-03-2020

I wanted to use UART in order to transfer data to the PC again, because it has to be a different PC than the one sending the RD maps. Now, I could use a network switch since the Arty has only one RJ45 connector, but I'd rather try to keep the device count low.

Why is there latency betweens PS and PL? I thought the PL could be connected to an AXI master and memory mapped to the CPU? It wouldn't be any different than with a Microblaze or not?

You're right... I'll try to do everything on PS first and then move on to PL.

One more question... There is a DMA block at the Ethernet interface of the Zynq. Does that mean that the Ethernet Intface has its own DMA controller? Or does it mean that the PS DMA has access to the Ethernet Interface? In my mind it would be a simple task to configure the DMA in order to continuously write to OCM for instance but I guess in reality it's going to be rather difficult.

0 Kudos
drjohnsmith
Teacher
Teacher
452 Views
Registered: ‎07-09-2009

Vitis, 

   IMHO,

      a great way to generate tons of very big code that always needs a bigger chip than you expect, 

          unless you have a DEEP knowledge of how the code is going to be implemented on the FPGA, i.e. you code for the tools,

  may be in ten years time when a Virtex UP cost a few dollars..

 

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>
drjohnsmith
Teacher
Teacher
443 Views
Registered: ‎07-09-2009

Ah, 

    if your asking about latency between PL and PS, 

        then I suggest that you have a big hill to climb.

 

PL runs at a different clock than the PS.

    Depending upon the chip, there are a number of different routes between the two sides,

        each have different characteristics,

for instance, you imply about direct access from the PL to the DRAM,

    If you do that , then while your transferring, the ARM is "stalled" unless it can operate out of local memory, 

       You also need the ARM to set up the DMA ?

             may or may not be problem, 

have a hunt 

   e.g. 

https://forums.xilinx.com/t5/Processor-System-Design-and-AXI/Zynq-to-PL-BRAM-Read-latency/td-p/1110380

 

also , how are you going to communicate control between the PL and PS .

    Interrupts ?

          these have to cross the clock boundary, and have a latency in them.

 

A micorblaze, is effectively a single clock processor, 

    whilst the ZYNQ with its multiple busses and PL / PS split is much more complex.

This might help 

https://www.aldec.com/en/company/blog/145--demystifying-axi-interconnection-for-zynq-soc-fpga

https://www.mit.bme.hu/system/files/oktatas/targyak/10107/lecture_zynq-slides.pdf

https://www.xilinx.com/support/answers/47266.html

 

 

 

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>

View solution in original post