01-03-2021 04:03 AM
For my thesis I want to implement a CNN (convolutional neural network) on a Zynq-7000 board (Arty Z7). I have some experience with Xilinx (mostly Microblaze designs) but now I am struggling with the outline of my project.
I need to do the following:
Some details if you care:
Now I could need some experience with how I'd like to implement this system. This is my idea:
I'd be happy to hear any better ideas!
Otherwise, here are my questions regarding this implementation:
Here's what I have working so far...
ANY help is appreciated! And thanks in advance!
01-03-2021 07:55 AM
Ah,
if your asking about latency between PL and PS,
then I suggest that you have a big hill to climb.
PL runs at a different clock than the PS.
Depending upon the chip, there are a number of different routes between the two sides,
each have different characteristics,
for instance, you imply about direct access from the PL to the DRAM,
If you do that , then while your transferring, the ARM is "stalled" unless it can operate out of local memory,
You also need the ARM to set up the DMA ?
may or may not be problem,
have a hunt
e.g.
also , how are you going to communicate control between the PL and PS .
Interrupts ?
these have to cross the clock boundary, and have a latency in them.
A micorblaze, is effectively a single clock processor,
whilst the ZYNQ with its multiple busses and PL / PS split is much more complex.
This might help
https://www.aldec.com/en/company/blog/145--demystifying-axi-interconnection-for-zynq-soc-fpga
https://www.mit.bme.hu/system/files/oktatas/targyak/10107/lecture_zynq-slides.pdf
https://www.xilinx.com/support/answers/47266.html
01-03-2021 04:54 AM - edited 01-03-2021 04:57 AM
The Arty , seems to me to be well under powered for your needs.
The ethernet is going to be extremely "bursty" , your going to need a fair bit of buffering and processing to extract the raw data,
why use the slow Serial link to the PC, when you already have a Ethernet,
Scater gather DMA might be of use to move data around from / to different memory areas.
be aware of the links between the PL and PS,
these are "fast" , but have a "large" latency, and can block the PS or PL from working whilst they are transferring depending how done,
I'd suggest ,
start off with the ethernet side and the PS. ignore the PL,
see what you can do on that ,
you will then become familiar with the internal data path structure of the Zynq processors,
Then , work on how you can and what you should move to the PL.
As an example of the extremes,
you could do an ethernet interface in the PL side, and a hardware packet filter that fed the data you want direct to the CNN in the PL,
then sent the results out on the same Ethernet . The Zynq side only being used for control ( what ip to look at etc )
Or you could do it al in software, using the Zynq ARM processors,
BTW: floating point maths in the PL is slow and takes up a lot of space, Assuming the ARM is running at 1 GHz, and the PL is running at 250 MHz, the ARM can be faster at a single floating point than the PL,
but if you can make say 16 floating point units in the PL, all running in parallel, then the PL is faster.
01-03-2021 05:05 AM
74 kB every 100 ms is 7.4 kB/s. Honestly, you may not need to mess around with the PL at all having a dual-core CPU at GHz, unless your CNN is huge (but if so, it won't either fit on an Arty Z7).
Have a look at the SDSoC flow, at least it automates the DMA transfers so it's one thing less you need to care about
01-03-2021 06:03 AM
Alright, I will start with a CPU only implementation and then see if I can accelerate anything with the PL! Thanks!
I will look into SDSoC. Do you think Vitis AI could be of benefit for me, too? I have the feeling that Vitis AI is more about training of NNs on hardware, or maybe not?
01-03-2021 06:11 AM
I wanted to use UART in order to transfer data to the PC again, because it has to be a different PC than the one sending the RD maps. Now, I could use a network switch since the Arty has only one RJ45 connector, but I'd rather try to keep the device count low.
Why is there latency betweens PS and PL? I thought the PL could be connected to an AXI master and memory mapped to the CPU? It wouldn't be any different than with a Microblaze or not?
You're right... I'll try to do everything on PS first and then move on to PL.
One more question... There is a DMA block at the Ethernet interface of the Zynq. Does that mean that the Ethernet Intface has its own DMA controller? Or does it mean that the PS DMA has access to the Ethernet Interface? In my mind it would be a simple task to configure the DMA in order to continuously write to OCM for instance but I guess in reality it's going to be rather difficult.
01-03-2021 07:44 AM
Vitis,
IMHO,
a great way to generate tons of very big code that always needs a bigger chip than you expect,
unless you have a DEEP knowledge of how the code is going to be implemented on the FPGA, i.e. you code for the tools,
may be in ten years time when a Virtex UP cost a few dollars..
01-03-2021 07:55 AM
Ah,
if your asking about latency between PL and PS,
then I suggest that you have a big hill to climb.
PL runs at a different clock than the PS.
Depending upon the chip, there are a number of different routes between the two sides,
each have different characteristics,
for instance, you imply about direct access from the PL to the DRAM,
If you do that , then while your transferring, the ARM is "stalled" unless it can operate out of local memory,
You also need the ARM to set up the DMA ?
may or may not be problem,
have a hunt
e.g.
also , how are you going to communicate control between the PL and PS .
Interrupts ?
these have to cross the clock boundary, and have a latency in them.
A micorblaze, is effectively a single clock processor,
whilst the ZYNQ with its multiple busses and PL / PS split is much more complex.
This might help
https://www.aldec.com/en/company/blog/145--demystifying-axi-interconnection-for-zynq-soc-fpga
https://www.mit.bme.hu/system/files/oktatas/targyak/10107/lecture_zynq-slides.pdf
https://www.xilinx.com/support/answers/47266.html