UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 

MicroZed Chronicles: The Deep Learning Processing Unit

Xilinx Employee
Xilinx Employee
0 0 1,091

This content is republished from the MicroZed Chronicles, with permission from the author and Hackster.io.

 

A few weeks ago we looked at the Xilinx Deep Neural Network Development Kit and the DNNDK framework.

 

Fig1_Deep Learning Processor Unit in the design.png

Deep Learning Processor Unit in the design

 

In this blog we are going to have a deep dive look at the element which is at the heart of the DNNDK — that is the Deep Learning Processor Unit, or the DPU, as it is commonly called.

Using the DPU with DNNDK enables us to implement Convolution Neural Networks (CNN) in our Zynq and Zynq MPSoC Solutions.

The DPU is instantiated in the programmable logic, and requires connections to both the processor and the external memory. The external memory stores both the instructions and images for classification, while the processor responds to interrupts from the DPU to synchronize operation.Fig2_DPU internal architecture.png

DPU internal architecture (source Xilinx PG338)

 

From an interfacing point of view, the DPU is very simple consisting of multiple AXI interfaces, interrupts, clocks, and resets.

  • Master DPU Instruction Interface (32 bits)
  • Two Master DPU Data Interfaces (128 bits)
  • Slave DPU Interface (32 bits)

Fig3_DPU interfaces.png

DPU interfaces

 

Aside from the slave AXI clock, the DPU core uses two clocks: the master AXI interface clock (m_axi_dpu_clk) and a clock twice this frequency (dpu_2x_clk). To achieve timing closure, these clocks need to be synchronized; therefore, a clock wizard should be used to generate both clocks.

To ensure the maximum performance, the master AXI clock should be set to 333MHz which is the maximum clock rate for AXI Interfaces. Of course, this means the dpu_2x_clk requires clocking at 666 MHz. Ensure the matched routing option is enabled.

Fig4_Clock wizard configuration.png

Clock wizard configuration

 

Once the AXI interfaces are connected, we can then assign the memory addresses. To ensure we can work with the DNNDK, we need to assign at least 16 MB of memory to the DPU.

To work with the DNNDK, the first DPU interrupt must be connected to IRQ 10. This means it needs connecting to IRQ1[7:0] bit 2. We can use concatenate and constant blocks to ensure the correct interrupt is used.

Fig5_Connecting the IRQs correctly.png

Connecting the IRQs correctly

 

With the design now connected into the processing system, we can focus a little more on the configuration of the IP core itself.

The first thing we need to decide is the number of DPUs we wish the DPU IP to contain — we can have between one and three cores in our solution.Fig6_Customizing the DPU IP.png

Customizing the DPU IP

 

The second is the actual architecture of the cores. There are eight available architectures. The architecture name Bxxx defines the peak operations per clock cycle. To provide a range of peak operations, the different core architectures have a have different levels of pixel, input, and output parallelism.Fig7_Architecture configurations.png

Architecture configurations (Source Xilinx PG338)

 

Selecting the DSP cascade length is as always, a trade-off between the resource utilization and timing performance. Larger cascade lengths use less logic but will offer worse timing performance, while lower cascade lengths use less resource yet offer better timing performance. Using higher cascade lengths is therefore more useful in smaller devices where the logic resources are not available.

The final DSP option, low or high DSP usage, relates to how the DPU IP core implements DSP elements.

  • Low — DSP are used for multiplication only
  • High — DSP elements are used for multiplication and accumulation

Again, using the low setting is for smaller devices which offer limited resources.

The final option is whether we desire UltraRAM to be used in the DPU IP. This is not available on all devices, but when it is, it can be used in place of BRAM.

Once all this is configured as desired, we can implement the design. When I implemented the above design, the utilization was as shown below:Fig8_Utilization and Power for ZU9EG implementation.png

Utilization and Power for ZU9EG implementation

 

Fig9_Implemented floor plan.png

Implemented floor plan

 

If you want to understand a little more about the DPU, take a look at the Technical Reference Design available freely here.

Now that we have a Vivado bit stream, we need to integrate it with the reset of the DNNDK stack and start using the solution for our CNN application.

Keep on eye on my Hackster project — there will be an in-depth tutorial appearing there soon!

 

See My FPGA / SoC Projects: Adam Taylor on Hackster.io

Get the Code: ATaylorCEngFIET (Adam Taylor)

Access the MicroZed Chronicles Archives with over 260 articles on the Zynq / Zynq MpSoC updated weekly at MicroZed Chronicles.