Showing results for 
Show  only  | Search instead for 
Did you mean: 

Versal ACAP AI Engines for Dummies

9 0 3,307

By Olivier Tremois, AI Engine Tools Technical Marketing and Florent Werbrouck, Xilinx Technical Support Product Application Engineer


Introduction to VersalTM ACAPs


VersalTM Adaptive Compute Acceleration Platforms (ACAPs) are the latest generation of Xilinx devices, built on the TSMC 7 nm FinFET process technology. They combine Scalar Engines (which represents the Processor System (PS)), Adaptable Engines (which represents the Programmable Logic (PL)), and Intelligent Engines which are connected together using a high-bandwidth Network-on-Chip (NoC).



In this article, the focus is on the AI Engines which are part of the Intelligent Engines.

Introduction to the Xilinx AI Engines

The AI Engines are included in some Xilinx Versal ACAPs. They are organized as a two-dimensional array of AI Engine Tiles which are connected together with Memory, Stream and Cascade Interfaces. This array can contain up to 400 tiles on the current ACAP devices (for example, the VC1902 device). The array also include an AI Engine interface, located at the last row row of the array, which allows the array to interface with the rest of the device (PS, PL and NoC).



The AI Engine interface includes PL and NoC interface tiles and a configuration tile. Interface from the PL to the AI Engine Array is done using AXI4-Stream interfaces through both the PL and NoC interface tiles. Interface from the NoC to the AI Engine Array is done using AXI4-Memory Mapped interfaces through the NoC interface tiles.




It is interesting to see that a direct AXI4-Memory Mapped communication channel is only available from the NoC to the AI Engine tiles (and not from the AI Engine Tiles to the NoC).

Note: The exact number of PL and NoC interface tiles is device specific. The Versal Architecture and Product Data Sheet: Overview (DS950) lists the size of the AI Engine array.


AI Engine Tile Architecture


Let's now have a closer look at the array and see what is inside an AI Engine tile.



Each AI Engine tile includes:

  • One tile interconnect module which handles AXI4-Stream and Memory Mapped AXI4 input/output
  • One memory module which includes a 32 KB data memory divided into eight memory banks, a memory interface, DMA, and locks.
  • One AI Engine

The AI Engine can access up to 4 memory modules in all four directions as one contiguous block of memory. This means that in addition to the memory local to its tile, the AI Engine can access the local memory of 3 neighboring tiles (unless the tile is located on the edges of the array):

  • The memory module on the north
  • The memory module on the south
  • The memory module on the east or west depending on the row and the relative placement of the AI Engine and memory module.

AI Engine Architecture


The AI Engine is a highly-optimized processor which includes the following highlights:

  • 32-bit scalar RISC processor (called Scalar Unit)
  • A 512b SIMD vector unit featuring a Vector fixed-point/integer unit and a Single-precision floating-point (SPFP) vector unit
  • Three address generator units (AGU)
  • Very-long instruction word (VLIW) function
  • Three data memory ports (Two load ports and one store port)
  • Direct stream interface (Two input streams and two output streams)





Programming the AI Engine array

AI Engine tiles come in arrays of 10 or 100 units. Creating a single program embedding some directives to specify the parallelism would be a tedious, almost impossible task. That's why the programming model of the AI Engine Array is close to Kahn Process Networks where autonomous computing processes are connected to each other by communication edges generating a network of processes. (cf.

In the AI Engine framework, edges of the graph are buffers and streams and the computing processes are called kernels. The kernels are instantiated and connected together and to the rest of the design (NoC or PL) within graphs.

The programming flow is done in 2 stages:

Single Kernel Programming:

A kernel describes a specific computing process. One kernel will run on a single AI Engine tile. However, note that multiple kernels can run on the same AI Engine tile, sharing the processing time.

Any C/C++ code can be used to program the AI Engine. The scalar processor will handle the majority of the code. If your goal is to design a high performance kernel, then you will target the vector processor using specialized functions called intrinsics. These functions are dedicated to the vector processor of the AI Engine and will allow you to extract a massive processing performance from the AI Engine.

Xilinx will provide pre-built kernels included in libraries that users will be able to use in their custom graphs.

Graph programming:

Xilinx will provide a C++ framework to create graphs from kernels. This framework includes graph nodes and connections declarations. These nodes can be either in the AI Engine Array or within the Programmable Logic (HLS kernel). To have full control over the kernel location, there will be a set of methods that will constrain the placer (kernels, buffers, system memory, ...). A graph will instantiate and connect the kernels together using buffers and streams. It will also describes the data transfer from/to the AI Engine Array to/from the rest of the ACAP device (PL, DDR).

Xilinx will provide pre-built graphs included in libraries that users will then be able to use in their applications.


During runtime, and simulation, the AI Engine application is controlled by the PS.

Xilinx will provide multiple APIs such as the following depending on the application OS.

  • Xilinx Run Time (XRT) and OpenCL for Linux application
  • Baremetal drivers

References and further information

For more information on Versal ACAP, visit:

For more information about the Versal AI Engine, refer to:

Note that the Versal ACAP AI Engine is still Early Access until the 2020.2 release. No information further than what is documented in the AM009 will be provided.

The AIE programming tools are also in Early Access. They will be in public access starting from the 2020.2 release.

Tags (2)