By Olivier Tremois, AI Engine Tools Technical Marketing and Florent Werbrouck, Xilinx Technical Support Product Application Engineer
Introduction to VersalTMACAPs
VersalTM Adaptive Compute Acceleration Platforms (ACAPs) are the latest generation of Xilinx devices, built on the TSMC 7 nm FinFET process technology. They combine Scalar Engines (which represents the Processor System (PS)), Adaptable Engines (which represents the Programmable Logic (PL)), and Intelligent Engines which are connected together using a high-bandwidth Network-on-Chip (NoC).
In this article, the focus is on the AI Engines which are part of the Intelligent Engines.
Introduction to the Xilinx AI Engines
The AI Engines are included in some Xilinx Versal ACAPs. They are organized as a two-dimensional array of AI Engine Tiles which are connected together with Memory, Stream and Cascade Interfaces. This array can contain up to 400 tiles on the current ACAP devices (for example, the VC1902 device). The array also include an AI Engine interface, located at the last row row of the array, which allows the array to interface with the rest of the device (PS, PL and NoC).
The AI Engine interface includes PL and NoC interface tiles and a configuration tile. Interface from the PL to the AI Engine Array is done using AXI4-Stream interfaces through both the PL and NoC interface tiles. Interface from the NoC to the AI Engine Array is done using AXI4-Memory Mapped interfaces through the NoC interface tiles.
It is interesting to see that a direct AXI4-Memory Mapped communication channel is only available from the NoC to the AI Engine tiles (and not from the AI Engine Tiles to the NoC).
Let's now have a closer look at the array and see what is inside an AI Engine tile.
Each AI Engine tile includes:
One tile interconnect module which handles AXI4-Stream and Memory Mapped AXI4 input/output
One memory module which includes a 32 KB data memory divided into eight memory banks, a memory interface, DMA, and locks.
One AI Engine
The AI Engine can access up to 4 memory modules in all four directions as one contiguous block of memory. This means that in addition to the memory local to its tile, the AI Engine can access the local memory of 3 neighboring tiles (unless the tile is located on the edges of the array):
The memory module on the north
The memory module on the south
The memory module on the east or west depending on the row and the relative placement of the AI Engine and memory module.
AI Engine Architecture
The AI Engine is a highly-optimized processor which includes the following highlights:
32-bit scalar RISC processor (called Scalar Unit)
A 512b SIMD vector unit featuring a Vector fixed-point/integer unit and a Single-precision floating-point (SPFP) vector unit
Three address generator units (AGU)
Very-long instruction word (VLIW) function
Three data memory ports (Two load ports and one store port)
Direct stream interface (Two input streams and two output streams)
Programming the AI Engine array
AI Engine tiles come in arrays of 10 or 100 units. Creating a single program embedding some directives to specify the parallelism would be a tedious, almost impossible task. That's why the programming model of the AI Engine Array is close to Kahn Process Networks where autonomous computing processes are connected to each other by communication edges generating a network of processes. (cf.https://perso.ensta-paris.fr/~chapoutot/various/kahn_networks.pdf)
In the AI Engine framework, edges of the graph are buffers and streams and the computing processes are called kernels. The kernels are instantiated and connected together and to the rest of the design (NoC or PL) within graphs.
The programming flow is done in 2 stages:
Single Kernel Programming:
A kernel describes a specific computing process. One kernel will run on a single AI Engine tile. However, note that multiple kernels can run on the same AI Engine tile, sharing the processing time.
Any C/C++ code can be used to program the AI Engine. The scalar processor will handle the majority of the code. If your goal is to design a high performance kernel, then you will target the vector processor using specialized functions called intrinsics. These functions are dedicated to the vector processor of the AI Engine and will allow you to extract a massive processing performance from the AI Engine.
Xilinx will provide pre-built kernels included in libraries that users will be able to use in their custom graphs.
Xilinx will provide a C++ framework to create graphs from kernels. This framework includes graph nodes and connections declarations. These nodes can be either in the AI Engine Array or within the Programmable Logic (HLS kernel). To have full control over the kernel location, there will be a set of methods that will constrain the placer (kernels, buffers, system memory, ...). A graph will instantiate and connect the kernels together using buffers and streams. It will also describes the data transfer from/to the AI Engine Array to/from the rest of the ACAP device (PL, DDR).
Xilinx will provide pre-built graphs included in libraries that users will then be able to use in their applications.
During runtime, and simulation, the AI Engine application is controlled by the PS.
Xilinx will provide multiple APIs such as the following depending on the application OS.
Xilinx Run Time (XRT) and OpenCL for Linux application