Inthe previous entry in the AI Engine Series, we had a look into the graph file which is the top level of the AI Engine application. We have seen how this graph file is used to instantiate and connect kernels together and to the ports of the AI Engine array.
In this entry we will look at the kernel. In the template we are looking at, the 2 kernels called first and second are implementing the same function which is called simple.
An AI Engine kernel is a C/C++ program which is written using specialized intrinsic calls that target the VLIW vector processor. The AI Engine kernel code is compiled using the AI Engine compiler (aiecompiler) that is included in the Vitis™ core development kit. The AI Engine compiler compiles the kernels to produce an ELF file that is run on one AI Engine.
Multiple kernels can run on a single AI Engine, however a single kernel cannot run on multiple AI Engine. It is up to the users to partition their function into multiple kernels to be run on multiple AI Engine.
Open the file kernels.cc under src/kernels:
As within the graph file, we can see that the adf dataflow library header (adf.h) has been added. The other header included, include.h, is used to define the parameters for the kernel. In this case, only the number of samples, NUM_SAMPLES is defined with a value of 32.
In the line 6, we can see the prototype of the kernels function:
void simple(input_window_cint16 * in, output_window_cint16 * out)
First, note that the function is not returning any value (returns void). This is a requirement for all kernels.
Then we can see that the kernel function has 2 parameters:
The first parameter, called in, is an input window interface of 16-bit complex samples (input_window_cint16). This means that each input samples is a 32-bit word (16-bit for the real part and 16-bit for the imaginary part)
The second parameter, called out, is an output window interface of 16-bit complex samples (output_window_cint16).
Lines 8 to 13 hold a processing loop for the kernels:
Each iteration of the loop consumes one 16-bit complex input sample (using window_readincr), which is stored in the variable c1, and produces one 16-bit complex output sample (using window_writeincr) taken from the variable c2.
The value of the output sample is simply the sum of the real and imaginary parts of the input sample for the real part and the difference of the real and imaginary parts of the input sample for the imaginary part.
One thing to note is that this operation is not using AIE intrinsics (functions to target the vector processor), so this will run only on the scalar unit and will not take advantage of the vector processing unit. As a result, the resulting kernel will not acheive the best performance of the AI Engine.
Because the loop has 32 iterations, each consuming a 32-bit sample (16-bit complex samples) and producing a 32-bit sample, the kernel requires a 128 bytes (32*32/8) input window and a 128 bytes output window. If you remember from the the previous article , this is the value which was set in the graph for connecting the kernels.
The kernels from the template we are looking at are accessing data using windows. Another type of access available for AIE kernels is stream access.
Window Based Access
When a kernel uses a window based access, it reads directly from the local memory of the AI Engine it is running on or the local memory of one of the neighboring AI Engines.
The kernel is loaded in the AI Engine only when the input window is full. This means that the first invocation of the kernel will have an initial latency while the input window is filled in. However, as the memory can be used as a ping pong buffer, the next set of data can be written into memory during the kernel execution and be ready for the next iteration.
Stream Based Access
When a kernel uses a stream based access, it reads directly from the AXI-Stream interface. In this case the kernel is reading the data in a sample by sample fashion.
Using streams can introduce back pressure to the upstream kernels if the downstream kernel is not able to process the data fast enough but can also create a stall in a downstream kernel if the upstream kernel is not able to produce data fast enough.
In the first 3 entries in this series, we have covered the different files for an AI Engine application. In the next article we will run the AI Engine compiler targeting the x86 and then run the x86 simulator.