cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 

AI Engine Series 8 - Introduction to the Run-Time Ratio parameter

florentw
Moderator
Moderator
0 0 299

Introduction

 

In the Versal AI Engine 2 article, we noticed a line in the graph file defining the run-time ratio parameter for each kernel instance.

In this article we will see how this parameter can impact the resource utilization and the performances of the AI Engine application.


Requirements

 

The following article requires you to have gone through the previous entries in the AI Engine Series.


What is the Run-Time Ratio?

 

The run-time ratio is a parameter with a value between 0 and 1 which should be defined for all of the kernels in a graph. It defines, as a percentage value, how much of the processing time of a single AI Engine core is required by the kernel.

For example, a run-time value of 0.5 means that the kernel only needs 50% of the processing time of one AI Engine core.

Depending on the run-time ratios of the kernels, one kernel or multiple kernels can be mapped onto one AI Engine.


Computing the run-time ratio for a kernel

 

The run-time ratio of a kernel can be computed using the following equation:

run-time ratio = (cycles for one run of the kernel)/(cycle budget)

The cycle budget is the number of instruction cycles a kernel can take to either consume data from its input (when dealing with a rate limited input data stream), or to produce a block of data on its output (when dealing with a rate limited output data stream).

It is defined by the following equation:

Cycle Budget = Block Size * (AI Engine Clock Frequency/Sample Frequency)

For example, take a kernel which processes a window of 128 samples and the input samples frequency (for example from an ADC) is 245.76MHz. The cycle budget is 128*(1000/245.76) = 520 cycles.

This means that with the AI Engine Array running at 1GHz, the kernel needs to be executed in less than 520 AI Engine clock cycles (because the next input window would be ready after 520 clock cycles).

We will cover how to estimate the cycles for one run of a kernel in a future article.


Impact of the Run-time ratio on performance / resources utilization

 

We will now look at the impact of changing the value of the run-time ratio using the example application created in the previous articles.

We have seen in the Versal AI Engine 2 article that the run-time ratio was set to 0.1 for the 2 kernels in the graph. In the Versal AI Engine 6 article, we have seen in the Vitis Analyzer that the 2 kernels were mapped in the same tile (tile [25,0]) so they are sharing the processing time from the core located on this tile.

Because the 2 kernels are running on the same core, they are not executed at the same time (running sequentially), instead a single buffer is instantiated between them.

01.png

Finally, in the Versal AI Engine Series 5, we saw in the simulation output that the last sample of the first iteration (marked by the TLAST) of the graph was outputted after 714ns and that the first sample of the second iteration is outputted after a simulation time of 1197ns.

First we will change the run-time ratio to 0.4 for both kernels and analyze how the graph is impacted.

Open the source file project.h and on line 26/27 change the run-time ratio for the kernels first and second to 0.4:

 

runtime<ratio>(first) = 0.4;
runtime<ratio>(second) = 0.4;

 

Build the Simple_application targeting Emulation-AIE and run the aiesimulator (Select Run As > Emulation-AIE).

Open the compilation summary (Emulation-AIE/Work/project.aiecompile_summary) in Vitis Analyzer. Look at the graph view and set the view as group by tile.

We can see that changing the run-time ratio has not impacted the resource utilization. The 2 kernels are still located on the same tile (and so are running on the same core) and they are still communicating via a single buffer.

Because the sum of the run-time ratio is still under 1, the aiecompiler still decides to group the 2 kernels together.

If you open the simulation output file (Emulation-AIE/aiesimulator_output/output.txt) you will see that the last sample of the first iteration of the graph is still outputted after 714ns and that the first sample of the second iteration is still outputted after a simulation time of 1197ns.

It is important to note that the run-time ratio does not schedule the execution of the kernels.

The kernel will be fired as soon as the data is available and the core is not already running another kernel. Even if the run-time ratio is set to 0.1 for a kernel, the kernel might be running 100% of the time on the AI Engine core (assuming there is only one kernel mapped to the core). In our case, because the two kernels are running the same functions, they might be running 50% of the time with the run-time ratio set to 0.1 or set to 0.4.

We can now try to increase the run-time ratio for the two kernels to 0.6.

 

runtime<ratio>(first) = 0.6;
runtime<ratio>(second) = 0.6;

 

Build the Simple_application targeting Emulation-AIE and run the aiesimulator (Select Run As > Emulation-AIE).

Open the compilation summary (Emulation-AIE/Work/project.aiecompile_summary) in Vitis Analyzer.

When we look at the graph view we can see that the usage has now changed:

02.JPG

Each kernel is mapped to a different tile.

The kernel named first is running on the tile [25,0] and the kernel named second is running on the tile [24,0] so the 2 kernels are running on 2 different cores. Also there is now a ping-pong buffer (as per the 2 names buf1 and buf1d) implemented between the 2 kernels.

From this we can see that increasing the run-time parameter might increase the resource utilization of the graph.

Open the simulation output file (Emulation-AIE/aiesimulator_output/output.txt).

We can see that the last sample of the first iteration of the graph is now outputted after 724ns. This means that latency has slightly increased (+10ns) for the first iteration. This is probably due to the lock management for the ping-pong buffer between the 2 kernels. However, the first sample of the second iteration arrives after 966ns (-473ns compared to previous results).

As the 2 kernels are now running from 2 different cores, they can now run in parallel increasing the throughput of the graph. Thus, we can see that increasing the run-time parameter might increase the the performance of the graph.

Important notes:

  • Increasing the run-time ratio to get 1 kernel per core might not increase the performances of the graph.
    This is because it might be limited by the input or output throughput.
    Thus, having a run-time ratio higher than required might result in inefficient use of the resources.

  • Reducing the run-time ratio might not result in a reduction of the resource utilization.
    This is because the compiler will map the kernels to the same core only when it makes sense.
    For example, here the 2 kernels are communicating together through a memory. Thus, they are already dependent on each other as the second kernel cannot start processing the data while the other has not completed its execution.
Tags (3)