UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 

Xilinx Unveiled the Secret Sauce of the Ultimate AI Inference Compute at XDF ‒ Versal AI Engine Array

Xilinx Employee
Xilinx Employee
0 3 4,533

By Ambrose Finnerty, Xilinx DSP Technical Marketing Management

 

Thank you for joining us at XDF!

Our developer community and everyone who joined us at Xilinx Developer Forum (XDF) Silicon Valley helped to make it our biggest, most successful developer event yet. XDF Silicon Valley brought together over 1,100 attendees from 24 countries for 80+ sessions and 40+ exhibitor demos. If you weren’t able to make it to XDF, we've got you covered. Xilinx reporters were everywhere to capture the highlights. We're going to post recap blogs for the next few weeks. Today is all about Versal AI Engine.

 

A new multicore vector processor architecture delivering unrivaled high compute efficiency required in a wide range of evolving markets and applications

In many dynamic and evolving markets, such as 5G cellular or ADAS, applications are pushing for ever increasing compute acceleration while remaining power efficient. With Moore’s Law running out of steam, moving to the latest and greatest IC process node no longer provides the traditional benefits of lower power and cost with better performance.

 

AI_everywhere.png

 

Architectural innovation is needed to deliver the necessary compute to enable these evolving applications. To that end, Xilinx has included new scalable AI Engine arrays into Versal™ AI Core series delivering 20X and 5X compute performance improvements respectively for AI inference and 5G Wireless with greater power efficiency over the prior generation as shown in Figure 1.

 

Fig1_AI_Engine_Application_Performane_and_Efficiency.png
Figure 1. AI Engine Application Performance and Power Efficiency

 

Xilinx Reinvents Multicore Compute

Traditional single and multicore processors are unable to provide the efficient compute acceleration required for the flexible workloads in these applications. They cannot deliver deterministic throughput and latency because traditional multicore cache-based architectures like GPUs have very structured hierarchical memory. The structured hierarchical memory has fixed shared interconnect, which limits compute efficiency because data replication, cache misses, and blocking of these shared interconnects will occur.

 

Fig2_AI_Engine_Array.png
Figure 2. AI Engine Array

 

The AI Engine array with adaptable, non-blocking shared interconnect between AI Engine tiles and local distributed memory delivers a deterministic and higher bandwidth multicore compute engine with unrivaled high compute efficiency. The AI Engine array is comprised of AI Engine tiles (see Figure 2). The size of the array varies across the Versal AI Core series, with a maximum of 400 in the largest device (VC1902). Key to enabling efficient compute is the tile-based architecture, where each AI Engine tile includes 32KB of local memory that can be shared across neighboring AI Engines, high bandwidth non-blocking interconnect, and the AI Engine core with ISA-based VLIW/SIMD vector processor and program memory.

Fig3_AI_Engine_TileBased_Architecture.png
Figure 3. AI Engine: Tile-based Architecture

 

Adaptable data movement avoids interconnect “bottlenecks,” which other multicore processors are susceptible to. This data movement architecture and a flexible memory hierarchy, which is local and shareable, enables very high bandwidth while ensuring no cache misses or data replication. Additionally, this flexibility allows for data transfers between AI Engines to be overlapped in time with AI engine compute in the AI Engines themselves. 

Fig4_AI_Engine_delivers_highcompute_efficiency.png
Figure 4. AI Engine delivers high compute efficiency

 

Unified Software Development Environment

As shown in Figure 5, Xilinx will deliver a unified software development environment for full chip programming. In this environment, the AI Engines can be programmed with different levels of abstraction from C/C++ to AI frameworks, like Caffe or Tensorflow, leveraging optimized AI and 5G software libraries. 

Fig5_Versal_ACAP_Development_Tools.png
Figure 5. Versal ACAP Development Tools

 

Using common frameworks, data scientists can accelerate AI inference workloads in the Data Center on Versal AI Core devices using Xilinx’s pre-packaged IP overlay or domain specific architectures (DSAs). The DSA partitions the compute intensive functions, (e.g., convolution layers), to the AI Engines and supporting functions to the Adaptable Engines while also utilizing the large on-chip memory capacity in the Adaptable Engines (block RAM/UltraRAM), for weights and activation buffering. This substantially reduces the latency and power compared to having to go to off-chip memory. See Figure 6.

Fig6_AI_Inference_Mapping.png
Figure 6. AI Inference Mapping on Versal ACAP

 

These DSAs deliver low-latency real-time inference leadership versus Nvidia GPUs, where high compute efficiency is not achievable. In a 75W power envelope, Xilinx projections show a 4X performance advantage against next-generation GPUs for GoogLeNet. See Figure 7.

Fig7_AI_Engine_RealTime_Inference_Leadership.png 
Figure 7. AI Engine Delivers Real-Time Inference Leadership

 

For more information on Versal ACAP, visit: http://www.xilinx.com/versal.

For more information on the Xilinx AI Engine and what this new multicore vector processor engine enables, please read WP506 - Xilinx AI Engines and Their Applications.

 

3 Comments
Scholar jmcclusk
Scholar

The AI Engine array is an amazing vector engine... but in reading the white paper, it seems there are no capabilities to handle FP16 data... only FP32.    The latest GPU's handle this data format with ease, and I can only conclude that the Everest project managers had to make a hard choice on scheduling and silicon area to leave this format out.    Or perhaps it's a judgement call on how popular FP16 will be for machine learning applications.

Xilinx Employee
Xilinx Employee

@jmcclusk Compute within the AI Engine is optimized for ML-Inference and advanced signal processing, and INT8 precision is a very common choice for ML Inference use cases in particular today. FP32 and FP16 is a precision format are commonly used in training networks but are less prevalent in inference. For deployed inference use cases where latency and power is important, lower precision like INT8 are most common. This has been shown to retain accuracy of most models while decreasing the amount of memory needed for the inference workload.

Newbie emmasmith
Newbie

Thank you so much for this. I was into this issue and tired to tinker around to check if its possible but couldnt get it done. Now that i have seen the way you did it, thanks guys
with
regards