By Ambrose Finnerty, Xilinx DSP Technical Marketing Management
Thank you for joining us at XDF!
Our developer community and everyone who joined us at Xilinx Developer Forum (XDF) Silicon Valley helped to make it our biggest, most successful developer event yet. XDF Silicon Valley brought together over 1,100 attendees from 24 countries for 80+ sessions and 40+ exhibitor demos. If you weren’t able to make it to XDF, we've got you covered. Xilinx reporters were everywhere to capture the highlights. We're going to post recap blogs for the next few weeks. Today is all about Versal™ AI Engine.
A new multicore vector processor architecture delivering unrivaled high compute efficiency required in a wide range of evolving markets and applications
In many dynamic and evolving markets, such as 5G cellular or ADAS, applications are pushing for ever increasing compute acceleration while remaining power efficient. With Moore’s Law running out of steam, moving to the latest and greatest IC process node no longer provides the traditional benefits of lower power and cost with better performance.
Architectural innovation is needed to deliver the necessary compute to enable these evolving applications. To that end, Xilinx has included new scalable AI Engine arrays into Versal™ AI Core series delivering 20X and 5X compute performance improvements respectively for AI inference and 5G Wireless with greater power efficiency over the prior generation as shown in Figure 1.
Figure 1. AI Engine Application Performance and Power Efficiency
Xilinx Reinvents Multicore Compute
Traditional single and multicore processors are unable to provide the efficient compute acceleration required for the flexible workloads in these applications. They cannot deliver deterministic throughput and latency because traditional multicore cache-based architectures like GPUs have very structured hierarchical memory. The structured hierarchical memory has fixed shared interconnect, which limits compute efficiency because data replication, cache misses, and blocking of these shared interconnects will occur.
Figure 2. AI Engine Array
The AI Engine array with adaptable, non-blocking shared interconnect between AI Engine tiles and local distributed memory delivers a deterministic and higher bandwidth multicore compute engine with unrivaled high compute efficiency. The AI Engine array is comprised of AI Engine tiles (see Figure 2). The size of the array varies across the Versal AI Core series, with a maximum of 400 in the largest device (VC1902). Key to enabling efficient compute is the tile-based architecture, where each AI Engine tile includes 32KB of local memory that can be shared across neighboring AI Engines, high bandwidth non-blocking interconnect, and the AI Engine core with ISA-based VLIW/SIMD vector processor and program memory.
Figure 3. AI Engine: Tile-based Architecture
Adaptable data movement avoids interconnect “bottlenecks,” which other multicore processors are susceptible to. This data movement architecture and a flexible memory hierarchy, which is local and shareable, enables very high bandwidth while ensuring no cache misses or data replication. Additionally, this flexibility allows for data transfers between AI Engines to be overlapped in time with AI engine compute in the AI Engines themselves.
Figure 4. AI Engine delivers high compute efficiency
Unified Software Development Environment
As shown in Figure 5, Xilinx will deliver a unified software development environment for full chip programming. In this environment, the AI Engines can be programmed with different levels of abstraction from C/C++ to AI frameworks, like Caffe or Tensorflow, leveraging optimized AI and 5G software libraries.
Figure 5. Versal ACAP Development Tools
Using common frameworks, data scientists can accelerate AI inference workloads in the Data Center on Versal AI Core devices using Xilinx’s pre-packaged IP overlay or domain specific architectures (DSAs). The DSA partitions the compute intensive functions, (e.g., convolution layers), to the AI Engines and supporting functions to the Adaptable Engines while also utilizing the large on-chip memory capacity in the Adaptable Engines (block RAM/UltraRAM), for weights and activation buffering. This substantially reduces the latency and power compared to having to go to off-chip memory. See Figure 6.
Figure 6. AI Inference Mapping on Versal ACAP
These DSAs deliver low-latency real-time inference leadership versus Nvidia GPUs, where high compute efficiency is not achievable. In a 75W power envelope, Xilinx projections show a 4X performance advantage against next-generation GPUs for GoogLeNet. See Figure 7.
Figure 7. AI Engine Delivers Real-Time Inference Leadership