A large part of a programmable logic developer’s time is not spent implementing RTL, but verifying RTL functionality and behavior. A few weeks ago, a young engineer asked me about simulation and its role in the development process. I intend to create several blogs focusing on how we can use Vivado XSIM to verify our design. However, verification of RTL can be a wider task than just performing a simulation.
The Vitis™ integrated AI Development environment is Xilinx’s development platform for AI inference on Xilinx hardware platforms consisting of optimized IP tools, models, and example designs.
Designed with high efficiency and ease of use in mind, it unleashes the full potential of AI accelerations on Xilinx FPGA and Adaptive Compute Acceleration Platforms (ACAPs). To understand better how to get started with Deep Learning, Doulos, in conjunction with Xilinx, has organized a one-day on-line training workshop for embedded engineers.
Over the last couple of weeks, we have examined how Vivado can help us identify design issues that might impact the implementation.
However, as all engineers know, the earlier we can find an issue, the easier and cheaper it is to correct (both financially and in time spent). What is even better is to avoid the issue in the first place. This is where using Xilinx UltraFast design methodologies can offer significant benefits when implementing designs.
By following the rules outlined below, we can create a design with reduced issues encountered later on. Of course, the UltraFast design methodology rules provide significant insight and explanation as to why the rules exist. However, five of the most critical are summarized below.
More than a decade ago, we anticipated the pervasiveness of PCI Express® and began offering integrated blocks for it in our devices. Over the years, we have refined our integrated block offering with each new Xilinx architecture. We now see PCI Express applied in nearly all our developers’ markets.
The Versal™ architecture continues to offer a Programmable Logic Integrated Block for PCI Express (PL PCIE) further improved from that available in prior architectures and adds an Integrated Block for PCI Express® with DMA and Cache Coherent Interconnect (CPM). Architecturally, CPM is one component of the Versal architecture integrated shell, the whole of which is timing closed and resides “outside” the programmable logic (PL). The illustration below shows where CPM resides, with PL PCIE included for context:
The Xilinx® UltraScale+™ FPGAs and SoCs with GTY transceivers can support a PCI Express® Gen4 interface. Design Gateway’s NVMe Host Controller IP core is designed to leverage the GTY transceivers to support the latest NVMe SSD drive PCIe Gen4 technology. The IP core is implemented on Xilinx’s Virtex® UltraScale+ FPGA VCU118 Evaluation Kit and able to achieve incredibly fast read/write performance—more than 4GB/s.
One of the most time-consuming elements of implementing an FPGA design is not often the design capture, but achieving the necessary timing performance. To achieve the required timing closure, we may need to adjust the design by inserting pipeline stages and using constraints to correctly define the clocks, their relationship, and even the location of logic elements.
Vivado has many capabilities which can be used to help us understand our design’s implementation as well as the challenges it can present in implementation.
One of these tools is the Design Analysis Report which enables the user to understand the design challenges (e.g. congestion) and make changes to the design or constraints.
Every day there are new devices appearing in homes, offices, hospitals, factories, and thousands of other places that are part of the Internet-of-Things (IoT). They need to be connected to the Internet, and there is a need for a huge amount of raw data to be collected, stored, and processed on the cloud.
Many data centers are available to store the data. However, only some provide features specifically for IoT applications. One of the most complete cloud-based IoT services available is Amazon Web Services (AWS) IoT Greengrass. It enables edge devices to act locally on the data networked devices and provides secure bi-directional communication between the IoT devices and the AWS cloud for management, analytics, and storage.
By using AWS Greengrass IoT, devices can keep the data in synch even when not connected to the Internet. Also, they can run AWS Lambda functions and make predictions based on machine learning models provided by AWS.
In the last blog, we installed the Alveo U200 accelerator card and ensured the card could be correctly validated. In this blog, we are going to look at what remains to be done to run the first executable program on the Alveo card itself.
First, let us do a little recap. In the previous blog, we installed the XRT and the deployment target platform. Once installed, these are available under the /opt/Xilinx/XRT and /opt/Xilinx/DSA directories in the file system. We used some of the available commands to update and validate the Alveo card.
To get started creating our own application on the Alveo card, we need to also install the development target platform. This will also be in the /opt directory once installed and will provide the acceleration platform.
seL4 is a formally verified microkernel that was built with security and performance in mind. It is a very attractive software solution for projects that have rigorous security and/or safety requirements. While seL4 is a great option, one of the drawbacks is that there is less built around it than other, more mature solutions, such as a traditional OS. For example, seL4 doesn’t work on that many platforms, and it doesn’t have many drivers developed.
DornerWorks is trying to mitigate this by building up the ecosystem around seL4 and giving developers the tools they need to use seL4 for their systems. One way we are building up this ecosystem is by porting seL4 to more platforms that can really take advantage of its benefits.
In most of the blogs I write, I demonstrate or explain an FPGA or SoC design technique. This one, however, is going to be a little different because I am going to ask a question.
How do you architect your programmable logic designs?
The question popped to mind as I was working on the architecture for three FPGAs as part of a satellite development. Of course, due to the end application, this architecture will be subjected to several reviews by both the prime contractor and appropriate space agency. Therefore, I want the architecture to show as much detail as possible without making the drawings difficult to maintain so that my design team can work from it with ease.
I get asked this question a lot in my interactions with customers. Okay, maybe not phrased exactly like that, but more or less along the same lines: “Why should I move to Versal™ ACAP and is now the right time to do so?” It’s a great question, and the answer is...
This is a follow-on to last week’s demonstration of developing embedded software using Vitis 2020.1.
In this week’s blog, we are going to look at how we can debug applications running on the processing system using Vitis. To be able to do this, we need a JTAG debugger because the MicroZed does not have on-board JTAG USB capability like some other boards.
To get started debugging the MicroZed, the Digilent JTAG HS2 / HS3 programming cables are the most cost effective. Of course for larger developments, you might want to consider the SmartLynq from Xilinx.
Traditionally, the RF design world and digital design worlds were separated. That has changed now that the Xilinx® Zynq® UltraScale+™ RFSoC has brought them together in one device. The Zynq® UltraScale+™ RFSoC revolutionized RF and wireless systems by introducing a programmable device with integrated RF data converters. MATLAB® and Simulink® help engineers work across both RF and digital domains, get the most out of their RF hardware, and save time and effort in developing and deploying their wireless processing algorithms on highly integrated devices like the Xilinx Zynq UltraScale+ RFSoC.
In the last blog we where looking at how we could start create a MicroZed project in Vivado 2020.1 just like we did when we started the blog nearly 7 years ago. In the previous blog we had just created the hardware build in Vivado, in this blog we are going to create a simple hello world program using Vitis.
The first thing we need to do in Vivado is to export the XSA, this is a compressed file which contains several elements which enables Vitis to create software applications. Within an XSA file you will find information on the address map, the IP included and of course the bit file. Although the bit file is not necessary to get started with the software development.
When I started this blog nearly seven years ago now (the first one was published 9/30/2013), we developed the solution in Vivado 2013.2.
Following that, we have gone on to examine how we can use Zynq-7000 SoCs, Zynq MPSoCs, and Xilinx 7 Series FPGAs in 354 blogs covering a range of topics and applications -- all of which built on the basics introduced in these blogs. Since then, we have seen several iterations of Vivado and the introduction of Vitis. Therefore, I think it is a good idea to recap a little and show the basic flow again but this time using Vivado and Vitis 2020.1.
As next-generation networks are deployed to support an increasingly diverse mix of high-bandwidth applications, network vendors and data center operators need to rapidly scale packet processing capability while both minimizing capex/opex and preserving the flexibility to adapt to future connectivity standards. To meet these requirements, Xilinx is excited to announce the latest addition to the Kintex® UltraScale+™ FPGA portfolio: the Kintex UltraScale+ KU19P FPGA. The KU19P FPGA delivers the optimized mix of resources and high-throughput connectivity needed to efficiently accelerate network processing while maintaining the strong balance of performance, price, and power inherent to the entire Kintex FPGA portfolio.
Xilinx has standardized on the AXI4 bus for the vast majority of its IP. To easily connect and operate in that environment, your custom IP should also use the AXI4 bus interfaces. These interfaces often add to the time required to implement and test your custom logic. Xilinx provides an IP Wizard that can be used to generate AXI4 interfaces so that they can be incorporated into your code, but the generated code usually requires some amount of modification and, therefore, still requires testing to verify the operation of the interface.
Synchronization technology is pervasive in many industry sectors: finance, telecom, industrial, automotive, and aerospace & defense. These markets have several applications that heavily rely on synchronization.
While geo-localization is probably one of the most well-known applications relying on this technology, it is not the only one. Many synchronization techniques are available; ultimately, they can be classified into two main categories
A few weeks ago, I demonstrated how to create a project from scratch using the Arm Cortex-M1. In that flow, I demonstrated an approach which fuses the Vivado bit file with the ELF file generated by the Arm Keil MDK. This flow works well for simple applications, however when we want to create more complex designs we need to be able to debug the software running on the board.
To capitalize on the benefits of system integration, design teams require state of the art hardware architectures, design flows, and a proven methodology that maximizes productivity from concept through implementation and debugging. The Vivado® Design Suite has offered a new approach for ultra-high productivity with next-generation C/C++ and IP based design. To maximize system performance and enable accelerated and predictable design cycles, Vivado Design Suite 2020.1 introduced new or improved features. Let’s take a look at some of the major highlights in the 2020.1 release.
Introducing the ground-breaking Zynq® UltraScale+™ RFSoC ZCU208 Evaluation kit, specially built for system architects and RF designers. This revolutionary platform delivers the power of an adaptable radio platform in a power-efficient, high-performance development system with full software programmability. Eight integrated SD-FEC cores provide forward error correction at 80% lower power consumption than soft implementations, making the ZCU208 ideal for DOCSIS, Microwave Backhaul, and Small Cell applications.
The main difference between the new JPEG XS standard and existing codecs as developed by JPEG, MPEG, or other standardization committees is that compression efficiency is not its first target. While other codecs primarily focus on high compression efficiency, disregarding latency or complexity, JPEG XS addresses the following question: “How can we ultimately and reliably replace uncompressed video?” JPEG XS handles increasing resolutions, frame rates, and a number of streams while safeguarding all advantages of an uncompressed stream.
As you will be aware, I do a lot of High-Level Synthesis (HLS) design for clients, especially for image processing applications. One of the great things about HLS is the productivity it brings when creating the application and its verification.
However, when the performance of our HLS block is not as expected, being able to find the critical path which impacts the violation is crucial. We have looked before at the analysis view and the potential optimizations which can be used to increase performance.
In this blog, we are going to examine how we can focus on finding the timing and initiation violations within our HLS designs and of course, correcting them.
In some resource-limited, high-performance, and low-latency scenarios, we strive for lower power consumption and higher performance without losing accuracy for AI inference. Low power consumption and high accuracy are especially critical in edge applications and low-latency ADAS. While 8-bit quantization can produce high accuracy, it requires more hardware resources. Extremely low-bit quantization, such as binary or ternary, often has a large accuracy degradation. Therefore, a full-process hardware-friendly quantization solution of 4-bit activations and 4-bit weights (4A4W) is proposed as a better accuracy/resource trade-off. With INT4 optimization, Xilinx can achieve up to a 77% performance boost on real hardware in comparison with INT8 and can achieve comparable accuracy to the full-precision models.
One of the most challenges and exciting aspects of programmable logic design can be achieving timing closure and ensuring data transfer is safely transferred across all clock domains. Typically, at this point of the design there could be a significant number of warnings generated by Vivado relating to the timing, design rule check and methodology checks.
When we open the implementation view to observe these methodology, DRC and CDC issues there will of course be numerous messages raised of several different severities from informational to advisories, warnings and critical warnings.
In last weeks blog, I examined the elements required to generate the image processing chain in the Vivado Design Suite. This week we are going to examine the more complex element of it in the PetaLinux build. This requires that we be familiar with device trees.
Once the design is completed in Vivado, the first thing we need to do is export the XSA and create a new PetaLinux project. If you are unsure how to do this, please check out my PetaLinux miniseries.
Intelligence on board of trains and on the ground in wayside equipment makes traveling fast, safe, reliable, and comfortable. Controlling various features like doors, lighting, cameras, air conditioning, brakes, lavatories, and displays, to name a few, requires the right balance of computation performance, real-time capability, and reliability. Combined with Artificial Intelligence (AI) and connected sensors, processes in and around a cabin can be improved for the benefit of the passenger. Efficiency, durability, and safety can be optimized by advanced traction control with optimized motor control algorithms.
Over the course of this series we have looked at image processing several times, mostly using bare-metal or PYNQ. However, for many applications and indeed for Vitis acceleration applications, we need to create a PetaLinux-based image processing chain.
This is exactly what we are going to do over the next couple of blogs. Of course we will be starting in Vivado and targeting the Ultra96-V2 board. We will use the Ultra96-V2 board, MIPI interface board, the JTAG/UART board, and the Digilent Pcam 5C camera.
Edge applications such as advanced driver-assistance systems (ADAS) and autonomous driving (AD) in next-generation cars are fueling the need for large amounts of sensor data from image, radar, and lidar sensors to be captured and processed to make intelligent decisions in real time. AD platforms are still in their infancy with evolving architectures. These platforms are expected to have many different configurations (number of sensors, resolutions, and types of sensors) needing very flexible yet optimal architectures for edge use cases.