Introducing the ground-breaking Zynq® UltraScale+™ RFSoC ZCU208 Evaluation kit, specially built for system architects and RF designers. This revolutionary platform delivers the power of an adaptable radio platform in a power-efficient, high-performance development system with full software programmability. Eight integrated SD-FEC cores provide forward error correction at 80% lower power consumption than soft implementations, making the ZCU208 ideal for DOCSIS, Microwave Backhaul, and Small Cell applications.
The main difference between the new JPEG XS standard and existing codecs as developed by JPEG, MPEG, or other standardization committees is that compression efficiency is not its first target. While other codecs primarily focus on high compression efficiency, disregarding latency or complexity, JPEG XS addresses the following question: “How can we ultimately and reliably replace uncompressed video?” JPEG XS handles increasing resolutions, frame rates, and a number of streams while safeguarding all advantages of an uncompressed stream.
As you will be aware, I do a lot of High-Level Synthesis (HLS) design for clients, especially for image processing applications. One of the great things about HLS is the productivity it brings when creating the application and its verification.
However, when the performance of our HLS block is not as expected, being able to find the critical path which impacts the violation is crucial. We have looked before at the analysis view and the potential optimizations which can be used to increase performance.
In this blog, we are going to examine how we can focus on finding the timing and initiation violations within our HLS designs and of course, correcting them.
In some resource-limited, high-performance, and low-latency scenarios, we strive for lower power consumption and higher performance without losing accuracy for AI inference. Low power consumption and high accuracy are especially critical in edge applications and low-latency ADAS. While 8-bit quantization can produce high accuracy, it requires more hardware resources. Extremely low-bit quantization, such as binary or ternary, often has a large accuracy degradation. Therefore, a full-process hardware-friendly quantization solution of 4-bit activations and 4-bit weights (4A4W) is proposed as a better accuracy/resource trade-off. With INT4 optimization, Xilinx can achieve up to a 77% performance boost on real hardware in comparison with INT8 and can achieve comparable accuracy to the full-precision models.
One of the most challenges and exciting aspects of programmable logic design can be achieving timing closure and ensuring data transfer is safely transferred across all clock domains. Typically, at this point of the design there could be a significant number of warnings generated by Vivado relating to the timing, design rule check and methodology checks.
When we open the implementation view to observe these methodology, DRC and CDC issues there will of course be numerous messages raised of several different severities from informational to advisories, warnings and critical warnings.
In last weeks blog, I examined the elements required to generate the image processing chain in the Vivado Design Suite. This week we are going to examine the more complex element of it in the PetaLinux build. This requires that we be familiar with device trees.
Once the design is completed in Vivado, the first thing we need to do is export the XSA and create a new PetaLinux project. If you are unsure how to do this, please check out my PetaLinux miniseries.
Intelligence on board of trains and on the ground in wayside equipment makes traveling fast, safe, reliable, and comfortable. Controlling various features like doors, lighting, cameras, air conditioning, brakes, lavatories, and displays, to name a few, requires the right balance of computation performance, real-time capability, and reliability. Combined with Artificial Intelligence (AI) and connected sensors, processes in and around a cabin can be improved for the benefit of the passenger. Efficiency, durability, and safety can be optimized by advanced traction control with optimized motor control algorithms.
Over the course of this series we have looked at image processing several times, mostly using bare-metal or PYNQ. However, for many applications and indeed for Vitis acceleration applications, we need to create a PetaLinux-based image processing chain.
This is exactly what we are going to do over the next couple of blogs. Of course we will be starting in Vivado and targeting the Ultra96-V2 board. We will use the Ultra96-V2 board, MIPI interface board, the JTAG/UART board, and the Digilent Pcam 5C camera.
Edge applications such as advanced driver-assistance systems (ADAS) and autonomous driving (AD) in next-generation cars are fueling the need for large amounts of sensor data from image, radar, and lidar sensors to be captured and processed to make intelligent decisions in real time. AD platforms are still in their infancy with evolving architectures. These platforms are expected to have many different configurations (number of sensors, resolutions, and types of sensors) needing very flexible yet optimal architectures for edge use cases.
One of the great things about Zynq and Zynq MPSoC devices and the MicroBlaze microprocessor is that we can run (them?) on embedded Linux operating systems. This gives us the ability to easily work with networking, communications and leverage high-level open source frameworks.
Of course, the great thing about using the SoCs of FPGA-based processors is their highly flexible nature that allows new peripherals to be added as needed. For embedded Linux solutions, this flexibility can be an issue as the kernel needs to know the hardware it is running on, its configuration, and also what peripherals are available.
DesignLinx and its customers have been early adopters of the Xilinx® SDAccel™ development environment for both cloud and on-premises applications, using the SDAccel development environment to target both Amazon AWS F1 and Xilinx Alveo™ data center accelerator card with accelerated software. Along with SDSoC and the Xilinx SDK, the SDAccel flow is now part of the Vitis™ unified software platform in version 2019.2, allowing developers to use a single platform for all software tasks on Xilinx devices.
I must admit, I have not worked with the board that this series of blogs is named after since I created the Vitis Acceleration Platform. However, in the past week, I have had two clients reach out to me for help regarding controlling and communicating with custom IP they developed in the programmable logic.
This got me thinking that what would help them is a PYNQ image for the MicroZed board. This helps in several ways:
PYNQ comes with drivers for most PL peripherals and IP. As such, we can focus on the configuration and behavior of the IP.
The PL design will be undergoing changes in development, and PYNQ enables the new overlay to be uploaded and tested with ease.
Of course, to be able to provide this image, I first need to create a PYNQ image for the MicroZed 7020. Doing so is quite straight forward and it is what I am going to demonstrate. To do this, we will need a virtual machine with the latest PYNQ repository cloned. You can see how to do this here.
Just a week ago, Xilinx announced the arrival of Radiation Tolerant (RT) Kintex UltraScale (XQRKU060), our newest addition to the Space-Grade (XQR) portfolio. The new device enables a broader array of functionality and delivers a significant increase in performance compared to our previous generations and other FPGA vendors. By adding an UltraScale device, we are skipping three process nodes from 65nm to 20nm. The product table below shows how RT Kintex® UltraScale™ stacks up to our previous XQR devices, Virtex®-4QV (V4QV) and Virtex-5QV (V5QV).
I do a lot of work for clients using the PYNQ framework on a wide range of different boards. I have noticed a few great things about it, especially when creating custom overlays, so I thought they would make for a good blog. A few weeks ago I wrote about the benefit of using PYNQ in development, so in this blog, we are going to examine a few more interesting aspects of PYNQ.
Understanding Overlays – We can determine what IP blocks are included in an overlay and which driver is being used by simply using the command print(<overlay>.__doc__). This is especially useful when we are working with custom overlays and we want to understand what they contain and how to use them.
One of the key elements of any embedded system is the ability to perform operations at specific intervals, like reading a sensor or updating calculations for example. The best way to do this is to use a timer which triggers a periodic interrupt, indicating to the system it is time to perform the required action.
In the Arm Cortex-M1 core, we have been examining the timer which is called the system timer and is controlled by the SysTick registers. There are a total of four registers which the SysTick timer uses.
Xilinx is partnering with Monolithic Power Systems (MPS) to hold an informational webinar on getting up to speed with the new Zynq® UltraScale+™ RFSoC ZCU216 evaluation kit. This webinar reviews various topics from ground level knowledge of the Zynq UltraScale+ RFSoC device to getting started on the ZCU216 evaluation kit, to an ultra-low noise power solution from MPS, and much more.
Last week we examined how to implement an Arm Cortex-M1 processor core from scratch using the IP provided by the Arm DesignStart program. In this week’s blog, I am going to demonstrate how to configure the software element of the build.
Having used Vivado Design Suite to generate the bitstream which contained the processor and its tightly coupled instruction and data memories. Now we need to create a board support package and the actual application. Once these have been created, we will be updating the tightly coupled instruction and data memories in the bitstream with the new application.
One of the more popular online and in-person classes I present at conferences and in webinars discusses how to implement Arm Cortex-M1 and Cortex-M3 processors in Xilinx programmable logic devices.
In this class, we start with an existing reference design, learn about the tool flow, and implement an application based on the provided reference design. While providing an excellent introduction to working with the Arm Cortex-M1 and Cortex-M3 processors, it does not show how to create Arm Cortex-M1 and Cortex-M3 solutions from scratch.
Over the next few blogs, I am going to explain this beginning with how to implement an Arm Cortex-M1 processor within the programmable logic.
At DesignLinx Hardware Solutions, we use PetaLinux to create custom Linux images in support of our customer’s custom Xilinx based products. When I first heard about PetaLinux, I will admit it; I was skeptical. I come from an embedded Linux background and have done numerous projects involving pure Yocto/Bitbake/OE and integrating Linux within different SoC platforms. Yocto is a great way to create a custom embedded Linux distribution. From building everything from source to its super extensible interface, Yocto allows users to create a custom Linux distribution for their products.
The problem is that Yocto is hard. There is quite a steep learning curve that can make adopting it tough if not painful. Additionally, without a speedy build machine, full images can often take many hours to build (depending on the number of packages). When I finally tried to use PetaLinux, I was pleasantly surprised. It seemed to have many of the advantages of Yocto without the learning curve and build time.
Achieving higher resolution is a never-ending race for camera, TV, and display manufacturers. After the emergence of 4K ultra high definition (Ultra HD) imaging in the market, it became the main standard for today’s multimedia products. 4K consumers are everywhere, from live sports broadcasting to video conferencing on our mobile devices.
4k Ultra HD brings us bigger screens, which gives the viewer an immersive experience. With this standard, the pixilation problem for big screens was solved. There are, however, many technical challenges in developing systems to process 4k Ultra HD resolution data. As an example, a 4K frame size is 3840 x 2160 pixels (8.5 Mpixel) and is refreshed at a 60Hz, equating to about 500 Mpixel/sec. This requires a high-performance system to process 4K frames in real time. Another bottleneck is power consumption, particularly for embedded devices where power is critical. Being low power yet high performance, Xilinx® Zynq® UltraScale+™ MPSoC has shown a strong potential to tackle these challenges. In this blog, you’ll learn all you need to know to start developing a 4K video conferencing project using Zynq UltraScale+ MPSoC.
High Level Synthesis is great for implementing algorithms. However, there are times as we develop our HLS IP that we need to think about how it interfaces with the rest of the system beyond the AXI Interfaces which are our main interfaces.
This can be challenging in HLS as it often means we need to be able to wait on external signals, or to be able to wait for several clock cycles etc. implementing these can be challenging in HLS.
In this blog we are going to look at how we can implement structure in out HLS algorithms
The COVID-19 forces many FPGA designers to work from home while still facing challenging engineering deadlines. Taking boards or lab setups home may not be an option – particularly when a group needs to collaborate through shared hardware / Devices-under-Test (DUT). Having a physically distributed team use the hardware from a central location presents some challenges like: usage administration, swapping SD-Cards, power cycling boards, handling GPIOs, UARTs, etc.
Xilinx® Versal™ silicon architecture and software tools provides a way to drastically improve image quality, speed, and accuracy in medical ultrasound systems using advanced imaging techniques. This greatly improves ultrasound-based diagnostic ability in complicated procedures.
High-speed real-time data acquisition and processing form the fundamental part of all innovative design developments. Taking this into account, iWave Systems has successfully developed and demonstrated a high-speed analog data acquisition and processing system over the JESD204B serial interface on our Zynq® UltraScale+™ MPSoC Development Platform. The JESD204B interface offers seamless connectivity between the AD-FMCDAQ2-EBZ data converters and the Zynq UltraScale+ MPSoC platform, accelerating the development of analog- based designs.
Over the last few blogs (P1, P2, P3), we have looked in depth at High-Level Synthesis (HLS) and its use in image processing.
HLS provides real advantages for image processing as it allows us to focus on our algorithm. We can also achieve very high frame rates when working with HLS, with a little thought about the optimizations we apply.
A few weeks ago we looked at reading a line of data from DDR memory such that we could create a simple test pattern.
For many applications however, we want the ability to inject a two dimensional image into the image processing stream. This gives us the ability to test our image processing algorithms performance using synthetic images.
In markets across the world, continuous demand for higher bandwidth scales beyond what today's technologies and form factors can support. The demand is for more efficient, pervasive compute that scales beyond what CPU and GPU technologies can match.
The Versal™ Premium series provides breakthrough heterogeneous integration, high-performance compute, connectivity, and security in an adaptable platform with a minimized power and area footprint. This highly integrated platform allows users to focus on their unique core competencies and novel algorithms, rather than designing connectivity and memory infrastructure, to achieve the earliest possible time to market.
One of the great things about image processing is we can layer video streams on top of each other. This gives us the ability to do picture in picture and the ability to overlay text and graphics on the screen.
When it comes to displaying information, typically the information we want to display will result from sensor data which has been gathered by a processor. For example temperature, pressure, altitude navigation information etc.
Displaying this information can be achieved in several different ways, if the processor is capable enough it can read the sensors, process the information and then create its own frame buffer in DDR memory which can be applied as an overlay to the output image.
One of the great and often unknown elements of the Vivado Design Suite and Vitis Unified Software Platform is how easy it can be to create a complex application.
The SP701 board is designed for industrial applications, including image processing. To support this it, contains both an MIPI CSI interface and MIPIDSI and HDMI outputs.
Creating image processing systems can be complex so you need to configure the input image capture stream, implement image recovery (e.g. demosiac to convert raw data to RGB pixels), frame buffers using VDMA, and finally create the image output path.
Architecting this image processing pipeline and configuring the settings can be time consuming.
However, if you have an SP701 board, there is a much faster way to get an image processing system up and running. Simply open the example design which is provided with the MIPI CSI-2 RX subsystem.
Xilinx® Alveo™ data center accelerator cards provide programmability, scalability, and performance across any server deployment. These products provide a low latency, power efficient solution that can be easily installed throughout a data center. Adaptable accelerator cards can be deployed to unlock dramatic throughput and latency improvements for demanding compute, network, and storage workloads. From machine learning inference, video transcoding, and data analytics to computational storage, electronic trading, and financial risk modeling, the Alveo cards bring programmability, flexibility, and high throughput while allowing low latency performance advantages to any server deployment.