Pentek’s new video discusses the broad product line of more than 20 Jade XMC, PCIe, AMC, compact PCI, and VPX boards based on the Xilinx Kintex UltraScale FPGA family. The Pentek Jade modules are designed for high-performance data-acquisition and signal-processing applications with on-board ADCs and DACs as fast as 5G samples/sec.
The broad Jade product line illustrates how a company can take a basic idea and use programmable logic to develop comprehensive, multi-member product lines while minimizing engineering effort by leveraging the numerous resources included in the broad line of mutually compatible Xilinx Kintex UltraScale FPGAs. The Jade family represents the latest generation of related products that Pentek has based on three successive generations of Xilinx FPGAs. This latest generation from Pentek is 13% lighter, uses 23% less power, and costs about 30% less than the preceding generation, partly due to using next-generation Xilinx devices.
The Jade product line illustrates this concept especially because Pentek has not only developed a comprehensive line of board-level products, the company has also created a set of support tools called Navigator Design Suite that provides BSPs and software support for the Jade modules using Pentek-supplied IP for the on-board FPGAs. A companion tool called the Navigator FPGA Design Kit allows you to develop your own IP for high-speed data acquisition and signal processing. The Navigator BSP package and the Navigator FPGA Design Kit are closely linked so that the software and hardware IP dovetail.
Here’s the 4-minute Pentek video:
Note: For additional information on the Pentek Jade product line, see “Pentek kicks its radar, SDR DSP architecture up a notch from Cobalt to Onyx to Jade by jumping to Kintex UltraScale FPGAs.”
By Adam Taylor
Having looked that how we can optimize the Zynq SoC’s PS (processor system) for power during operation and when we wish for the Zynq SoC to enter sleep mode, I now want to round off our look at power-reduction techniques by looking at how we reduce power consumption within the Zynq SoC’s PL (programmable logic) using design techniques. Obviously, one of the first things we should do is enable power optimization within implementation flow, which optimizes the design for power efficiency. However, Vivado tools can only optimize a design as presented. So let’s see what we can do to ensure that we present the best design possible.
Setting Power Optimization within Vivado
One of the first places to start is to ensure that we are familiar with the structure of the CLBs and slices used to implement our creations within the Zynq SoC’s PL. If you are not as familiar as you should be, the detail of these PL components is provided within in the Seven Series CLB user guide UG474.
Each CLB contains two slices. These slices provide the LUTs (look up tables), storage elements, etc. used to implement the logic in your design. The first thing we can do to optimize power consumption in our programmable logic design is to consider the polarity, synchronicity, and grouping of control signals to these CLB’s and slices. When we talk about a control signal, we mean the clock, clock enable, set/reset, and distributed-RAM write enables used within a slice.
Storage elements in a Programmable Logic Slice
Looking at the storage elements shown above, you can see that except for the CLK control signal, which has a mux to enable its inversion, all other signals are active high. If we declare them as active low or asynchronous, we will require an extra LUT to invert the signal and additional routing resources to connect the inverter. These extra logic and routing resources increase power consumption.
Grouping of control signals relates to how a specific group of control signals—e.g. the clock, reset and clock enable—behave. Creating many different control groups within a design or module makes it more difficult for the placer to locate elements within different control groups close together. The end result will require more routing which makes timing closure more difficult and increases power consumption.
We also need to consider how we use and configure the PL’s I/O resources. For instance, we must giver proper consideration to limiting drive strength and slew rate. We should also consider using the lowest I/O voltage supported by the receiving device. For example, can we may be able to use reduced-swing LVDS in place of LVDS.
More advanced design techniques that we can use relate to the use of hard macros within the PL and how the tools use this logic. One of the biggest savings can be achieved by using a smaller device, which clearly reduces overall power. There are two main techniques we can use to reduce the size of the required device. The first of these is resource time sharing, which uses the same on-chip logic resources for different functions at different times. A second approach is to use a common core for processing multiple inputs and inputs if possible. However, this technique increases complexity during design capture because we must consider multiplexing and sequencing needs.
Once we have completed our design, we can run the XPE tool within Vivado to estimate power consumption and predict junction temperature (very important!). Hopefully, we’ll get the reduction power we require. However, if we do not, we can perform “what if” scenarios as detailed by UG907, which also contains other low-power design techniques.
Code is available on Github as always.
If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.
All of Adam Taylor’s MicroZed Chronicles are cataloged here.
Magicians are very good at creating the illusion of levitating objects but the Institute for Integrated Systems at Ruhr University Bochum (RUB) has developed a system that does the real thing—quite precisely. The system levitates a steel ball using an electromagnet controlled by an Avnet PicoZed SOM, which in turn is based on a Xilinx Zynq Z-7000 SoC. An FMCW (frequency-modulated, continuous wave) radar module jointly developed by RUB and the Fraunhofer Institute senses the ball’s position and that data feeds a PID control loop that controls the pulse-width-modulated current supplied to an electromagnet that levitates the steel ball.
FMCW radar sensor module jointly developed by RUB and the Fraunhofer Institute
The entire system was developed using the Xilinx SDSoC development environment with hardware acceleration used for the critical paths in the control loop resulting in fast, repeatable, real-time system response. The un-accelerated code runs on the Zynq SoC’s dual-core ARM Cortex-A9 processor and the code translated into hardware by SDSoC resides in the Zynq SoC’s programmable logic. SDSoC seamlessly manages the interaction between the system’s software and the hardware accelerators and the Zynq SoC provides a single-chip solution to the sensor-driven-control design problem.
Here’s a 3-minute video that captures the entire demo:
It’s amazing what you can do with a few low-cost video cameras and FPGA-based, high-speed video processing. One example: the Virtual Flying Camera that Xylon has implemented with just four video cameras and a Xilinx Zynq Z-7000 SoC. This setup gives the driver a flying, 360-degree view of a car and its surroundings. It’s also known as a bird’s-eye view, but in this case the bird can fly around the car.
Many such implementations of this sort of video technology use GPUs for the video processing, but Xylon uses the programmable logic in the Zynq SoC using custom hardware designed with Xylon logicBRICKS IP cores. The custom hardware implemented in the Zynq SoC’s programmable logic enables very fast execution of complex video operations including camera lens-distortion corrections, video frame grabbing, video rotation, perspective changes, as well as the seamless stitching of four processed video streams into a single display output—and all this occurs in real time. This design approach assures the lowest possible video processing delay at significantly lower power consumption when compared to GPU-based implementations.
A Xylon logi3D Scalable 3D Graphics Controller soft-IP core—also implemented in the Zynq SoC’s programmable logic—renders a 3D vehicle and the surrounding view on the driver’s information display. The Xylon Surround View system permits real-time 3D image generation even in programmable SoCs without an on-chip GPU, as long as there’s programmable logic available to implement the graphics controller. The current version of the Xylon ADAS Surround View Virtual Flying Camera system runs on the Xylon logiADAK Automotive Driver Assistance Kit that is based on the Xilinx Zynq-7000 All Programmable SoC.
Here’s a 2-minute video of the Xylon Surround View system in action:
If you’re attending the CAR-ELE JAPAN show in Tokyo next week, you can see the Xylon Surround View system operating live in the Xilinx booth.
Jan Gray’s FPGA.org site has just published a blog post detailing the successful test of the GRVI Phalanx massively parallel accelerator framework, with 1680 open-source RISC-V processor cores running simultaneously on one Xilinx Virtex UltraScale+ VU9P. (That’s a mid-sized Virtex UltraScale+ FPGA.) According to the post, this is the first example of a kilocore RISC-V implementation and represents “the most 32-bit RISC cores on a chip in any technology.”
That’s certainly worth a picture (is a picture worth 1000 cores?):
1680 RISC-V processor cores run simultaneously on a Xilinx VCU118 eval kit with a Virtex UltraScale+ VU9P FPGA
The GRVI Phalanx’s design consists of 210 processing clusters with each cluster comprised of eight RISC-V processor cores, 128Kbytes of multiported RAM, and a 300-bit Hoplite NOC router. Here’s a block diagram of one such Phalanx cluster:
GRVI Phalanx Cluster Block Diagram
Note: Jan Gray contacted Xcell Daily after this post first appeared and wanted to clarify that the RISC-V ISA may be open-source and there may be open-source implementations of the RISC-V processor, but the multicore GRVI Phalanx is a commercial design and is not open-source.
Yesterday, National Instruments (NI) along with 15 partners announced the grand opening of the new NI Industrial IoT Lab located at NI’s headquarters in Austin, TX. The lab is a working showcase for Industrial IoT technologies, solutions and systems architectures and will address challenges including interoperability and security in the IIoT space. The partner companies working with NI on the lab include:
NI’s Jamie Smith (on the left), NI’s Business and Technology Director, opens the new NI Industrial IoT Lab in Austin, TX
Next week, the Xilinx booth at the CAR-ELE JAPAN show at Tokyo Big Sight will hold a variety of ADAS (Advanced Driver Assistance Systems) demos based on Xilinx Zynq SoC and Zynq UltraScale+ MPSoC devices from several companies including:
The Zynq UltraScale+ MPSoC and original Zynq SoC offer a unique mix of ARM 32- and 64-bit processors with the heavy-duty processing you get from programmable logic, needed to process and manipulate video and to fuse data from a variety of sensors such as video and still cameras, radar, lidar, and sonar to create maps of the local environment.
If you are developing any sort of sensor-based electronic systems for future automotive products, you might want to come by the Xilinx booth (E35-38) to see what’s already been explored. We’re ready to help you get a jump on your design.
Avnet has just announced the 1x1 version of its PicoZed SDR 2x2 SOM that you can use for rapid development of software-defined radio applications. The 62x100mm form factor for the PicoZed SDR 1x1 SOM is the same as that used for the 2x2 version but the PicoZed SDR 1x1 SOM uses the Analog Devices AD9364 RF Agile Transceiver instead of the AD9361 used in the PicoZed SDR 2x2 SOM. Another difference is that the 2x2 version of the PicoZed SDR SOM employs a Xilinx Zynq Z-7035 SoC and the 1x1 SOM uses a Zynq Z-7020 SoC.
Avnet’s Zynq-based PicoZed SDR 1x1 SOM
One final difference: The Avnet PicoZed SDR 1x1 sells for $549 and the PicoZed SDR 2x2 sells for $1095. So if you liked the idea of the original PicoZed SDR SOM but wished for a lower-cost entry point, your wish is granted, with immediate availability.
Work started on CCIX, the cache-coherent interconnect for accelerators, a little over a year ago. The CCIX specification describes an interconnect that makes workload handoff from server CPUs to hardware accelerators as simple as passing a pointer. This capability enables a whole new class of accelerated data center applications.
Xilinx VP of Silicon architecture Gaurav Singh discussed CCIX at the recent Xilinx Technology Briefing held at SC16 in Salt Lake City. His talk covers many CCIX details and you can watch him discuss these topics in this 9-minute video from the briefing:
The video below shows Ravi Sunkavalli, the Xilinx Sr. Director of Data Center Solutions, discussing how advanced FPGAs like devices based on the Xilinx UltraScale architecture can aid you in developing high-speed networking and storage equipment as data centers migrate to faster internal networking speeds. Sunkavalli posits that CPUs, which are largely used for networking and storage applications connected with today’s 10G networks, quickly run out of gas at 40G and 100G networking speeds. FPGAs can provide “bump-in-the-wire” acceleration for high-speed networking ports thanks to the large number of fast compute elements and the high-speed transceivers incorporated into devices like the Xilinx UltraScale and UltraScale+ FPGAs.
Examples of networking applications already handled by FPGAs include VNF (Virtual Network Functions) such as VPNs, firewalls, and security. FPGAs are already being used to implement high-speed data center storage functions such as error correction, compression, and security.
The following 8-minute video was recorded during a Xilinx technology briefing at the recent SC16 conference in Salt Lake City:
All Internet-connected video devices produce data streams that are processed somewhere in the cloud, said Xilinx Chief Video Architect Johan Janssen during a talk at November’s SC16 conference in Salt Lake City. FPGAs are well suited to video acceleration and deliver better compute density than cloud servers based on microprocessors. One example Janssen gave during his talk shows a Xilinx Virtex UltraScale VU190 FPGA improving the video-stream encoding rate from 3 to 60fps while cutting power consumption by half when compared to the performance of a popular Intel Xeon microprocessor executing the same encoding task. In power-constrained data centers, that’s a 40x efficiency improvement with no increase in electrical or heat load. In other words, it costs a lot less operationally to use FPGA for video encoding in data centers.
Here’s the 7-minute video of Janssen’s talk at SC16:
Last November at SC16 in Salt Lake City, Xilinx Distinguished Engineer Ashish Sirasao gave a 10-minute talk on deploying deep-learning applications using FPGAs with significant performance/watt benefits. Sirasao started by noting that we’re already knee-deep in machine-learning applications: spam filters; cloud-based and embedded voice-to-text converters; and Amazon’s immensely successful, voice-operated Alexa are all examples of extremely successful machine-learning apps in broad use today. More—many more—will follow. These applications all have steep computing requirements.
There are two phases in any machine-learning application. The first is training and the second is deployment. Training is generally done using floating-point implementations so that application developers need not worry about numeric precision. Training is a 1-time event so energy efficiency isn’t all that critical.
Deployment is another matter however.
Putting a trained deep-learning application in a small appliance like Amazon’s Alexa calls for attention to factors such as energy efficiency. Fortunately, said Sirasao, the arithmetic precision of the application can change from training to mass deployment and there are significant energy-consumption gains to be had by deploying fixed-point machine-learning applications. According to Sirasao, you can get accurate machine inference using 8- or 16-bit fixed-point implementations while realizing a 10x gain in energy efficiency for the computing hardware and a 4x gain in memory energy efficiency.
The Xilinx DSP48E2 block implemented in the company’s UltraScale and UltraScale+ devices is especially useful for these machine-learning deployments because its DSP architecture can perform two independent 8-bit operations per clock per DSP block. That translates into nearly double the compute performance, which in turn results in much better energy efficiency. There’s a Xilinx White Paper on this topic titled “Deep Learning with INT8 Optimization on Xilinx Devices.”
Further, Xilinx recently announced its Acceleration Stack for machine-learning (and other cloud-based applications), which allows you to focus on developing your application rather than FPGA programming. You can learn about the Xilinx Acceleration Stack here
Finally, here’s the 10-minute video with Sirasao’s SC16 talk:
Nextera Video is helping the broadcast video industry migrate to video-over-IP as quickly as possible with an FPGA IP core developed for Xilinx UltraScale and other Xilinx FPGAs that compresses 4K video using Sony’s low-latency, noise-free NMI (Network Media Interface) packet protocols to achieve compression ratios of 3:1 to 14:1. The company’s products can transport compressed 4Kp60 video between all sorts of broadcast equipment over standard 10G IP switches, which significantly lowers equipment and operating costs for broadcasters.
Here’s a quick video that describes Nextera’s approach:
By Adam Taylor
As I discussed last week, one method we can use to reduce the power is to put the Zynq SoC in low-power mode when we detect that the system is idle. The steps required to enter and leave the low-power mode appear in the technical reference manual (section 24.4 of UG585). However, it’s always good to see an actual example to understand the power savings we get by entering this mode.
We’ll start with a running system. The current draw (344.9mA) appears on the DMM’s display in the upper left part of this image:
MicroZed with the DMM measuring current
We follow these steps from the TRM to place the Zynq SoC’s ARM Cortex-A9 processor into sleep mode:
Implementing most of these steps requires that we use the standard XilOut32() approach to modify the desired register contents as we have done for many examples throughout this blog series. There are however some registers we need to interact with using inline assembly language. We will now look at this in more detail because it is a little different to using the XilOutXX() fucntions.
We need to use assembler for two reasons. The first is to interact with the CP15 co-processor registers and the second is to execute the wait for interrupt, wfi() instruction. You will notice that the CP15 registers are not defined within the TRM. As such no address-space details are provided. However we can still access them from within our SDK application.
We’ll use a bare-metal approach to demonstrate how we enter sleep mode. The generated BSP will provide the functions and macros to interact with the Zynq SoC’s CP15 registers. There are three files that we need to use:
We can use two macros contained within xpseudo_asm_gcc.h to perform the writes we need to make to the CP15 power-control register. These are the macros MFCP, which allows us to read a register, and MFCP, which allows us to write to a register. We can find the register address within the file xreg_cortexa9.h to target the register we want to interact with, as shown in the image below:
Actual code within the power-down application
The last element we need is the final WFI instruction to wait for the wake-up source interrupt. Again, we use inline assembler, just as we did previously when we issued the SEV instruction to wake up the second processor as part of the AMP example.
Defining the WFI instruction
When I put all of this together and ran the code on the MicroZed dev board, I noted a 100 mA drop in the overall current draw, which equates to a 29% drop in power—from 1.72W to 1.22W. You can see the overall effect in this image. Note the new reading on then DMM.
Resultant current consumption after entering sleep mode
This is a considerable reduction in power. However, you may be surprised it is not more. Remember that we still have elements of the Zynq SoC powered up. We can power these elements down as well, to achieve an even lower power dissipation. For example, we can power-down the Zynq SoC’s PL. While powering down the PL results in a longer wake-up time as the PL would need to be reconfigured after waking up, the resultant power saving would be greater. This does require that we correctly architect the power architecture to provide the ability to power down specific voltage rails.
Next week we will look at how we can develop our PL application for lower power dissipation in operation.
Code is available on Github as always.
If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.
All of Adam Taylor’s MicroZed Chronicles are cataloged here.
As a winter-break project, Edward at MicroCore Labs ported his FPGA-based MCL86 8088 processor core to a vintage IBM PCjr. The MCL86 core combined with the minimum-mode 8088 BIU consumes 1.5% of the smallest Kintex-7 FPGA, the K70T. Four of the Kintex-7 FPGA’s 135 block RAMS hold the processor’s microcode. Edward disabled the cycle accuracy governor in the core and added 128Kbytes of internal RAM using the same Kintex-7 FPGA that he was using to implement the processor. The result: the world's fastest IBM PCjr running Microsoft DOS 2.1.
IBM PCjr sped up by a MicroCore Labs MCL86 processor core implemented in a Kintex-7 FPGA
Will the world beat a path to Edward’s door for fast antiquated personal computers? Probably not. The PCjr, code name “Peanut,” was IBM’s ill-fated attempt to enter the home market with a cost-down version of the original IBM PC. When it was announced on November 1, 1983, the entire market quickly developed a “Peanut” allergy. Adjectives such as “toylike,” “pathetic,” and “crippled” were used to describe the PCjr in the press.
The machine’s worst feature, the one to come under the most criticism, was its “Chiclet” keyboard (named after a chewing gum with a shape similar to the keyboard’s keys). IBM had gone from making the world’s best keyboard on the IBM PC to the world’s worst on the PCjr. After a year and a half of sales that dropped off a cliff as soon as the discounts ended, IBM killed the machine.
So what’s the point of MicroCore’s Franken-Peanut then?
It nicely demonstrates the vast implementation power of even the smallest Xilinx FPGAs. MicroCore Labs’ MCL86 processor core easily fits in low-cost FPGAs from the Spartan-6 and Spartan-7 product lines.
Finally, here’s a very short video of MicroCore’s jazzed-up IBM PCjr playing a little Bach:
It’s been fascinating to watch Apertus’ efforts to develop the crowd-funded Axiom Beta open 4K cinema camera over the past few years. It’s based on On Semi and CMOSIS image sensors and a MicroZed dev board sporting a Xilinx Zynq Z-7030 SoC. Apertus released a Team Talk video last November with Max Gurresch and Sebastian Pichelhofer discussing the current state of the project and focusing on the mechanical and housing aspects of the projects.
Here’s a photo from the video showing a diagram of the camera’s electronic board stack including the MicroZed board:
Axiom Beta open-source 4K Cinema Camera Electronic Board Stack
And here’s a rendering of the current thinking for a camera enclosure, which is discussed at length in the video.
Axiom Beta open-source 4K Cinema Camera Housing Concept
If you’re interested in following the detailed thought processes of this complex imaging product that also addresses many of the issues connected with a crowd-funded project like the Axiom Beta camera, watch this video:
For more information about the Axiom Beta 4K cinema camera, see “How to build 4K Cinema Cameras: The Apertus Prescription includes Zynq ingredients.”
S2C wants you to get into system prototyping with the super-capable Xilinx Kintex UltraScale FPGA fast, so it’s running a short-term, limited-time, limited-quantity promo cutting the price of a proto package in half. You get a bundle including the company’s Single KU115 Prodigy Logic Module, the 8Gbyte Prodigy DDR4 Memory Module, and the Prodigy GPIO Extension Module for
Prodigy Kintex UltraScale Proto Package with DDR4, GPIO extension modules
The Kintex UltraScale KU115 FPGA is a DSP monster with 5520 DSP48E2 DSP slices, 1.451 million system logic cells, 75.9Mbits of BRAM, and 52 16.3Gbps GTH serial transceiver ports (48 of which are brought out to connectors on the S2C Prodigy Logic Module), and 832 I/O pins (656 of which are brought out to connectors on the S2C Prodigy Logic Module).
Want that S2C deal? (Of course you do!) Click here.
Better do it fast though, before S2C changes its mind.
Anand V Kulkarni, Engineering Manager, Atria Logic India Pvt Ltd, Bangalore, India
Atria Logic’s H.264 codec IP blocks (the AL-H264E-4KI422-HW encoder and the AL-H264D-4KI422-HW decoder) achieve UHD 4k@60fps video with each running on a Xilinx Zynq Z-7045 SoC as shown in the figure below.
Block Diagram of Atria Logic UHD H.264 Codec Solution
Atria Logic’s AL-H264E-4KI422-HW is a hardware-based, feature-rich, low-latency, high-quality, H.264 (AVC) UHD Hi422 Intra encoder IP core. The AL-H264E-4KI422-HW encoder pairs with the Atria Logic AL-H264D-4KI422-HW low-latency decoder IP.
The IP cores’ features include:
When devising a plan for evaluating our UHD Encoder and Decoder IP cores and to meet 4K@60fps performance requirements, we needed a flexible, powerful platform. We settled on the Xilinx ZC706 evaluation kit based on the Zynq Z-7045 SoC because:
The H.264 encoder supports the H.264 Hi422 (High-422) profile at Level 5.1 (3840x2160p30) for Intra-only coding. Support for 10-bit video content means that there is no grayscale or color degradation in terms of banding. Support for YUV 4:2:2 video content means that there is better color separation—especially noticeable for red colors—which makes images appear sharper. These video-quality attributes are especially important for medical-imaging applications.
Atria Logic UHD H.264 Encoder IP Block Diagram
Support for Intra-only encoding allows the H.264 encoder to operate at frame-rate latencies. A macroblock-line-level pipelined architecture further reduces the latency to the sub-frame level: about 0.3msec. Using a pipelined design that processes 8 pixels/clock allows the design to encode 4k@60fps in real time.
Implementation of the Atria Logic H.264 encoder consumes only 78% of the Zynq Z-7045 SoC’s programmable logic and DSP resources and 55% of the available RAM, leaving ample room for other required circuitry.
The H.264 decoder supports the H.264 Hi422 (High-422) profile at Level 5.1 (3840x2160p30) for Intra-only coding. As with the encoder, support for 10-bit video content means that there is no grayscale or color degradation in terms of banding. The decoder also supports YUV 4:2:2 video content. Support for Intra-only decoding using a pipelined architecture allows the decoder to operate at frame-rate latencies.
Atria Logic UHD H.264 Decoder IP Block Diagram
Low latency is important for any closed-loop man/machine application. When the Atria Logic AL-H264E-4KI422-HW encoder is connected to the Atria Logic AL-H264D-4KI422-HW low-latency decoder via an IP network, the glass-to-glass latency is about 0.6msec (excluding transmission latency). That’s about a 2-frame latency.
An efficient implementation of the Atria Logic H.264 decoder only takes up 68% of the Zynq Z-7045 SoC’s programmable logic resources, 35% of available DSP resources, and 45% of the available RAM, leaving ample room for implementation of any other required circuitry.
The HDMI Transceiver (GTX) module transmits and receives the serial HDMI TX and RX data and converts between these serial streams and on-chip parallel data streams as needed. The transceiver module, which converts parallel data into serial and vice versa, employs the Zynq SoC’s high speed GT transceivers as the HDMI PHY.
The TX subsystem consists of the transmitter core, AXI video bridge, video timing controller, and an optional HDCP module. An AXI video stream carries two or four pixels per clock into the HDMI TX subsystem and supports 8, 10, and 12 bits per component. This stream conforms to the video protocol defined in the Video IP chapter of the AXI Reference Guide (UG761). The TX subsystem’s video bridge converts the incoming video AXI-stream to native video and the video timing controller generates the native video timing. The audio AXI stream transports multiple channels of uncompressed audio data into the HDMI TX subsystem. The Zynq Z-7045 SoC’s ARM Cortex-A9 processor controls the HDMI TX subsystem’s transmitter blocks through the CPU interface.
The HDMI RX subsystem incorporates three AXI interfaces. A video bridge converts captured native video to AXI streaming video and outputs the video data through the AXI video interface using the video protocol defined in the Video IP chapter of the AXI Reference Guide (UG761). The video timing controller measures the video timing. Received audio is transmitted through the AXI streaming audio interface. A CPU interface provides processor access to the peripherals’ control and status data.
The HDCP module is optional and is not included in the standard deliverables.
By Adam Taylor
I thought I would kick off the new year with a few blogs that look at the Zynq SoC’s power-management options. These options are important for many Zynq-based systems that are designed to run from battery power or other constrained power sources.
There are several elements of the design we can look at, from the system and board level down to the PS and PL levels inside of the Zynq SoC. At the system level, we can look at component selection. We can use low-voltage devices wherever possible because they will have a lower static power consumption. We can also use lower-power DRAM by selecting components like LPDDR in place of DDR2. One of the simpler choices would be selecting a single-core Zynq SoC as opposed to a dual-core device.
Within the Zynq SoC itself, there are several things we can do both within the PS and PL to reduce power. There are two categories we can consider when it comes to reducing power consumption in Zynq-based systems:
The first option allows us to reduce the power consumption after we have detected that the system has been inactive for a period and should therefore enter a low-power mode to prolong operating life on a battery charge. The second option allows us to make the best use of the battery capacity while operating. I will demonstrate the savings to be had with the Zynq SoC’s sleep mode and how to enter it in a follow-up blog. For the moment, I want to look at what we can do within the Zynq SoC’s PS to reduce power consumption. Most of these techniques relate to how we configure the clocking architecture within the PS.
As you can see in the diagram below, the Zynq SoC’s clocking architecture is very flexible. We can use this flexibility to reduce the power consumption of the Zynq PS.
Zynq SoC Clocking Architecture
The first approach we can take is to trade off performance against power consumption. We can reduce the power consumption within the Zynq SoC’s PS simply by selecting a lower APU frequency. Of course, this also reduces APU performance. However, as engineers one of our roles is to understand the overall system requirements and balance them. CMOS power dissipation is frequency dependent so we reduce power consumption by reducing the APU frequency, which has the potential to significantly reduce PS power dissipation. We can also use the same trade-off with the DDR SDRAM, trading memory bandwidth for reduced power.
Clock Configuration in the Zynq SoC – Reducing the APU frequency
Along with reducing the frequency of the APU, we can also implement a clocking scheme that reduces the number of PLLs used within the PS. The Zynq PS has three available PLLs named the ARM, IO, and DDR PLL. The clocking architecture allows downstream connections to use any one of the PLL sources, so a clocking scheme that uses fewer than all three PLLs results in lower power dissipation as unused PLLs can be disabled and their power consumption eliminated.
In addition, the application being developed may not require the use of all peripherals within the PS. We can therefore use the Zynq SoC’s clock-gating facilities to reduce power consumption by not clocking unused peripherals, further reducing the power consumption of the PS within the Zynq.
I performed a very simple experiment with a MicroZed board by inserting an ammeter into the USB power port supplying power to the MicroZed. This is a simple way to monitor the board’s overall power consumption. Running the Zynq PS alone with no design in the programmable logic, the MicroZed drew a current of 364mA @ 5V (1.825W) with the default MicroZed configuration.
I ran a few simple experiments to see the effect on the overall power consumption by reducing the clock frequency by half from 666MHz to 250MHz and then selecting the use of only one PLL—the DDR PLL—to clock the design. Running just from the DDR PLL reduced the current consumption to only 308mA, a 16% reduction. However, I did to have de-activate the unused PLL’s myself in my application. Reducing the frequency of the APU alone only reduced the overall current consumption to 345mA, a 6% reduction. So we see that turning off unused PLLs can have a big effect on power consumption.
If we want to gate the clocks to unused peripherals within the PS, we can use the Zynq SoC’s APER register to disable the clocks to that peripheral.
APER Control Register Bits
For a final experiment, I relocated the program to execute from the Zynq SoC’s on-chip RAM and disabled the DDR memory. For many applications, this may not be feasible but for some it may, so I thought it worthy of a test. Relocating the code further reduced the current consumption to 270mA (a 26% reduction) when combined with peripheral gating, APU frequency reduction, and running from one PLL alone.
Next time we will look at how we can place the processor into sleep mode.
Baumer’s new line of intelligent LX VisualApplets industrial cameras delivers image and video processing with image resolutions to 20Mpixels at high frame rates. The cameras contain sufficient FPGA-accelerated, image-processing power to perform real-time image pre-processing according to application-specific programming created using Silicon Software’s VisualApplets graphical programming environment. This pre-processing improves an imaging systems throughput and real-time response while reducing the amount of data uploaded to a host.
Baumer intelligent LX VisualApplets industrial camera
The Baumer LX VisualApplets cameras perform this pre-processing in camera using an internal Xilinx Spartan-6 LX150 FPGA and 256Mbytes of DDR3 SDRAM. The cameras support a GigE Vision interface over 100m of cable. Coincidentally, these new industrial cameras recently won a Platinum-level award in the Vision Systems Design 2016 Innovators Awards Program.
The LX VisualApplets industrial camera product family includes seven models with sensor resolutions ranging from 2Mpixels to 20Mpixels, all based on CMOSIS image sensors. Here’s a table listing details for the seven 2D and 3D models in the product family:
Here’s a lighthearted, 3.5-minute whiteboard video that concisely describes the advantages of in-camera, FPGA-based, image-stream pre-processing:
You can now download the latest version of the Vivado Design Suite HLx Editions, release 2016.4, which adds support for multiple Xilinx UltraScale+ devices including the Virtex UltraScale+ XCVU11P and XCVU13P FPGAs and board support packages for the Zynq UltraScale+ MPSoC ZCU102-ES2 and Virtex UltraScale+ VCU118-ES1 boards.
Download the latest version here.
By Adam Taylor
To wrap up this blog for the year, we are going to complete the SDSoC integration using the shared library.
To recap, we have generated a bit file using the Xilinx SDSoC development environment that implements the matrix multiply example using the PL (programmable logic) on the base PYNQ platform, which we previously defined using SDSoC. The final step is to get it all integrated and the first step is to upload the following files to the PYNQ board:
The names are slightly different as I generated them as part of the previous blog.
Using a program like WinSCP, I uploaded these three files to the PYNQ bit stream directory, the same place we uploaded our previous design too.
The next step is to develop the Jupyter notebook so that we can drive the new overlay that we have created. To get this up and running we need to do the following:
This is very similar to what we have done previously with the exception of the creating the CFFI, so that is where the rest of this blog will focus.
The first thing we need to do is know the names of the function within the shared library, because SDSoC will create a different name from the actual accelerated function. We can find the renamed files under <project>/<build config>/_sds/swstubs while the hardware files are under <project>/<build config>/_sds/p0/ipi.
If you already have the shared library on your PYNQ board, then you can use the command nm -D <path & shared library name> to examine its contents if you access the PYNQ via an SSH session.
With the name of the function known we can create CFFI class within our Jupyter note book. In the class for this example we need to create two functions: one for initialization and another to interact with the library. The more complicated of the two is the initialization under which we must define the location of the shared library within the file system. As mentioned earlier, I have uploaded the shared library to the same location as the bit and TCL files. We also need to declare the functions contained within the shared library and the finally open the shared library.
The second function within the class is what we call when we wish to make use of the shared library. We can then make use of this class as we do any other within the rest of our program. In fact, this approach is used often in Python development to bind together C and Python.
This example shows just how easily we can create overlays using SDSoC and interface with them using Python and the PYNQ development system. If you want to try and you currently do not have a license for SDSoC, you can obtain a free 60 day evaluation here with the new release.
As I mentioned up top this is the last blog of 2016, I will resume writing in the New Year and to give you a taste of what we are going to be looking at in 2017. Amongst other things I will be featuring:
Until then, have a great Christmas and New Year and thanks for reading the series.
Code is available on Github as always.
If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.
All of Adam Taylor’s MicroZed Chronicles are cataloged here.
You can develop and deploy FPGA-accelerated cloud apps using the Xilinx SDAccel development environment with no downloads and no local FPGA hardware using a new Web-based service from Nimbix. This service runs on a Nimbix platform named JARVICE, which is specifically designed for Bid Data and Big Compute workloads.
Here’s a new 2.5-minute video demonstrating the Nimbix platform in action:
You can develop apps and deploy them on JARVICE. The Nimbix service is available as a subscription and as a pay-as-you-go service for only a few bucks per hour.
For more information about the Xilinx SDAccel development environment for cloud-based apps, see “Removing the Barrier for FPGA-Based OpenCL Data Center Servers.” To read about applications created using SDAccel, see:
VadaTech’s new AMC596 “FPGA carrier” answers the question, “How much processing power can you pack into the MicroTCA AMC form factor?” The answer is: a lot.
VadaTech MicroTCA AMC596 FPGA Carrier
The AMC596 combines the processing horsepower of a Xilinx Virtex UltraScale VU440 FPGA with a QorIQ P2040 quad-core communications processor (that’s a team of four PowerPC e500mc processor cores, each running at 1.2GHz). All of this processing power plus 8Gbytes of 64-bit DDR4 SDRAM for the Virtex UltraScale FPGA and 1Gbyte of DDR3 SDRAM dedicated to the QorIQ processor fits on a board measuring a mere 73.5x180.6mm.
Here’s a block diagram of VadaTech’s AMC596 “FPGA carrier”:
The Virtex UltraScale VU440 FPGA on the VadaTech AMC596 is the largest 20nm Xilinx Virtex UltraScale device and brings 4.433M logic cells, 2880 DSP48E2 slices, and 88.6Mbits of Block RAM to the party. You can build a lot of things with that many on-chip resources. VadaTech’s Web page for the AMC596 suggests that it’s ideal for ASIC prototyping or emulation, but of course there are a lot of interesting things you can do with this much processing power in a small form factor.
Powerful things. Fast things.
For the last three years, I’ve searched for a book on designing with FPGAs—specifically Xilinx FPGAs—that I can recommend to people who are just starting out. The introduction of the Vivado Design Suite in April 2012 complicated things because existing books on FPGA-based design did not discuss Vivado.
Now, there’s a 260-page book from Springer that I can recommend. The book title is “Designing with Xilinx FPGA: Using Vivado.” That’s pretty self-explanatory, isn’t it?
You might be able to teach yourself all of the topics in this book by collecting dozens of different documents on the Xilinx.com Web site. You’ll find it a lot easier just to get this book.
The simplest way to explain the books’ contents is to list the clearly labeled chapters, each written by a different author:
1. State-of-the-Art Programmable Logic
2. Vivado Design Tools
3. IP Flows
4. Gigabit Transceivers
5. Memory Controllers
6. Processor Options
7. Vivado IP Integrator
8. SysGen for DSP
10. C-Based Design
13. Stacked Silicon Interconnect (SSI)
14. Timing closure
15. Power Analysis and Optimization
16. System Monitor
17. Hardware Debug
18. Emulation using FPGAs
19. Partial Reconfiguration and Hierarchical Design
This newly published book covers extremely current topics including Stacked Silicon Interconnect (the Xilinx designation for 3D ICs) and the Xilinx Zynq UltraScale+ MPSoC. Of course, just as it says in the book title, the text heavily discusses the use of the Vivado Design Suite to develop designs with Xilinx devices.
This unlikely new project on the Instructables Web site uses a $189 Digilent ZYBO trainer board (based on a Xilinx Zynq Z7010 SoC) to track balloons with an attached Webcam and then pop them with a high-powered semiconductor laser. The tracking system is programmed with OpenCV.
Here’s a view down the bore of the laser:
And there’s a 1-second video of the system in action on the Instructables Web page.
Fun aside, this system demonstrates that even the smallest Zynq SoC can be used for advanced embedded-vision systems. You can get more information about embedded-vision systems based on Xilinx silicon and tools at the new Embedded Vision Developer Zone.
Note: For more information about Digilent’s ZYBO trainer board, see “ZYBO has landed. Digilent’s sub-$200 Zynq-based Dev Board makes an appearance (with pix!)”
The latest version of the Xilinx SDSoC Development Environment for Zynq UltraScale+ MPSoCs and Zynq-7000 SoCs, 2016.3, is now available for download and includes the following features:
Complete release notes for SDSoC 2016.3 available here.
Chomping at the bit to start working with the Zynq UltraScale+ MPSoC? How about a low-cost starter kit from Avnet called UltraZed? The $895 kit includes the UltraZed-EG SOM with a Zynq UltraScale+ ZU3EG MPSoC, an I/O carrier card with a large assortment of breakout headers, a power supply, and miscellaneous accessories. Here’s a photo of the assembled kit with the UltraZed-EG SOM and the I/O carrier card:
Avnet UltraZed Starter Kit featuring the Xilinx Zynq UltraScale+ ZU3EG MPSoC
The Zynq UltraScale+ ZU3EG MPSoC used in this kit incorporates a tremendous amount of processing power including four 64-bit ARM Cortex-A53 processors, two ARM Cortex-R5 processors, and a dual-core ARM Mali-400 GPU. There’s also the immense processing power of the high-performance UltraScale+ programmable logic on the device—this is a 16nm FinFET device, after all—which you can harness as an accelerator for nearly any video-processing, graphics, DSP, networking, or communications task.
According to the Avnet site, the UltraZed Starter Kit is in stock now. Go get one.
Do you have a big job to do? How about a terabit router bristling with optical interconnect? Maybe you need a DSP monster for phased-array radar or sonar. Beamforming for advanced 5G applications using MIMO antennas? Some other high-performance application with mind-blowing processing and I/O requirements?
You need to look at Xilinx Virtex UltraScale+ FPGAs with their massive data-flow and routing capabilities, massive memory bandwidth, and massive I/O bandwidth. These attributes sweep away design challenges caused by performance limits of lesser devices.
Now you can quickly get your hands on a Virtex UltraScale+ Eval Kit so you can immediately start that challenging design work. The new eval kit is the Xilinx VCU118 with an on-board Virtex UltraScale+ VU9P FPGA. Here’s a photo of the board included with the kit:
Xilinx VCU118 Eval Board with Virtex UltraScale+ VU9P FPGA
The VCU118 eval kit’s capabilities spring from the cornucopia of on-chip resources provided by the Virtex UltraScale+ VU9P FPGA including:
If you can’t build what you need with the VCU118’s on-board Virtex UltraScale+ VU9P FPGA—and it’s sort of hard to believe that’s even possible—just remember, there are even larger parts in the Virtex UltraScale+ FPGA family.
Basler’s PowerPacks for Embedded Vision gives you everything you need to develop smarter vision applications including:
Here’s a short, 2.5-minute video describing the Basler dart camera module, which is “about the same size” as a postage stamp:
You can use this plug-and-play kit to develop a variety of embedded vision designs for applications such as mobile inspection and the Industrial Internet of Things (IIoT).
Note the use of a Zynq-based SOM in these Basler kits. There’s a reason for that. If you need real-time video processing and flexible interfacing both for the camera and for your final product, then the Xilinx Zynq Z-7000 SoC series is something you should be considering because of the Zynq family’s unmatched high-speed processing and interfacing capabilities. Need more processing power for really big vision-processing systems? No problem. Take a look at the Zynq UltraScale+ MPSoC device family.
Note: For more information about Basler’s lean BCON LVDS interface for video applications and pylon software for embedded-vision applications, click here. For an excellent comparison of USB 3.0 versus BCON for LVDS interfaces used in embedded-vision applications, click here to see Basler’s 5-page note on the topic.