Thirty years ago, my friends and co-workers Jim Reyer and KB and I would drive to downtown Denver for a long lunch at a dive Mexican bar officially known as “The Brewery Bar II.” But the guy who owned it, the guy who was always perched on a stool inside to the door to meet and seat customers, was named Abe Schur so we called these trips “Abe’s runs.” This week, I found myself in downtown Denver again at the SC17 supercomputer conference at the Colorado Convention Center. The Brewery Bar II is still in business and only 15 blocks away from the convention center, so on a fine, sunny day, I set out on foot for one more Abe’s run.
I arrived about 45-minutes later.
I walked in the front door and 30 years instantly evaporated. I couldn’t believe it but the place didn’t look any different. The same rickety tables. The same neon signs on the wall. The same bar. The same weird red, flocked wallpaper. It was all the same except my friends weren’t there with me and Abe wasn’t sitting on a stool. I’d already known that he’d passed away many years ago.
Also the same was the crowded state of the place at lunch time. The waitress (they don’t have servers at Abe’s) told me there were no tables available but I could eat at the bar. I took a place at the end of the bar and sat next to a guy typing on a laptop. That wasn’t the same as it was 30 years ago.
The bartender came up and asked me what I wanted to drink. I said I’d not been in for more than 25 years and asked if they still served “Tinys.” A Tiny is Abe’s-speak for a large beer. He said “Of course,” so I ordered a Tiny ice tea. (Not the Long Island variety.)
Then he asked me what I wanted to eat. There’s only one response for that at Abe’s and since they still understood what a Tiny was, I answered without ever touching a menu: “One special relleno, green, with sour cream as a neutron moderator.” He asked me if I wanted the green chile hot, mild, or half and half. Thirty years ago, I’d have ordered hot. My digestive system now has three more decade’s worth of mileage on it, so I ordered half and half. Good thing. The chile’s hotness still registered a 6 or 7 on the Abe’s 1-to-10 scale.
After I ordered, the guy with the laptop next to me said “The rellenos are still as good as they were 25 years ago.” Indeed, that’s what he was eating. The ice had broken with Abe’s hot rellenos and so we started talking. The laptop guy’s name was Scott and he maintains cellular antenna installations on towers and buildings. His company owns a lot of cell tower sites in the Denver area.
Scott is very familiar with the changes taking place in cellular infrastructure and cell-site ownership, particularly with the imminent arrival of the latest 5G gear. He told me that the electronics is migrating up the towers to be as near the antennas as possible. “All that goes up there now is 48V power and a fiber,” he said. Scott is also familiar with the migration of the electronics directly into the antennas.
It turns out that Scott is also a ham radio operator, so we talked about equipment. He’s familiar with and has repaired just about everything that’s been on the market going back to tube-based gear but he was especially impressed with the new all-digital Icom rig he now uses most of the time. Scott’s not an engineer, but hams know a ton about electronics, so we started discussing all sorts of things. He’s especially interested in the newer LDMOS power FETs. So much so that he’s lost interest in using high-voltage transmitter tubes. "Why mess with high voltage when I can get just as far with 50V?" he mused.
I was wearing my Xilinx shirt from the SC17 conference, so I took the opportunity to start talking about the very relevant Xilinx Zynq UltraScale+ RFSoC, which is finding its way into a lot of 5G infrastructure equipment. Scott hadn’t heard about it, which really isn’t surprising considering how new it is, but after I described it he said he looked forward to maybe finding one in his next ham rig.
The special relleno, green with sour cream, arrived and one bite immediately took me back three decades again. The taste had not changed one morsel. Scott and I continued to talk for an hour. Sadly, the relleno didn’t last nearly that long.
Scott and I left Abe's together. He got into his truck and I started the 15-block walk back to the convention center. The conversation and the food formed one of those really remarkable time bubbles you sometimes stumble into—and always at Abe’s.
During a gala black-tie ceremony held on November 15 at The Brewery in central London, the Xilinx Zynq UltraScale+ RFSoC won the top spot among the six competitors in the IET Innovation Awards’ Communications Category—although you might not figure that out to read the citation on the E&T Magazine Web site:
Note: E&T is the IET's award-winning monthly magazine and associated website.
Xilinx’s Giles Peckham (center) accepts the IET Innovation Award in the Communications Category for the
Zynq UltraScale+ RFSoC from Professor Will Stewart (IET Communications Policy Panel, on left) and
Rick Edwards (Awards emcee, television presenter, and writer/comic, on right). Photo courtesy of IET.
Classifying the Xilinx Zynq UltraScale+ RFSoC device family, with its integrated multi-gigasample/sec RF ADCs and DACs, soft-decision forward error correction (SD-FEC) IP blocks, UltraScale architecture programmable logic fabric, and Arm Cortex-A53/Cortex-R5 multi-core processing subsystem as an “antenna interface device,” even a “Massive-MIMO Antenna Interface” device, sort of shortchanges the RFSoC in my opinion. The Zynq UltraScale+ RFSoC is a category killer for many, many applications that need “high-speed analog-in, high-speed analog-out, digital-processing-in-the-middle” capabilities due to the devices’ extremely high integration level, though it most assuredly will reduce the size, power, and complexity of traditional antenna structures as cited in the IET Innovation Awards literature. There's simply no other device like the Zynq UltraScale+ RFSoC on the market, as suggested by this award. (If you drill down to here on the IET Innovation Awards Web page, you’ll find that the Zynq UltraScale+ RFSoC was indeed Xilinx’s IET Innovation Awards entry in the communications category this year.)
Zynq UltraScale+ RFSoC Conceptual Diagram
The UK-based IET is one of the world’s largest engineering institutions with more than 168,000 members in 150 countries and so winning one of the IET’s annual Innovation Awards is an honor not to be taken lightly. This year, the Communications category of the IET Innovation Awards was sponsored by GCHQ (Government Communications Headquarters), the UK’s intelligence and security organization responsible for providing signals intelligence and information assurance to the UK’s government and armed forces.
For more information about the IET Innovation Awards and to see all of the various categories, click here for an animated brochure.
For more information about the Zynq UltraScale+ RFSoC, see:
According to an announcement released today:
“Xilinx, Inc. (XLNX) and Huawei Technologies Co., Ltd. today jointly announced the North American debut of the Huawei FPGA Accelerated Cloud Server (FACS) platform at SC17. Powered by Xilinx high performance Virtex UltraScale+ FPGAs, the FACS platform is differentiated in the marketplace today.
“Launched at the Huawei Connect 2017 event, the Huawei Cloud provides FACS FP1 instances as part of its Elastic Compute Service. These instances enable users to develop, deploy, and publish new FPGA-based services and applications through easy-to-use development kits and cloud-based EDA verification services. Both expert hardware developers and high-level language users benefit from FP1 tailored instances suited to each development flow.
"...The FP1 demonstrations feature Xilinx technology which provides a 10-100x speed-up for compute intensive cloud applications such as data analytics, genomics, video processing, and machine learning. Huawei FP1 instances are equipped with up to eight Virtex UltraScale+ VU9P FPGAs and can be configured in a 300G mesh topology optimized for performance at scale."
Huawei’s FP1 FPGA accelerated cloud service is available on the Huawei Public Cloud today. To register for the public beta, click here.
One of the several demos in the Xilinx booth during this week’s SC17 conference in Denver was a working demo of the CCIX (Cache Coherent Interconnection for Accelerators) protocol, which simplifies the design of offload accelerators for hyperscale data centers by providing low-latency, high-bandwidth, fully coherent access to server memory. The demo shows L2 switching acceleration using an FPGA to offload a host processor. The CCIX protocol manages a hardware cache in the FPGA, which is coherently linked to the host processor’s memory. Cache updates take place in the background without software intervention through the CCIX protocol. If cache entries are invalidated in the host memory, the CCIX protocol automatically invalidates the corresponding cache entries in the FPGA’s memory.
Senior Staff Design Engineer Sunita Jain gave Xcell Daily a 3-minute explanation of the demo, which shows a 4.5x improvement in packet transfers using CCIX versus software-controlled transfers:
There’s one thing to note about this demo. Although the CCIX standard calls for using the PCIe protocol as a transport layer at 25Gbps/lane, which is faster than PCIe Gen4, this demo only demonstrates the CCIX protocol and is using the significantly slower PCIe Gen1 for the transport layer.
For more information about the CCIX protocol as discussed in Xcell Daily, see:
This week at SC17 in Denver, Everspin was showing some impressive performance numbers for the MRAM-based nvNITRO NVMe Accelerator Card that the company introduced earlier this year. As discussed in a previous Xcell Daily blog post, the nvNITRO NVMe Accelerator Card is based on the company’s non-volatile ST-MRAM chips and a Xilinx Kintex UltraScale KU060 FPGA implements the MRAM controller and the board’s PCIe Gen3 x8 host interface. (See “Everspin’s new MRAM-based nvNITRO NVMe card delivers Optane-crushing 1.46 million IOPS (4Kbyte, mixed 70/30 read/write).”)
The target application of interest at SC17 was high-frequency trading, where every microsecond you can shave off of system response times directly adds dollars to the bottom line, so the ROI on a product like the nvNITRO NVMe Accelerator Card that cuts transaction times is easy to calculate.
Everspin MRAM-based nvNITRO NVMe Accelerator Card
It turns out that a common thread and one of the bottlenecks for high-frequency trading applications is the use of Apache Log4j event-logging utility. However, incoming packets arrive at a variable rate—the traffic is bursty—and the Log4j logging utility needs to keep up with the highest possible burst rates to ensure that every event is logged. Piping these events directly into SSD storage sets a low limit to the burst rate that a system can handle. Inserting an nvNITRO NVMe Accelerator Card as a circular buffer in series with the incoming event stream as shown below boosts Log4j performance by 9x.
Proof of efficacy appears in the chart below, which shows the much lower latency and much better determinism provided by the nvNITRO card:
One more thing of note: As you can see by one of the labels on the board in the photo above, Everspin’s nvNITRO card is now available as Smart Modular Technologies’ MRAM NVM Express Accelerator Card. Click here for more information.
Ryft is one of several companies now offering FPGA-accelerated applications based on Amazon’s AWS EC2 F1 instance. Ryft was at SC17 in Denver this week with a sophisticated, cloud-based data analytics demo based on machine learning and deep learning that classified 50,000 images from one data file using a neural network, merged the classified image files with log data from another file to create a super metadata file, and then provided fast image retrieval using many criteria including image classification, a watch-list match (“look for a gun” or “look for a truck”), or geographic location using the Google Earth database. The entire demo made use of geographically separated servers containing the files used in conjunction with Amazon’s AWS Cloud. The point of this demo was to show Ryft’s ability to provide “FPGAs as a Service” (FaaS) in an easy to use manner using any neural network of your choice, any framework (Caffe, TensorFlow, MXNet), and the popular RESTful API.
This was a complex, live demo and it took Ryft’s VP of Products Bill Dentinger six minutes to walk me through the entire thing, even moving as quickly as possible. Here’s the 6-minute video of Bill giving a very clear explanation of the demo details:
Note: Ryft does a lot of work with US government agencies and as of November 15 (yesterday), Amazon’s AWS EC2 F1 instance based on Xilinx Virtex UltraScale+ FPGAs is available on GovCloud. (See “Amazon’s FPGA-accelerated AWS EC2 F1 instance now available on Amazon’s GovCloud—as of today.”)
“Amazon EC2 F1 instances are now available in the AWS GovCloud (US) region.” Amazon posted the news on its AWS Web site today and the news was announced by Amazon’s Senior Director Business Development and Product Gadi Hutt during his introductory speech at a special half-day Amazon AWS EC2 F1 instance dev lab held at SC17 in Denver the same morning. According to the Amazon Web page, “With this launch, F1 instances are now available in four AWS regions, specifically US East (N. Virginia), US West (Oregon), EU (Ireland) and AWS GovCloud (US).”
Nearly 100 developers attended the lab and listened to Hutt’s presentation along with two AWS F1 instance customers, Ryft and NGCodec. The presentations were followed by a 2-hour hands-on lab.
Amazon's Gadi Hutt presents to an AWS EC2 F1 hands-on lab at SC17
The Amazon EC2 F1 compute instance allows you to create custom hardware accelerators for your application using cloud-based server hardware that incorporates multiple Xilinx Virtex UltraScale+ VU9P FPGAs. Each Amazon EC2 F1 compute instance can include as many as eight FPGAs, so you can develop extremely large and capable, custom acceleration engines with this technology. According to Amazon, use of the FPGA-accelerated F1 instance can accelerate applications in diverse fields such as genomics research, financial analysis, video processing (in addition to security/cryptography and machine learning) by as much as 30x over general-purpose CPUs.
For more information about Amazon’s AWS EC2 F1 instance in Xcell Daily, see:
This week, if you were in the Xilinx booth at SC17, you would have seen demos of the new Virtex UltraScale+ FPGA VCU1525 Acceleration Development Kit (available in actively and passively cooled versions). Both versions are based on Xilinx Virtex UltraScale+ VU9P FPGAs with 64Gbytes of on-board DDR4 SDRAM.
Xilinx Virtex UltraScale+ FPGA VCU1525 Acceleration Development Kit, actively cooled version
Xilinx Virtex UltraScale+ FPGA VCU1525 Acceleration Development Kit, passively cooled version
Xilinx had several VCU1525 Acceleration Development Kits at SC17 running various applications at SC17. Here’s a short 90-second video from SC17 showing two running applications—edge-to-cloud video analytics and machine learning— narrated by Xilinx Senior Engineering Manager Khang Dao:
Note: For more information about the Xilinx Virtex UltraScale+ FPGA VCU1525 Acceleration Development Kit, contact your friendly neighborhood Xilinx or Avnet sales representative.
By Adam Taylor
Being able to see internal software variables in our Zynq-based embedded systems in real time is extremely useful during system bring-up and for debugging. I often use an RS-232 terminal during commissioning to report important information like register values from my designs and have often demonstrated that technique in previous blog posts. Information about variable values in a running system provides a measure of reassurance that the design is functioning as intended and, as we progress though the engineering lifecycle, provides verification that the system is continuing to work properly. In many cases we will develop custom test software that reports the status of key variables and processes to help prove that the design functions as intended.
This approach works well and has for decades—since the earliest days of embedded design. However, a better solution for Zynq-based systems that allows us to read the contents of the processor memory and extract the information we need without impacting the target’s operation and without the need to add a single line of code to the running target now presents itself. It’s called μC/Probe and it’s from Micrium, the same company that has long offered the µC/OS RTOS for a wide variety of processors including the Xilinx Zynq SoC and Zynq UltraScale+ MPSoC.
Micrium’s μC/Probe tool allows us to create custom graphical user interfaces that display the memory contents of interest in our systems designs. With this capability, we can create a virtual dashboard that provides control and monitoring of key system parameters and we can do this very simply by dragging and dropping indicator, display, and control components onto the dashboard and associating them with variables in the target memory. In this manner it is possible to both read and write memory locations using the dashboard.
When it comes to using Micrium’s μC/Probe tool with our Zynq solution, we have choices regarding interfacing:
For this first example, I am going to use a Segger J Link JTAG pod to create a simple example and demonstrate the capabilities. However, the second interface option proves useful for Zynq-based boards that lack a separate JTAG header and instead use a USB-to-JTAG device or if you do not have a J Link probe. We will look at using Micrium’s μC/Probe tool with the second interface option in a later blog.
Of course, the first thing we need to do is create a test application and determine the parameters to observe. The Zynq SoC’s XADC is perfect for this because it provides quantized values of the device temperature and voltage rails. These are ideal parameters to monitor during an embedded system’s test, qualification, and validation so we will use these parameters in this blog.
The test application example will merely read these values in a continuous loop. That’s a very simple program to write (see MicroZed Chronicles 7 and 8 for more information on how to do this). To understand the variables that we can monitor or interact with using μC/Probe, we need to understand that the tool reads in and parses the ELF produced by SDK to get pointers to the memory values of interest. To ensure that the ELF can be properly read in and parsed by μC/Probe, the ELF needs to contain debugging data in the DWARF format. That means that within SDK, we need to set the compile option –gdwarf-2 to ensure that we use the appropriate version of DWARF. Failure to use this switch will result in μC/Probe being unable to read and parse the generated ELF.
We set this compile switch in the C/C++ build settings for the application, as shown below:
Setting the correct DWARF information in Xilinx SDK
With the ELF file properly created, I made a bootable SD card image of the application and powered on the MicroZed. To access the memory, I connected the Segger J Link, which uses a Xilinx adaptor cable to mate with the MicroZed board’s JTAG connector.
MicroZed with the Segger J Link
All I needed to do now was to create the virtual dashboard. Within μC/Probe, I loaded the ELF by clicking on the ELF button in the symbol browser. Once loaded, we can see a list of all the symbols that can be used on μC/Probe’s virtual dashboard.
ELF loaded, and available symbols displayed
For this example, which is monitoring Zynq SoC’s XADC internal signals including device temperature and power, I added six Numeric Indicators and one Angular Gauge Quadrant. Adding graphical elements to the data screen is very simple. All you need to do this find the display element you desire in the toolbox, drag onto the data screen, and drop it in place.
Adding a graphical element to the display
To display information from the running Zynq SoC on μC/Probe’s Numeric Indicators and on the Gauge, I needed to associate each indicator and gauge with a variable in memory. We use the Symbol viewer to do this. Select the variable you want and drag it onto the display indicator as shown below.
Associating a variable with a display element
If you need to scale the display to use the full variable range or otherwise customize it, hold the mouse over the appropriate display element and select the properties editor icon on the right. The properties editor lets you scale the range, enter a simple transfer function, or increase the number of decimal places if desired.
Formatting a Display Element
Once I’d associated all the Numeric Indicators and the Gauge with appropriate variables but before I could run the project and watch the XADC values in real time, one final thing remained: I needed to inform the project how I wished to communicate with the target and select the target processor. For this example, I used Segger’s J Link probe.
Configuring the communication with the target
With this complete. I clicked “run” and captured the following video of the XADC data being captured and displayed by μC/Probe.
All of this was pretty simple and very easy to do. Of course, this short has just scratched the surface of the capabilities of Micrium’s μC/Probe tool. It is possible to implement advanced features such as oscilloscopes, bridges to Microsoft Excel, and communication with the target using terminal windows or more advanced interfaces like USB. In the next blog we will look at how we can use some of these advanced features to create a more in-depth and complex virtual dashboard.
I think I am going to be using Micrium’s μC/Probe tool in many blogs going forward where I want to interact with the Zynq as well.
You can find the example source code on the GitHub.
Adam Taylor’s Web site is http://adiuvoengineering.com/.
If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.
First Year E Book here
First Year Hardback here.
Second Year E Book here
Second Year Hardback here
Mercury Systems recently announced the BuiltSAFE GS Multi-Core Renderer, which runs on the multi-core ARM Cortex-A53 processor inside Xilinx Zynq UltraScale+ MPSoCs. The BuiltSAFE GS Multi-Core Renderer—a high-performance, small-footprint OpenGL library designed to render highly complex 3D graphics in safety-critical embedded systems, is certifiable to DO-178C at the highest design assurance level (DAL-A) as well as the highest Automotive Safety Integrity Level (ASIL D). Because it runs on the CPU, performance of the Multi-Core Renderer scales up with more CPU cores and can run on Zynq Ultrascale+ CG MPSoC variants that do not include the Arm Mali-400 GPU.
According to Mercury’s announcement:
“Hardware certification requirements (DO-254/ED80) present huge challenges when using a graphics-processing unit (GPU), and the BuiltSAFE GS Multi-Core Renderer is the ideal solution to this problem. It uses a deterministic, processor architecture-independent model optimized for any multicore-based platform to maximize performance and minimize power usage. All of the BuiltSAFE Graphics Libraries use industry standard OpenGL API specifications that are compatible with most new and legacy applications, but it can also be completely tailored to meet any customer requirements.”
Please contact Mercury Systems for more information about the BuiltSAFE GS Multi-Core Renderer.
The new Mellanox Innova-2 Adapter Card teams the company’s ConnectX-5 Ethernet controller with a Xilinx Kintex UltraScale+ KU15P FPGA to accelerate computing, storage, and networking in data centers. According to the announcement, “Innova-2 is based on an efficient combination of the state-of-the-art ConnectX-5 25/40/50/100Gb/s Ethernet and InfiniBand network adapter with Xilinx UltraScale FPGA accelerator.” The adapter card has a PCIe Gen4 host interface.
Mellanox’s Innova-2 PCIe Adapter Card
Key features of the card include:
Innova-2 is available in multiple, pre-programmed configurations for security applications with encryption acceleration such as IPsec or TLS/SSL. Innova-2 boosts performance by 6x for security applications while reducing total cost of ownership by 10X when compared to alternatives.
Innova-2 enables SDN and virtualized acceleration and offloads for Cloud infrastructure. The on-board programmable resources allow deep-learning training and inferencing applications to achieve faster performance and better system utilization by offloading algorithms into the card’s Kintex UltraScale+ FPGA and the ConnectX acceleration engines.
The adapter card is also available as an unprogrammed card, open for customers’ specific applications. Mellanox provides configuration and management tools to support the Innova-2 Adapter Card across Windows, Linux, and VMware distributions.
Please contact Mellanox directly for more information about the Innova-2 Adapter Card.
William Wong, Technology Editor for ElectronicDesign.com, just published an article titled “Hypervisors Step Up Security for Arm Cortex-A” and the first item he discusses is Lynx Software Technologies’ LynxSecure Separation Kernel Hypervisor running on the Xilinx Zynq UltraScale MPSoC. Wong writes, “The 64-bit, Arm Cortex-A ARMv8 architecture supports virtual machines (VMs), but it requires hypervisor software to deliver this functionality.”
LynxSecure 6.0, the latest version of the company’s Separation Kernel Hypervisor, was just announced late last month. The initial port of this new hypervisor to the ARM architecture targets the multiple Arm Cortex-A53 processors in the Zynq Ultrascale+ MPSoC. (Previous versions only supported the x86 microprocessor architecture.) LynxSecure uses the Arm Cortex-A53 processors’ MMU, SMMU, and virtualization capabilities found on Armv8 processors to fully isolate operating systems and applications. It allows access only to the devices allocated to these applications and operating systems.
According to Lynx, it chose the Zynq UltraScale+ MPSoC as the first porting target for the LinuxSecure hypervisor “for its broad market applicability, early customer interest, and the long-term relationship between Lynx and Xilinx.”
For more information about the LinuxSecure Separation Kernel Hypervisor, see this brochure or contact Lynx Software Technologies directly.
Accolade’s new Flow-Shunting feature for its FPGA-based ANIC network adapters lets you more efficiently drive packet traffic through existing 10/40/100GE data center networks by offloading host servers. It does this by eliminating the processing and/or storage of unwanted traffic flows, as identified by the properly configured Xilinx UltraScale FPGA on the ANIC adapter. By offloading servers and reducing storage requirements, flow shunting can deliver operational cost savings throughout the data center.
The new Flow Shunting feature is a subset of the existing Flow Classification capabilities built into the FPGA-based Advanced Packet Processor in the company’s ANIC network adapters. (The company has written a technology brief explaining the capability.) Here’s a simplified diagram of what’s happening inside of the ANIC adapter:
The Advanced Packet Processor in each ANIC adapter performs a series of packet processing functions including flow classification (outlined in red). The flow classifier inspects each packet, determines whether each packet is part of a new flow or an existing flow, and then updates the associated lockup table (LUT)—which resides in a DRAM bank—with the flow classification. The LUT has room to store as many as 32 million unique IP flow entries. Each flow entry includes standard packet-header information (source/destination IP, protocol, etc.) along with flow metadata including total packet count, byte count, and the last time a packet was seen. The same flow entry tracks information about both flow directions to maintain a bi-directional context. With this information, the ANIC adapter can take specific actions on an individual flow. Actions might include forwarding, dropping, or re-directing packets in each flow.
These operations form the basis for flow shunting, which permits each application to decide from which flow(s) it does and does not want to receive data traffic. Intelligent, classification-based flow shunting allows an application to greatly reduce the amount of data it must analyze or handle, which frees up server CPU resources for more pressing tasks.
For more information about Accloade’s UltraScale-based ANIC network adapters, see “Accolade 3rd-gen, dual-port, 100G PCIe Packet Capture Adapter employs UltraScale FPGAs to classify 32M unique flows at once.”
Today, Microsoft, Mocana, Infineon, Avnet, and Xilinx jointly introduced a highly integrated, high-assurance IIoT (industrial IoT) system based on the Microsoft Azure Cloud and Microsoft’s Azure IoT Device SDK and Azure IoT Edge runtime package, Mocana’s IoT Security Platform, Infineon’s OPTIGA TPM (Trusted Platform Module) 2.0 security cryptocontroller chip, and the Avnet UltraZed-EG SOM based on the Xilinx Zynq UltraScale+ EG MPSoC.
The Mocana IoT Security Platform stack looks like this:
Mocana IoT Security Platform stack
Here’s a photo of the dev board that combines all of these elements:
The Avnet UltraZed-EG SOM appears in the lower left and the Infineon OPTIGA TPM 2.0 security chip resides on a Pmod carrier plugged into the top of the board.
If you’re interested in learning more about this highly integrated IIoT hardware/software solution, click here.
The November/December 2017 issue of the ARRL’s QEX magazine carries an article written by Stefan Scholl (DC9ST) titled “The Panoradio: A Modern Software Defined Radio with Direct Sampling.” This article describes the implementation of an open-source software-defined radio (SDR) based on an Avnet Zedboard—which in turn is based on a Xilinx Zynq Z-7020 SoC—and an Analog Devices AD9467-FMC-250EBZ board based on the 16-bit, 250Msamples/sec AD9467 ADC.
Stefan Scholl’s Panoradio SDR is based on a Zedboard (the green board on the left, with a Zynq Z-7020 S0C) and an AD9467-FMC-250EBZ ADC board (the blue board on the right)
The Panoradio’s features include:
Beyond the comprehensive design, Scholl’s article contains one of the most concise arguments for the adoption of SDRs that I’ve seen:
“The extensive use of digital signal processing has many advantages over analog circuits: Analog processing is often limited by the laws of physics, that can hardly be overcome. Digital processing is limited only by circuit complexity—a better performance (sensitivity, dynamic range, spurs, agility, etc.) is achieved by more complex calculations and larger bit widths. Since semiconductor technology has continuously advanced following Moore’s Law, very complex systems can be built today and it is possible to achieve extraordinary accuracy and performance for digital signals with comparatively little effort.”
Scholl then lists numerous SDR advantages:
The Panoradio’s DSP section consumes less than 50% of the Zynq Z-7020 SoC’s PL (programmable logic) resources. As Scholl writes: “This quite low utilization would allow for even more complex DSP.” Note that is the resource utilization in a low-end Zynq SoC. There are several Zynq SoC family members with significantly more PL resources available.
At the end of the article, Scholl writes: “Interestingly, the bottleneck is not the FFTs or the communication with the IP cores, but the drawing routines for the waterfall plots. The Zynq does not have any graphics acceleration core, which could speed up the drawing process. However, Xilinx has realized this bottleneck and included graphics acceleration in the Zynq successors: the UltraScale MPSoC, and the Zynq UltraScale+.”
(Actually, there’s just one device family here, the Zynq UltraScale+ MPSoC, with Mali-400 GPUs in the EG and EV variants, but Scholl will no doubt be even more interested in the new Zynq UltraScale+ RFSoCs, which incorporate 4Gsamples/sec ADCs and 6.4Gsamples/sec DACs—perfect for SDR applications.)
For more information on the Zynq UltraScale+ RFSoC, see:
By Adam Taylor
So far, all of my image-processing examples have used only one sensor and produce one video stream within the Zynq SoC or Zynq UltraScale+ MPSoC PL (programmable logic). However, if we want to work with multiple sensors or overlay information like telemetry on a video frame, we need to do some video mixing.
Video mixing merges several different video streams together to create one output stream. In our designs we can use this merged video stream in several ways:
To do this within our Zynq SoC or Zynq UltraScale+ MPSoC system, we use the Video Mixer IP core, which comes with the Vivado IP library. This IP core mixes as many as eight image streams plus a final logo layer. The image streams are provided to the core via AXI Streaming or AXI memory-mapped inputs. You can select which one on a stream-by-stream basis. The IP Core’s merged-video output uses an AXI Stream.
To give a demonstration of the how we can use the video mixer, I am going to update the MiniZed FLIR Lepton project to use the 10-inch touch display and merge a second video stream using a TPG. Using the 10-inch touch display gives me a larger screen to demonstrate the concept. This screen has been sitting in my office for a while now so it’s time it became useful.
Upgrading to the 10-inch display is easy. All we need to do in the Vivado design is increase the pixel clock frequency (fabric clock 2) from 33.33MHz to 71.1MHz. Along with adjusting the clock frequency, we also need to set the ALI3 controller block to 71.1MHz.
Now include a video mixer within the MiniZed Vivado design. Enable layer one and select a streaming interface with global alpha control enabled. Enabling a layer’s global alpha control allows the video mixer to blend the alpha on a pixel-by-pixel basis. This setting allows pixels to be merged according to the defined alpha value rather than just over riding the pixel on the layer beneath. The alpha value for each layer ranges between 0 (transparent) and 1 (opaque). Each layer’s alpha value is defined within an 8-bit register.
Insertion of the Video Mixer and Video Test Pattern Generator
Enabling layer 1, for AXI streaming and Global Alpha Blending
The FLIR camera provides the first image stream. However we need a second image stream for this example, so we’ll instantiate a video TPG core and connect its output to the video mixer’s layer 1 input. For the video mixer and test pattern generator, be sure to use the high-speed video clock used in the image-processing chain. Build the design and export it to SDK.
We use the API xv_mix.h to configure the video mixer in SDK. This API provides the functions needed to control the video mixer.
The principle of the mixer is simple. There is a master layer and you declare the vertical and horizontal size of this layer using the API. For this example using the 10-inch display, we set the size to 1280 pixels by 800 lines. We can then fill this image space using the layers, either tiling or overlapping them as desired for our application.
Each layer has an alpha register to control blending along with X and Y origin registers and height and width registers. These registers tell the mixer how it should create the final image. Positional location for a layer that does not fill the entire display area is referenced from the top left of the display. Here’s an illustration:
Video Mixing Layers, concept. Layer 7 is a reduced-size image in this example.
To demonstrate the effects of layering in action, I used the test pattern generator to create a 200x200-pixel checkerboard pattern with the video mixer’s TPG layer alpha set to opaque so that it overrides the FLIR Image. Here’s what that looks like:
Test Pattern FLIR & Test Pattern Generator Layers merged, test pattern has higher alpha.
Then I set the alpha to a lower value, enabling merging of the two layers:
Test Pattern FLIR & Generator Layers merged, test pattern alpha lower.
We can also use the video mixer to tile images as shown below. I added three more TPGs to create this image.
Four tiled video streams using the mixer
The video mixer is a good tool to have in our toolbox when creating image-processing or display solutions. It is very useful if we want to merge the outputs of multiple cameras working in different parts of the electromagnetic spectrum. We’ll look at this sort of thing in future blogs.
You can find the example source code on the GitHub.
Adam Taylor’s Web site is http://adiuvoengineering.com/.
If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.
First Year E Book here
First Year Hardback here.
Second Year E Book here
Second Year Hardback here
I’ve written several times about Amazon’s AWS EC2 F1 instance, a cloud-based acceleration services based on multiple Xilinx Virtex UltraScale+ VU9P FPGAs. (See “AWS makes Amazon EC2 F1 instance hardware acceleration based on Xilinx Virtex UltraScale+ FPGAs generally available.”) The VINEYARD Project, a pan-European effort to significantly increase the performance and energy efficiency of data centers by leveraging the advantages of hardware accelerators, is using Amazon’s EC2 F1 instance to develop Apache Spark accelerators. VINEYARD project coordinator Christoforos Kachris from ICCS/NTU, of gave a presentation on "Hardware Acceleration of Apache Spark on Energy Efficient FPGAs" at SPARK Summit 2017 and a video of his presentation appears below.
Kachris’ presentation details experiments on accelerating machine-learning (ML) applications running on the Apache Spark cluster-computing framework by developing hardware-accelerated IP. The central idea is to create ML libraries that can be seamlessly invoked by programs simply by calling the appropriate library. No other program changes are needed to get the benefit of hardware acceleration. Raw data passes from a Spark Worker through a pipe, a Python API, and a C API to the FPGA acceleration IP and returns to the Spark Worker over a similar, reverse path.
The VINEYARD development team first prototyped their idea by creating a small model of the AWS EC2 F1 could-based system using four Digilent PYNQ-Z1 dev boards networked together via Ethernet and the Python-based, open-source PYNQ software development environment. Digilent’s PYNQ-Z1 dev boards are based on Xilinx Zynq Z-7020 SoCs. Even this small prototype dramatically doubled the performance relative to a Xeon server.
Having proved the concept, the VINEYARD development team scaled up to the AWS EC2 F1 and achieved a 3x to 10x performance improvement (cost normalized against an AWS instance with non-accelerated servers).
Here’s the 26-minute video presentation:
The smart factories envisioned by Industrie 4.0 are not in our future. They’re here, now and they rely heavily on networked equipment. Everything needs to be networked. In the past, industrial and factory applications relied on an entire zoo of different and incompatible networking “standards” and some of these were actually standards. Today, the networking standard for Industrie 4.0 and IIoT applications is pretty well understood to be Ethernet—and that’s a problem because Ethernet is not deterministic. In the world of factory automation, non-deterministic networks are a “bad thing.”
Enter TSN (time-sensitive networking).
TSN is a set of IEEE 802 substandards—extensions to Ethernet—that enable deterministic Ethernet communications. Xilinx’s Michael Zapke recently published an article about TSN in the EBV Blog titled “Time-Sensitive Networking (TSN): Converging networks for Industry 4.0.” After discussing TSN briefly, Zapke’s article veers to practical matters and discussed implementing TSN protocols within the Xilinx Zynq UltraScale+ MPSoC’s PS (Processing System) and PL (Programmable Logic) using Xilinx’s 1G/100M TSN Subsystem LogiCORE IP, because you need both hardware and software to make TSN protocols work correctly.
Note: If you’re attending the SPS IPC Drives trade fair late this month in Nuremberg, you can see this IP in action.
Green Hills recently announced availability of its safety-critical, secure INTEGRITY RTOS for Xilinx Zynq UltraScale+ MPSoCs. The INTEGRITY RTOS has an 18-year history of use in safety-critical avionics, industrial, medical, avionics, and automotive applications. The RTOS uses the Zynq UltraScale+ MPSoC’s hardware memory protection to isolate and protect embedded applications and it creates secure partitions that guarantee the resources each task needs to run correctly.
The company’s comprehensive announcement covers the release of:
Please contact Green Hills directly for more information about the INTEGRITY RTOS and associated Green Hills products.
Today’s blog post from Adam Taylor, “Adam Taylor’s MicroZed Chronicles Part 222, UltraZed Edition Part 20: Zynq Watchdogs,” discusses the three watchdogs incorporated into the Zynq UltraScale+ MPSoC’s PS (processing system). Adam does a terrific job of describing the watchdog capabilities in the Zynq UltraScale+ MPSoC and how you can use them to bulletproof your firmware and software against crashes. Meanwhile, I think a bit more background on watchdogs in general might be of help.
In my opinion, the best general article about watchdogs and their uses is “Great Watchdog Timers For Embedded Systems,” by Jack Ganssle. Jack wrote this article in 2011 and updated it in 2016. The article starts with a disaster. The Clementine spacecraft, launched in 1994, successfully mapped the moon and then was lost en route to near-Earth asteroid Geographos. Evidence suggests that the spacecraft’s dual-redundant processor’s software threw a floating-point exception, locked up, and essentially exhausted the fuel for the spacecraft’s maneuvering thrusters while putting the spacecraft into an 80 RPM spin from which it could not be recovered. Although the spacecraft’s 1750 microprocessor had a hardware watchdog, it was not used.
Clementine spacecraft. Photo credit: NASA/JPL
You do not want this sort of thing happening to your design.
Later in the article, Jack writes, “The best watchdog is one that doesn't rely on the processor or its software. It's external to the CPU, shares no resources, and is utterly simple, thus devoid of latent defects.”
That describes the three hardware watchdogs in the Zynq UltraScale+ MPSoC.
By Adam Taylor
As engineers we cannot assume the systems we design will operate as intended 100% of the time. Unexpected events and failures do occur. Depending upon the application, the protections implemented against these unexpected failures vary. For example, a safety-critical system used for an industrial or automotive application requires considerably more failure-mode analysis and much better protection mechanisms than a consumer application. One simple protection mechanism that can be implemented quickly and simply in any application is the use of a watchdog and this blog post describes the three watchdogs in a Zynq UltraScale+ MPSoC.
Watchdogs are intended to protect the processor against the software application crashing and becoming unresponsive. A watchdog is essentially a counter. In normal operation, the application software prevents this counter from reaching its terminal count by resetting it periodically. Should the application software crash and allow the watchdog to reach its terminal count by failing to reset the counter, the watchdog is said to have expired.
Upon expiration, the watchdog generates a processor reset or non-maskable interrupt to enable the system to recover. The watchdog is designed to protect against software failures so it must be implemented physically in silicon. It cannot be a software counter for reasons that should be obvious.
Preventing the watchdog’s expiration can be complicated. We do not want crashes to be masked resulting in the watchdog’s failure to trigger. It is good practice for the application software to restart the watchdog in the main body of application. Using an interrupt service routine to restart the watchdog opens the possibility of the main application crashing but the ISR still being serviced. In that situation, the watchdog will restart without a crash recovery.
The Zynq UltraScale+ MPSoC provides three watchdogs in its processing system (PS):
The FPD and LPD watchdogs can be configured to generate a reset, an interrupt, or both should a timeout occur. The CSU can only generate an interrupt, which can then be actioned by the APU, RPU, or the PMU. We use the PMU to manage the effects of a watchdog timeout, configuring it via to act via its global registers.
The FPD and LPD watchdogs are clocked either from an internal 100MHz clock or from an external source connected to the MIO or EMIO. The FPD and LPD watchdogs can output a reset signal via MIO or EMIO. This is helpful if we wish to alert functions in the PL that a watchdog timeout has occurred.
Each watchdog is controlled by four registers:
To ensure that writes to the Zero Mode, Counter Control, and Restart registers are intentional and not the result of an incorrect software operation, write accesses to these registers require that specific keys, different for each register, must be included in the written data word for each write take effect.
Zynq UltraScale+ MPSoC Timer and Watchdog Architecture
To include the FPD or LPD watchdogs in our design, we need to enable them in Vivado. You do so using the I/O configuration tab of the MPSoC customization dialog.
Enabling the SWDT (System Watchdog Timer) in the design
For my example in this blog post, I enabled the external resets and connected them to an ILA within the PL so that I can capture the reset signal when it’s generated.
Simple Test Design for the Zynq UltraScale+ MPSoC watchdog
To configure and use the watchdog, we use SDK and the API defined in xwdtps.h. This API allows us to configure, initialize, start, restart, and stop the selected watchdog with ease.
To use the watchdog to its fullest extent, we also need to configure the PMU to respond to the watchdog error. This is simple and requires writes to the PMU Error Registers (ERROR_SRST_EN_1 and ERROR_EN_1) enabling the PMU to respond to watchdog timeout. This will also cause the PS to assert its Error OUT signal and will result in LED D3 illuminating on the UltraZed SOM when the timeout occurs.
For this example, I also used the PMU persistent global registers, which are cleared only by a power-on reset, to keep a record of the fact that a watchdog event has occurred. This count increments each time a watchdog timeout occurs. After the example design has reset 6 times, the code finally begins to restart the watchdog and stays alive.
Because the watchdog causes a reset event each time the processor reboots, we must take care to clear the previously recorded error. Failure to do so will result in a permanent reset cycle because the timeout error is only cleared by a power-on reset or a software action that clears the error. To prevent this from happening, we need to clear the watchdog fault indication in the PMU’s global ERROR_STATUS_1 register at the start of the program.
Reset signal being issued to the programmable logic
Observing the terminal output after I created my example application, it was obvious that the timeout was occurring and the number of occurrences was incrementing. The screen shot below shows the initial error status register and the error mask register. The error status register is shown twice. The first instance shows the watchdog error in the system and the second confirms that it has been cleared. The reset reason also indicates the cause of the reset. In this case it’s a PS reset.
Looping Watchdog and incrementing count
We have touched lightly on the PMU’s error-handling capabilities. We will explore these capabilities more in a future blog.
Meanwhile, you can find the example source code on the GitHub.
If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.
First Year E Book here
First Year Hardback here.
Second Year E Book here
Second Year Hardback here
The Xilinx Zynq UltraScale+ MPSoC incorporates two to four 64-bit Arm Cortex-A53 processors. If you’re trying to figure out why you want those in your next design and how you will harness them, there’s a new, free, 1-hour Webinar just for you on November 2 titled “Shift Gears with a 64-bit ARM-Powered Zynq UltraScale+ MPSoC.”
Topic to be covered:
Doulos, a Xilinx Authorized Training Provider and ARM Approved Training Center, is presenting the Webinar.
It’s next week, so register here. Now.
Now that Amazon has made the FPGA-accelerated AWS EC2 F1 instance based on multiple Xilinx Virtex UltraScale+ VU9P FPGAs generally available, the unbound imaginations of some really creative people have been set free. Case in point: the cloud-based FireSim hardware/software co-design environment and simulation platform for designing and simulating systems based on the open-source RocketChip RISC-V processor. The Computer Architecture Research Group at UC Berkeley is developing FireSim. (See “Bringing Datacenter-Scale Hardware-Software Co-design to the Cloud with FireSim and Amazon EC2 F1 Instances.”)
Here’s what FireSim looks like:
According to the AWS blog cited above, FireSim addresses several hardware/software development challenges. Here are some direct quotes from the AWS blog:
1: “FPGA-based simulations have traditionally been expensive, difficult to deploy, and difficult to reproduce. FireSim uses public-cloud infrastructure like F1, which means no upfront cost to purchase and deploy FPGAs. Developers and researchers can distribute pre-built AMIs and AFIs, as in this public demo (more details later in this post), to make experiments easy to reproduce. FireSim also automates most of the work involved in deploying an FPGA simulation, essentially enabling one-click conversion from new RTL to deploying on an FPGA cluster.”
2: “FPGA-based simulations have traditionally been difficult (and expensive) to scale. Because FireSim uses F1, users can scale out experiments by spinning up additional EC2 instances, rather than spending hundreds of thousands of dollars on large FPGA clusters.”
3: “Finding open hardware to simulate has traditionally been difficult. Finding open hardware that can run real software stacks is even harder. FireSim simulates RocketChip, an open, silicon-proven, RISC-V-based processor platform, and adds peripherals like a NIC and disk device to build up a realistic system. Processors that implement RISC-V automatically support real operating systems (such as Linux) and even support applications like Apache and Memcached. We provide a custom Buildroot-based FireSim Linux distribution that runs on our simulated nodes and includes many popular developer tools.”
4: “Writing hardware in traditional HDLs is time-consuming. Both FireSim and RocketChip use the Chisel HDL, which brings modern programming paradigms to hardware description languages. Chisel greatly simplifies the process of building large, highly parameterized hardware components.”
Using high-speed FPGA technology to simulate hardware isn’t a new idea. Using an inexpensive, cloud-based version of that same FPGA technology to develop hardware and software from your laptop while sitting in a coffee house in Coeur d’Alene, Timbuktu, or Ballarat—now that is something new.
National Instruments’ (NI’s) PXI FlexRIO modular instrumentation product line has been going strong for more than a decade and the company has just revamped its high-speed PXI analog digitizers and PXI digital signal processors by upgrading to high-performance Xilinx Kintex UltraScale FPGAs. According to NI’s press release, “With Kintex UltraScale FPGAs, the new FlexRIO architecture offers more programmable resources than previous Kintex-7-based FlexRIO modules. In addition, the new mezzanine architecture fits both the I/O module and the FPGA back end within a single, integrated 3U PXI module. For high-speed communication with other modules in the chassis, these new FlexRIO modules feature PCIe Gen 3 x8 connectivity for up to 7 GB/s of streaming bandwidth.” (Note that all Kintex UltraScale FPGAs incorporate hardened PCIe Gen1/2/3 cores.)
The change was prompted by a tectonic shift in analog converter interface technology—away from parallel LVDS converter interfaces and towards newer, high-speed serial protocols like JESD204B. As a result, NI’s engineers re-architected their PXI module architecture by splitting it into interface-compatible Mezzanine I/O modules and a trio of back-end FPGA carrier/PCIe-interface cards—NI calls them “FPGA back ends”—based on three pin-compatible Kintex UltraScale FPGAs: the KU035, KU040, and KU060 Kintex UltraScale devices. These devices allow NI to offer three different FPGA resource levels with the same pcb design.
Modular NI PXI FlexRIO Module based on Xilinx Kintex UltraScale FPGAs
The new products in the revamped FlexRIO product line include:
Digitizer Modules – New PXI FlexRIO Digitizers higher-speed sample rates and wide bandwidth without compromising dynamic range. The 16-bit, 400MHz PXIe-5763 and PXIe-5764 operate at 500 Msamples/sec and 1 Gsamples/sec respectively. (Note: NI previewed these modules earlier this year at NI Week. See “NI’s new FlexRIO module based on Kintex UltraScale FPGAs serves as platform for new modular instruments.”)
Here’s a diagram showing you the flow for LabVIEW 2017’s Xilinx Vivado Project Export capability:
NI’s use of three Kintex UltraScale FPGA family members to develop these new FlexRIO products illustrate several important benefits associated with FPGA-based design.
First, NI has significantly future-proofed its modular PXIe FlexRIO design by incorporating the flexible, programmable I/O capabilities of Xilinx FPGAs. JESD204B is an increasingly common analog interface standard, easily handled by the Kintex UltraScale FPGAs. In addition, those FPGA I/O pins driving the FlexRIO Mezzanine interface connector are bulletproof, fully programmable, 16.3Gbps GTH/GTY ports that can accommodate a very wide range of high-speed interfaces—nearly anything that NI’s engineering teams might dream up in years to come.
Second, NI is able to offer three different resource levels based on three pin-compatible Kintex UltraScale FPGAs using the same pcb design. This pin compatibility is not accidental. It’s deliberate. Xilinx engineers labored to achieve this compatibility just so that system companies like NI could benefit. NI recognized and took advantage of this pre-planned compatibility. (You'll find that same attention to detail in every one of Xilinx's All Programmable device families.)
Third, NI is able to leverage the same Kintex UltraScale FPGA architecture for its high-speed Digitizer Modules and for its Coprocessor Modules, rather than using two entirely dissimilar chips—one for I/O control in the digitizers and one for the programmable computing engine Coprocessor Modules. The same programmable Kintex UltraScale FPGA architecture suits both applications well. The benefit here is the ability to develop common drivers and other common elements for both types of FlexRIO module.
For more information about these new NI FlexRIO products, please contact NI directly.
Note: Xcell Daily has covered the high-speed JESD204B interface repeatedly in the past. See:
Envious of all the cool FPGA-accelerated applications showing up on the Amazon AWS EC2 F1 instance like the Edico Genome DRAGEN Genome Pipeline that set a Guinness World Record last week, the DeePhi ASR (Automatic speech Recognition) Neural Network announced yesterday, Ryft’s cloud-based search and analysis tools, or NGCodec’s RealityCodec video encoder?
Well, you can shake off that green monster by signing up for the free, live, half-day Amazon AWS EC2 F1 instance and SDAccel dev lab being held at SC17 in Denver on the morning of November 15 at The Studio Loft in the Denver Performing Arts Complex (1400 Curtis Street), just across the street from the Denver Convention Center where SC17 is being held. Xilinx is hosting the lab and technology experts from Xilinx, Amazon Web Services, Ryft, and NGCodec will be available onsite.
Here’s the half-day agenda:
8:00 AM Doors open, Registration, and Continental Breakfast
9:00 AM Welcome, Technology Discussion, F1 Developer Use Cases and Demos
9:35 AM Break
9:45 AM Hands-on Training Begins
12:00 PM Developer Lab Concludes
A special guest speaker from Amazon Web Services is also on the agenda.
Lab instruction time includes:
Seats are necessarily limited for a lab like this, so you might want to get your request in immediately. Where? Here.
One of the several products announced at DeePhi’s event held yesterday in Beijing was the DP-S64 ASR (Automatic speech Recognition) Acceleration Solution, a Neural Network (NN) application that runs on Amazon’s FPGA-Accelerated AWS EC2 F1 instance. The AWS EC2 F1 instances’ FPGA acceleration is based on multiple Xilinx Virtex UltraScale+ VU9P FPGAs. (See “AWS makes Amazon EC2 F1 instance hardware acceleration based on Xilinx Virtex UltraScale+ FPGAs generally available.”)
For details and to apply for a free trial of DeePhi’s DP-S64, send an email to firstname.lastname@example.org.
For information about the other NN products announced yesterday by DeePhi, see “DeePhi launches vision-processing dev boards based on a Zynq SoC and Zynq UltraScale+ MPSoC, companion Neural Network (NN) dev kit.”
Yesterday, DeePhi Tech announced several new deep-learning products at an event held in Beijing. All of the products are based on DeePhi’s hardware/software co-design technologies for neural network (NN) and AI development and use deep compression and Xilinx All Programmable technology as a foundation. Central to all of these products is DeePhi’s Deep Neural Network Development Kit (DNNDK), an integrated framework that permits NN development using popular tools and libraries such as Caffe, TensorFlow, and MXNet to develop and compile code for DeePhi’s DPUs (Deep Learning Processor Units). DeePhi has developed two FPGA-based DPUs: the Aristotle Architecture for convolutional neural networks (CNNs) and the Descartes Architecture for Recurrent Neural Networks (RNNs).
DeePhi’s DNNDK Design Flow
DeePhi’s Aristotle Architecture
DeePhi’s Descartes Architecture
DeePhi’s approach to NN development using Xilinx All Programmable technology uniquely targets the company’s carefully optimized, hand-coded DPUs instantiated in programmable logic. In the new book “FPGA Frontiers” published the Next Platform Press, DeePhi’s co-founder and CEO Song Yao describes using his company’s DPUs: “The algorithm designer doesn’t need to know anything about the underlying hardware. This generates instruction instead of RTL code, which leads to compilation in 60 seconds.” The benefits are rapid development and the ability to concentrate on NN code development rather than the mechanics of FPGA compilation, synthesis, and placement and routing.
Part of yesterday’s announcement included two PCIe boards oriented towards vision processing that implement DeePhi’s Aristotle Architecture DPU. One board, based on the Xilinx Zynq Z-7020 SoC, handles real-time CNN-based video analysis including facial detection for more than 30 faces simultaneously for 1080p, 18fps video using only 2 to 4 watts. The second board, based on a Xilinx Zynq UltraScale+ ZU9 MPSoC, supports simultaneous, real-time video analysis for 16 channels of 1080p, 18fps video and draws only 30 to 60 watts.
DeePhi PCIe NN board based on a Xilinx Zynq Z-7020 SoC
DeePhi PCIe NN board based on a Xilinx Zynq UltraScale+ ZU9 MPSoC
For more information about these products, please contact DeePhi Tech directly.
Here’s a great, very short video from Samtec showing its Firefly QSFP28 copper “flyover” cable carrying four lanes of 28Gbps serial data error free over six feet of copper Twinax cable with plenty of margin. (It’s a 3-foot cable with the signals looped back at one end in a passive loop with no repeaters.)
That’s an amazing feat! The 28Gbps data streams are both transmitted from and received by the same Xilinx VCU118 Dev Kit board, which is based on a Xilinx Virtex UltraScale+ VU9P FPGA. Samtec’s video uses its FMC+ Active Loopback Card for the demo.
If you do this sort of thing on a pcb—even using a super-advanced, super-high-speed, ultra-low-loss pcb dielectric material like Panasonic’s MEGTRON6—your on-board reach is only five to eight inches. So Samtec’s Firefly system really extends your reach for these very high-speed data streams.
There are two main reasons why this high-speed signal-transmission system works:
Here’s the 2.5-minute Samtec video demo:
And here’s Samtec’s original 3-minute Firefly/UltraScale+ demo video, mentioned in the above video:
Last month, I described BrainChip’s neuromorphic BrainChip Accelerator PCIe card, a spiking neural network (SNN) based on a Xilinx Kintex UltraScale FPGA. (See “FPGA-based Neuromorphic Accelerator board recognizes objects 7x more efficiently than GPUs on GoogleNet, AlexNet.”) Now, BrainChip Holdings has announced that it has shipped a BrainChip Accelerator to “a major European automobile manufacturer” for evaluation in ADAS (Advanced Driver Assist Systems) and autonomous driving applications. According to BrainChip’s announcement, the company’s BrainChip Accelerator “provides a 7x improvement in images/second/watt, compared to traditional convolutional neural networks accelerated by Graphics Processing Units (GPUs).”
BrainChip Accelerator card with six SNNs instantiated in a Kintex UltraScale FPGA
Do you need to pack the maximum amount of embedded resources into a minimum space? Then take a look at the just-announced Aldec TySOM-3-ZU7EV Embedded Prototyping Board. Aldec’s TySOM-3-ZU7EV board jams a Xilinx Zynq UltraScale+ ZU7EV MPSoC, DDR4 SoDIMM, WiFi, Bluetooth, two FMC sites, HDMI input and output ports, a DisplayPort connection, a QSFP+ optical cage, and a Pmod connector into a dense, 100x144mm footprint. Don’t let the word “Prototyping” in the product name fool you; in my opinion, this board looks like a good production foundation for many embedded designs.
The on-board Zynq UltraScale+ ZU7EV MPSoC itself provides you with a quad-core, 64-bit ARM Cortex-A53 MPCore processor; a dual-core, 32-bit ARM Cortex-R5 MPCore processor capable of lockstep operation that’s especially useful for safety-critical designs; an ARM Mali-400 MP2 GPU; four tri-mode Ethernet ports; two USB3.0 ports; a variety of other useful peripherals; and a very big chunk of Xilinx UltraScale+ programmable logic with 1720 DSP48E2 slices and more than 30Mbits of high-speed, on-chip SRAM in three forms (distributed RAM, BRAM, and UltraRAM). That’s an immense amount of processing capability—ample for a wide variety of embedded designs.
The board also includes a SoDIMM socket capable of holding 4Gbytes of DDR4 SDRAM, a MicroSD card socket capable of holding SD cards with as much as 32Gbytes of Flash memory, 256Mbits of on-board QSPI Flash memory, and 64Kbits of EEPROM.
Here’s a photo of this packed board:
Aldec TySOM-3-ZU7EV Embedded Prototyping Board
Here’s a diagram showing you the board’s diminutive dimensions:
Aldec TySOM-3-ZU7EV Embedded Prototyping Board Dimension Diagram
And finally, here’s the TySOM-3-ZU7EV board’s block diagram:
Aldec TySOM-3-ZU7EV Embedded Prototyping Board Block Diagram
If you’re visiting Arm TechCon this week in the Santa Clara Convention Center, you’ll find Aldec in the Expo Hall.