UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

 

Xilinx has announced at HUAWEI CONNECT 2017 that Huawei’s new, accelerated cloud service and its FPGA Accelerated Cloud Server (FACS) is based on Xilinx Virtex UltraScale+ VU9P FPGAs. The Huawei FACS platform allows users to develop, deploy, and publish new FPGA-based services and applications on the Huawei Public Cloud with a 10-50x speed-up for compute-intensive cloud applications such as machine learning, data analytics, and video processing. Huawei has more than 15 years of experience in the development of FPGA systems for telecom and data center markets. "The Huawei FACS is a fully integrated hardware and software platform offering developer-to-deployment support with best-in-class industry tool chains and access to Huawei's significant FPGA engineering expertise," said Steve Langridge, Director, Central Hardware Institute, Huawei Canada Research Center.

 

The FPGA Accelerated Cloud Server is available on the Huawei Public Cloud today. To register for the public beta, please visit http://www.hwclouds.com/product/fcs.html. For more information on the Huawei Cloud, please visit www.huaweicloud.com.

 

 

For more information, see this page.

 

 

 

Yesterday, Premier Farnell announced that has added the Xilinx All Programmable device product line including Zynq SoCs, Zynq UltraScale+ MPSoCs, and FPGAs to its line card. That means Xilinx All Programmable devices are available from Farnell element14 in Europe, Newark element14 in North America, and element14 in APAC. Premier Farnell is a business unit of Avnet, Inc.

 

 

Adam Taylor’s MicroZed Chronicles, Part 214: Addressing VDMA Issues

by Xilinx Employee ‎09-05-2017 12:04 PM - edited ‎09-06-2017 08:41 AM (3,772 Views)

 

By Adam Taylor

 

 

Video Direct Memory Access (VDMA) is one of the key IP blocks used within many image-processing applications. It allows frames to be moved between the Zynq SoC’s and Zynq UltraScale+ MPSoC’s PS and PL with ease. Once the frame is within the PS domain, we have several processing options available. We can implement high-level image processing algorithms using open-source libraries such as OpenCV and acceleration stacks such as the Xilinx reVISION stack if we wish to process images at the edge. Alternatively, we can transmit frames over Gigabit Ethernet, USB3, PCIe, etc. for offline storage or later analysis.

 

It can be infuriating when our VDMA-based image-processing chain does not work as intended. Therefore, we are going to look at a simple VDMA example and the steps we can take to ensure that it works as desired.

 

The simple VDMA example shown below contains the basic elements needed to provide VDMA output to a display. The processing chain starts with a VDMA read that obtains the current frame from DDR memory. To correctly size the data stream width, we use an AXIS subset convertor to convert 32-bit data read from DDR memory into a 24-bit format that represents each RGB pixel with 8 bits. Finally, we output the image with an AXIS-to-video output block that converts the AXIS stream to parallel video with video data and sync signals, using timing provided by the Video Timing Controller (VTC). We can use this parallel video output to drive a VGA, HDMI, or other video display output with an appropriate PHY.

 

This example outlines a read case from the PS to the PL and corresponding output. This is a more complicated case than performing a frame capture and VDMA write because we need to synchronize video timing to generate an output.

 

 

 

Image1.jpg

 

 

Simple VDMA-Based Image-Processing Pipeline

 

 

 

So what steps can we take if the VDMA-based image pipeline does not function as intended? To correct the issue:

 

  1. Check Reset and Clocks as we would when debugging any application. Ensure that the reset polarity is correct for each module as there will be mixed polarities. Ensure that the pixel clock is correct for the required video timing and that it is supplied to both the VTC and the AXIS-to-Video Out blocks. While the clock required for the AXIS network must be able to support the image throughput.
  2. Check the Clock Enables on both the VTC and AXIS to Video Out blocks are tied to the correct level to enable the clocks.
  3. Check that the VTC is correctly configured, especially if you are using the AXI interface to define the configuration through the application software. When configuring the VTC using AXI, it is important to make sure we have set the source registers to the VTC generator, enabled register updates, and defined the timing parameters required.
  4. Check the connections between the VTC and AXIS-to-Video-Out Blocks. Ensure that the horizontal and vertical blanking signals are also connected along with the horizontal and vertical syncs.
  5. Check the AXIS-to-Video-Out If we are using VDMA, the timing mode of the AXIS-to-Video-Out block should be set to master. This enables the AXIS-to-Video-Out block to assert back pressure on the AXIS data stream to halt the frame buffer output. This mechanism permits the AXIS-to-Video-Out block to manage the flow of pixels by enabling synchronization and lock. You may also want to increase the size of the internal buffer from the default.
  6. Check that the AXIS-to-Video-Out VTC_ce signal is not connected to the VTC gen clock enable as is the case when configured for slave operation. This will prevent the AXIS-to-Video-Out block from being able to lock to the AXIS video stream.
  7. Insert ILA’s. Inserting these within the design allow us to observe the detailed workings of the AXI buses. When commissioning a new image processing pipeline, I insert ILA blocks on the VTC output and the VDMA MM-to-AXIS port so that I can observe the generated timing signals and VDMA output stream. When observing the AXI Stream the tuser signal identifies the start of frame and the tlast signal represents the end of line. You may also want to observe the AXIS-to-Video-Out 32-bit status output, which provides indication of the locked status along with additional debug information.
  8. Ensure that HSize and Stride are set correctly. These are defined by the application software and configure the VMDA with frame-store information. HSize represents the horizontal size of the image and Stride represents the distance in memory between the image lines. Both HSize and Stride are defined in bytes. As such, when working with U32 or U16 types, take care to correctly set these values to reflect the number of bytes used.

 

 

Hopefully by the time you have checked these points, the issue with your VDMA based image processing pipeline will have been identified and you can start developing the higher-level image processing algorithms needed for the application.

 

 

Code is available on Github as always.

 

 

If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.

 

 

 

  • First Year E Book here
  • First Year Hardback here.

 

 

MicroZed Chronicles hardcopy.jpg 

  

 

 

  • Second Year E Book here
  • Second Year Hardback here

 

 

MicroZed Chronicles Second Year.jpg 

 

by Anthony Boorsma, DornerWorks

 

 

Why aren’t you getting all of the performance that you expect after moving a task or tasks from the Zynq PS (processing system) to its PL (programmable logic)? If you used SDSoC to develop your embedded design, there’s help available. Here’s some advice from DornerWorks, a Premier Xilinx Alliance Program member. This blog is adapted from a recent post on the DornerWorks Web site titled “Fine Tune Your Heterogeneous Embedded System with Emulation Tools.

 

 

 

Thanks to Xilinx’s SDSoC Development Environment, offloading portions of your software algorithm to a Zynq SoC’s or Zynq UltraScale+ MPSoC’s PL (programmable logic) to meet system performance requirements is straightforward. Once you have familiarized yourself with SDSoC’s data-transfer options for moving data back and forth between the PS and PL, you can select the appropriate data mover that represents the best choice for your design. SDSoC’s software estimation tool then shows you the expected performance results.

 

Yet when performing the ultimate test of execution—on real silicon—the performance of your system sometimes fails to match expectations and you need to discover the cause… and the cure. Because you’ve offloaded software tasks to the PL, your existing software debugging/analysis methods do not fully apply because not all of the processing occurs in the PS.

 

You need to pinpoint the cause of the unexpected performance gap. Perhaps you made a sub-optimal choice of data mover. Perhaps the offloaded code was not a good candidate for offloading to the PL. You cannot cure the performance problem without knowing its cause.

 

Just how do you investigate and debug system performance on a Zynq-based heterogeneous embedded system with part of the code running in the PS and part in the PL?

 

If you are new to the world of debugging PL data processing, you may not be familiar with the options you have for viewing PL data flow. Fortunately, if you used SDSoC to accelerate software tasks by offloading them to the PL, there is an easy solution. SDSoC has an emulation capability for viewing the simulated operation of your PL hardware that uses the context of your overall system.

 

This emulation capability allows you to identify any timing issues with the data flow into or out of the auto-generated IP blocks that accelerate your offloaded software. The same capability can also show you if there is an unexpected slowdown in the offloaded software acceleration itself.

 

Using this tool can help you find performance bottlenecks. You can investigate these potential bottlenecks by watching your data flow through the hardware via the displayed emulation signal waveforms. Similarly, you can investigate the interface points by watching the data signals transfer data between the PS and the PL. This information provides key insights that help you find and fix your performance issues.

 

We’ll focus on the multiplier IP block from the Xilinx MMADD example to demonstrate how you can debug/emulate a hardware-accelerated function. For simplicity, we will focus on one IP block, the matrix multiplier IP block from the Multiply and Add example, shown in Figure 1.

 

 

 

Image1.jpg

 

Figure 1: Multiplier IP block with Port A expanded to show its signals

 

 

 

We will look at the waveforms for the signals to and from this Mmult IP block in the emulation. Specifically we will view the A_PORTA signals as shown in the figure above. These signals represent the data input for matrix A, which corresponds to the software input param A to the matrix multiplier function.

 

To get started with the emulation, enable generation of the “emulation model” configuration for the build in SDSoC’s project’s settings, as shown in Figure 2.

 

 

 

Image2.jpg 

 

 

Figure 2: The mmult Project Settings needed to enable emulation

 

 

 

Next, rebuild your project as normal. After building your project with emulation model support enabled in the configuration, run the emulator by selecting “Start/Stop Emulation” under the “Xilinx Tools” menu option. When a window opens, select “Start” to start the emulator. SDSoC will then automatically launch an instance of Xilinx Vivado, which triggers the auto-generated PL project that SDSoC created for you as a subproject within your SDSoC project.

 

We specifically want to view the A_PORTA signals of the Mmult IP block. These signals must be added to the Wave Window to be viewed during a simulation. The available Mmult signals can be viewed in the Objects pane by selecting the mmult_1 block in the Scopes pane. To add the A_PORTA signals to the Wave Window, select all of the “A_*” signals in the Objects pane, right click, and select “Add to Wave Window” as shown in Figure 3.

 

 

 

Image3.jpg

 

 

Figure 3: Behavioral Simulation – mmult_1 signals highlighted

 

 

 

Now you can run the emulation and view the signal states in the waveform viewer. Start the emulator by clicking “Run All” from the “Run” drop-down menu as shown in Figure 4.

 

 

 

Image4.jpg

 

 

Figure 4: Start emulation of the PL

 

 

 

Back SDSoC’s toolchain environment, you can now run a debugging session that connects to this emulation session as it would to your software running on the target. From the “Run” menu option, select “Debug As -> 1 Launch on Emulator (SDSoC Debugger)” to start the debug session as shown in Figure 5.

 

 

 

Image5.jpg

 

 

Figure 5: Connect Debug Session to run the PL emulation

 

 

 

Now you can step or run through your application test code and view the signals of interest in the emulator. Shown below in Figure 6 are the A_PORTA signals we highlighted earlier and their signal values at the end of the PL logic operation using the Mmult and Add example test code.

 

 

Image6.jpg

 

Figure 6: Emulated mmult_1 signal waveforms

 

 

 

These signals tell us a lot about the performance of the offloaded code now running in the PL and we used familiar emulation tools to obtain this troubleshooting information. This powerful debugging method can help illuminate unexpected behavior in your hardware-accelerated C algorithm by allowing you to peer into the black box of PL processing, thus revealing data-flow behavior that could use some fine-tuning.

 

 

Fidus Systems based the design of its Sidewinder-100 PCIe NVMe Storage Controller on a Xilinx Zynq UltraScale+ MPSoC ZU19EG for many reasons but among the most important are PCIe Gen3/4 capability; high-speed, bulletproof SerDes for the board’s two 100Gbps-capable QSFP optical network cages; vast I/O flexibility inherent in Xilinx All Programmable devices to control DDR SDRAM, to drive the two SFF-8643 Mini SAS connectors for off-board SSDs, etc.; and the immense processing capabilities that come from the six on-chip ARM processor cores (four 64-bit ARM Cortex-A53 MPcore processors and two 32-bit ARM Cortex-R5 MPCore processors); and the big chunk of on-chip programmable logic based on the Xilinx UltraScale architecture. The same attributes that made the Zynq UltraScale+ MPSoC a good foundation for a high-performance NVMe controller like the Sidewinder-100 also make the board an excellent development target for a truly wide variety of hardware designs—just about anything you might imagine.

 

The Sidewinder-100’s significant performance advantage over SCSI and SAS storage arrays comes from its use of NVMe Over Fabrics technology reduce storage transaction latencies. In addition, there are two on-board M.2 connectors available for docking NVMe SSD cards. The board also accepts two DDR4 SO-DIMMs that are independently connected to the Zynq UltraScale+ MPSoC’s PS (processing system) and PL (programmable logic). That independent connection allows the PS-connected DDR4 SO-DIMM to operate at 1866Mtransfers/sec and the PL-connected DDR4 SO-DIMM to operate at 2133Mtransfers/sec.

 

All of this makes for a great PCIe Gen4 development platform, as you can see from this photo:

 

 

Fidus Sidewinder-100 NVMe Storage Controller.jpg

 

Fidus Sidewinder-100 PCIe NVMe Storage Controller

 

 

Because Fidus is a design house, it had general-purpose uses in mind for the Sidewinder-100 PCIe NVMe Storage Controller from the start. The board makes an excellent, ready-to-go development platform for any sort of high-performance PCIe Gen 3 or Gen4 development and Fidus would be happy to help you develop something else using this platform.

 

Oh, and one more thing. Tucked onto the bottom of the Sidewinder-100 PCIe NVMe Storage Controller Web page is this interesting PCIe Power and Loopback Adapter:

 

 

Fidus PCIe Power and Loopback Adapter.jpg 

 

Fidus PCIe Power and Loopback Adapter

 

 

It’s just the thing you’ll need to bring up a PCIe card on the bench without a motherboard. After all, PCIe Gen4 motherboards are scarce at the moment and this adapter looks like it should cost a lot less than a motherboard with a big, power-hungry processor on board. Just look at that tiny dc power connector to operate the adapter!

 

 

Please contact Fidus Systems directly for more information about the Sidewinder-100 PCIe NVMe Storage Controller and the PCIe Power and Loopback Adapter.

 

 

 

 

If you’re teaching digital design (or learning it), then the Digilent Nexys4-DDR FPGA Trainer Board based on the Xilinx Artix-7 A100T FPGA is a very good teaching platform because it provides you with ample programmable logic to work with (15,850 logic cells, 13.14Mbits of on-chip SRAM, and 240 DSP48E1 slices) along with 128Mbytes of DDR2 SDRAM and a good mix of peripherals and it’s paired with the industry’s most advanced system design tool—Xilinx Vivado.

 

RS University and Digilent have partnered to provide academics with a free half-day workshop on teaching digital systems using FPGAs (and the Nexys4-DDR Trainer Board). The half-day workshop will take place at Coventry U. on October 25th 2017 in the Engineering and Computing Building. (More info here, registration here.)

 

 

Digilent Nexys4-DDR Trainer Board.jpg 

 

 

 

A commitment to “Any Media over Any Network” when video has rapidly proliferated across all markets requires another commitment: any-to-any video transcoding. That’s because the video you want is often not coded in the format you want (compression standard, bit rate, frame rate, resolution, color depth, etc.). As a result, transcoding has become a big deal and supporting the myriad video formats already available, and the new ones to come, is a big challenge.

 

Would you like some help? Wish granted.

 

Xilinx’s Pro AV & Broadcast Video Systems Architect Alex Luccisano is presenting two free, 1-hour Webinars on September 26 that covers video transcoding and how you can use Xilinx Zynq UltraScale+ EV MPSoCs for real-time, multi-stream video transcoding in your next design.

 

 

Click here for the 7:00 am (PST), 14:00 (GMT) Webinar on September 26.

 

Click here for the 10:00 am (PST), 17:00 (GMT) Webinar on September 26.

 

 

Avnet publishes article that serves as a Buyer’s Guide for its Zynq-based Dev Boards and SOMs

by Xilinx Employee ‎08-29-2017 02:04 PM - edited ‎08-30-2017 05:25 AM (4,812 Views)

 

Avnet just published an article titled “Zynq SoMs Decrease Customer Development Times and Costs” that provides a brief-but-good buyer’s guide for several of its Zynq-based dev boards and SOMs including the MicroZed (based on the Xilinx Zynq Z-7010 or Z-7020 SoCs), PicoZed (based on the Zynq Z-7010, 7015, 7020, or Z-7030 SoCs), and the Mini-Module Plus (based on the Xilinx Zynq Z-7045 or Z-7100 SoCs). These three boards give you pre-integrated access to nearly the entire broad line of Zynq Z-7000 dual-ARM-core SoCs.

 

 

Avnet PicoZed SOM.jpg 

 

Avnet PicoZed SOM

 

 

The article also lists several important points to consider when contemplating a make-or-buy decision for a Zynq-based board including:

 

 

  • “Designing the high-speed DDR3 interface for Zynq requires a deep understanding of transmission line theory. The PCB layout calls for matching trace lengths, controlling impedances and using proper termination. If designed improperly, several PCB spins and months of development times can be wasted.”

 

 

  • “Avnet jumpstarts Zynq-based software, firmware and HDL development by providing the necessary tools to get started. MicroZed, PicoZed and Avnet’s entire portfolio of Zynq-based SoM have board support packages (BSPs) available.”

 

 

Whichever way you choose to go, the Zynq SoC (and the more powerful Zynq UltraScale+ MPSoC), give you a unique blend of software-based processor horsepower and programmable-logic that delivers hardware-level performance when and where you need it in your design.

 

 

 

A recent Sensorsmag,com article written by Nick Ni and Adam Taylor titled “Accelerating Sensor Fusion Embedded Vision Applications” discusses some of the sensor-fusion principles behind, among other things, 3D stereo vision as used in the Carnegie Robotics Multisense stereo cameras discussed in today’s earlier blog titled “Carnegie Robotics’ FPGA-based GigE 3D cameras help robots sweep mines from a battlefield, tend corn, and scrub floors.” We’re starting to put a large amount of sensors into systems and turning the deluge of raw sensor data into usable information is a tough computational job.

 

Describing some of that job’s particulars consumes the first half of Ni’s and Taylor’s article. The second half of the article then discusses some implementation strategies based on the new Xilinx reVISION stack, which is built on top of Xilinx Zynq SoCs and Zynq UltraScale+ MPSoCs.

 

If there are a lot of sensors in your next design, particularly image sensors, be sure to take a look at this article.

 

 

Carnegie Robotics currently uses a Spartan-6 FPGA in its GigE 3D imaging sensors to fuse video feeds from the two video cameras in the stereo pair; to generate 2.1 billion correspondence matches/sec from the left and right camera video streams; to then generate 15M points/sec of 3D point-cloud data from the correspondence matches; which in turn helps the company’s robots to make safe movement decisions and avoid obstacles while operating in unknown, unstructured environments. The company’s 3D sensors are used in unmanned vehicles and robots, which generally weigh between 100 and 1000 pounds, operate in a variety of such unstructured environments in applications as diverse as agriculture, building maintenance, mining, and battlefield mine sweeping. All of this is described by Carnegie Robotics’ CTO Chris Osterwood in a new 3-minute “Powered by Xilinx” video, which appears below.

 

The company is a spinout of Carnegie Mellon University’s National Robotics Engineering Center (NREC), one of the world’s premier research and development organizations for advanced field robotics, machine vision and autonomy. It offers a variety of 3D stereo cameras including:

 

 

  • The MultiSense S7, a rugged, high-resolution, high-data-rate, high-accuracy GigE 3D imaging sensor.
  • The MultiSense S21, a long-range, low-latency GigE imaging sensor based on the S7 stereo-imaging sensor but with wide (21cm) separation between the stereo camera pair for increased range
  • The MultiSense SL, a tri-modal GigE imaging sensor that fuses high-resolution, high-accuracy 3D stereo vision from the company’s MultiSense S7 stereo-imaging sensor with laser ranging (0.4 to 10m).

 

 

 

Carnegie Robotics MultiSense SL Tri-Modal 3D Imaging Sensor.jpg

 

 

Carnegie Robotics MultiSense SL Tri-Modal 3D Imaging Sensor

 

 

 

 

All of these Carnegie Robotics cameras consume less than 10W, thanks in part to the integrated Spartan-6 FPGA, which uses 1/10 of the power required by a CPU to generate 3D data from the 2.1 billion correspondence matches/sec. The Multisense SL served as the main perceptual “head” sensor for the six ATLAS robots that participated in the DARPA Robotics Challenge Trials in 2013. Five of these robots placed in the top eight finishers during the DARPA trials.

 

The video below also briefly discusses the company’s plans to migrate to a Zynq SoC, which will allow Carnegie Robotics’ sensors to perform more in-camera computation and will further reduce the overall robotic system’s size, weight, power consumption and image latency. That’s a lot of engineering dimensions all being driven in the right direction by the adoption of the more integrated Zynq SoC All Programmable technology.

 

Earlier this year, Carnegie Robotics and Swift Navigation announced that they were teaming up to develop a line of multi-sensor navigation products for autonomous vehicles, outdoor robotics, and machine control. Swift develops precision, centimeter-accurate GNSS (global navigation satellite system) products. The joint announcement included a photo of Swift Navigation’s Piksi Multi—a multi-band, multi-constellation RTK GNSS receiver clearly based on a Zynq Z-7020 SoC.

 

 

 

 

Swift Piksi Multi GNSS Receiver.jpg 

 

 

Swift Navigation Piksi Multi multi-band, multi-constellation RTK GNSS receiver, based on a Zynq SoC.

 

 

 

There are obvious sensor-fusion synergies between the product-design trajectory based on the Zynq SoC as described by Chris Osterwood in the “Powered by Xilinx” video below and Swift Navigation’s existing, Zynq-based Piksi  Multi GNSS receiver.

 

Here’s the Powered by Xilinx video:

 

 

 

 

 

 

Even though I knew this was coming, it’s still hard to write this blog post without grinning. Last week, acknowledged FPGA-based processor wizard Jan Gray of Gray Research LLC presented a Hot Chips poster titled “GRVI Phalanx: A Massively Parallel RISC-V FPGA Accelerator Framework: A 1680-core, 26 MB SRAM Parallel Processor Overlay on Xilinx UltraScale+ VU9P.” Allow me to unpack that title and the details of the GRVI Phalanx for you.

 

Let’s start with 1680 “austere” processing elements in the GRVI Phalanx, which are based on the 32-bit RISC-V processor architecture. (Is that parallel enough for you?) The GRVI processing element design follows “Jan’s Razor”: In a chip multiprocessor, cut nonessential resources from each CPU, to maximize CPUs per die. Thus, a GRVI processing element is a 3-stage, user-mode RV321 core minus a few nonessential bits and pieces. It looks like this:

 

 

GRVI Processing Element.jpg

 

A GRVI Processing Element

 

 

 

Each GRVI processing element requires ~320 LUTs and runs at 375MHz. Typical of a Jan Gray design, the GRVI processing element is hand-mapped and –floorplanned into the UltraScale+ architecture and then stamped 1680 times into the Virtex UltraScale+ VU9P FPGA on a VCU118 Eval Kit.

 

Now, dropping a bunch of processor cores onto a large device like the Virtex UltraScale+ VU9P FPGA is interesting but less than useful unless you give all of those cores some memory to operate out of, some way for the processors to communicate with each other and with the world beyond the FPGA package, and some way to program the overall machine.

 

Therefore, the GRVI processing elements are packaged in clusters containing as many as eight processing elements with 32 to 128Kbytes of RAM, and additional accelerator(s). Each cluster is tied to the other on-chip clusters and to the external-world I/O through a HOPLITE router to a NOC (network on chip) with 100Gbps links between nodes. The HOPLITE router is an FPGA-optimized, directional router designed for a 2D torus network.

 

A GRVI Phalanx cluster looks like this:

 

 

GRVI Phalanx Cluster.jpg

 

A GRVI Phalanx Cluster

 

 

 

Currently, Gray’s paper says there a multithreaded C++ compiler with message-passing runtime layered on top of a RISC-V RV321MA GCC compiler with future plans to support OpenCL, P4, and other programming tools.

 

In development: an 80-core educational version of the GRVI Phalanx instantiated in the programmable logic of a “low-end” Zynq Z-7020 SoC on the Digilent PYNQ-Z1 board.

 

Now if all that were not enough (and you will find a lot more packed into Gray’s poster), there’s a Xilinx Virtex UltraScale+ VU9P available to you. It’s as near as your keyboard and Web browser on the Amazon AWS EC2 F1.2XL and F1.16XL instances and Jan Gray is working on putting the GRVI Phalanx on that platform as well.

 

Incredibly, it’s all in that Hot Chips poster.

 

Good news. Spartan-7 7S50FT196 ES devices with 52,160 logic cells and 120 DSP48E1 slices will be available starting this month. However, maybe you’re planning on using a smaller member of the Spartan-7 family like the 7S25, 7S15, or 7S6 and you’d like to start on the hardware design including the pcb now. Is there a way?

 

Yes, there is.

 

Start your development now with the Spartan-7 7S50FT196 ES FPGA in the 15x15mm FTGB196 package with 100 3.3V SelectIO HR I/O pins. That Spartan-7 FPGA in that package is footprint-compatible with the 7S25, 7S15, and 7S6 devices in the same package. It’s the smallest package that’s footprint-compatible across all four devices and it takes you down to the Spartan-7 7S6 FPGA’s 6000 logic cells and 10 DSP slices should one of the smaller devices meet your needs. You get a head start on your development program with no need for a future pcb turn due to the component change out. You also get an immediate hardware upgrade path should your future needs demand a larger FPGA.

 

As they say, “Operators are standing by.”

 

 

Spartan-7 Family Table with FTGB196 Pin Compatibility.jpg

 

 

 

 

 

by Anthony Boorsma, DornerWorks

 

Need more performance in your embedded design? Got a new feature you need to wedge into an existing design? Here’s some advice from DornerWorks, a Premier Xilinx Alliance Program member. This blog is adapted from a recent post on the DornerWorks Web site titled “Manage Dynamic Requirements in Your Embedded System.”

 

 

 

Let’s say you have developed some novel new technology and that you must deal with complex processing demands. Some requirements can be met using software executing on a microprocessor. Additionally there are requirements for unique hardware—perhaps an unusually fast external interface—that requires a custom logic implementation. To meet these requirements, you select a programmable SoC like the Xilinx Zynq SoC or Zynq UltraScale+ MPSoC that provides an embedded, heterogeneous system architecture with both a powerful processing system (PS) and flexible programmable logic (PL) to meet these requirements. You plan to use the PS predominantly for your application and algorithm development and the PL for a custom interface. You start your development cycle with this plan in place.

 

At first, everything goes according to plan. Your design envelope for PS utilization is right on target for the final product. Resource utilization leaves room for future upgrades or for platform reuse when you want to create additional variants. The PL implementation provides you with the flexibility and power you need for your custom interface and even has some extra resources available.

 

Then unexpectedly, you learn of a requirements change that adds one or more new system features. Due to the nature of this feature or features, perhaps because of the complexity, you implement the new feature(s) in software running on the PS—which loads the PS more heavily. If this is a severe case, the new feature(s) could push the PS well beyond its performance limits. At this point, it is too late in the product life cycle to start over without missing or renegotiating deadlines. What do you do if you want to make your current deadline and still include future upgrade capability?

 

By now, this hypothetical situation may seem all too real.

 

There are design alternatives that can help you in these situations. Intensive optimization of the source code to bring software performance within spec is a potential option—one frequently used with microcontrollers and processors when a PL is not available. However, as is often the case with innovative designs that push the hardware envelope, software optimization alone may be insufficient to hit performance targets.

 

Alternatively, offloading computationally complex portions of your application to the PL is an option because you selected a heterogeneous system with a PS and a PL. Even if this offloading process turns out not be an option for the new feature(s), there are likely opportunities waiting for you in the existing code to offload tasks to the PL that will free up needed PS capacity.

 

Not all code blocks make sense to move to the PL because you incur overhead when transferring data back and forth between PS and PL. Code sections that require significant data transfer with relatively little processing will not benefit from the PL’s parallel processing capabilities compared to the incurred overhead.

 

An example of a good opportunity for offloading an algorithm from software to hardware is a block of code that:

 

  • Needs to execute frequently
  • Involves above-average computation
  • Lacks strict data-movement requirements
  • Already uses parallel software processing

 

A code block with complex independent loops processing relatively small amounts of data is an ideal candidate for algorithm acceleration in the PL because the overhead of the induced data-transfer latency is minor compared to the increased performance gained by parallel PL processing. Other blocks to consider are those with strict data-latency requirements where software-scheduling mechanisms and interrupts would introduce unwanted spikes in data-processing latencies. These tasks are best served by PL-based hardware processing.

 

If there are no ideal candidate code blocks or features for hardware offloading, you can still leverage the PL. Select a code block that is difficult to compute for a processor and let the PL do the work. Free up the PS for the work it does best and you still realize the benefit of PL offloading.

 

In fact, you will realize a benefit even if the PL is slower than the PS when executing offloaded code. For example, this situation might occur where the offloaded processing logic on the PL is not a natural fit for acceleration. Yet, the PS is still freed up to perform other tasks that it could not do without offloading. As long as the PL implementation is fast enough, you will see the benefit. The PL’s availability provides you with an opportunity for parallelization and balancing of the processing load across the PS and the PL, improving overall system performance.

 

You might even find that this approach becomes a standard procedure during development of future designs as a load-balancing mechanism to improve consistent baseline performance when adjusting for dynamic requirements over the development life cycle of your new technology. This benefit alone makes it advantageous to consider the selection of a heterogeneous SoC for your next embedded system.

 

 

 

By Adam Taylor

 

With the hardware platform built using the Zynq-based Avnet MiniZed dev board, the next step in this adventure is to write the software so we can display images on the 7-inch touch display. To do this we need write a bare-metal software application to do the following:

 

  • Configure the video timing controller (VTC) to generate timings required for the 800x480-pixel WGA (Wide Video Graphics Array) display.
  • Create three frame buffers within the PS (processing system) DDR SDRAM.
  • Configure the FLIR Lepton IR camera and store images in the current write frame buffer.
  • Configure the VDMA to read from the current read frame buffer.

 

The first step is to configure VTC to generate video timing signals for the desired resolution. Failing to do this correctly will mean that the AXI-Stream-to-Video-Out block won’t lock with the AXIS video stream.

 

The VTC is a core component, present in most image-processing pipelines (ISPs). The VTC’s function is not just limited to generating timing signals; it also detects video input timing. This feature allows the VTC to lock its timing generation with input video streams. That’s a key capability if the ISP needs to be agile and if it’s to adapt on the fly to changes in input resolution.

 

 

 

 

Image1.jpg 

 

 

 

The VTC generator can be configured by either its own registers, which we update when write to those registers directly, or by the VTC detector registers. For this exercise, we need to set the VTC generator register sources correctly because we are only using the generator half of the VTC and not the detector half. The VTC’s power-on default is to take configuration data from the detector registers and that’s not the mode we wish to use here. To set the VTC register source, we’ll use a variable of the structure type XVtc_SourceSelect in conjunction with the function XVtc_SetSource().

 

 

 

Image2.jpg

 

 

Together these lines of code set the VTC control-register bits 8 to 26, which determine the source for each register. Each of these bits controls a specific generator register source. For example, bit 8 controls the Frame Horizontal Size register. Setting this bit to “0” instructs the VTC to use the detector settings while a “1” instructs the VTC to use the generator’s internal register settings.

 

Failing to do this results in writes to the detector registers having no effect on the generated video timing, which can be a rather frustrating issue to track down.

 

With the correct register source set, the next step is to write the timing parameters. We need the following settings for the 7-Inch touch display:

 

 

 

Image3.jpg 

 

 

 

These parameters are stored in a variable of the XVtc_Timing type. We write them into the VTC using the XVtc_SetGeneratorTiming() function:

 

 

Image4.jpg

 

 

 

Of course, the VDMA and the frame buffers must also be aligned with the VTC. The current design uses three frame buffers to store the output images. Each frame buffer is based on the u32 type and declared as a one-dimensional array containing the total number of pixels in the image.

 

The u32 type is ideal for the frame buffer because each pixel in the 7-inch touch display requires eight-bit Red, Green, and Blue values. Therefore, we need 24 bits per pixel. Each frame buffer has an associated pointer that we’ll use for frame-buffer access. We initialize these pointers just after the program starts.

 

We use the VDMA to display the contents of the frame buffer. The key VDMA configuration parameters are stored within a variable of the type XAxiVdma_DmaSetup. It is here where we define the vertical & horizontal size, stride, and the frame-store addresses. The DMA is then configured using this data and the XAxiVdma_DmaConfig() and XAxiVdma_DmaSetBufferAddr() functions. One very important thing to remember here is that the horizontal size and stride are entered bytes. So in this example, they are set to 800 * 4 as each u32 word consists of four bytes.

 

 

Image6.jpg 

 

 

 

We’ll use code from the previous example (p1 & p2) to interface with the FLIR Lepton IR camera. This code communicates with the camera over I2C and SPI interfaces. Once the image has been received from the camera, the code copies the image into the frame buffer. However, to ensure that we use most of the available image frame, we’ll use a simple digital zoom to scale up the 80x60-pixel image from the Lepton 2 camera. To do this, we output each pixel eight times to generate a 640x480-pixel display image that we’ll position within the 7-inch touch display’s 800x480 pixels. We set the remaining pixels to a constant color. As this is a touch display, this remaining space would be idea for command buttons and other user interfaces.

 

Putting all this together results in the image below. The green coloring comes from mapping the 8-bit Lepton image data into the green channel of the display.

 

 

 

Image5.jpg 

 

 

This combination of the FLIR Lepton camera and the Zynq-based MiniZed dev board results in a very compact and cost-efficient thermal-imaging solution. The next step in our journey is to get the MiniZed’s wireless communications working with PetaLinux so that we can transmit these images over the air.

 

 

I have uploaded the initial complete design to GitHub and it is available here

 

 

If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.

 

 

 

  • First Year E Book here
  • First Year Hardback here.

 

 

 MicroZed Chronicles hardcopy.jpg

  

 

  • Second Year E Book here
  • Second Year Hardback here

 

 

MicroZed Chronicles Second Year.jpg 

 

 

The latest teardown video from EEVblog’s Dave Jones looks at the ZeroPlus LAP-F1 logic analyzer with 40 or 64 channels and a maximum internal sample rate of 1Gsamples/sec for timing analysis, 200MHz (dual-edge) state analysis. This is a 55-minute video that looks thoroughly at the hardware and software of the ZeroPlus LAP-F1 analyzer and the lower-cost LAP-C analyzer. The LAP-C analyzers are 10x slower and appear to be based on an ASIC that ZeroPlus designed several years ago. As Dave tears into the newer, faster LAP-F1 logic analyzer, he finds (at 22:44 in the video) a Xilinx Kintex-7 160T FPGA—he calls it “a bit of a beast”—performing the logic functions including data capture, storage in the analyzer’s DDR3-1600 SDRAM capture memory, triggering, and protocol decoding. (From Dave’s perspective, it appears that extensive protocol decoding support may well be the standout feature of the ZeroPlus analyzers.)

 

Dave’s analysis of the LAP-F1 hardware starts at 18:00 in the video. He looks at the LAP-F1 software starting at 45:08 in the video.

 

 

ZeroPlus LAP-F1 Logic Analyzer.jpg 

 

ZeroPlus LAP-F1 Logic Analyzer, Based on a Xilinx Kintex-7 160T FPGA

 

 

 

The LAP-F1 employs an unusual design for the logic probes. It uses USB 3.0 connectors and wiring to connect the analyzer to the probes, but does not use the USB 3.0 protocol to talk to the probes. Instead, the USB 3.0 connectors and wiring are employed to take advantage of the differential-pair wiring and controlled impedance of the cabling. The logic analyzer talks to the probe cables and then active probes designed for various logic levels plug onto the probe cables. Dave approves of the hardware design, but he’s a bit hard on the software design.

 

 

 

ZeroPlus Logic Probe Cables.jpg

 

ZeroPlus LAP-F1 Logic Probe Cables

 

 

 

ZeroPlus Logic Analyzer Active Probes.jpg

 

ZeroPlus LAP-F1 Logic Analyzer Active Probes

 

 

 

The Kintex-7 160T FPGA is a good choice for this sort of design because it can support as many as 192 high-speed, differential I/O pairs, which is more than sufficient for this design.

 

 

 

ZeroPlus LAP-F1 Logic Analyzer Kintex-7 160T.jpg

 

There’s a Xilinx Kintex-7 160T FPGA at the heart of the ZeroPlus LAP-F1 Logic Analyzer

 

 

 

Here’s the EEVblog teardown video:

 

 

 

 

 

 

 

 

 

 

 

 

Engineering Advisory Explicit Content.jpg

 

The following blog post contains explicitly competitive information. If you do not like to read such things or if you live in a country where you’re not supposed to read such things, then stop reading.

 

 

 

In this blog post, I will discuss device performance in a competitive context. Now, whenever you read about “the competition” on a vendor’s Web site, you need to take the information provided with a big grain of salt. It’s hard to believe anything one vendors says about the competition, which is why I so rarely attempt to do so in the Xcell Daily blog.

 

This post is an exception.

 

With that caveat stated, let’s rush in where angels fear to tread.

 

There’s a new 18-page White Paper on the Xilinx.com Web site titled “Measuring Device Performance and Utilization: A Competitive Overview” and written by Frederic Rivoallon, the Vivado HLS and RTL Synthesis Product Manager here at Xilinx. Rivoallon’s White Paper “compares actual Kintex UltraScale FPGA results to Intel’s (formerly Altera) Arria 10, based on publicly available OpenCores designs.” (OpenCores.org declares itself to be “the world’s largest site/community for development of hardware IP cores as open source.”) The data for this White Paper was generated in June, 2017 and is based on the latest versions of the respective design tools available at that time (Vivado Design Suite 2017.1 and Quartus Prime v16.1).

 

Cutting to the chase, here’s the White Paper’s conclusion, conveniently summarized in the same White Paper’s introduction:

 

“Verifiable results based on OpenCores designs demonstrate that the Xilinx UltraScale architecture delivers a two-speed-grade performance boost over competing devices while implementing 20% more design content. This boost equates to a generation leap over the closest competitive offering.”

 

I place in evidence Exhibit 1 (actually Figure 1 in the White Paper), which compares Kintex UltraScale FPGA device utilization versus Arria 10 device utilization and shows that it’s much harder to use all of the Arria 10’s device capacity than it is for the Kintex UltraScale device:

 

 

 

wp496 Figure 1.jpg 

 

 

 

It’s quite reasonable for you to ask “why is this so?” at this point. In fact, you certainly should. I’m told and the White Paper explains that there’s a fundamental architectural reason for this significant utilization disparity. You see it in the architectural difference between a Xilinx UltraScale CLB and an Arria ALM (adaptive logic module). Here’s the picture (which is Figure 2 in the White Paper):

 

 

 

wp496 Figure 2.jpg 

 

 

 

You can see that the two 6-input LUTs in the Arria 10 ALM share four inputs while the two 6-input LUTs in the UltraScale device have independent inputs. (Xilinx UltraScale+ devices employ the same LUT configuration.) There’s no sleight of hand here. Given enough routing resources (which the Xilinx UltraScale architecture has) and a sufficiently clever place-and-route tool (which Vivado has), you will be able to use both 6-input LUTs more often if they have independent inputs than if they have several shared inputs. Hence the greater maximum usable resource capacity for UltraScale and UltraScale+ devices.

 

And now for Exhibit 2. Here’s the associated performance graph showing FMAX for the various OpenCores IP cores (Figure 3 in the White Paper):

 

 

 

wp496 Figure 3.jpg 

 

 

 

As you might expect from a Xilinx White Paper, the UltraScale device performs better after placement and routing. There are many more such Exhibits (charts and graphs) for you to peruse in the White Paper and Xilinx does not always win.

 

So what?

 

Well, the purpose of this blog post is twofold. First, I wanted you to be aware of this White Paper. If you’ve read this far, that goal has been achieved. Second, I don’t want you to take my word for it. I am reporting what’s stated in the White Paper but you should know that this White Paper was created in response to a similar White Paper published a few months back by “the competition.” No surprise, the competition’s White Paper came to different conclusions.

 

So who is right?

 

As a former Editor-in-Chief of both EDN Magazine and Microprocessor Report, I am well aware of benchmarks. In fact, EEMBC, the industry alliance that developed industry-standard benchmarks for embedded systems, was based on a hands-on project conducted by former EDN editor Markus Levy in 1996 while I was EDN’s Editor-in-Chief. Markus founded EEMBC a year later. I devoted a portion of Chapter 3 in my book “Designing SoCs with Configured Cores” to microprocessor benchmarking and I wrote an entire chapter (Chapter 10) about the history of microprocessor benchmarking for the textbook titled “EDA for IC System Design, Verification, and Testing,” published in 2006. That chapter also discussed some of the many ways to achieve the results you desire from benchmarks. FPGA benchmarks are in a similar state of affairs, going back at least to the 1990s and the famous/infamous PREP benchmark suite.

 

Here’s what Alexander Carlton at HP in Cupertino, California wrote way back in 1994 in his article on the SPEC Web site titled “Lies, **bleep** Lies, and Benchmarks”:

 

“It has been said that there are three classes of untruths, and these can be classified (in order from bad to worse) as: Lies, **bleep** Lies, and Benchmarks. Actually, this view is a corollary to the observation that ‘Figures don't lie, but liars can figure...’ Regardless of the derivation of this opinion, criticism of the state of performance marketing has become common in the computer industry press.”

 

 

[Editorial note: The blogging tool has modified the article's title to meet its Victorian sense of propriety.]

 

 

To my knowledge, no shenanigans were used to achieve the above FPGA benchmark results (I did ask) but I nevertheless caution you to be careful when interpreting the numbers. Here’s how I’d view these White Paper benchmark results:

 

Your mileage may vary. (Even the US EPA says so.) The only benchmark truly indicative of the device utilization and performance you’ll get for your design is… your design. Benchmarks are merely surrogates for your design.

 

So go ahead. Download and read the new Xilinx “Measuring Device Performance and Utilization: A Competitive Overview” White Paper, get educated, and then start asking questions.

 

 

MYIR Tech’s 91x63mm Z-turn Lite is a flexible SBC (single-board computer)/dev board that’s offered in a $69 version populated with a Xilinx Zynq Z-7007S SoC with one ARM Cortex-A9 processor core or in a $75 version populated with a dual-core Xilinx Zynq Z-7010 SoC. The two versions of the Zynq SoC are pin-compatible, making it much easier for MYIR to offer two versions of the board using the same pcb layout. Both versions of the Z-turn Lite dev board also include 512Mbytes of DDR3 SDRAM, eMMC Flash memory, QSPI Flash memory, and a TF (SD) card slot.

 

 

 

MYIR Tech Zynq-Based Z-turn Lite Top.jpg 

 

MYIR Tech Z-turn Lite SBC (single-board computer)/dev board (top)

 

 

 

The above photo of the top of the Z-turn Lite board initially threw me off, as it may you. It looks like the board has a few standard I/O ports (10/100/1000 Ethernet, USB OTG, UART, and JTAG), but no ports that break out the Zynq SoC PL’s (programmable logic’s) many programmable I/O pins for I/O-centric applications such as sensor fusion. That’s why you need to look at the bottom of the board as well because that’s where you’ll find the breakout/expansion connector for an additional 84 PL I/O pins:

 

 

 

MYIR Tech Zynq-Based Z-turn Lite Bottom.jpg 

 

MYIR Tech Z-turn Lite SBC (single-board computer)/dev board (bottom)

 

 

 

MYIR’s use of the single- and dual-core Zynq SoC with the same board layout gives you a lot of scalability with respect to PS (processing system) horsepower and some scalability with respect to PL capacity. (Feel free to use this trick yourself.) The PL in the Zynq Z-7007S SoC has 23K logic cells and 66 DSP48E1 slices. The PL in the Zynq Z-7010 SoC has 28K logic cells and 80 DSP48E1 slices.

 

For more information about the Z-turn Lite SBC/dev board, please contact MYIR directly.

 

 

The XA-RX PCIe XMC module from Innovative Integrations presents eight 16-bit, 125Msamples/sec ADCs to the world through eight SSMC RF connectors. That 1Gsamples/sec of aggregate ADC sample bandwidth at 16 bits/sample, supplied by two Analog Devices AD9653 quad ADCs. It’s best to have some on-board processing when you’ve got that much data coming in that fast, and the XA-RX module funnels that sampled data directly into a Xilinx Artix-7 A200T FPGA for local processing as shown in this block diagram:

 

 

 

Innovative Integration XA-RX Block Diagram.jpg 

 

 

Innovative Integration’s XA-RX XMC module block diagram

 

 

 

The Artix-7 A200T’s 740 (!) DSP48E1 slices can handle some pretty significant processing tasks and its programmable I/O and sixteen GTP SerDes transceivers easily handle the high-speed I/O needs of the ADCs’ serial LVDS ports, control of the 1Gbyte DDR3L-1600 SDRAM, and the XMC connectors’ PCIe Gen2 and Aurora ports. As you can see from the block diagram, the Artix-7 FPGA implements all of the XA-RX modules’ on-board logic including control, signal processing, buffering, and system-interface functions. Not bad for a supposedly “low-end” FPGA. That Artix-7 A200T is a rather capable device and a good fit for the performance needs of this application.

 

Here’s a photo of the XA-RX board:

 

 

 

Innovative Integration XA-RX.jpg

 

 

Innovative Integration’s XA-RX XMC module

 

 

 

 

Applications for Innovative Integration’s XA-RX XMC module include:

 

  • High speed stimulus-response
  • Radar
  • Medical ultrasound and MRI
  • High-speed imaging
  • Quadrature radio receivers
  • Diversity radio receivers
  • Test equipment

 

 

Innovative Integration also supplies data-acquisition, logging, and analysis sample applications with the XA-RX module along with Windows/Linux drivers, C++ host tools, and VHDL/MATLAB tools via the company’ Framework Logic toolset.

 

For more information about the XA-RX module, please contact Innovative Integration directly.

 

 

 

Hardent, a Xilinx Authorized Training Partner, has announced a 3-day embedded design class based on the Xilinx Zynq UltraScale+ MPSoC and you can attend either in person at one of several North American locations or live over the Internet. Here’s a course outline:

 

  • Zynq UltraScale+ MPSoC Architecture Overview
  • Zynq MPSoC Processor System (PS)
  • The Application Processing Unit (APU)
  • The Real-Time Processing Unit (RPU)
  • The Platform Management Unit (PMU)
  • The Quick Emulator (QEMU)
  • System-Level Features
  • Boot and Configuration
  • Coherency
  • AXI Interfaces between the PS and PL (Programmable Logic)
  • Power Management
  • Clocks and Resets
  • DDR and QoS
  • Security and Safety
  • System Protection
  • Security and Software
  • ARM TrustZone Technology
  • Linux and the MPSoC {Lectures, Labs}
  • Symmetric Multi-Processor Linux
  • Yocto
  • PetaLinux
  • Virtualization
  • HW-SW Virtualization
  • Introduction to the Xen Hypervisor
  • OpenAMP
  • The Software Ecosystem
  • Software Ecosystem Support
  • FreeRTOS
  • Software Stack

 

 

There are eleven scheduled classes, and the first one starts today.

 

For more information and to register, click here.

 

 

Now that Amazon has made the FPGA-accelerated Amazon EC2 F1 compute instance generally available to all AWS customers (see “AWS makes Amazon EC2 F1 instance hardware acceleration based on Xilinx Virtex UltraScale+ FPGAs generally available”), just about anyone can get access to the latest Xilinx All Programmable UltraScale+ devices from anywhere, just as long as you have an Internet connection and a Web browser. Xilinx has just published a new video demonstrating the use of its Vivado IP Integrator, a graphical-based design tool, with the AWS EC2 F1 compute instance.

 

Why use Vivado IP Integrator? As the video says, there are five main reasons:

 

  • Simplified connectivity
  • Block automation
  • Connectivity automation
  • DRC (design rule checks)
  • Advanced hardware debug

 

 

Here’s the 5-minute video:

 

 

 

 

 

 

 

Baidu details FPGA-based Cloud acceleration with 256-core XPU today at Hot Chips in Cupertino, CA

by Xilinx Employee ‎08-22-2017 11:38 AM - edited ‎08-22-2017 11:40 AM (5,778 Views)

 

Xcell Daily covered an announcement by Baidu about its use of Xilinx Kintex UltraScale+ FPGAs for the acceleration of cloud-based applications last October. (See “Baidu Adopts Xilinx Kintex UltraScale FPGAs to Accelerate Machine Learning Applications in the Data Center.”) Today, Baidu discussed more architectural particulars of its FPGA-acceleration efforts at the Hot Chips conference in Cupertino, California—according to Nicole Hemsoth’s article appearing on the NextPlatform.com site (“An Early Look at Baidu’s Custom AI and Analytics Processor”).

 

Hemsoth writes:

 

“…Baidu has a new processor up its sleeve called the XPU… The architecture they designed is aimed at this diversity with an emphasis on compute-intensive, rule-based workloads while maximizing efficiency, performance and flexibility, says Baidu researcher, Jian Ouyang. He unveiled the XPU today at the Hot Chips conference along with co-presenters from FPGA maker, Xilinx…

 

“’The FPGA is efficient and can be aimed at specific workloads but lacks programmability,’ Ouyang explains. ‘Traditional CPUs are good for general workloads, especially those that are rule-based and they are very flexible. GPUs aim at massive parallelism and have high performance. The XPU is aimed at diverse workloads that are compute-intensive and rule-based with high efficiency and performance with the flexibility of a CPU,’ Ouyang says. The part that is still lagging, as is always the case when FPGAs are involved, is the programmability aspect. As of now there is no compiler, but he says the team is working to develop one…

 

“’To support matrix, convolutional, and other big and small kernels we need a massive math array with high bandwidth, low latency memory and with high bandwidth I/O,” Ouyang explains. “The XPU’s DSP units in the FPGA provide parallelism, the off-chip DDR4 and HBM interface push on the data movement side and the on-chip SRAM provide the memory characteristics required.’”

 

According to Hemsoth’s article, “The XPU has 256 cores clustered with one shared memory for data synchronization… Somehow the all 256 cores are running at 600MHz.”

 

For more details, see Hemsoth’s article on the NextPlatform.com Web site.

 

Pinnacle’s Denali-MC Real-Time, HDR-Capable Image Signal Processor Supports 29 CMOS Image Sensors

by Xilinx Employee ‎08-22-2017 10:33 AM - edited ‎08-22-2017 10:41 AM (5,434 Views)

 

Pinnacle Imaging Systems’ configurable Denali-MC HDR video and HDR still ISP (Image Signal Processor) IP can support 29 different HDR-capable CMOS image sensors including nine Aptina/ON Semi, six Omnivision, and eleven Sony sensors and twelve different pixel-level gain and frame-set HDR methods using 16-bit processing. The IP can be useful in a wide variety of applications including but certainly not limited to:

 

  • Surveillance/Public Safety
  • ADAS/Autonomous Driving
  • Intelligent Traffic systems
  • Body Cameras
  • Machine Vision

 

 

Pinnacle Denali-MC ISP Core.jpg

 

Pinnacle’s Denali-MC Image Signal Processor Core Block Diagram

 

 

 

Pinnacle has implemented its Denali-MC IP on a Xilinx Zynq Z-7045 SoC (from the photo on the Denali-MC product page, it appears that Pinnacle used a Xilinx Zynq ZC706 Eval Kit as the implementation vehicle) and it has produced this impressive 3-minute video of the IP in real-time action:

 

 

 

 

Please contact Pinnacle directly for more information about the Denali-MC ISP IP. The data sheet for the Denali-MC ISP core is here.

 

 

 

Whatever you’re designing, chances are there are at least a few people out there in the world somewhere who want to break in. That’s one reason why the Xilinx Zynq UltraScale+ MPSoC makes a good design foundation. The Zynq MPSoC family offers many system-level protection mechanisms to help keep your systems secure starting with the ARM TrustZone hardware built into the ARM Cortex-A53 processor cores in the APU that maintains isolation between secure and non-secure processes, the XMPU (Xilinx Memory Protection Unit), the XPPU (Xilinx Peripheral Protection Unit), and the SMMU (System Memory Management Unit).

 

The Zynq UltraScale+ MPSoC Technical Reference Manual discusses these features but if you’d like a fast path to more detail, consider attending the free, 1-hour “System Protection Features of Zynq UltraScale+ MPSoCs” Webinar being given on Wednesday, September 13. Hardent, a Xilinx authorized Training Provider is teaching the course.

 

Register here.

 

 

By Anthony Boorsma, DornerWorks

 

Having some trouble choosing between Vivado HLS and SDSoC? Here’s some advice from DornerWorks, a Premier Xilinx Alliance Program member. This blog is adapted from a recent post on the DornerWorks Web site titled “Algorithm Implementation and Acceleration on Embedded Systems

 

 

How does an engineer already experienced and comfortable with working in the Zynq SoC’s software-based PS (processing system) domain take advantage of the additional flexibility and processing power of the Zynq SoC’s PL (programmable logic)? The traditional method is through education and training to learn to program the PL using an HDL such as Verilog or VHDL. Another way is to learn and use a tool that allows you to take a software-based design written exclusively for the ARM 32-bit processors in the PS and transfer some or most of the tasks to the PL, without writing HDL descriptions.

 

One such tool is Xilinx’s Vivado High Level Synthesis (HLS). By leveraging the capabilities of HLS, you can prototype a design using the Zynq PS and then move functionality to the PL to boost performance. The advantage of this tool is that it generates IP blocks that can be used in the programmable logic of Xilinx FPGAs as well as Xilinx Zynq SoCs and Zynq UltraScale+ MPSoCs.

 

Logic optimization occurs when Vivado HLS synthesizes your algorithm’s C model and creates RTL. There are code directives (essentially guidelines for the tools’ optimization process) available that allow you to guide the HLS tool’s synthesis from the C model source to the RTL bitstream programmed into the FPGA. If you are working with an existing algorithm modeled in C, C++, or SystemC and need to implement this algorithm in custom logic for added performance, then HLS is a great tool choice.

 

However, be aware that the data movers that transfer data between the Zynq PS and the PL must be manually configured for performance when using Vivado HLS. This can become a complicated process when there’s significant data transfer between the domains.

 

A recent innovation that simplifies data-mover configuration is the development of the Xilinx SDSoC (Software-Defined System on Chip) Development Environment for use with Zynq SoCs and Zynq UltraScale+ MPSoCs. SDSoC builds on Vivado HLS’ capabilities by using HLS to perform the C-to-RTL conversion but with the convenient addition of automatically generated data movers, which greatly simplifies configuring the connection between the software running on the Zynq PS and the accelerated algorithm executing in the Zynq PL. SDSoC also allows you to guide data-mover generation by providing a set of pragmas to make specific data-mover choices. The SDSoC directive pragmas give you control over the automatically generated data movers but still require some minimal manual configuration. Code-directive pragmas for RTL optimization available in Vivado HLS are also available in SDSoC and can be used in tandem with SDSoC pragmas to optimize both the PL algorithm and the automatically generated data movers.

 

It is possible to disable the SDSoC auto generated data movers and only use the HLS optimizations. Demonstrated below are an IP block diagram generated with the auto configured SDSoC data movers and one without them.

 

The following screen shots are taken from a Xilinx-provided template project demonstrating the acceleration of a software matrix multiplication and addition algorithm, provided with the SDx installation. We used the SDx 2016.4 toolchain and targeted an Avnet Zedboard with a standalone OS configuration for this example.

 

 

Image1.jpg

 

 

Here is a screen shot of the same block, but without the SDSoC data movers. (We have disabled the automatic generation of data movers within SDSoC by manually declaring the AXI HLS interface directives for both mmult and madd accelerated IP block.)

 

 

Image2.jpg 

 

 

To achieve the best algorithm performance, be prepared to familiarize yourself and use both the SDSoC and Vivado HLS user guides and datasheets. SDSoC provides a superset of Vivado HLS’s capabilities.

 

If you are developing and accelerating your model from first principles but want to take advantage of the flexibility of testing and proving out a design in software first, and you don’t intend to use a Zynq SoC, then using the Vivado HLS toolset straightaway is the place to start. A design started in HLS is transferable to an SDSoC if requirements change. Alternatively, if using a Zynq-based system is possible, it would be worthwhile to start right away with using SDSoC.

 

 

 

Adam Taylor’s MicroZed Chronicles, Part 212: Building an IoT Application with the MiniZed Dev Board

by Xilinx Employee ‎08-21-2017 10:34 AM - edited ‎08-21-2017 10:48 AM (9,443 Views)

 

By Adam Taylor

 

Avnet’s Zynq-based MiniZed is one of the most interesting dev boards we have looked at in this series. Thanks to its small form factor and its WiFi and Bluetooth capabilities, it is ideal for demonstrating Internet of Things (IoT) applications. We are now going to combine the FLIR Lepton camera module with the MiniZed and use them both to create a simple IOT application.

 

 

Image1.jpg 

 

 

The approach I am going to follow for this demonstration is to update the MiniZed PetaLinux hardware design to do the following:

 

  • Interface with the FLIR Lepton camera module
  • Implement a video-processing pipeline that supports a 7-inch touch display connected to the MiniZed’s Pmod ports

 

The use of the local 7-inch touch display has two purposes. First, it demonstrates that the FLIR Lepton camera and the MiniZed are correctly working before I invest too much time in getting WiFi image transmission working. Second, the touch display could be used for local control and display, if required in an industrial (IIoT) application for example.

 

Opening the existing MiniZed Vivado project, you will notice it contains the Zynq (for the first time a single core Zynq) and an RTL block that interfaces with the WiFi and Bluetooth radio modules. This interface uses processing systems’ (PS’) SDIO0 for the WiFi interface and UART0 for Bluetooth. When we develop software, we must therefore remember to define the STDIN/STDOUT as being PS UART1 if we need a UART for debugging.

 

To this diagram we will add the following IP Blocks:

 

  • Quad SPI Core – Configured for single-mode operation. Receives the VoSPI from the Lepton.
  • Video Timing Controller – Generates the video timing signals for display output.
  • VDMA – Reads an image from the PS DDR and converts it into a PL (programmable logic) AXI Stream.
  • AXI Stream to Video Out – Converts the AXI Streamed video data to parallel video with timing synchronization provided by the Video Timing Core.
  • Zed_ALI3_Controller – Display controller for the 7-inch touch-screen display.

 

The Zed_ALI3_Controller IP block can be downloaded from the AVNET GitHub. Once downloaded, running the TCL script within the Vivado project will create an IP block we can include in our design.

 

The clocking architecture is now a little more complicated and includes the new Zed_ALI3_Controller block. This module generates the pixel clock, which is supplied to the VTC and the AXIS to Video blocks. Zynq-generated clocks provide the reference clock to the Zed_ALI3_Controller (33.33MHz) and the AXI Networks.

 

This demonstration uses two AXI networks. The first is the General-Purpose network. Te software uses this GP AXI network to configure IP blocks within the PL including the VDMA and VTC.

 

The second AXI network uses the High Performance AXI interface to transfer images from the PS DDR memory into the image-processing stream in the PL.

 

 

Image2.jpg

 

The complete block diagram

 

 

 

To connect the FLIR Lepton camera module, we will connect it as we did previously (p1 & p2) to the MiniZed shield connector, making use of the shield’s I2C and SPI connections.

 

The I2C pins are mapped into the constraints file already used for the temperature and motion sensors. Therefore, all we need to do is add the SPI I/O pin locations and standards.

 

The FLIR Lepton camera’s AREF supply pin is not enabled. To power the camera on the shield connector as in the previous example, we take 5V power from a flying lead connected to the opposite shield connector’s 5V supply and the back of the FLIR Lepton camera.

 

 

Image3.jpg

 

FLIR Lepton Connected to the MiniZed in the Shield Header

 

 

 

We’ll need both Pmod connectors To output the image to the 7-inch display. The pin-out required appears below. The differential pins on the Pmod connector are used for the video output lines with the I/O standard set to TMDS_33.

 

 

 

Image4.jpg

 

Pmod Pinout

 

 

 

With the basic hardware design in place all that remains now is to generate the software builds. Initially, I will build a bare metal application to verify that this design functions as intended. This step-by-step process stems from my strong belief in incremental verification as a project progresses.

 

Notes:

 

  • You need to install the MiniZed board definition files into your Vivado /data/boards/board_files directory to work with the MiniZed dev board. If you have not already done so, they are available here.

 

  • This blog welcomes Daniel Taylor, born today.

 

 

 

Code is available on Github as always.

 

 

If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.

 

 

 

  • First Year E Book here
  • First Year Hardback here.

 

 

MicroZed Chronicles hardcopy.jpg 

  

 

  • Second Year E Book here
  • Second Year Hardback here

 

 

MicroZed Chronicles Second Year.jpg

 

Digilent has added yet more members to the Arty FPGA dev board family with two flavors of the new Arty S7, which is based on medium-sized Xilinx Spartan-7 FPGAs. The $89 Arty S7-25 incorporates a Spartan-7 S25 device and the $109 Arty S7-50 incorporates a Spartan-7 S50 device, which offers significantly more resources as you can see in this chart:

 

 

 

Arty S7 dev board specs.jpg 

 

 

Other than the difference in FPGAs, the two boards are identical including 256Mbytes of DDR3L SDRAM, 128Mbits of Quad SPI Flash memory, four Digilent Pmod ports, combined Arduino/chipKIT shield connectors, and some switches and LEDs. Here’s a photo of the board:

 

 

Arty S7 dev board.jpg 

 

 

The Arty FPGA Dev Board family based on the Xilinx Spartan-7 FPGA

 

 

 

And here’s the block diagram for the board:

 

 

 

Digilent Arty S7 Block Diagram.jpg

 

 

The Arty FPGA Dev Board Block Diagram

 

 

 

 Please contact Digilent for more information about the Arty S7 Dev Board.

 

Two time-sensitive announcements from Xilinx about time-sensitive networking (TSN) for IIoT applications

by Xilinx Employee ‎08-18-2017 02:18 PM - edited ‎08-18-2017 02:18 PM (5,724 Views)

 

Time-sensitive networking (TSN) makes the IIoT (industrial Internet of things) run on time and if you are developing any IIoT or Industrie 4.0 equipment, you’ll need to know about and then use the deterministic TSN protocol. Xilinx has two time-sensitive TSN announcements you need to know about sooner rather than later.

 

First, Xilinx’s Product Manager for Industrial Applications Michael Zapke will present a free TSN Webinar on September 7 at 7am PDT. Register for Michael’s TSN Webinar here.

 

Second, Xilinx has just put its IEEE-compliant 100M/1G TSN Subsystem IP core (with one year of  maintenance) on sale for a “significantly reduced price,” now through September 29 through a TSN Headstart program. (Sorry, that’s all I’m allowed to say about the sale.) However at this price, you will definitely want to check into this sale price if you’re developing IIoT equipment based on Xilinx’s Zynq SoC or Zynq UltraScale+ MPSoC.

 

If you want to learn more about this sale and wish to request access to additional information about the TSN Subsystem IP core, click here.

 

Note: The number of TSN Headstart program participants is limited, so act sooner rather than later—like now—if this offer interests you.)

 

 

Xilinx TSN Diagram.jpg 

 

 

 

For more TSN coverage in Xilinx’s Xcell Daily blog, see:

 

 

 

 

 

 

 

Every device family in the Xilinx UltraScale+ family of devices (Virtex UltraScale+ FPGAs, Kintex UltraScale+ FPGAs, and Zynq UltraScale+ MPSoCs) have members with 28Gbps-capable GTY transceivers. That’s likely to be important to you as the number and forms of small, 28Gbps interconnect grow. You have many such choices in such interconnect these days including:

 

 

  • QSFP28 Optical
  • QSFP28 Direct-Attach Copper
  • SFP28 Optical
  • SFP28 Direct-Attach Copper
  • Samtec FireFly AOC (Active Optical Cable or Twinax ribbon cable)

 

 

The following 5.5-minute video demonstrates all of these interfaces operating with 25.78Gbps lanes on Xilinx VCU118 and KCU116 Eval Kits, as concisely explained (as usual) by Xilinx’s “Transceiver Marketing Guy” Martin Gilpatric. Martin also discusses some of the design challenges associated with these high-speed interfaces.

 

But first, as a teaser, I could not resist showing you the wide-open IBERT eye on the 25.78Gbps Samtec FireFly AOC:

 

 

 

Kintex Ultrascale Firefly AOC IBERT Eye.jpg 

 

 

 

Now that’s a desirable eye.

 

Here’s the new video:

 

 

 

 

 

 

Amazon Web Services (AWS) is now offering the Xilinx SDAccel Development Environment as a private preview. SDAccel empowers hardware designers to easily deploy their RTL designs in the AWS F1 FPGA instance. It also automates the acceleration of code written in C, C++ or OpenCL by building application-specific accelerators on the F1. This limited time preview is hosted in a private GitHub repo and supported through an AWS SDAccel forum. To request early access, click here.

 

Last September at the GNU Radio Conference in Boulder, Colorado, Ettus Research announced the RFNoC & Vivado Challenge for SDR (software-defined radio). Ettus’ RFNoC (RF Network on Chip) is designed to allow you to efficiently harness the latest-generation FPGAs for SDR applications without being an expert firmware or FPGA developer. Today, Ettus Research and Xilinx announced the three challenge winners.

 

Ettus’ GUI-based RFNoC design tool allows you to create FPGA applications as easily as you can create GNU Radio flowgraphs. This includes the ability to seamlessly transfer data between your host PC and an FPGA. It dramatically eases the task of FPGA off-loading in SDR applications. Ettus’ RFNoC is built upon Xilinx’s Vivado HLS.

 

Here are the three winning teams and their projects:

 

 

 

 

 

Finally, here’s a 5-minute video announcing the winners along with the prizes they have won:

 

 

 

Labels
About the Author
  • Be sure to join the Xilinx LinkedIn group to get an update for every new Xcell Daily post! ******************** Steve Leibson is the Director of Strategic Marketing and Business Planning at Xilinx. He started as a system design engineer at HP in the early days of desktop computing, then switched to EDA at Cadnetix, and subsequently became a technical editor for EDN Magazine. He's served as Editor in Chief of EDN Magazine, Embedded Developers Journal, and Microprocessor Report. He has extensive experience in computing, microprocessors, microcontrollers, embedded systems design, design IP, EDA, and programmable logic.