UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

 

By Adam Taylor

 

At the end of the Sysmon AMS blogs I had introduced the several PLLs within the Zynq UltraScale+ MPSoC. This introduction suggests to me that it’s time to talk about the clocking architecture of the MPSoC Device.

 

As with the original Zynq SoC, the PS (processing system) in the Zynq UltraScale+ MPSoC is the system master. So we will initially focus upon its clocking architecture.  Within the PS there are three main clock inputs:

 

  • PS Reference Clock (PSS_REF_CLK)
  • Alternate PS Reference Clock (PSS_ALT_REF_CLK)
  • Video Reference Clock (PSS_VIDEO_REF_CLK)

 

While the PS reference clock has a dedicated input pin, the PSS_ALT_REF_CLK and PSS_VIDEO_REF_CLK are input via the MIO and are enabled or disabled in Vivado by the I/O configuration customization tab. If we plan on using these clocks, we need to ensure there is no conflict with other planned use of the MIO.

 

 

 

Image1.jpg

 

 

Enabling the Alternate reference clock and the video clock

 

 

Once these have been enabled, we can configure them on the clock configuration input clock tab as shown below:

 

 

Image2.jpg 

 

 

Internally, the PS has four clock groups that provide all the required clocks:

 

  • Main Clock Group (MCG): This group covers the Zynq UltraScale+ MPSoC’s LPD and FPD power domains. Within the MCG, we find the five PLLs in the PS (the DDR, APU, VIDEO, RPU, and IO PLLs). The first three PLLs are within the FPD while the last two are within the LPD.
  • Secure Clock Group (SCG): This group provides the clocks for the Zynq UltraScale+ MPSoC’s PMU and the CSU. It is generated internally via a ring oscillator.
  • Real Time Clock Group (RTC): This group provides the clock for the RTC and requires an external crystal attached to two dedicated Zynq UltraScale+ MPSoC PS I/O pins (PS_ADI, PS_ADO).
  • Interface Clock Group (ICG): This group consists of clocks provided to the PS via interfaces (e.g. from the PL (programmable logic) as part of the AXI transactions).

 

 

We’ll now focus on the MCG as this is the group with which we will have the most interaction. Within this group, we choose which of the five PLLs is used to clock the Zynq UltraScale+ MPSoC’s processors and peripherals within the LPD and FPD. We can do this via the clock configuration -> output clocks tab. Here we can configure the domains clocking for both the low and full power domains.

 

 

Image3.jpg 

 

 

To generate a PLL output frequency as closely as possible to the desired frequency, we may want to change the PLL input-clock source. We have several potential clock sources which can be used to clock each of the PLLs within the Zynq UltraScale+ MPSoC. 

 

As mentioned above we can use PS_REF_CLK, PS_ALT_REF_CLK, or PS_VIDEO_REF_CLK. These clocks are directly input into the PS. We can also use one of the four GT_REF_CLKS or the AUX_REF_CLK. This latter reference clock is provided from the PL while the former clock is provided by the PS_GTR. The relevant PLL control register selects which of these clocks drives the PLL. These registers reside in either the CRL_APB module for low-power domain PLLs or CRF_APB module for high-power domain PLLs.

 

We can select which of the four GT reference clocks is provided as the GT_REF_CLK using the Serial Input Output Unit (SIOU) module CRX_CNTRL Register.

 

Now that we understand the Zynq UltraScale+ MPSoC’s clocking and how we set the desired frequency for each of the subsystems, we will explore the subsystems in more detail in the MicroZed Chronicles blogs that follow.

 

 

If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.

 

 

 

  • First Year E Book here
  • First Year Hardback here.

 

 

MicroZed Chronicles hardcopy.jpg 

  

 

  • Second Year E Book here
  • Second Year Hardback here

 

 

MicroZed Chronicles Second Year.jpg 

 

 

 

 

 

 

Image Matters’ Origami B20 module, based on a Xilinx Kintex UltraScale KU060 FPGA, is a small 94x53mm module that you can use to perform all sorts of high-speed processing. (See “Image Matters launches Origami Ecosystem for developing advanced 4K/8K video apps using the FPGA-based Origami module.”) For example, you can use it for a variety of video-compression applications using various IP compression cores including MPEG, JPEG-2000, and TICO. You can also use it for cloud-computing and neural-network applications such as image detection. The key thing is that the small Origami B20 module puts everything you need to run the FPGA on the one small module including SDRAM, Flash memory, the power supply, a backup battery, and security features (including tamper protection).

 

Here’s a short, 2.5-minute, Powered by Xilinx video with more information about the Origami B20 module:

 

 

 

 

 

By Adam Taylor

 

A couple of weeks ago, I talked about the Xilinx reVision stack and the support it provides for OpenVX and OpenCV. One of the most exciting things I explained was about how we could accelerate several OpenCV functions (which include the OpenVX Core functions) using the Zynq SoC’s programmable logic. What I did not look at was the other significant part of the reVision stack and its support for machine learning.

 

Machine learning is increasing important for embedded-vision applications because it helps systems to evolve from being vision-enabled to being vision-guided autonomous systems. Machine learning is often used for embedded-vision applications to identify and classify information contained within an image. The embedded-vision system uses these identifications and classifications to make informed decisions in real time, enabling increased interaction with the environment.

 

For those unfamiliar with machine learning it is most often implemented by the creation and training of a neural network. Neural networks are modelled upon the human cerebral cortex in that each neuron receives an input, processes it, and communicates the processed signal it to another neuron. Neural networks typically consist of an input layer, internal layer(s), and an output layer.

 

 

Image1.jpg

 

 

 

Those familiar with machine learning may have come across the term “deep learning.” This is where there are several hidden layers in the neural network, allowing more complex machine-learning algorithms to be implemented.

 

When working with neural networks in embedded-vision applications, we need to use a 2D network. This is where Convolutional Neural Networks (CNNs) are used. CNNs are deep-learning networks that contain several convolutional and sub-sampling layers along with a separate, fully connected network to perform the final classification. Within the convolution layer, the input image will be broken down into several overlapping smaller tiles.

 

The results from this convolution layer are used to create an activation map, using an activation layer in the network placed before further sub-sampling and additional stages and preceding the final, fully connected network. The exact implementation of the CNN network varies depending upon the network architecture implemented (GoogLeNet, SSD, AlexNet). However, a CNN will typically contain at least the following elements:

 

 

  • Convolution – Identifies features within the image
  • Rectified Linear Unit (reLU) – Activation layer that creates an activation map following the convolution
  • Max Pooling – Performs sub-sampling between layers
  • Fully Connected layer – Performs the final classification

 

 

The weights used for each of these elements are determined via training, and one of the CNN’s advantages is the relative ease of training the network. Training requires large data sets and high-performance computers to correctly determine the weights for each stage.

 

To ease the development of machine-learning applications, many engineers use a framework like Caffe, which supports the implementation and training of machine learning. The use of frameworks allows us to work at a higher level and maximize reuse. Using a framework, we don’t need to start from scratch each time we develop an application.

 

The Xilinx reVision stack provides an integrated Caffe framework flow, which allows us to take the prototext definition of the network and trained weights to deploy the machine-learning application. (Note that network training is separate and distinct from deployment.) To enable this, the Xilinx reVision stack provides several hardware-accelerated functions that can be implemented within the Zynq SoC’s or Zynq UltraScale+ MPSoC’s PL (programmable logic) to create the machine-learning inference engine. The reVision stack also provides examples for a wide range of network structures, enabling us to get up and running with our machine-learning application without the need to initially compile the PL design. Once we are happy with the machine-learning application, we can then use the SDSoC flow to develop our own embedded-vision application containing the optimized machine-learning application.

 

 

Image2.jpg 

 

 

Using the Zynq PL provides for an optimal implementation that delivers faster response times when interacting with the embedded-vision system environment. This is especially true as machine learning applications are increasingly implemented using fixed-point integers like INT8, which are ideal for implementation in DSP elements.

 

Machine learning is going to be a hot area for several applications. So I will be coming back to this topic in detail as the MicroZed Chronicles progress—with some examples of course.

 

 

If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.

 

 

 

  • First Year E Book here
  • First Year Hardback here.

 

 

MicroZed Chronicles hardcopy.jpg 

  

 

 

  • Second Year E Book here
  • Second Year Hardback here

 

 

 

MicroZed Chronicles Second Year.jpg 

 

 

If you’re going to strap two 12Gsamples/sec, 16-bit DACs and two 6.4Gsamples/sec, 12-bit ADCs into your VPX/AMC module, you’d better include massive real-time DSP horsepower to tame them. That’s exactly what VadaTech has done with its VPX599 and AMC599 modules by placing a Xilinx Kintex UltraScale KU115 FPGA (along with 16 or 20Gbytes of high-speed DDR4 SDRAM) on the modules’ digital carrier board mated to an FMC analog converter board.

 

 

 

VadaTech AMC599.jpg 

 

VadaTech AMC599 ADC/DAC Module

 

 

VadaTech VPX599.jpg

 

VadaTech VPX599 ADC/DAC Module

 

 

 

Here’s a block diagram of the AMC599 module (the VPX599 block diagram is quite similar):

 

 

 

VadaTech AMC599 Block Diagram.jpg
 

 

VadaTech AMC599 ADC/DAC Module Block Diagram

 

 

 

At these conversion rates, raw data streams to and from the host CPU are quite impractical so you must, repeat must, have on-board processing and local storage—and what other processing genie besides a Xilinx UltraScale FPGA would you trust to handle and process those sorts of extreme streams?

 

 

Please contact VadaTech directly for more information on the VPX599 and AMC599 modules.

 

 

 

 

 

 

The just-announced VICO-4 TICO SDI Converter from Village Island employs visually lossless 4:1 TICO compression to funnel a 4K60p video (on four 3G-SDI video streams or one 12G-SDI stream) into onto a single 3G-SDI output stream, which reduces infrastructure costs for transport, cabling, routing, and compression in broadcast networks.

 

 

 

Village Island VICO-4.jpg

 

 

VICO-4 4:1 SDI Converter from Village Island

 

 

 

Here’s a block diagram of what’s going on inside of Village Island’s VICO-4 TICO SDI Converter:

 

 

Village Island VICO-4 Block Diagram.jpg 

 

And here’s a diagram showing you what broadcasters can do with this sort of box:

 

 

Village Island VICO-4 Distribution Diagram.jpg

 

 

 

The reason this is even possible in a real-time broadcast environment is because the lightweight intoPIX TICO compression algorithm has very low latency (just very a few video lines) when implemented in hardware as IP. (Software-based, frame-by-frame video compression is therefore totally out of the question in an application such as this introduces too much delay.)

 

Looking at the VICO-4’s main (and only) circuit board shows one main chip implementing the 4:1 compression and signal multiplexing. And that chip is… a Xilinx Kintex UltraScale KU035 FPGA. It has plenty of on-chip programmable logic for the TICO compression IP and it has sixteen 16.3Gbps transceiver ports—more than plenty to handle the 3G- and 12G-SGI I/O required by this application.

 

 

Village Island VICO-4 pcb.jpg 

 

 

Note: Paltek in Japan is distributing Village Island’s VICO-4 board in Japan as an OEM component. The board needs 12Vdc at ~25VA.

 

 

 

For more information about TICO compression IP, see:

 

 

 

 

 

 

 

 

Next week at OFC 2017 in Los Angeles, Acacia Communications, Optelian, Precise-ITC, Spirent, and Xilinx will present the industry’s first interoperability demo supporting 200/400GbE connectivity over standardized OTN and DWDM. Putting that succinctly, the demo is all about packing more bits/λ, so that you can continue to use existing fiber instead of laying more.

 

Callite-C4 400GE/OTN Transponder IP from Precise-ITC instantiated in a Xilinx Virtex UltraScale+ VU9P FPGA will map native 200/400GbE traffic—generated by test equipment from Spirent—into 2x100 and 4x100 OTU4-encapsulated signals. The 200GbE and 400GbE standards are still in flux, so instantiating the Precise-ITC transponder IP in an FPGA allows the design to quickly evolve with the standards with no BOM or board changes. Concise translation: faster time to market with much less risk.

 

 

Precise-ITC Callite-4 IP.jpg

 

Callite-C4 400GE/OTN Transponder IP Block Diagram

 

 

 

Optelian’s TMX-2200 200G muxponder, scheduled for release later this year, will muxpond the OTU4 signals into 1x200Gbps or 2x200Gbps DP-16QAM using Acacia Communications’ CFP2-DCO coherent pluggable transceiver.

 

 

The Optelian and Precise-ITC exhibit booths at OFC 2017 are 4139 and 4141 respectively.

 

 

 

By Adam Taylor

 

In looking at the Zynq UltraScale+ MPSoC’s AMS capabilities so far, we have introduced the two slightly different Sysmon blocks residing within the Zynq UltraScale+ MPSoC’s PS (processing system) and PL (programmable logic). In this blog, I am going to demonstrate how we can get the PS Symon up and running when we use both the ARM Cortex-A53 and Cortex-R5 processor cores in the Zynq UltraScale+ MPSoC’s PS. There is little difference when we use both types of processor, but I think it important to show you how to use both.

 

The process to use the Sysmon is the same as it is for many of the peripherals we have looked at previously with the MicroZed Chronicles:

 

  1. Look Up the configuration of the Sysmon Peripheral (XSysMonPsu_LookupConfig)
  2. Initialize the Sysmon Peripheral (XSysMonPsu_CfgInitialize)
  3. Reset the Sysmon (XSysMonPsu_Reset)
  4. Set the Sequencer to safe mode while we update its configuration (XSysMonPsu_SetSequencerMode)
  5. Disable the alarms (XSysMonPsu_SetAlarmEnables)
  6. Set the Sequencer Enables for the channels we want to sample (XSysMonPsu_SetSeqChEnables)
  7. Set the ADC Clock Divisor (XSysMonPsu_SetAdcClkDivisor)
  8. Set the Sequencer Mode (XSysMonPsu_SetSequencerMode)

 

The function names in parentheses are those which we use to perform the operation we desire, provided we pass the correct parameters. In the simplest case, as in this example, we can then poll the output registers using the XSysMonPsu_GetAdcData() function. All of these functions are defined within the file xsysmonpsu.h, which is available under the board Support Package Lib Src directory in SDK.

 

Examining the functions, you will notice that each of the functions used in step 4 to 8 require an input parameter called SysmonBlk. You must pass this parameter to the function. This parameter is how we which Sysmon (within the PS or the PL) we want to address. For this example, we will be specifying the PS Sysmon using XSYSMON_PS, which is also defined within xsysmonpsu.h. If we want to address the PL, we use the XSYSMON_PL definition, which we will be looking at next time.

 

There is also another header file which is of use and that is xsysmonpsu_hw.h. Within this file, we can find the definitions required to correctly select the channels we wish to sample in the sequencer. These are defined in the format:

 

 

XSYSMONPSU_SEQ_CH*_<Param>_MASK

 

 

This simple example samples the following within the PS Sysmon:

 

  1. Temperature
  2. Low Power Core Supply Voltage
  3. Full Power Core Supply Voltage
  4. DDR Supply Voltage
  5. Supply voltage for PS IO banks 0 to 3

 

We can use conversion functions provided within the xsysmonpsu.h to convert from the raw value supplied by the ADC into temperature and voltage. However, the PS IO banks are capable of supporting 3v3 logic. As such, the conversion macro from raw reading to voltage is not correct for these IO banks or for the HD banks in the PL. (We will look at different IO bank types in another blog).

 

The full-scale voltage is 3V for most of the voltage conversions. However, in line with UG580 Pg43, we need to use a full scale of 6V for the PS IO. Otherwise we will see a value only half of what we are expecting for that bank’s supply voltage setting. With this in mind, my example contains a conversion function at the top of the source file to be used for these IO banks, to ensure that we get the correct value.

 

The Zynq UltraScale+ MPSoC architecture permits both the APU (the ARM Cortex-A53 processors) and the RPU (the ARM Cortex-R5 processors) to address the Sysmon. To demonstrate this, the same file was used in applications first targeting an ARM Cortex-A53 processor in the APU and then targeting the ARM Cortex-R5 processor in the RPU. I used Core 0 in both cases.

 

The only difference between these two cases was the need to create new applications that select the core to be targeted and then updating the FSBL to load the correct core. (See “Adam Taylor’s MicroZed Chronicles, Part 172: UltraZed Part 3—Saying hello world and First-Stage Boot” for more information on how to do this.)

 

 

 

Image1.jpg

 

Results when using the ARM Cortex-A53 Core 0 Processor

 

 

 

Image2.jpg

 

Results when using the ARM Cortex-R5 Core 0 Processor

 

 

 

When I ran the same code, which is available in the GitHub directory, I received the examples as above over the terminal program, which show it working on both the ARM Cortex-A53 and ARM Cortex-R5 cores.

 

Next time we will look at how we can use the PL Sysmon.

 

 

 

Code is available on Github as always.

 

If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.

 

 

 

  • First Year E Book here
  • First Year Hardback here.

 

 

MicroZed Chronicles hardcopy.jpg 

  

 

  • Second Year E Book here
  • Second Year Hardback here

 

 

MicroZed Chronicles Second Year.jpg

 

 

 

 

EETimes’ Junko Yoshida with some expert help analyzes this week’s Xilinx reVISION announcement

by Xilinx Employee ‎03-15-2017 01:25 PM - edited ‎03-22-2017 07:20 AM (646 Views)

 

Image3.jpgThis week, EETimes’ Junko Yoshida published an article titled “Xilinx AI Engine Steers New Course” that gathers some comments from industry experts and from Xilinx with respect to Monday’s reVISION stack announcement. To recap, the Xilinx reVISION stack is a comprehensive suite of industry-standard resources for developing advanced embedded-vision systems based on machine learning and machine inference.

 

(See “Xilinx reVISION stack pushes machine learning for vision-guided applications all the way to the edge.”)

 

As Xilinx Senior Vice President of Corporate Strategy Steve Glaser tells Yoshida, “Xilinx designed the stack to ‘enable a much broader set of software and systems engineers, with little or no hardware design expertise to develop, intelligent vision guided systems easier and faster.’

 

Yoshida continues:

 

While talking to customers who have already begun developing machine-learning technologies, Xilinx identified ‘8 bit and below fixed point precision’ as the key to significantly improve efficiency in machine-learning inference systems.

 

 

Yoshida also interviewed Karl Freund, Senior Analyst for HPC and Deep Learning at Moor Insights & Strategy, who said:

 

Artificial Intelligence remains in its infancy, and rapid change is the only constant.” In this circumstance, Xilinx seeks “to ease the programming burden to enable designers to accelerate their applications as they experiment and deploy the best solutions as rapidly as possible in a highly competitive industry.

 

 

She also quotes Loring Wirbel, a Senior Analyst at The Linley group, who said:

 

What’s interesting in Xilinx's software offering, [is that] this builds upon the original stack for cloud-based unsupervised inference, Reconfigurable Acceleration Stack, and expands inference capabilities to the network edge and embedded applications. One might say they took a backward approach versus the rest of the industry. But I see machine-learning product developers going a variety of directions in trained and inference subsystems. At this point, there's no right way or wrong way.

 

 

There’s a lot more information in the EETimes article, so you might want to take a look for yourself.

 

 

 

 

Next week at the OFC Optical Networking and Communication Conference & Exhibition in Los Angeles, Xilinx will be in the Ethernet Alliance booth demonstrating the industry’s first, standard-based, multi-vendor 400GE network. A 400GE MAC and PCS instantiated in a Xilinx Virtex UltraScale+ VU9P FPGA will be driving a Finisar 400GE CFP8 optical module, which in turn will communicate with a Spirent 400G test module over a fiber connection.

 

In addition, Xilinx will be demonstrating:

 

 

PAM4 Eye.jpg

 

 

  • The world’s first complete FlexE 1.0 solution showcasing bonding, sub-rating and channelization on UltraScale+ FPGAs.

 

  • LLDP packet snooping on transport line cards to allow SDN controllers to build network topology maps, which aid data-center network automation.

 

  • Optical technology abstraction in DCI transport.

 

If you’re visiting OFC, be sure to stop by the Xilinx booth (#1809).

 

 

 

PLDA has announced the XpressRICH4-AXI PCIe 4.0 configurable IP block that ties an on-chip AXI bus to PICe 4.0. The IP block complies with the PCI Express Base 4.0r7 specification and supports endpoint, root port, and dual mode configurations. The IP supports Xilinx Virtex-7, Virtex UltraScale, and Kintex UltraScale devices and can be used for ASIC design as well.

 

Here’s a block diagram of the core:

 

 

PLDA XpressRICH4-AXI PCIe 4 IP.jpg

 

 

PLDA XpressRICH4-AXI PCIe 4.0 configurable IP Block Diagram

 

 

Please contact PLDA directly for more information about this IP.

 

 

 

This week at Embedded World in Nuremberg, Lynx Software Technologies is demonstrating its port of the LynxSecure Separation Kernel hypervisor to the ARM Cortex-A53 processors on the Xilinx Zynq UltraScale+ MPSoC. According to Robert Day, Vice President of Marketing at Lynx, "ARM designers are now able to run safety critical environments alongside a general purpose OS like Linux or LynxOS RTOS on the same Xilinx processor without compromising safety, security or real-time performance. Use cases include automotive systems based on environments such as AUTOSAR RTA-BSW from ETAS and avionics designs using LynxOS-178 RTOS from Lynx. Designers can match the security of air-gap hardware partitioning without incurring the cost, power and size overhead of separate hardware."

 

The LynxSecure port to the Zynq UltraScale+ MPSoC supports modular software architectures and tight integration with the Zynq UltraScale+ MPSoC’s FPGA fabric for hosting bare-metal applications, trusted functions, and open-source projects on a single SoC with secure partitioning.  You have the option to decide which functions run in software using LynxSecure bare-metal apps and which functions you need to hardware-accelerate through the Zynq UltraScale+ MPSoC’s FPGA fabric.

 

The LunxSecure technology was designed to satisfy high-assurance computing requirements in support of the NIST, NSA Common Criteria, and NERC CIP evaluation processes which are used to regulate military and industrial computing environments.

 

The LynxSecure Separation Kernel hypervisor provides:

 

  • Safety & Security
  • Domain Isolation
  • Trusted Execution Environments
  • Reference Monitor Plugins (e.g. Firewalls, IDS encryption, guards)

 

 

Here’s a diagram of the LynxSecure Separation Kernel hypervisor architecture:

 

 

 

LynxSecure Architecture.jpg 

 

 

 

Please contact Lynx Software Technologies directly for information about the LynxSecure Separation Kernel hypervisor.

 

 

 

Image3.jpgToday, EEJournal’s Kevin Morris has published a review article of the announcement titled “Teaching Machines to See: Xilinx Launches reVISION” following Monday’s announcement of the Xilinx reVISION stack for developing vision-guided applications. (See “Xilinx reVISION stack pushes machine learning for vision-guided applications all the way to the edge.”

 

Morris writes:

 

But vision is one of the most challenging computational problems of our era. High-resolution cameras generate massive amounts of data, and processing that information in real time requires enormous computing power. Even the fastest conventional processors are not up to the task, and some kind of hardware acceleration is mandatory at the edge. Hardware acceleration options are limited, however. GPUs require too much power for most edge applications, and custom ASICs or dedicated ASSPs are horrifically expensive to create and don’t have the flexibility to keep up with changing requirements and algorithms.

 

“That makes hardware acceleration via FPGA fabric just about the only viable option. And it makes SoC devices with embedded FPGA fabric - such as Xilinx Zynq and Altera SoC FPGAs - absolutely the solutions of choice. These devices bring the benefits of single-chip integration, ultra-low latency and high bandwidth between the conventional processors and the FPGA fabric, and low power consumption to the embedded vision space.

 

Later on, Morris gets to the fly in the ointment:

 

“Oh, yeah, There’s still that “almost impossible to program” issue.”

 

And then he gets to the solution:

 

reVISION, announced this week, is a stack - a set of tools, interfaces, and IP - designed to let embedded vision application developers start in their own familiar sandbox (OpenVX for vision acceleration and Caffe for machine learning), smoothly navigate down through algorithm development (OpenCV and NN frameworks such as AlexNet, GoogLeNet, SqueezeNet, SSD, and FCN), targeting Zynq devices without the need to bring in a team of FPGA experts. reVISION takes advantage of Xilinx’s previously-announced SDSoC stack to facilitate the algorithm development part. Xilinx claims enormous gains in productivity for embedded vision development - with customers predicting cuts of as much as 12 months from current schedules for new product and update development.

 

In many systems employing embedded vision, it’s not just the vision that counts. Increasingly, information from the vision system must be processed in concert with information from other types of sensors such as LiDAR, SONAR, RADAR, and others. FPGA-based SoCs are uniquely agile at handling this sensor fusion problem, with the flexibility to adapt to the particular configuration of sensor systems required by each application. This diversity in application requirements is a significant barrier for typical “cost optimization” strategies such as the creation of specialized ASIC and ASSP solutions.

 

The performance rewards for system developers who successfully harness the power of these devices are substantial. Xilinx is touting benchmarks showing their devices delivering an advantage of 6x images/sec/watt in machine learning inference with GoogLeNet @batch = 1, 42x frames/sec/watt in computer vision with OpenCV, and ⅕ the latency on real-time applications with GoogLeNet @batch = 1 versus “NVidia Tegra and typical SoCs.” These kinds of advantages in latency, performance, and particularly in energy-efficiency can easily be make-or-break for many embedded vision applications.

 

 

But don’t take my word for it, read Morris’ article yourself.

 

 

 

 

Using the Xilinx RFSoC for Satcom applications

by Xilinx Employee ‎03-13-2017 03:46 PM - edited ‎03-24-2017 08:46 AM (2,101 Views)

 

By Dr. Rajan Bedi, Spacechips

 

Several of my satcom ground-segment clients and I are considering Xilinx's recently announced RFSoC for future transceivers and I want to share the benefits of this impending device. (Note: For more information on the Xilinx RFSoC, see “Xilinx announces RFSoC with 4Gsamples/sec ADCs and 6.4Gsamples/sec DACs for 5G, other apps. When we say “All Programmable,” we mean it!”)

 

Direct RF/IF sampling and direct DAC up-conversion are currently being used very successfully in-orbit and on the ground. For example, bandpass sampling provides flexible RF frequency planning with some spacecraft by directly digitizing L- and S-band carriers to remove expensive and cumbersome superheterodyne down-conversion stages. Today, many navigation satellites directly re-construct the L-band carrier from baseband data without using traditional up-conversion. Direct RF/IF Sampling and direct DAC up-conversion have dramatically reduced the BOM, size, weight, power consumption, as well as the recurring and non-recurring costs of transponders. Software-defined radio (SDR) has given operators real scalability, reusability, and reconfigurability. Xilinx's new RFSoC will offer further hardware integration advantages for the ground segment.

 

The Xilinx RFSoC integrates multi-Gsamples/sec ADCs and DACs into a 16nm Zynq UltraScale+ MPSoC. At this geometry and with this technology, the mixed-signal converters draw very little power and economies of scale make it possible to add a lot of digital post-processing (Small A/Big D!) to implement functions such as DDC (digital down-conversion), DUC (digital up-conversion), AGC (automatic gain control), and interleaving calibration.

 

While CMOS scaling has improved ADC and DAC sample rates, which results in greater bandwidths at lower power, the transconductance of transistors and the size of the analog input/output voltage swing are reduced for analog designs, which impacts G/T at the satellite receiver. (G/T is antenna gain-to-noise-temperature, a figure of merit in the characterization of antenna performance where G is the antenna gain in decibels at the receive frequency and T is the equivalent noise temperature of the receiving system in kelvins. The receiving system’s noise temperature is the summation of the antenna noise temperature and the RF-chain noise temperature from the antenna terminals to the receiver output.)

 

Integrating ADCs and DACs with Xilinx's programmable MPSoC fabric shrinks physical footprint, reduces chip-to-chip latency, and completely eliminates the external digital interfaces between the mixed-signal converters and the FPGA. These external interfaces typically consume appreciable power. For parallel-I/O connections, they also need large amounts of pc board space and are difficult to route.

 

There will be a number of devices in the Xilinx RFSoC family, each containing different ADC/DAC combinations targeting different markets. Depending on the number of integrated mixed-signal converters, Xilinx is predicting a 55% to 77% reduction in footprint compared to current discrete implementations using JESD204B high-speed serial links between the FPGA and the ADCs and DACs, as illustrated below. Integration will also benefit clock distribution both at the device and system level.

 

 

RFSoC Footprint Reduction 2.jpg

 

Figure 1: RFSoC device concept (Source Xilinx)

 

 

The RFSoC’s integrated 12-bit ADCs can each sample up to 4Gsamples/sec, which offers flexible bandwidth and RF frequency-planning options. The analog input bandwidth of each ADC appears to be 4GHz, which allows direct RF/IF sampling up to the S-band.

 

Direct RF/IF sampling obeys the bandpass Nyquist Theorem when oversampling at 2x the information bandwidth (or greater) and undersampling the absolute carrier frequencies. For example, the spectrum below shows a 48.5MHz L-band signal centerd at 1.65GHz, digitized using an undersampling rate of 140.5Msamples/sec. The resulting oversampling ratio is 2.9 with the information located in the 24th Nyquist zone. Digitization aliases the bandpass information to the first Nyquist zone, which may or may not be baseband depending on your application. If not, the RFSoC's integrated DDC moves the alias to dc, allowing the use of a low-pass filter.

 

 

Direct L-band sampling.jpg

 

 

Figure 2: Direct L-Band Sampling

 

 

As the sample rate increases, the noise spectral density spreads across a wider Nyquist region with respect to the original signal bandwidth. Each time the sampling frequency doubles, the noise spectral density decreases by 3dB as the noise re-distributes across twice the bandwidth, which increases dynamic range and SNR. Understandably, operators want to exploit this processing gain! A larger oversampling ratio also moves the aliases further apart, relaxing the specification of the anti-aliasing filter. Furthermore, oversampling increases the correlation between successive samples in the time-domain, allowing the use of a decimating filter to remove some samples and reduce the interface rate between the ADC and the FPGA.

 

The RFSoC’s integrated 14-bit DACs operate up to 6.4Gsamples/sec, which also offers flexible bandwidth and RF frequency-planning options.

 

Just like any high-frequency, large bandwidth mixed-signal device, designing an RFSoC into a system requires careful consideration of floor-planning, front/back-end component placement, routing, grounding, and analog-digital segregation to achieve the required system performance. The partitioning starts at the die and extends to the module/sub-system level with all the analog signals (including the sampling clock) typically on one side of an ADC or DAC. Given the RFSoC's high sampling frequencies, at the pcb level, analog inputs and outputs must be isolated further to prevent crosstalk between adjacent channels and clocks, and from digital noise.

 

At low carrier frequencies, the performance of an ADC or DAC is limited by its resolution and linearity (DNL/INL). However at higher signal frequencies, SNR is determined primarily by the sampling clock’s purity. For direct RF/IF applications, minimizing jitter will be key to achieving the desired performance as shown below:

 

 

SNR of an ideal ADC vs analog input frequency and clock jitter.jpg 

 

Figure 3: SNR of an ideal ADC vs analog input frequency and clock jitter

 

 

While there are aspects of the mixed-signal processing that could be improved, from the early announcements and information posted on their website, Xilinx has done a good job with the RFSoC. Although not specifically designed for satellite communication, but more so for 5G MIMO and wireless backhaul, the RFSoC's ADCs and DACs have sufficient dynamic range and offer flexible RF frequency-planning options for many ground-segment OEMs.

 

The specification of the RFSoC's ADC will allow ground receivers to directly digitize the information broadcast at traditional satellite communication frequencies at L- and S-band as well as the larger bandwidths used by high-throughput digital payloads. Thanks to its reprogrammability, the same RFSoC-based architecture with its wideband ADCs can be re-used for other frequency plans without having to re-engineer the hardware.

 

The RFSoC's DAC specification will allow ground transmitters to directly construct approximately 3GHz of bandwidth up to the X-band (9.6GHz). Xilinx says that first samples of RFSoC will become available in 2018 and I look forward to designing the part into satcom systems and sharing my experiences with you.

 

 

 

Dr. Rajan Bedi pioneered the use of Direct RF/IF Sampling and direct DAC up-conversion for the space industry with many in-orbit satellites currently using these techniques. He was previously invited by ESA and NASA to present his work and was also part of the project teams which developed many of the ultra-wideband ADCs and DACs currently on the market. These devices are successfully operating in orbit today. Last year, his company, Spacechips, was awarded High-Reliability Product of the Year for advancing Software Defined Radio.

 

Spacechips provides space electronics design consultancy services to manufacturers of satellites and spacecraft around the world. The company also helps OEMs assess the benefits of COTS components and exploit the advantages of direct RF/IF sampling and direct DAC up-conversion. Prior to founding Spacechips, Dr. Bedi headed the Mixed-Signal Design Group at Airbus Defence & Space in the UK for twelve years. Rajan is the author of Out-of-this-World Design, the popular, award-winning blog on Space Electronics. He also teaches direct RF/IF sampling and direct DAC up-conversion techniques in his Mixed-Signal and FPGA courses which are offered around the world. Rajan offers a series of unique training courses, Courses for Rocket Scientists, which teach and compare all space-grade FPGAs as well as the use of COTS Xilinx UltraScale and UltraScale+ parts for implementing spacecraft IP. Rajan has designed every space-grade FPGA into satellite systems!

 

 

As part of today’s reVISION announcement of a new, comprehensive development stack for embedded-vision applications, Xilinx has produced a 3-minute video showing you just some of the things made possible by this announcement.

 

Here it is:

 

 

Adam Taylor’s MicroZed Chronicles, Part 177: Introducing the reVision stack

by Xilinx Employee ‎03-13-2017 10:39 AM - edited ‎03-22-2017 07:19 AM (1,367 Views)

 

By Adam Taylor

 

Several times in this series, we have looked at image processing using the Avnet EVK and the ZedBoard. Along with the basics, we have examined object tracking using OpenCV running on the Zynq SoC’s or Zynq UltraScale+ MPSoC’s PS (processing system) and using HLS with its video library to generate image-processing algorithms for the Zynq SoC’s or Zynq UltraScale+ MPSoC’s PL (programmable logic, see blogs 140 to 148 here).

 

Xilinx’s reVision is an embedded-vision development stack that provides support for a wide range of frameworks and libraries often used for embedded-vision applications. Most exciting, from my point of view, is that the stack includes acceleration-ready OpenCV functions.

 

Image1.jpg 

 

 

The stack itself is split into three layers. Once we select or define our platform, we will be mostly working at the application and algorithm layers. Let’s take a quick look at the layers of the stack:

 

  1. Platform layer: This is the lowest level of the stack and is the one on which the remaining stack layers are built. This layer includes platform definitions of the hardware and the software environment. Should we choose not to use a predefined platform, we can generate a custom platform using Vivado.

 

  1. Algorithm layer: Here we create our application using SDSoC and the platform definition for the target hardware. It is within this layer that we can use the acceleration-ready OpenCV functions along with predefined and optimized implementations for Customized Neural Network (CNN) developments such as inference accelerators within the PL.

 

  1. Application Development Layer: The highest layer of the stack. Development here is where high-level frameworks such as Caffe and OpenVX are used to complete the application.

 

As I mentioned above one of the most exciting aspects of the reVISION stack is the ability to accelerate a wide range of OpenCV functions using the Zynq SoC’s or Zynq UltraScale+ MPSoC’s PL. We can group the OpenCV functions that can be hardware-accelerated using the PL into four categories:

 

  1. Computation – Includes functions such as absolute difference between two frames, pixel-wise operations (addition, subtraction and multiplication), gradient, and integral operations
  2. Input Processing – Supports bit-depth conversions, channel operations, histogram equalization, remapping, and resizing.
  3. Filtering – Supports a wide range of filters including Sobel, Custom Convolution, and Gaussian filters.
  4. Other – Provides a wide range of functions including Canny/Fast/Harris edge detection, thresholding, SVM, HoG, LK Optical Flow, Histogram Computation, etc.

 

What is very interesting with these function calls is that we can optimize them for resource usage or performance within the PL. The main optimization method is specifying the number of pixels to be processed during each clock cycle. For most accelerated functions, we can choose to process either one or eight pixels. Processing more pixels per clock cycle reduces latency but increases resource utilization. Processing one pixel per clock minimizes the resource requirements at the cost of increased latency. We control the number of pixels processed per clock in via the function call.

 

Over the next few blogs, we will look more at the reVision stack and how we can use it. However in the best Blue Peter tradition, the image below shows the result of running a reVision Harris OpenCV acceleration function within the PL when accelerated.

 

 

Image2.jpg

 

 

Accelerated Harris Corner Detection in the PL

 

 

 

 

Code is available on Github as always.

 

If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.

 

 

 

  • First Year E Book here
  • First Year Hardback here.

 

 

MicroZed Chronicles hardcopy.jpg 

 

 

 

  • Second Year E Book here
  • Second Year Hardback here

 

 

MicroZed Chronicles Second Year.jpg

 

Xilinx reVISION stack pushes machine learning for vision-guided applications all the way to the edge

by Xilinx Employee ‎03-13-2017 07:37 AM - edited ‎03-22-2017 07:19 AM (2,891 Views)

 

Image3.jpgToday, Xilinx announced a comprehensive suite of industry-standard resources for developing advanced embedded-vision systems based on machine learning and machine inference. It’s called the reVISION stack and it allows design teams without deep hardware expertise to use a software-defined development flow to combine efficient machine-learning and computer-vision algorithms with Xilinx All Programmable devices to create highly responsive systems. (Details here.)

 

The Xilinx reVISION stack includes a broad range of development resources for platform, algorithm, and application development including support for the most popular neural networks: AlexNet, GoogLeNet, SqueezeNet, SSD, and FCN. Additionally, the stack provides library elements such as pre-defined and optimized implementations for CNN network layers, which are required to build custom neural networks (DNNs and CNNs). The machine-learning elements are complemented by a broad set of acceleration-ready OpenCV functions for computer-vision processing.

 

For application-level development, Xilinx supports industry-standard frameworks including Caffe for machine learning and OpenVX for computer vision. The reVISION stack also includes development platforms from Xilinx and third parties, which support various sensor types.

 

The reVISION development flow starts with a familiar, Eclipse-based development environment; the C, C++, and/or OpenCL programming languages; and associated compilers all incorporated into the Xilinx SDSoC development environment. You can now target reVISION hardware platforms within the SDSoC environment, drawing from a pool of acceleration-ready, computer-vision libraries to quickly build your application. Soon, you’ll also be able to use the Khronos Group’s OpenVX framework as well.

 

For machine learning, you can use popular frameworks including Caffe to train neural networks. Within one Xilinx Zynq SoC or Zynq UltraScale+ MPSoC, you can use Caffe-generated .prototxt files to configure a software scheduler running on one of the device’s ARM processors to drive CNN inference accelerators—pre-optimized for and instantiated in programmable logic. For computer vision and other algorithms, you can profile your code, identify bottlenecks, and then designate specific functions that need to be hardware-accelerated. The Xilinx system-optimizing compiler then creates an accelerated implementation of your code, automatically including the required processor/accelerator interfaces (data movers) and software drivers.

 

The Xilinx reVISION stack is the latest in an evolutionary line of development tools for creating embedded-vision systems. Xilinx All Programmable devices have long been used to develop such vision-based systems because these devices can interface to any image sensor and connect to any network—which Xilinx calls any-to-any connectivity—and they provide the large amounts of high-performance processing horsepower that vision systems require.

 

Initially, embedded-vision developers used the existing Xilinx Verilog and VHDL tools to develop these systems. Xilinx introduced the SDSoC development environment for HLL-based design two years ago and, since then, SDSoC has dramatically and successfully shorted development cycles for thousands of design teams. Xilinx’s new reVISION stack now enables an even broader set of software and systems engineers to develop intelligent, highly responsive embedded-vision systems faster and more easily using Xilinx All Programmable devices.

 

And what about the performance of the resulting embedded-vision systems? How do their performance metrics compare against against systems based on embedded GPUs or the typical SoCs used in these applications? Xilinx-based systems significantly outperform the best of this group, which employ Nvidia devices. Benchmarks of the reVISION flow using Zynq SoC targets against Nvidia Tegra X1 have shown as much as:

 

  • 6x better images/sec/watt in machine learning
  • 42x higher frames/sec/watt for computer-vision processing
  • 1/5th the latency, which is critical for real-time applications

 

Image1.jpg 

 

There is huge value to having a very rapid and deterministic system-response time and, for many systems, the faster response time of a design that's been accelerated using programmable logic can mean the difference between success and catastrophic failure. For example, the figure below shows the difference in response time between a car’s vision-guided braking system created with the Xilinx reVISION stack running on a Zynq UltraScale+ MPSoC relative to a similar system based on an Nvidia Tegra device. At 65mph, the Xilinx embedded-vision system’s response time stops the vehicle 5 to 33 feet faster depending on how the Nvidia-based system is implemented. Five to 33 feet could easily mean the difference between a safe stop and a collision.

 

 

Image2.jpg 

 

(Note: This example appears in the new Xilinx reVISION backgrounder.)

 

 

The last two years have generated more machine-learning technology than all of the advancements over the previous 45 years and that pace isn't slowing down. Many new types of neural networks for vision-guided systems have emerged along with new techniques that make deployment of these neural networks much more efficient. No matter what you develop today or implement tomorrow, the hardware and I/O reconfigurability and software programmability of Xilinx All Programmable devices can “future-proof” your designs whether it’s to permit the implementation of new algorithms in existing hardware; to interface to new, improved sensing technology; or to add an all-new sensor type (like LIDAR or Time-of-Flight sensors, for example) to improve a vision-based system’s safety and reliability through advanced sensor fusion.

 

Xilinx is pushing even further into vision-guided, machine-learning applications with the new Xilinx reVISION Stack and this announcement complements the recently announced Reconfigurable Acceleration Stack for cloud-based systems. (See “Xilinx Reconfigurable Acceleration Stack speeds programming of machine learning, data analytics, video-streaming apps.”) Together, these new development resources significantly broaden your ability to deploy machine-learning applications using Xilinx technology—from inside the cloud to the very edge.

 

 

You might also want to read “Xilinx AI Engines Steers New Course” by Junko Yoshida on the EETimes.com site.

 

 

Dave Embedded to show new ONDA SOM based on Zynq UltraScale+ MPSoC at Embedded World 2017 next week

by Xilinx Employee ‎03-09-2017 11:54 AM - edited ‎03-09-2017 11:55 AM (786 Views)

 

I just received an email from Dave Embedded Systems announcing that the company will be showing its new ONDA SOM (System on Module) based on Xilinx Zynq UltraScale+ MPSoCs at next week’s Embedded World 2017 in Nuremberg. Here’s a board photo:

 

 

Dave ONDA Zynq UltraScale Plus SOM.jpg

 

 

Dave Embedded Systems ONDA SOM based on the Xilinx Zynq UltraScale+ MPSoC (Note: Facsimile Image)

 

 

 

And here’s a photo of the SMM’s back side showing the three 140-pin, high-density I/O connectors:

 

 

 

Dave ONDA Zynq UltraScale Plus SOM Back Side.jpg

 

 

Dave Embedded Systems ONDA SOM based on the Xilinx Zynq UltraScale+ MPSoC (Back Side)

 

 

 

Thanks to the multiple processors and programmable logic in the Zynq UltraScale+ MPSoC, the ONDA board packs a lot of processing power into its small 90x55mm board. Dave Embedded Systems plans to offer versions of the ONDA SOM based on the Zynq UltraScale+ ZU2, ZU3, ZU4, and ZU5 MPSoCs, so there should be a wide range of price/performance points to pick from while standardizing on one uniformly sized platform.

 

Here’s a block diagram of the board:

 

 

Dave ONDA Zynq UltraScale Plus SOM Block Diagram.jpg 

 

Dave Embedded Systems ONDA SOM based on the Xilinx Zynq UltraScale+ MPSoC, Block Diagram

 

 

Please contact Dave Embedded Systems for more information about the ONDA SOM.

 

 

 

 

A LinkedIn blog published last month by Alfred P Neves of Wild River Technology describes a DesignCon 2017 tutorial titled “32 to 56Gbps Serial Link Analysis and Optimization Methods for Pathological Channels.” (You can get a copy of the paper here on the Wild River Web site. Registration required.) Co-authors of the turorial included Al Neves and Tim Wang Lee of Wild River Technology, Heidi Barnes and Mike Resso of Keysight, and Jack Carrel and Hong Ahn of Xilinx.

 

The tutorial discussed ways to test pathological channels at these nose-bleed serial speeds and those methods employed the bulletproof GTY SerDes on a Xilinx 16nm UltraScale+ FPGA for the 32Gbps transmitters and receivers as well as the Wild River ISI-32 loss platform and XTALK-32 crosstalk platform and Keysight test equipment.

 

Here’s a photo of the test setup showing the Xilinx UltraScale+ FPGA characterization board on the right, the Wild River test platforms on the left, and the Keysight test equipment in the background:

 

 

Wind River Technology ISI-32 Test Platform with UltraScale FPGA.jpg

 

 

If you don’t want to scan the DesignCon tutorial presentation, you can also watch a free 1-hour recorded Webinar about the topic on the Keysight web site. Click here.

 

Everspin announces MRAM-based NMVe accelerator board and a new script for adapting FPGAs to MRAMs

by Xilinx Employee ‎03-08-2017 10:18 AM - edited ‎03-08-2017 01:40 PM (871 Views)

 

MRAM (magnetic RAM) maker Everspin wants to make it easy for you to connect its 256Mbit DDR3 ST-MRAM devices (and it’s soon-to-be-announced 1Gbit ST-MRAMs) to Xilinx UltraScale FPGAs, so it now provides a software script for the Vivado MIG (Memory Interface Generator) that adapts the MIG DDR3 controller to the ST-MRAM’s unique timing and control requirements. Everspin has been shipping MRAMs for more than a decade and, according to this EETimes.com article by Dylan McGrath, it’s still the only company to have shipped commercial MRAM devices.

 

Nonvolatile MRAM’s advantage is that it has no wearout failure, as opposed to Flash memory for example. This characteristic gives MRAM huge advantages over Flash memory in applications such as server-class enterprise storage. MRAM-based storage cards require no wear leveling and their read/write performance does not degrade over time, unlike Flash-based SSDs.

 

As a result, Everspin also announced its nvNITRO line of NVMe storage-accelerator cards. The initial cards, the 1Gbyte nvNITRO ES1GB and 2Gbyte nvNITRO ES2GB, deliver 1,500,000 IOPS with 6μsec end-to-end latency. When Everspin's 1Gbit ST-MRAM devices become available later this year, the card capacities will increase to 4 to 16Gbytes.

 

Here’s a photo of the card:

 

 

Everspin nvNITRO Accelerator Card.jpg 

 

Everspin nvNITRO Storage Accelerator

 

 

 

If it looks familiar, perhaps you’re recalling the preview of this board from last year’s SC16 conference in Salt Lake City. (See “Everspin’s NVMe Storage Accelerator mixes MRAM, UltraScale FPGA, delivers 1.5M IOPS.”)

 

If you look at the photo closely, you’ll see that the hardware platform for this product is the Alpha Data ADM-PCIE-KU3 PCIe accelerator card, loaded 1 or 2Gbyte Everspin ST-MRAM DIMMs. Everspin has added its own IP to the Alpha Data card, based on a Kintex UltraScale KU060 FPGA, to create an MRAM-based NVMe controller.

 

As I wrote in last year’s post:

 

“There’s a key point to be made about a product like this. The folks at Alpha Data likely never envisioned an MRAM-based storage accelerator when they designed the ADM-PCIE-KU3 PCIe accelerator card but they implemented their design using an advanced Xilinx UltraScale FPGA knowing that they were infusing flexibility into the design. Everspin simply took advantage of this built-in flexibility in a way that produced a really interesting NVMe storage product.”

 

It’s still an interesting product, and now Everspin has formally announced it.

 

 

 

By Lei Guan, MTS Nokia Bell Labs (lei.guan@nokia.com)

 

Many wireless communications signal-processing stages, for example equalization and precoding, require linear convolution functions. Particularly, complex linear convolution will play a very important role in future-proofing massive MIMO system through frequency-dependent, spatial-multiplexing filter banks (SMFBs), which enable efficient utilization of wireless spectrum (see Figure 1). My team at Nokia Bell Labs has developed a compact, FPGA-based SMFB implementation.

 

 

Figure 1.jpg

 

Figure 1 - Simplified diagram of SMFB for Massive MIMO wireless communications

 

 

 

Architecturally, linear convolution shares the same structure used for discrete finite impulse response (FIR) filters, employing a combination of multiplications and additions. Direct implementation of linear convolution in FPGAs may not satisfy the user constraints regarding key DSP48 resources, even when using the compact semi-parallel implementation architecture described in “Xilinx FPGA Enables Scalable MIMO Precoding Core” in the Xilinx Xcell Journal, Issue 94.

 

From a signal-processing perspective, the discrete FIR filter describes the linear convolution function in the time domain. Because the linear convolution in the time domain is equivalent to multiplication in the frequency domain, an alternative algorithm—called “fast linear convolution” (FLC)—is good candidate for FPGA implementation. Unsurprisingly, such an implementation is a game of trade-offs between space and time, between silicon area and latency. In this article, we mercifully skip the math for the FLC operation (but you will find many more details in the book “FPGA-based Digital Convolution for Wireless Applications”). Instead, let’s take closer look at the multi-branch FLC FPGA core that our team created.

 

The design targets supplied by the system team included:

 

  • The FLC core should be able to operate on multi-rate LTE systems (5MHz, 10MHz and 20MHz).
  • Each data stream to an antenna pair requires a 160-tap complex asymmetric FIR-type linear convolution filter. For example, if we are going to transmit 4 LTE data streams via 32 antennas, we require 4´32 = 128 160-tap FIR filters.
  • The core should be easily stackable or cascadable.
  • Core latency should be less than one tenth of one time slot of an LTE-FDD radio frame (i.e. 50μsec).

 

Figure 2 shows the top-level design of the resulting FLC core in the Vivado System Generator Environment. Figure 3 illustrates the simplified processing stages at the module level with four branches as an example.

 

 

Figure 2.jpg

 

 

Figure 2 - Top level of the FLC core in Xilinx Vivado System Generator

 

 

Figure 3.jpg

 

 

Figure 3 - Illustration of multi-branch FLC-core processing (using 4 branches as an example)

 

 

 

The multi-branch FLC-core contains the following five processing stages, isolated by registers for logic separation and timing improvement:

 

  1. InBuffer Stage: This module caches the incoming continuous, slow-rate (30.72MSPS) data stream and reproduces the data in the form of bursty data segments at a higher processing rate (e.g., 368.64MSPS) so that functions in multiple branches in the later processing stages—such as FFT, CM and IFFT modules—can share the DSP48-sensitive blocks in a TDM manner, resulting in a very compact implementation. Our FPGA engineer built a very compact buffer based on four dual-port block RAMs, as shown in Figure 4.

 

Figure 4.jpg

 

Figure 4 - Simple Dual-Port RAM based input data buffer and reproduce stage


  1. FFT Stage: To save valuable R&D time at the prototyping stage, we used the existing Xilinx FFT IP-core directly. This core can be easily configured by the provided GUI and we choose pipelined streaming I/O to minimize the FFT core’s idle processing time. We also selected Natural order output ordering to maintain correct processing for the subsequent IFFT operation.
  2. Complex Multiplication (CM) Stage: After converting the data from the time domain to the frequency domain, we added a complex multiplication processing stage to perform convolution in the frequency domain. We implemented a fully pipelined complex multiplier using three DSP48 blocks at a latency cost of 6 clock cycles. We instantiated a dual-port, 4096-word RAM for storing eight FLC coefficient groups. Each coefficient group contains 512 I&Q frequency domain coefficients converted by another FFT-core. We implement multiple parallel complex multiplications using only one high-speed TDM-based CM to minimize DSP48 utilization.
  3. IFFT Stage: This module provides the IFFT function. It was configured similarly to the FFT module.
  4. OutBuffer Stage: At this stage, the processed data streams are interleaved at the data-block level. We passed this high-speed sequential data stream to 8 parallel buffer modules built using dual-port RAMs. Each module buffers and re-assembles the bursty segmental convolution data into a final data stream at the original data rate. Delay lines are required to synchronize the eight data streams.

 

Table 1 compares the performance of our FLC design and a semi-parallel solution. Our compact FLC core implemented with Xilinx UltraScale and UltraScale+ FPGAs creates a cost-effective, power-efficient, single-chip frequency dependent Massive MIMO spatial multiplexing solution for actual field trials. For more information, please contact the author.

 

 

Table 1.jpg

 

 

 

Last month, the European AXIOM Project took delivery of its first board based on a Xilinx Zynq UltraScale+ ZU9EG MPSoC. (See “The AXIOM Board has arrived!”) The AXIOM project (Agile, eXtensible, fast I/O Module) aims at researching new software/hardware architectures for Cyber-Physical Systems (CPS).

 

 

AXIOM Project Board Based on Zynq UltraScale MPSoC.jpg

 

 

AXIOM Project Board based on Xilinx Zynq UltraScale+ MPSoC

 

 

 

The board in fact presents the pinout of an Arduino Uno so you can attach an Arduino Uno-compatible shield to the board. The presence of the Arduino UNO pinout enables fast prototyping and exposes the FPGA I/O pins in a user-friendly manner.

 

Here are the board specs:

 

  • Wide boot capabilities: eMMC, Micro SD, JTAG
  • Heterogeneus 64-bit ARM FPGA Processor: Xilinx Zynq Ultrascale+ ZU9EG MPSoC
    • 64-bit Quad core A53 @ 1.2GHz
    • 32-bit Dual core R5 @ 500MHz
    • DDR4 @ 2400MT/s
    • Mali-400 GPU @ 600MHz
    • 600K System Logic Cells
  • Swappable SO-DIMM RAM (up to 32Gbytes) for the Processing System, plus a soldered 1Gbyte RAM for Programmable Logic
  • 12 GTH transceivers @ 12.5 Gbps (8 on USB Type C connectors + 4 on HS connector)
  • Easy rapid prototyping, because of the Arduino UNO pinout

 

You can see the AXIOM board for the first time during next week’s Embedded World 2017 at the SECO UDOO Booth, at the SECO booth, and at the EVIDENCE booth.

 

Please contact the AXIOM Project for more information.

 

 

 

 

A paper describing the superior performance of an FPGA-based, speech-recognition implementation over similar implementations on CPUs and GPUs won a Best Paper Award at FPGA 2017 held in Monterey, CA last month. The paper—titled “ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA” and written by authors from Stanford U, DeePhi Tech, Tsinghua U, and Nvidia—describes a speech-recognition algorithm using LSTM (Long Short-Term Memory) models with load-balance-aware pruning implemented on a Xilinx Kintex UltraScale+ KU060 FPGA. The implementations runs at 200MHz and draws 41W (for the FPGA board) slotted into a PCIe chassis. Compared to Core i7 CPU/Pascal Titan X GPU implementations of the same algorithm, the FPGA-based implementation delivers 43x/3x more raw performance and 40x/11.5x better energy efficiency, according to the FPGA 2017 paper. So the FPGA implementation is both faster and more energy-efficient. Pick any two.

 

Here’s a block diagram of the resulting LSTM speech-recognition design:

 

 

Speech Recognition Engine Block Diagram.jpg 

 

 

 

The paper describes the algorithm and implementation in detail, which probably contributed to this paper winning the conference’s Best Paper Award. This work was supported by the National Natural Science Foundation of China.

 

 

 

 

Adam Taylor’s MicroZed Chronicles Part 175 Analog Mixed Signal UltraZed Edition Part 5

by Xilinx Employee ‎03-06-2017 11:11 AM - edited ‎03-06-2017 11:12 AM (1,496 Views)

 

By Adam Taylor

 

Without a doubt, some of the most popular MicroZed Chronicles blogs I have written about the Zynq 7000 SoC explain how to use the Zynq SoC’s XADC. In this blog, we are going to look at how we can use the Zynq UltraScale+ MPSoC’s Sysmon, which replaces the XADC within the MPSoC.

 

 

Image5.jpg

 

 

 

The MPSoC contains not one but two Sysmon blocks. One is located within the MPSoC’s PS (processing system) and another within the MPSoC’s PL (programmable logic). The capabilities of the PL and PS Sysmon blocks are slightly different. While the processors in the MPSoC’s PS can access both Sysmon blocks through the MPSoC’s memory space, the different Sysmon blocks have different sampling rates and external interfacing abilities. (Note: the PL must be powered up before the PL Sysmon can be accessed by the MPSoC’s PS. As such, we should check the PL Sysmon control register to ensure that it is available before we perform any operations that use it.)

 

The PS Sysmon samples its inputs at 1Msamples/sec while the PL Sysmon has a reduced sampling rate of 200Ksamples/sec. However, the PS Sysmon does not have the ability to sample external signals. Instead, it monitors the Zynq MPSoC’s internal supply voltages and die temperature. The PL Sysmon can sample external signals and it is very similar to the Zynq SoC’s XADC, having both a dedicated VP/VN differential input pair and the ability to interface to as many as sixteen auxiliary differential inputs. It can also monitor on-chip voltage supplies and temperature.

 

 

 

Image1.jpg

 

 

Sysmon Architecture within the Zynq UltraScale+ MPSoC

 

 

 

Just as with the Zynq SoC’s XADC, we can set upper and lower alarm limits for ADC channels within both the PL and PS Sysmon in the Zynq UltraScale+ MPSoC. You can use these limits to generate an interrupt should the configured bound be exceed. We will look at exactly how we can do this in another blog once we understand the basics.

 

The two diagrams below show the differences between the PS and PL Sysmon blocks in the Zynq UltraScale+ MPSoC:

 

 

 

Image2.jpg 

Zynq UltraScale+ MPSoC’s PS System Monitor (UG580)

 

 

 

 

Image3.jpg

 

Zynq UltraScale+ MPSoC’s PL Sysmon (UG580)

 

 

 

Interestingly, the Sysmone4 block in the MPSoC’s PL provides direct register access to the ADC data. This will be useful if using either the VP/VN or Aux VP/VN inputs to interface with sensors that do not require high sample rates. This arrangement permits downstream signal processing, filtering, and transfer functions to be implemented in logic.

 

Both MPSoC Sysmon blocks require 26 ADC clock cycles to perform a conversion. Therefore, if we are sampling at 200Ksamlpes/sec, using the PL Sysmon we require a 5.2MHz ADC clock. For the PS Sysmon to sample at 1Msamples/sec, we need to provide a 26MHz ADC clock.

 

We set the AMS modules’ clock within the MPSoC Clock Configuration dialog, as shown below:

 

 

Image4.jpg

 

 

Zynq UltraScale+ MPSoC’s AMS clock configuration

 

 

 

The eagle-eyed will notice that I have set the clock to 52MHz and not 26 MHz. This is because the PS Sysmon’s clock divisor has a minimum value of 2, so setting the clock to 52MHz results in the desired 26MHz clock. The minimum divisor is 8 for the PL Sysmon, although in this case it would need to be divided by 10 to get the desired 5.2MHz clock. You also need to pay careful attention to the actual frequency and not just the requested frequency to get the best performance. This will impact the sample rate as you may not always get the exact frequency you want—as is the case here.

 

Next time in the UltraZed Edition of the MicroZed Chronicles, we will look at the software required to communicate with both the PS and PL Symon in the Zynq UltraScale+ MPSoC.

 

 

References

 

UltraScale Architecture System Monitor User Guide, UG580

 

Zynq UltraScale+ MPSoC Register Reference

 

 

 

 

Code is available on Github as always.

 

If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.

 

 

 

  • First Year E Book here
  • First Year Hardback here.

 

 

 

 MicroZed Chronicles hardcopy.jpg

 

 

  • Second Year E Book here
  • Second Year Hardback here

 

 

MicroZed Chronicles Second Year.jpg

 

 

 

Today, Aldec announced its latest FPGA-based HES prototyping board—the HES-US-440—with a whopping 26M ASIC gate capacity. This board is based on the Xilinx Virtex UltraScale VU440 FPGA and it also incorporates a Xilinx Zynq Z-7100 SoC that acts as the board’s peripheral controller and host interface. The announcement includes a new release of Aldec’s HES-DVM Hardware/Software Validation Platform that enables simulation acceleration and emulation use modes for the HES-US-440 board in addition to the physical prototyping capabilities. You can also use this prototyping board directly to implement HPC (high-performance computing) applications.

 

 

Aldec HES-US-440 Prototyping Board.jpg

 

 

Aldec HES-US-440 Prototyping Board, based on a Xilinx Virtex UltraScale VU440 FPGA

 

 

 

The Aldec HES-US-440 board packs a wide selection of external interfaces to ease your prototyping work including four FMC HPC connections, PCIe, USB 3.0 and USB 2.0 OTG, UART/USB bridge, QSFP+, 1Gbps Ethernet, HDMI, SATA; has on-board NAND and SPI Flash memories; and incorporates two microSD slots.

 

Here’s a block diagram of the HES-US-440 prototyping board:

 

 

Aldec HES-US-440 Prototyping Board Block Diagram.jpg

 

 

Aldec HES-US-440 Prototyping Board Block Diagram

 

 

For more information about the Aldec HES-US-440 prototyping board and Aldec’s HES-DVM Hardware/Software Validation Platform, please contact Aldec directly.

 

 

Here are four online training classes in March that cover various technical design aspects of Xilinx UltraScale and UltraScale+ FPGAs and the Zynq UltraScale+ MPSoC:

 

 

Date                   Class

 

03/09/2017         Zynq UltraScale+ MPSoC for the Software Developer

 

03/15/2017         Serial Transceivers in UltraScale Series FPGAs/MPSoCs – Part I – Transceiver Design Methodology

 

03/22/2017         Serial Transceivers in UltraScale Series FPGAs/MPSoCs – Part II – Debugging Techniques and PCB Design

 

03/23/2017         Zynq UltraScale+ MPSoC for the System Architect

 

 

 

These four classes will be taught by three Xilinx Authorized Training Providers: Faster Technology, Xprosys, and Hardent. Click here for registration details.

 

 

 

 

Today, Cadence announced the Protium S1 FPGA-Based Prototyping Platform, which delivers as many as 200M ASIC gates worth of prototyping capacity per chassis for hardware/software integration, software development, system validation, and hardware regression using one to eight Xilinx Virtex UltraScale VU440 FPGAs as a foundation. That’s double the previous version of the Protium FPGA-based Prototyping Platform which had a maximum gate capacity of 100M ASIC gates. The Protium S1 combines these Virtex UltraScale FPGA boards with a complete implementation and debug software suite, permitting ultra-fast design bring-up. The new Protium S1 platform is compatible with Cadence’s Palladium platforms and SpeedBridge adapters, paving the way for a smooth transition of SoC designs from an existing emulation environment into a high-performance FPGA-based prototype.

 

Here’s a 4-minute Protium S1 introductory video from Cadence:

 

 

 

 

 

Xcell Daily has covered the Samtec FireFly mid-board interconnect system several times but now there’s a new 3.5-minute video demo of a PCIe-specific version of the FireFly optical module. In the video demo, FireFly optical PCIe modules convey PCIe signals between a host PC and a video card over 100m of optical fiber in real time. The video passed over this link works smoothly. That’s quite a feat for a small module like the FireFly and it creates new possibilities for designing distributed systems.

 

The PCIe-specific version of the Samtec FireFly module handles PCIe sidebands and other PCIe-specific protocols. These modules match up well with the PCIe controllers found in Xilinx UltraScale and UltraScale+ devices and many 7 series FPGAs and Zynq SoCs. As Kevin Burt of Samtec’s Optical Group explains, the mid-board design of the FireFly system allows you to locate the modules adjacent to the driving chips (FPGAs in this case), which improves signal integrity of the pcb design.

 

Here’s the Samtec video:

 

 

 

 

 

 

For additional coverage of the Samtec FireFly system, see:

 

 

 

 

 

Adam Taylor’s MicroZed Chronicles Part 174: UltraZed Edition Part 4

by Xilinx Employee ‎02-27-2017 09:26 AM - edited ‎02-27-2017 09:28 AM (1,546 Views)

 

By Adam Taylor

 

Having looked at how we can quickly and easily get the Zynq UltraScale+ MPSoC up and running, I now want to look at the architecture of the system in a little more detail. I am going to start with examining the processor’s global address map. I am not going to look in detail into the contents of the address map. Initially, I want to explore how it is organized so that we understand it. I want to explain how the 32-bit ARM Cortex-R5 processors in the Zynq UlraScale+ MPSoC’s RPU (Real-time Processing Unit) and the 64-bit ARM Cortex-A53 processors in the APU (Application Processing Unit) share their address spaces.

 

The ARM Cortex-A53 processors use a 40-bit address bus, which can address up to 1Tbyte of memory. Compare this to the 32-bit address bus of the ARM Cortex-R5 processors, which can only address a 4Gbyte address space. The Zynq UltraScale+ MPSoC architects therefore had to consider how these address spaces would work together. The solution they came up with is pretty straightforward.

 

The memory map of the The Zynq UltraScale+ MPSoC is organised to so that the PMU (Platform Management Unit), MIO peripherals, DDR controller, the PL (programmable logic), etc. all fall within the first 4Gbyte of addressable space so that the APU and the RPU can both address these resources. The APU has further access to the DDR and PCIe controllers and the PL up to the remaining 1Tbyte address limit. The lower 4Gbytes of address space supports 32-bit addressing for some peripherals. One example of this is the PCIe controller, which supports 32-bit addressing via a 256Mbyte address range in the lower 4Gbytes and up to 256Gbytes (using 64-bit addressing) in the full address map.

 

Image1.jpg

 

MPSoC Global Address Map

 

 

It goes without saying that the only the APU can access the address space above 4 GB. However, the more observant amongst us will have noticed that there is also what appears to be a 36-bit addressable mode as well. Using a 36-bit address, provides for faster address translation, because the table walker uses only three stages instead of four for a 40-bit address. Therefore, 36 bit addressing should be used if possible to optimize system performance.

 

Address translation is the role of the System Memory Management Unit (SMMU), which has been designed to transform addresses from a virtual address space to a physical address space when using a virtualized environment. The SMMU can provide the following translations if desired:

 

 

Virtual Address (VA) - > Intermediate Physical Address (IPA) -> Physical Address (PA)

 

 

Within the SMMU, these are defined as being stage one VA to IPA or stage two IPA to PA and depending upon use case we can perform only a stage one, stage two or a stage one and two translation. To understand more about the SMMU—which is a complex subject—I would recommend reading chapters 3 and 10 of the Zynq UltraScale+ MPSoC TRM (UG1085) and the ARM SMMU architecture specification.

 

 

Image2.jpg 

 

SMMU translation schemes

 

 

 

Now that we understand a little more about the Zynq UltraScale+ MPSoC’s global memory map, we will look at exactly what is contained within this memory map and how we can configure and use this map with both the APU and the RPU cores over the next few blogs.

 

 

 

 

Code is available on Github as always.

 

If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.

 

 

 

  • First Year E Book here
  • First Year Hardback here.

 

 

 

 MicroZed Chronicles hardcopy.jpg

 

 

  • Second Year E Book here
  • Second Year Hardback here

 

 

MicroZed Chronicles Second Year.jpg

 

 

Adam Taylor just published an EETimes review of the Xilinx RFSoC, announced earlier this week. (See “Game-Changing RFSoCs from Xilinx”.) Taylor has a lot of experience with high-speed analog converters: he’s designed systems based on them—so his perspective is that of a system designer who has used these types of devices and knows where the potholes are—and he’s worked for a semiconductor company that made them—so he should know what to look for with a deep, device-level perspective.

 

Here’s the capsulized summary of his comments in EETimes:

 

 

“The ADCs are sampled at 4 Gsps (gigasamples per second), while the DACs are sampled at 6.4 Gsps, all of which provides the ability to work across a very wide frequency range. The main benefit of this, of course, is a much simpler RF front end, which reduces not only PCB footprint and the BOM cost but -- more crucially -- the development time taken to implement a new system.”

 

 

“…these devices offer many advantages beyond the simpler RF front end and reduced system power that comes from such a tightly-coupled solution.”

 

 

“These devices also bring with them a simpler clocking scheme, both at the device-level and the system-level, ensuring clock distribution while maintaining low phase noise / jitter between the reference clock and the ADCs and DACs, which can be a significant challenge.”

 

 

“These RFSoCs will also simplify the PCB layout and stack, removing the need for careful segregation of high-speed digital signals from the very sensitive RF front-end.”

 

 

Taylor concludes:

 

 

“I, for one, am very excited to learn more about RFSoCs and I cannot wait to get my hands on one.”

 

 

For more information about the new Xilinx RFSoC, see “Xilinx announces RFSoC with 4Gsamples/sec ADCs and 6.4Gsamples/sec DACs for 5G, other apps. When we say ‘All Programmable,’ we mean it!” and “The New All Programmable RFSoC—and now the video.”

 

 

Avnet’s new $499 UltraZed PCIe I/O carrier card for its UltraZed-EG SoM (System on Module)—based on the Xilinx Zynq UltraScale+ MPSoC—gives you easy access to the SoM’s 180 user I/O pins, 26 MIO pins from the Zynq MPSoC’s MIO, and 4 GTR transceivers from the Zynq MPSoC’s PS (Processor System) through the PCIe x1 edge connector; two Digilent PMOD connectors; an FMC LPC connector; USB and microUSB, SATA, DisplayPort, and RJ45 connectors; an LVDS touch-panel interface; a SYSMON header; pushbutton switches; and LEDs.

 

 

Avnet UltraZed PCIe IO Carrier Card Image.jpg

 

 

$499 UltraZed PCIe I/O Carrier Card for the UltraZed-EG SoM

 

 

That’s a lot of connectivity to track in your head, so here’s a block diagram of the UltraZed PCIe I/O carrier card:

 

 

Avnet UltraZed PCIe IO Carrier Card.jpg

 

UltraZed PCIe I/O Carrier Card Block Diagram

 

 

 

For information on the Avnet UltraZed SOM, see “Look! Up in the sky! Is it a board? Is it a kit? It’s… UltraZed! The Zynq UltraScale+ MPSoC Starter Kit from Avnet” and “Avnet UltraZed-EG SOM based on 16nm Zynq UltraScale+ MPSoC: $599.” Also, see Adam Taylor’s MicroZed Chronicles about the UltraZed:

 

 

 

 

 

 

 

Labels
About the Author
  • Be sure to join the Xilinx LinkedIn group to get an update for every new Xcell Daily post! ******************** Steve Leibson is the Director of Strategic Marketing and Business Planning at Xilinx. He started as a system design engineer at HP in the early days of desktop computing, then switched to EDA at Cadnetix, and subsequently became a technical editor for EDN Magazine. He's served as Editor in Chief of EDN Magazine, Embedded Developers Journal, and Microprocessor Report. He has extensive experience in computing, microprocessors, microcontrollers, embedded systems design, design IP, EDA, and programmable logic.