Displaying articles for: 03-05-2017 - 03-11-2017
The amazing “snickerdoodle one”—a low-cost, single-board computer with wireless capability based on the Xilinx Zynq Z-7010 SoC—is once more available for purchase on the Crowd Supply crowdsourcing Web site. Shipments are already going out to existing backers and, if you missed out on the original crowdsourcing campaign, you can order one for the post-campaign price of $95. That’s still a huuuuge bargain in my book. (Note: There is a limited number of these boards available, so if you want one, now’s the time to order it.)
In addition, you can still get the “snickerdoodle black” with a faster Zynq Z-7020 SoC and more SDRAM that also includes an SDSoC software license, all for $195. Finally, snickerdoodle’s creator krtkl has added two mid-priced options: the snickerdoodle prime and snickerdoodle prime LE—also based on Zynq Z-7020 SoCs—for $145.
The krtkl snickerdoodle low-cost, single-board computer based on a Xilinx Zynq SoC
Ryan Cousins at krtkl sent me this table that helps explain the differences among the four snickerdoodle versions:
For more information about krtkl’s snickerdoodle SBC, see:
I just received an email from Dave Embedded Systems announcing that the company will be showing its new ONDA SOM (System on Module) based on Xilinx Zynq UltraScale+ MPSoCs at next week’s Embedded World 2017 in Nuremberg. Here’s a board photo:
Dave Embedded Systems ONDA SOM based on the Xilinx Zynq UltraScale+ MPSoC (Note: Facsimile Image)
And here’s a photo of the SMM’s back side showing the three 140-pin, high-density I/O connectors:
Dave Embedded Systems ONDA SOM based on the Xilinx Zynq UltraScale+ MPSoC (Back Side)
Thanks to the multiple processors and programmable logic in the Zynq UltraScale+ MPSoC, the ONDA board packs a lot of processing power into its small 90x55mm board. Dave Embedded Systems plans to offer versions of the ONDA SOM based on the Zynq UltraScale+ ZU2, ZU3, ZU4, and ZU5 MPSoCs, so there should be a wide range of price/performance points to pick from while standardizing on one uniformly sized platform.
Here’s a block diagram of the board:
Dave Embedded Systems ONDA SOM based on the Xilinx Zynq UltraScale+ MPSoC, Block Diagram
Please contact Dave Embedded Systems for more information about the ONDA SOM.
A LinkedIn blog published last month by Alfred P Neves of Wild River Technology describes a DesignCon 2017 tutorial titled “32 to 56Gbps Serial Link Analysis and Optimization Methods for Pathological Channels.” (You can get a copy of the paper here on the Wild River Web site. Registration required.) Co-authors of the turorial included Al Neves and Tim Wang Lee of Wild River Technology, Heidi Barnes and Mike Resso of Keysight, and Jack Carrel and Hong Ahn of Xilinx.
The tutorial discussed ways to test pathological channels at these nose-bleed serial speeds and those methods employed the bulletproof GTY SerDes on a Xilinx 16nm UltraScale+ FPGA for the 32Gbps transmitters and receivers as well as the Wild River ISI-32 loss platform and XTALK-32 crosstalk platform and Keysight test equipment.
Here’s a photo of the test setup showing the Xilinx UltraScale+ FPGA characterization board on the right, the Wild River test platforms on the left, and the Keysight test equipment in the background:
If you don’t want to scan the DesignCon tutorial presentation, you can also watch a free 1-hour recorded Webinar about the topic on the Keysight web site. Click here.
On Thursday, March 30, two member companies from the IIConsortium (Industrial Internet Consortium)—Cisco and Xilinx—are presenting a free, 1-hour Webinar titled “How the IIoT (Industrial Internet of Things) Makes Critical Data Available When & Where it is Needed.” The discussion will cover machine learning and how self-optimization plays a pivotal role in enhancing factory intelligence. Other IIoT topics covered in the Webinar include TSN (time-sensitive networking), real-time control, and high-performance node synchronization. The Webinar will be presented by Paul Didier, the Manufacturing Solution Architect for the IoT SW Group at Cisco Systems, and Dan Isaacs, Director of Connected Systems at Xilinx.
By Adam Taylor
Embedded vision is one of my many FPGA/SoC interests. Recently, I have been doing some significant development work with the Avnet Embedded Vision Kit (EVK) significantly (for more info on the EVK and its uses see Issues 114 to 126 of the MicroZed Chronicles). As part my development, I wanted to synchronize the EVK display output with an external source—also useful if we desire to synchronize multiple image streams.
Implementing this is straight forward provided we have the correct architecture. The main element we need is a buffer between the upstream camera/image sensor chain and the downstream output-timing and -processing chain. VDMA (Video Direct Memory Access) provides this buffer by allowing us to store frames from the upstream image-processing pipeline in DDR SDRAM and then reading out the frames into a downstream processing pipeline with different timing.
The architectural concept appears below:
VDMA buffering between upstream and downstream with external sync
For most downstream chains, we use a combination of the video timing controller (VTC) and AXI Stream to Video Out IP blocks, both provided in the Vivado IP library. These two IP blocks work together. The VTC provides output timing and generates signals such as VSync and HSync. The AXI Stream to Video Out IP Block synchronizes its incoming AXIS stream with the timing signals provided by the VTC to generate the output video signals. Once the AXI Stream to Video Out block has synchronized with these signals, it is said to be locked and it will generate output video and timing signals that we can use.
The VTC itself is capable of both detecting input video timing and generating output video timing. These can be synchronized if you desire. If no video input timing signals are available to the VTC, then the input frame sync pulse (FSYNC_IN) serves to synchronize the output timing.
Enabling Synchronization with FSYNC_IN or the Detector
If FSYNC_IN alone is used to synchronize the output, we need to use not only FSYNC_IN but also the VTC-provided frame sync out (FSYNC_OUT) and GEN_CLKEN to ensure correct synchronization. GEN_CLKEN is an input enable that allows the VTC generator output stage to be clocked.
The FSYNC_OUT pulse can be configured to occur at any point within the frame. For this application, is has been configured to be generated at the very end of the frame. This configuration can take place in the VTC re-configuration dialog within Vivado for a one-time approach or, if an AXI Lite interface is provided, it can be positioned using that during run time.
The algorithm used to synchronize the VTC to an external signal is:
Should GEN_CLK not be disabled, the VTC will continue to run freely and will generate the next frame sequence. Issuing another FSYNC_IP while this is occurring will not result in re-synchronisation but will result in the AXI Stream to Video Out IP block being unable to synchronize the AXIS video with the timing information and losing lock.
Therefore, to control the enabling of the GEN_CLKEN we need to create a simple RTL block that implements the algorithm above.
Vivado Project Demonstrating the concept
When simulated, this design resulted in the VTC synchronizing to the FSYNC_IN signal as intended. It also worked the same when I implemented it in my EVK kit, allowing me to synchronize the output to an external trigger.
Code is available on Github as always.
If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.
MRAM (magnetic RAM) maker Everspin wants to make it easy for you to connect its 256Mbit DDR3 ST-MRAM devices (and it’s soon-to-be-announced 1Gbit ST-MRAMs) to Xilinx UltraScale FPGAs, so it now provides a software script for the Vivado MIG (Memory Interface Generator) that adapts the MIG DDR3 controller to the ST-MRAM’s unique timing and control requirements. Everspin has been shipping MRAMs for more than a decade and, according to this EETimes.com article by Dylan McGrath, it’s still the only company to have shipped commercial MRAM devices.
Nonvolatile MRAM’s advantage is that it has no wearout failure, as opposed to Flash memory for example. This characteristic gives MRAM huge advantages over Flash memory in applications such as server-class enterprise storage. MRAM-based storage cards require no wear leveling and their read/write performance does not degrade over time, unlike Flash-based SSDs.
As a result, Everspin also announced its nvNITRO line of NVMe storage-accelerator cards. The initial cards, the 1Gbyte nvNITRO ES1GB and 2Gbyte nvNITRO ES2GB, deliver 1,500,000 IOPS with 6μsec end-to-end latency. When Everspin's 1Gbit ST-MRAM devices become available later this year, the card capacities will increase to 4 to 16Gbytes.
Here’s a photo of the card:
Everspin nvNITRO Storage Accelerator
If it looks familiar, perhaps you’re recalling the preview of this board from last year’s SC16 conference in Salt Lake City. (See “Everspin’s NVMe Storage Accelerator mixes MRAM, UltraScale FPGA, delivers 1.5M IOPS.”)
If you look at the photo closely, you’ll see that the hardware platform for this product is the Alpha Data ADM-PCIE-KU3 PCIe accelerator card, loaded 1 or 2Gbyte Everspin ST-MRAM DIMMs. Everspin has added its own IP to the Alpha Data card, based on a Kintex UltraScale KU060 FPGA, to create an MRAM-based NVMe controller.
As I wrote in last year’s post:
“There’s a key point to be made about a product like this. The folks at Alpha Data likely never envisioned an MRAM-based storage accelerator when they designed the ADM-PCIE-KU3 PCIe accelerator card but they implemented their design using an advanced Xilinx UltraScale FPGA knowing that they were infusing flexibility into the design. Everspin simply took advantage of this built-in flexibility in a way that produced a really interesting NVMe storage product.”
It’s still an interesting product, and now Everspin has formally announced it.
By Lei Guan, MTS Nokia Bell Labs (firstname.lastname@example.org)
Many wireless communications signal-processing stages, for example equalization and precoding, require linear convolution functions. Particularly, complex linear convolution will play a very important role in future-proofing massive MIMO system through frequency-dependent, spatial-multiplexing filter banks (SMFBs), which enable efficient utilization of wireless spectrum (see Figure 1). My team at Nokia Bell Labs has developed a compact, FPGA-based SMFB implementation.
Figure 1 - Simplified diagram of SMFB for Massive MIMO wireless communications
Architecturally, linear convolution shares the same structure used for discrete finite impulse response (FIR) filters, employing a combination of multiplications and additions. Direct implementation of linear convolution in FPGAs may not satisfy the user constraints regarding key DSP48 resources, even when using the compact semi-parallel implementation architecture described in “Xilinx FPGA Enables Scalable MIMO Precoding Core” in the Xilinx Xcell Journal, Issue 94.
From a signal-processing perspective, the discrete FIR filter describes the linear convolution function in the time domain. Because the linear convolution in the time domain is equivalent to multiplication in the frequency domain, an alternative algorithm—called “fast linear convolution” (FLC)—is good candidate for FPGA implementation. Unsurprisingly, such an implementation is a game of trade-offs between space and time, between silicon area and latency. In this article, we mercifully skip the math for the FLC operation (but you will find many more details in the book “FPGA-based Digital Convolution for Wireless Applications”). Instead, let’s take closer look at the multi-branch FLC FPGA core that our team created.
The design targets supplied by the system team included:
Figure 2 shows the top-level design of the resulting FLC core in the Vivado System Generator Environment. Figure 3 illustrates the simplified processing stages at the module level with four branches as an example.
Figure 2 - Top level of the FLC core in Xilinx Vivado System Generator
Figure 3 - Illustration of multi-branch FLC-core processing (using 4 branches as an example)
The multi-branch FLC-core contains the following five processing stages, isolated by registers for logic separation and timing improvement:
Figure 4 - Simple Dual-Port RAM based input data buffer and reproduce stage
Table 1 compares the performance of our FLC design and a semi-parallel solution. Our compact FLC core implemented with Xilinx UltraScale and UltraScale+ FPGAs creates a cost-effective, power-efficient, single-chip frequency dependent Massive MIMO spatial multiplexing solution for actual field trials. For more information, please contact the author.
Last month, the European AXIOM Project took delivery of its first board based on a Xilinx Zynq UltraScale+ ZU9EG MPSoC. (See “The AXIOM Board has arrived!”) The AXIOM project (Agile, eXtensible, fast I/O Module) aims at researching new software/hardware architectures for Cyber-Physical Systems (CPS).
AXIOM Project Board based on Xilinx Zynq UltraScale+ MPSoC
The board in fact presents the pinout of an Arduino Uno so you can attach an Arduino Uno-compatible shield to the board. The presence of the Arduino UNO pinout enables fast prototyping and exposes the FPGA I/O pins in a user-friendly manner.
Here are the board specs:
You can see the AXIOM board for the first time during next week’s Embedded World 2017 at the SECO UDOO Booth, at the SECO booth, and at the EVIDENCE booth.
Please contact the AXIOM Project for more information.
A simple press release last month from the UK’s U of Bristol announced a 5G Massive MIMO milestone jointly achieved by BT, the Universities of Bristol and Lund, and National Instruments (NI): serving 2Gbps to 24 users simultaneously using a 20MHz LTE channel. That’s just short of 100 bits/sec/Hz and improves upon today’s LTE system capacity by 10x. The system that achieved this latest LTE milestone is based on the same Massive MIMO SDR system based on NI USRP RIO dual-channel SDR radios that delivered 145.6 bps/Hz in 5G experiments last year. (See “Kapow! NI-based 5G Massive MIMO SDR proto system “chock full of FPGAs” sets bandwidth record: 145.6 bps/Hz in 20MHz channel.”)
According to the press release:
“Initial experiments took place in BT’s large exhibition hall and used 12 streams in a single 20MHz channel to show the real-time transmission and simultaneous reception of ten unique video streams, plus two other spatial channels demonstrating the full richness of spatial multiplexing supported by the system.
“The system was also shown to support the simultaneous transmission of 24 user streams operating with 64QAM on the same radio channel with all modems synchronising over-the-air. It is believed that this is the first time such an experiment has been conducted with truly un-tethered devices, from which the team were able to infer a spectrum efficiency of just less than 100bit/s/Hz and a sum rate capacity of circa two Gbits/s in this single 20MHz wide channel.”
The NI USRP SDRs are based on Xilinx Kintex-7 325T FPGAs. Again, quoting from the press release:
“The experimental system uses the same flexible SDR platform from NI that leading wireless researchers in industry and academia are using to define 5G. To achieve accurate, real-time performance, the researchers took full advantage of the system's FPGAs using LabVIEW Communications System Design and the recently announced NI MIMO Application Framework. As lead users, both the Universities of Bristol and Lund worked closely with NI to implement, test and debug this framework prior to its product release. It now provides the ideal foundations for the rapid development, optimization and evaluation of algorithms and techniques for massive MIMO.”
Here’s a BT video describing this latest milestone in detail:
A paper describing the superior performance of an FPGA-based, speech-recognition implementation over similar implementations on CPUs and GPUs won a Best Paper Award at FPGA 2017 held in Monterey, CA last month. The paper—titled “ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA” and written by authors from Stanford U, DeePhi Tech, Tsinghua U, and Nvidia—describes a speech-recognition algorithm using LSTM (Long Short-Term Memory) models with load-balance-aware pruning implemented on a Xilinx Kintex UltraScale+ KU060 FPGA. The implementations runs at 200MHz and draws 41W (for the FPGA board) slotted into a PCIe chassis. Compared to Core i7 CPU/Pascal Titan X GPU implementations of the same algorithm, the FPGA-based implementation delivers 43x/3x more raw performance and 40x/11.5x better energy efficiency, according to the FPGA 2017 paper. So the FPGA implementation is both faster and more energy-efficient. Pick any two.
Here’s a block diagram of the resulting LSTM speech-recognition design:
The paper describes the algorithm and implementation in detail, which probably contributed to this paper winning the conference’s Best Paper Award. This work was supported by the National Natural Science Foundation of China.
By Adam Taylor
Without a doubt, some of the most popular MicroZed Chronicles blogs I have written about the Zynq 7000 SoC explain how to use the Zynq SoC’s XADC. In this blog, we are going to look at how we can use the Zynq UltraScale+ MPSoC’s Sysmon, which replaces the XADC within the MPSoC.
The MPSoC contains not one but two Sysmon blocks. One is located within the MPSoC’s PS (processing system) and another within the MPSoC’s PL (programmable logic). The capabilities of the PL and PS Sysmon blocks are slightly different. While the processors in the MPSoC’s PS can access both Sysmon blocks through the MPSoC’s memory space, the different Sysmon blocks have different sampling rates and external interfacing abilities. (Note: the PL must be powered up before the PL Sysmon can be accessed by the MPSoC’s PS. As such, we should check the PL Sysmon control register to ensure that it is available before we perform any operations that use it.)
The PS Sysmon samples its inputs at 1Msamples/sec while the PL Sysmon has a reduced sampling rate of 200Ksamples/sec. However, the PS Sysmon does not have the ability to sample external signals. Instead, it monitors the Zynq MPSoC’s internal supply voltages and die temperature. The PL Sysmon can sample external signals and it is very similar to the Zynq SoC’s XADC, having both a dedicated VP/VN differential input pair and the ability to interface to as many as sixteen auxiliary differential inputs. It can also monitor on-chip voltage supplies and temperature.
Sysmon Architecture within the Zynq UltraScale+ MPSoC
Just as with the Zynq SoC’s XADC, we can set upper and lower alarm limits for ADC channels within both the PL and PS Sysmon in the Zynq UltraScale+ MPSoC. You can use these limits to generate an interrupt should the configured bound be exceed. We will look at exactly how we can do this in another blog once we understand the basics.
The two diagrams below show the differences between the PS and PL Sysmon blocks in the Zynq UltraScale+ MPSoC:
Zynq UltraScale+ MPSoC’s PS System Monitor (UG580)
Zynq UltraScale+ MPSoC’s PL Sysmon (UG580)
Interestingly, the Sysmone4 block in the MPSoC’s PL provides direct register access to the ADC data. This will be useful if using either the VP/VN or Aux VP/VN inputs to interface with sensors that do not require high sample rates. This arrangement permits downstream signal processing, filtering, and transfer functions to be implemented in logic.
Both MPSoC Sysmon blocks require 26 ADC clock cycles to perform a conversion. Therefore, if we are sampling at 200Ksamlpes/sec, using the PL Sysmon we require a 5.2MHz ADC clock. For the PS Sysmon to sample at 1Msamples/sec, we need to provide a 26MHz ADC clock.
We set the AMS modules’ clock within the MPSoC Clock Configuration dialog, as shown below:
Zynq UltraScale+ MPSoC’s AMS clock configuration
The eagle-eyed will notice that I have set the clock to 52MHz and not 26 MHz. This is because the PS Sysmon’s clock divisor has a minimum value of 2, so setting the clock to 52MHz results in the desired 26MHz clock. The minimum divisor is 8 for the PL Sysmon, although in this case it would need to be divided by 10 to get the desired 5.2MHz clock. You also need to pay careful attention to the actual frequency and not just the requested frequency to get the best performance. This will impact the sample rate as you may not always get the exact frequency you want—as is the case here.
Next time in the UltraZed Edition of the MicroZed Chronicles, we will look at the software required to communicate with both the PS and PL Symon in the Zynq UltraScale+ MPSoC.
Code is available on Github as always.
If you want E book or hardback versions of previous MicroZed chronicle blogs, you can get them below.