The module connects to the network via its optical SFP port and generates disciplined reference clocks that are used by the rest of the local node for network timing synchronization. With the Spartan-6 FPGA clocking at 160MHz, no output clock jitter was detected in the latest design tests.
Prior Xcell Daily coverage of CERN’s White Rabbit protocol:
China Mobile Research Institute (CMRI) and Xilinx have just signed an MOU for development of the Next Generation Fronthaul Interface (NGFI) to be used in C-RANs (Cloud-based Radio Area Networks), RRUs (Remote Radio Units), large-scale antenna systems, 3D MIMO (multiple-input, multiple-output) antenna systems, and other 5G infrastructure components. Where does NGFI fit in a C-RAN network architecture? Here’s a block diagram to illustrate, taken from CMRI’s NGFI White Paper:
As the above figure shows, NGFI is the fronthaul interface between baseband processors and remote radio heads in the radio network infrastructure. NGFI shifts some BBU (baseband unit) processing functions to the RRUs, which alters BBU and RRU architecture. As a result, this next-generation architecture redefines the BBU as the Radio Cloud Center (RCC). Collections of RRUs, possibly managed by Radio Access Units (RAUs), become Radio Remote Systems (RRS). In addition, the existing fronthaul protocol using point-to-point connections morphs into a many-to-many fronthaul network that employs a packet-exchange protocol.
NGFI is a work in progress. Xilinx is contributing a validated NGFI reference design based on the Zynq SoC to the NGFI R&D effort. The Xilinx reference design will serve as a baseline framework for ongoing 4.5G/5G wireless network research and the All Programmable Zynq SoC’s hardware, software, and I/O programmability give researchers the ability to quickly grow, develop, and evolve the NGFI design from the baseline as envisioned by CMRI.
A research group at Yonsei University working on future wireless communications systems has demonstrated a real-time, full-duplex LTE radio system at IEEE Globecom in Austin, Texas last December. The team is using a novel antenna approach and has been working with National Instruments SDR platforms and the LabVIEW graphical programming environment. The full-duplex prototype is based on the LTE downlink standard with the following system specifications:
There’s a very big reason that explains why Xilinx won this award.
Equipped with the right IP and software, Xilinx All Programmable FPGAs and SoCs can implement single-chip systems, replacing ASSPs and ASICs entirely. This capability is critical in leading-edge applications where shipping volumes have not yet become—and may never become—sufficiently large to attract the ASSP vendors or to justify the NRE associated with cutting-edge, nanometer ASIC development.
This is precisely the current situation for networking products that implement 100G and 400G Ethernet and other equally high-speed networking protocols including OTN.
Quite simply, there is no better way to implement these products—especially with the introduction of the latest Xilinx 20nm UltraScale and 16nm UltraScale+ devices.
Xilinx owns 100% of the 400G Ethernet platforms announced to date.
The Leading Lights awards are the telecom industry's most prestigious awards program and focus on next-generation communications technologies, applications, services, and strategies. Xilinx won the 2015 Leading Lights award for Outstanding Components Vendor out of a field of seven finalists in the category.
Hitech Global has announced the HTG-728 100G NIC (network interface card) PCIe Gen3 board based on the Xilinx Virtex-7 H580T FPGA. On board, there’s a CFP2 (4x25Gbps) and a CFP4 (4x25G) card cage for optical Ethernet modules. In addition, for board-to-board and other in-box system interconnect, there are two Avago MiniPOD optical receiver/transmitters, each with twelve TX and RX lanes for an aggregate bandwidth of 120Gbps in addition to the board’s x16 PCIe Gen3 interface.
Hitech Global HTG-728 100G NIC PCIe Gen3 board
Here’s a block diagram of the board:
Hitech Global HTG-728 100G NIC Block Diagram
Clearly, the board is extracting a lot of benefit from the Virtex-7 H580T FPGA’s eight 28.05Gbps GTZ and forty-eight 13.1Gbps GHT SerDes ports.
Alpha Data ADM-PCIE-7V3 PCIe accelerator board based on a Xilinx Virtex-7 FPGA
CAPI as implemented on IBM’s POWER8 systems provides a high-performance way to implement client-specific, computation-heavy algorithms on an FPGA. These accelerated algorithms can replace application programs running on a POWER8 processor core. Using CAPI, POWER8 systems can treat the FPGA-based accelerator as a coherent peer to the POWER8 processors. Because of CAPI's peer-to-peer coherent relationship with the POWER8 processors, data intensive programs are easily offloaded to the FPGA and these offloaded functions operate as part of the application, which results in higher system performance with a much smaller programming investment. In IBM’s view, this approach allows hybrid computing to be successful across a much broader range of applications.
Here’s a block diagram of the way this all works:
Alpha Data CAPI Acceleration Development Kit Hardware Block Diagram
The Alpha Data CAPI Acceleration Development Kit includes the PSL (Power Service Layer), which resides on the FPGA and provides the infrastructure connection to the POWER8 chip; examples of user-defined AFUs (Accelerator Function Units); as well as CAPI-specific OS Kernel extensions and library functions. This kits includes all needed components to significantly reduce development time.
Note: This announcement is part of the OpenPOWER Summit taking place today in Beijing.
“At data rates above about 10 Gbits/s, the frequency response and impedance mismatches from the transmitting end of one SERDES (serializer-deserializer) to the receiving end of another SERDES causes eye-closing ISI (inter-symbol interference). The combination of pre/de-emphasis at the transmitter and equalization at the receiver fixes enough of that ISI to reopen the eye so it can operate at a reasonable BER (bit error ratio). The receiver usually employs two types of equalization: CTLE (continuous time linear equalization) at its input and DFE (decision feedback equalization) that feeds back ISI corrections following identification of 1s and 0s by the decision circuit.”
That’s how Ranson Stephens EDN article “Why FEC plays nice with DFE” starts out. This readable article discusses the use of RS-FECs (Reed-Solomon Forward Error Correction) to trade off a bit of bandwidth for a big BER reduction—especially handy when dealing with high-speed serial communications like 100G Ethernet.
Last week, Xilinx CTO and Senior Vice President Ivo Bolsens presented two Ross Freeman Awards for Technical Innovation to the hardware and software innovation teams responsible for developing the relevant technologies. I interviewed Ivo after the awards to get more detail. This blog contains Ivo’s remarks about the hardware award, for the 20nm GTY 30.5Gbps SerDes in Xilinx Virtex UltraScale devices.
While some applications need a wideband front end, others require the ability to filter and tune to a narrower band of spectrum. It can be inherently inefficient for an ADC to sample, process and burn the power to transmit a wideband spectrum, when only a narrow band is required in the application. An unnecessary system burden is created when the data link consumes a large bank of high-speed transceivers within a Xilinx FPGA, only to then decimate and filter the wideband data in subsequent processing. The Xilinx FPGA transceiver resources can instead be better allocated to receive the lower bandwidth of interest and channelize the data from multiple ADCs. Additional filtering can be done within the FPGA’s polyphase filter bank channelizer for frequency-division multiplexed (FDM) applications.
High-performance GSPS ADCs are now bringing the digital downconversion (DDC) function further up in the signal chain to reside within the ADC in a design solution based on Xilinx FPGAs. This approach offers several new design options to a highspeed system architect. However, because this function is relatively new to the ADC, there are design-related questions that engineers may have about the operation of the DDC blocks within GSPS ADCs. Let’s clear up some of the more-common questions so that designers can begin using this new technique with more confidence.
WHAT IS DECIMATION? In the simplest definition, decimation is the method of observing only a periodic subportion of the ADC output samples, while ignoring the rest. The result is to effectively reduce the sample rate of the ADC by downsampling. Sample decimation alone will only effectively reduce the sample rate of the ADC and correspondingly act as a low- pass filter. Without frequency translation and digital filtering, decimation will merely fold the harmonics of the fundamental and other spurious signals on top of one another in the frequency domain.
WHAT IS THE ROLE OF THE DDC? Since decimation by itself does not prevent the folding of out-of-band signals, how does the DDC make this happen? To get the full performance benefit of DDCs, the design must also contain a filter-and-mixer component that’s used as a companion to the decimation function. Digital filtering effectively removes the out-of-band noise from the narrowly defined bandwidth that is set by the decimation ratio.
HOW WIDE SHOULD THE DDC FILTERS BE? The decimation ratios for DDCs are typically based on integer factors that are powers of 2 (2, 4, 8, 16, etc.). However, the decimation factor could actually be any ratio based on the DDC architecture, including fractional decimation.
The decimation of the ADC samples removes the need to send unwanted information downstream in the signal chain to eventually get discarded anyway. Therefore, since this data is filtered out, it reduces the output data bandwidth needed on the back end of the ADC. This amount of reduction is offset by the increase in data from both the I/Q data output. For example, a decimate-by-16 filter with both I and Q data would reduce the wideband output data by a factor of 8.
This minimized data rate reduces the complexity of system layout by lowering the number of output JESD204B lanes from the ADC. The reduction in ADC output bandwidth can allow the design of a compact system that otherwise may not be achievable. For example, the use of a single decimate-by-8 DDC allows the same Xilinx Artix-7 FPGA system to support four times more ADCs by reducing the output bandwidth of the ADCs to just two output data lanes.
Note: This blog post is a short excerpt from a much larger technical article that appeared in the most recent issue of Xcell Journal. To read Ian Beavers’ full article with pages of technical details on implementing FFTs in FPGAs, see “Rethinking Digital Downconversion in Fast, Wideband ADCs.”
For many engineers, working in the frequency domain does not come as naturally as working within the time domain, probably because the frequency domain is associated with complicated mathematics. However, to unlock the true potential of Xilinx FPGA-based solutions, you need to feel comfortable working within both of these domains.
Depending upon the type of signal—repetitive or nonrepetitive, discrete or nondiscrete—there are a number of methods you can use to convert between time and frequency domains, including Fourier series, Fourier transforms and Z transforms. Within electronic signal processing and FPGA applications in particular, you will most often be interested in one transform: the discrete Fourier transform (DFT), which is a subset of the Fourier transform. Engineers use the DFT to analyze signals that are periodic and discrete—that is, they consist of a number of n bit samples evenly spaced at a sampling frequency that in many applications is supplied by an ADC within the system.
From telecommunications to image processing, radar and sonar, it is hard to think of a more powerful and adaptable analysis technique to implement within an FPGA than the Fourier transform. Indeed, the DFT forms the foundation for one of the most commonly used FPGA-based applications: It is the basis for generating the coefficients of the finite input response (FIR) filter (see Xcell Journal issue 78, “Ins and Outs of Digital Filter Design and Implementation”).
However, its use is not just limited to filtering. The DFT and IDFT are used in telecommunications processing to perform channelization and recombination of the telecommunication channels. In spectral-monitoring applications, they are used to determine what frequencies are present within the monitored bandwidth, while in image processing the DFT and IDFT are used to handle convolution of images with a filter kernel to perform, for example, image pattern recognition. All of these applications are typically implemented using a more efficient algorithm to calculate the DFT than the one shown above.
All told, the ability to understand and implement a DFT within your FPGA is a skill that every FPGA developer should have.
Note: This blog post is a short excerpt from a much larger technical article that appeared in the most recent issue of Xcell Journal. To read Adam Taylor’s full article with pages of technical details on implementing FFTs in FPGAs, see “Coming to Grips with the Frequency Domain.”
SDx Central’s just-published article “P4 Language Aims to Take SDN Beyond OpenFlow” reports on the new P4 open-source, data-plane programming language from the just-launched P4 Language Consortium, started by Jennifer Rexford and Nick McKeown. While OpenFlow is a table-driven method for describing and programming packet processing, P4 is a programming language, which provides more flexible ways of controlling networking equipment.
Professor Rexford, from Princeton University, has been involved with the Open Networking Foundation since its early days. McKeown, a professor at Stanford University, helped birth the OpenFlow protocol and sparked the current groundswell of SDN developments.
Xilinx is an industry member of the P4 Language Consortium.
The latest release of the Xilinx SDAccel development environment, 2015.1, boasts many enhancements to accelerate the development of OpenCL, C, and C++ applications. There’s been significant expansion along three other dimensions including the addition of four new hardware development platforms, four new SDAccel-optimized libraries, and the creation of a new design services ecosystem with six initial member companies.
The four new development platforms supported by SDAccel 2015.1 are the:
AuvizLA for BLAS (Basic Linear Algebra) applications
AuvizDNN for Machine Learning DNN (deep neural network) applications
ArrayFire open-source Machine Learning and OpenCL library (formerly available only for GPU acceleration)
Finally, recognizing that some companies would like design help in developing FPGA-accelerated data-center applications in OpenCL, C, and C++, Xilinx has formed a global ecosystem of Xilinx Alliance Members offering relevant design services for SDAccel. The new design services members of the Xilinx Alliance include:
We recently used Xilinx’s SDAccel development environment to compile and optimize a video-watermarking application written in OpenCL for an FPGA accelerator card. Video content providers use watermarking to brand and protect their content. Our goal was to design a watermarking application that would process high-definition (HD) video at a 1080p resolution with a target throughput of 30fps running on an Alpha Data ADM-PCIE-7V3 card.
The SDAccel development environment enables designers to take applications captured in OpenCL and compile them to an FPGA without requiring knowledge of the underlying FPGA implementation tools. The video-watermarking application serves as a perfect way to introduce the main optimization techniques available in SDAccel.
The main function of the video-watermarking algorithm is to overlay a logo at a specific location on a video stream. The logo used for the watermark can be either active or passive. An active logo is typically represented by a short, repeating video clip, while a passive logo is a still image. The most common technique among broadcasting companies that brand their video streams is to use a company logo as a passive watermark, so that was the aim of our example design. The application inserts a passive logo on a pixel-by-pixel level of granularity based on the operations of the following equations:
The input and output frames are two-dimensional arrays in which pixels are expressed using the YCbCr color space. In this color space, each pixel is represented in three components: Y is the luma component, Cb is the chroma blue-difference component and Cr is the chroma red-difference component. Each component is represented by an 8-bit value, resulting in a total of 24 bits per pixel.
The system on which we executed the application is shown in the figure below. It is composed of an Alpha Data ADMPCIE-7V3 card communicating with an x86 processor over a PCIe link. In this system, the host processor retrieves the input video stream from disk and transfers it to the device global memory. The device global memory is the memory on the FPGA card that is directly accessible from the FPGA. In addition to placing the video frames in device global memory, the logo and mask are transferred from the host to the accelerator card and placed in on-chip memory to take advantage of the low latency of BRAM memories. The code that runs on the host processor is responsible for sending a video frame to the FPGA accelerator card, launching the accelerator and then retrieving the processed frame from the FPGA accelerator card.
System Overview for the Video Watermarking Application
The optimizations necessary when creating applications like this one using SDAccel are software optimizations. Thus, these optimizations are similar to the ones required to extract performance from other processing fabrics, such as GPUs. As a result of using SDAccel, the details of getting the PCIe link to work, drivers, IP placement and interconnect became a non-issue, allowing us as designers to focus solely on the target application.
Alpha Data, the maker of the FPGA-based Alpha Data ADM-PCIE-7V3 PCIe accelerator card discussed in this article, has joined the OpenPOWER Foundation. The organization is a group of technology organizations working collaboratively to build advanced server, networking, storage and acceleration technology as well as industry leading open source software aimed at delivering more choice, control, and flexibility to developers of next-generation hyperscale and cloud data centers.
As the infamous saying goes, you can’t be too rich or too thin. Although that may or may not be true, in the world of network switching you truly can’t have too many Ethernet ports. Time was, the number of Gigabit Ethernet ports you could have on one FPGA was limited to the number of SerDes ports on the device. That’s no longer true. With the advent of Xilinx UltraScale All Programmable devices, you can now use low-power LVDS SelectIO pins (in addition to SerDes transceivers) for 1000Base-X Ethernet ports. If you find that assertion tough to swallow, here’s a 5-minute video complete with a technical explanation, eye diagrams, and J-BERT jitter histograms to convince you that you can get much better low-jitter I/O performance for Gigabit Ethernet than you’ll need from the UltraScale LVDS SelectIO pins, even with nearly all of the FPGA’s on-chip logic resources toggling:
How many ports can you get on one FPGA using LVDS SelectIO pins to implement Gigabit Ethernet on Xilinx UltraScale devices? Well, of course, that depends on the size of the device. I’m told that it’s certainly possible to fit 40 Gigabit Ethernet ports on one Kintex UltraScale KU040 device. That’s the second smallest Kintex UltraScale FPGA, the one that entered full production last December (see “First Kintex UltraScale FPGA enters full production. Two dev boards now available to help you design advanced new systems”). My quick calculations show that 40 ports worth of Gigabit Ethernet won’t come close to filling the device even with the MACs. UltraScale devices really do alter reality when it comes to system-design assumptions.
So, would 40 fully configurable, low-jitter Gigabit Ethernet ports on one chip help you with your next design?
Note: The maximum number of differential HP I/O pairs on an UltraScale KU040 FPGA is 192.
Mission-critical enterprise servers often use specialized hardware for application acceleration, including graphics processing units (GPUs) and digital signal processors (DSPs). Xilinx’s new SDAccel development environment removes programming as a gating issue to FPGA utilization in this application by providing developers with a familiar CPU/GPU-like environment.
Convolutional Neural Networks (CNNs) and deep learning are revolutionizing all sorts of recognition applications from image and speech recognition to big data mining. Baidu’s Dr. Ren Wu, a GPU application pioneer, gave a keynote at last week’s Embedded Vision Summit 2015 announcing worldwide accuracy leadership in analyzing the ImageNet Large Scale Visual Recognition Challenge data set using Baidu’s GPU-based deep-learning CNN. (See “Baidu Leads in Artificial Intelligence Benchmark” and Baidu’s paper.) GPUs are currently the implementation technology of choice for CNN researchers—because of their familiar programming model—but GPUs have prohibitive power consumption. Meanwhile and also at the Embedded Vision Summit, Auviz Systems founder and CEO Nagesh Gupta presented results of related work on image-processing CNNs. Auviz Systems has been developing FPGA-based middleware IP for data centers that cuts application power consumption.
This week at the Embedded Vision Summit, Teradeep demonstrated real-time video classification from streaming video using its deep-learning neural network IP running on a Xilinx Kintex-7 FPGA, the same FPGA fabric you find in a Zynq Z-7045 SoC. Image-search queries run in data center servers usually use CPUs and GPUs, which consume a lot of power. Running the same algorithms on a properly configured FPGA can reduce the power consumption by 3x-5x according to Vinayak Gokhale, a hardware engineer at TeraDeep, who was running the following demo in the Xilinx booth at the event:
Note that this demo can classify the images using as many as 40 categories simultaneously without degrading the real-time performance.
Renesas is using its R8A20686BG-G 80Mbit Dual-Port Interlaken-LA TCAM to store search data. The device is designed for large table searches; is capable of performing 2 billion searches/sec; supports 80-, 160-, 320- and 640-bit search keys; and connects to a packet processor using 12-lane 10.1325/12.5Gbps Interlaken serial ports.
In this demo, the Renesas TCAM is connected to a Xilinx Eval Board carrying a Xilinx Virtex-7 FPGA. A custom Programmable Packet Processor created by the Xilinx SDNet development environment generates and feeds search keys to the Renesas device, which performs the searches in real time and passes search results back to the Programmable Packet Processor. A MicroBlaze RISC processor instantiated in the Virtex-7 FPGA handles table maintenance in the Renesas TCAM.
Here’s a photo of the working demo system:
The board on the left is the Xilinx Virtex-7 Eval Board and the board on the right has the Renesas S-series Network Search Engine IC. The boards are linked through a high-speed CFP cable, appearing at the bottom of the photo. The rainbow ribbon cable between the boards carries a low-speed housekeeping connection employed by the MicroBlaze processor instantiated in the Virtex-7 FPGA.
Emerging RF-class data converters—namely, RF DACs and RF ADCs—architecturally make it possible to create compact multiband transceivers. But the nonlinearities inherent in these new devices can be a stumbling block. For instance, nonlinearity of the RF devices has two faces in the frequency domain: in-band and out of band. In-band nonlinearity refers to the unwanted frequency terms within the TX band, while out-of-band nonlinearity consists of the undesired frequency terms out of the TX band.
Here at Bell Labs Ireland, we have created a flexible software-and-hardware platform to rapidly evaluate RF DACs that are potential candidates for next-generation wireless systems. The three key elements of this R&D project are a high-performance Xilinx FPGA, Xilinx intellectual property (IP), and MATLAB. We tried to minimize the FPGA resource usage while keeping the system as flexible as possible. A system block diagram appears below:
We picked the latest Analog Devices RF-DAC evaluation boards (AD9129 and AD9739a) and the Xilinx ML605 evaluation board. The ML605 board comes with a Virtex-6 XC6VLX240T-1FFG1156 FPGA device, which contains fast-switching I/Os (up to 710 MHz) and serdes units (up to 5 Gbps) for interfacing the RF DACs.
The FPGA portion of the design includes a clock distribution unit, a state machine-based system control unit and a DDS core-based multitone generation unit, along with two units built around Block RAM: a small BRAM-based control message storage unit (cRAM core) and a BRAM array-based user data storage unit (dRAM core).
The clock is the life pulse of the FPGA. In order to ensure that multiple clocks are properly distributed across FPGA banks, we chose Xilinx’s clock-management core, which provides an easy, interactive way of defining and specifying clocks. A compact instruction core built around a state machine serves as the system control unit.
We designed two testing strategies: a continuous-wave (CW) signals test (xDDS) and a wideband signals test (xRAM). Multitone CW testing has long been the preferred choice of RF engineers for characterizing the nonlinearity of RF components. Keeping the same testing philosophy, we created a tunable four-tone logic core based on a direct digital synthesizer (DDS), which actually uses a pair of two-tone signals to stimulate the RF DAC in two separate frequency bands. By tuning the four tones independently, we can evaluate the linearity performance of the RF DAC—that is, the location and the power of the intermodulation spurs in the frequency domain. CW signal testing is an inherently narrowband operation. To further evaluate the RF DAC regarding wideband performance, we need to drive it with concurrent multiband, multimode signals, such as dual-mode UMTS and LTE signals at 2.1 GHz and 2.6 GHz, respectively.
We chose MATLAB as the software host, simply because it has many advantages in terms of digital signal processing (DSP) capability. What’s more, MATLAB also provides a handy tool called GUIDE for laying out a graphical user interface (GUI). The figure below illustrates the GUI that we created for the platform:
Note: This blog is an excerpt. To read the full article in the latest issue of Xcell Journal, click here.
Speed is the name of the game for digital radio design and for many other high-speed systems as well. No surprise there. It should also not be a surprise that there are special device features and design techniques that yield more performance—sometimes a lot more performance. If only you know what to use and how. A new on-demand Xilinx video Webinar has just been posted that gives you this knowhow and it’s free.
For example, Pecot discusses specific ways to optimize FPGA-based system implementations—starting at the architectural level—to get maximum clock rates and maximum performance with exceptional resource utilization. From the Webinar:
An Ethernet-based technology called White Rabbit, born at CERN, the European Organization for Nuclear Research, promises to meet the precise timing needs of high-speed, widely distributed applications including 100G Ethernet and 5G mobile telecom networks, smart grids, high-frequency trading, and geopositioning systems. Named after the time-obsessed hare in Alice in Wonderland, White Rabbit is based on, and is compatible with, standard mechanisms such as PTPv2 (IEEE-1588v2) and Synchronous Ethernet, but is properly modified to achieve subnanosecond accuracy. White Rabbit inherently performs self-calibration over long-distance links and is capable of distributing time to a very large number of devices with very small degradation.
From the very beginning, Seven Solutions, based in Granada, Spain, has collaborated in the design of White Rabbit products including not only the electronics but also the firmware and gateware. The company also provides customization and turnkey solutions based on this technology. As an extension of Ethernet, White Rabbit technology is being evaluated for possible inclusion in the next Precision Time Protocol standard (IEEE-1588v3) in the framework of a high-accuracy profile. Standardization would facilitate WR’s integration with a wide range of diverse technologies in the future.
We introduced the concept of the IP stack in the previous instalment of the MicroZed Chronicles. (See “Adam Taylor’s MicroZed Chronicles Chronicles Part 79: Zynq SoC Ethernet Part III.”) The next step is to use this stack in our design. The SDK development environment gives us the ability to include a lightweight IP stack (lwIP) when we create a BSP. lwIP is an open-source IP stack that’s used in a number of embedded systems. Originally it was developed by the Swedish Institute of Computer Sciences to reduce the resources required to create an IP stack.
The SSD Guy, Jim Handy, reports that Baidu—China’s leading search engine company—has created a radical SSD architecture called “Software-Defined Flash” (SDF) that maximally exploits the inherent performance of the SSD’s Flash memory chips. It does this by exposing the Flash chips’ individual channels to host software running on a server, allowing the server to organize its own data and better schedule data accesses. The experimenters report that their design approach extracts 95% of the raw bandwidth from the Flash chips while making 99% of the Flash memory capacity available for user data. These results appear in a paper titled “SDF: Software-Defined Flash for Web-Scale Internet Storage Systems,” presented at the ASPLOS conference held last year in Salt Lake City.
Baidu’s data centers store hundreds of petabytes of data with a daily data processing volume reaching dozens of petabytes. With an ever-increasing performance requirements for storage systems, Baidu is discovering that hard disks and conventional SSDs are becoming inadequate. The SDF represents Baidu’s latest effort to increase data-center performance.
A radically different Flash-based storage architecture like the SDF presented in this paper cannot use conventional SSD controller ASSPs, so the development team created a custom SDF architecture and built a PCIe SDF board based on Xilinx Virtex-5 and Spartan-6 FPGAs as shown in the diagram below:
Baidu SDF controller architecture
The Virtex-5 FPGA implements the SDF’s PCIe DMA and interface, performs chip-to-chip bridging among the FPGAs, and serves as the master controller for four Spartan-6 FPGAs. Each Spartan-6 FPGA implements eleven channels of independent Flash translation layer (FTL) control for a total of 44 channels per board. Each Flash channel controls two commodity Micron 8Gbyte MLC Flash chips, resulting in a total capacity of 704Gbytes per SDF board.
The design team expended significant effort to reduce board costs by including only required features as opposed to commercial SSDs that are designed for general-purpose storage and varying workloads. For example, the SDFs are mainly used as hard-disk caches and data that is rarely accessed is not expected to reside in the cache for a long period of time, so the SDF does not conduct static wear leveling.
In addition, the SDF hardware does not include a DRAM cache while conventional SSDs usually include a large DRAM cache to reduce access latency. Data consistency can be compromised by a power failure when the data resides in a DRAM cache, so a cached controller design must include a battery or capacitor to prevent data loss by making it possible to ride through power outages or by allowing time to shut down gracefully. However, batteries and capacitors add hardware cost. In Baidu’s storage infrastructure, server host memory caches recently accessed data, so the SDF design eliminates the DRAM cache and the associated costs of that cache memory.
The Baidu experimenters conducted a number of benchmark tests on the SDF design and the paper referenced above provides several pages of detailed results. The paper concludes:
“Our experimental measurements show that SDF can deliver about 95% of the raw Flash bandwidth and provide 99% of the Flash capacity for user data. SDF increases the I/O bandwidth by 3 times and reduces per-GByte hardware cost by 50% on average compared with Baidu’s commodity-SSD-based system.”
Over the last two instalments of this blog we have looked at the Ethernet MACs (Media Access Controllers) within the Zynq SoC’s PS (processor system), including an in-depth exploration of a MAC usage example. The Ethernet MAC is a fundamental building block that allows us to implement an IP stack and thus create network-enabled solutions for our engineering challenges.
Would you like to brush up on SDN fundamentals? Let Gordon Brebner, a Distinguished Engineer at Xilinx, help. Gordon has been working on FPGA-based networking hardware for 25 years (!) and he’s got this stuff nailed. There’s nothing like training from someone who really knows the topic and Gordon compresses a lot of knowledge into a short, easily digested tutorial. Just click on his 30-minute video below.
Note: There are a very few audio gaps in this video and Gordon suggests you just imagine the words that are missing.
You can implement several key wireless applications including radio and wireless backhaul very efficiently using Zynq SoC devices. Radio applications are especially good examples of this, where the Zynq SoC with both on-chip processor cores and programmable logic can implement fully-integrated hardware and software systems that handle all digital front-end processing. Every wireless application has different performance requirements and needs an appropriate OS.
In the previous blog we introduced the Zynq SoC’s Gigabit Ethernet Controller, which provides Media Access Controller (MAC) capability. This is the first step in being able to establish an IP stack. Now we will look at how we can configure the MAC to send and receive packets using the example provided by Xilinx with the SDK, which demonstrates how the MAC works.