UltraRAM primitives, also referred to as URAMs, are available in Xilinx UltraScale+™ Architecture and can be used to efficiently implement large and deep memory.
Typically such memories are not suitable for implementation using other memory resources due to their size and performance requirements.
The URAM primitives have configurable pipeline attributes in conjunction with dedicated cascade connections to enable high speed memory access. Pipeline stages and cascade connections are configured using attributes on primitives.
This blog entry describes methods for achieving optimal timing performance by configuring the URAM matrix to use pipeline registers.
Note: this article was co-authored by Pradip K Kar, Satyaprakash Pareek, and Chaithanya Dudha.
The Need For Pipelining:
A large and deep memory is implemented from available URAM primitives by connecting several URAMs in a matrix structure.
The matrix consists of rows and columns of URAMs. The URAMs in one column are cascaded using a built-in cascading circuit and several columns of URAMs are interconnected via an external cascading circuit, referred to as a horizontal cascade circuit.
As an example, Figure-1 show a matrix decomposition for a 4x4 URAM matrix for a 64K deep x 72-bit wide memory.
Fig.1: A URAM Matrix of 4 rows x 4 columns implementing 64K deep and 72 bit wide Memory
Without pipelining, deep cascade structures result in large clock to out delays from memory access. For example, the URAM matrix above can achieve about 350 MHz by default. To achieve memory access at higher speeds, a pipeline should be inserted. Vivado Synthesis automatically achieves this provided a certain number of output latencies are specified in the netlist.
Specifying a Pipeline in an RTL Design:
There are two ways to specify the use of a pipeline in an RTL design, either by using the XPM flow, or by inferring the memory with behavioral RTL.
If the RTL design uses XPM to create URAM memory, the user can specify the pipeline requirement as a parameter to the XPM instance. The parameter “READ_LATENCY_A/B” captures the latency requirement for the memory.
The number of pipeline stages available is the LATENCY value minus two. For example, if Latency is set to ten, it would allow eight register stages to be available for pipelining. The other two registers are used to create the URAM itself.
Fig 2: Using XPM to set pipelining
If the user creates URAM by writing RTL using templates provided in the Vivado user guide, they can create as many stages of registers as are needed at the output of URAM. The only requirement is that along with the data, the enable of the pipeline registers also needs to be pipelined.
Figure 3. Shows the data and enable pipeline.
Fig 3. Data and Enable pipeline specification at output of URAM block
Figure 4 shows an example of pipelining a RAM in RTL.
Fig 4. a verilog template to specify data and enable pipelines
Analyzing the Log file:
Vivado Synthesis issues different messages related to the pipelining of URAM depending on context and scenario. The table below illustrates some of the messages to look for in the vivado.log file and the corresponding action to take.
Note that the recommended pipeline stages are based on fully pipelining the matrix which can achieve maximum performance (800MHz+). The recommendation does not depend on the actual timing constraint.
URAM with no pipeline
WARNING: [Synth 8-6057] Memory: "uram00/ram1/mem_reg" defined in module: "top_sp_no_pipe" implemented as Ultra-Ram has no pipeline registers. It is recommended to use pipeline registers to achieve high performance
Increase Latency or Insert a few pipeline stages.
URAM is severly under pipelined
CRITICAL WARNING: [Synth 8-6013] UltraRAM uram00/ram1/mem_reg is under-pipelined and may not meet performance target : Pipeline stages found = 1; Recommended pipeline stages =8