UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 

Achieving optimal timing performance by automatic pipelining of a URAM matrix in Vivado Synthesis

Xilinx Employee
Xilinx Employee
5 3 1,014

Introduction:

UltraRAM primitives, also referred to as URAMs, are available in Xilinx UltraScale+™ Architecture and can be used to efficiently implement large and deep memory.

Typically such memories are not suitable for implementation using other memory resources due to their size and performance requirements.

The URAM primitives have configurable pipeline attributes in conjunction with dedicated cascade connections to enable high speed memory access.  Pipeline stages and cascade connections are configured using attributes on primitives. 

This blog entry describes methods for achieving optimal timing performance by configuring the URAM matrix to use pipeline registers.

Note: this article was co-authored by Pradip K Kar, Satyaprakash Pareek, and Chaithanya Dudha.

The Need For Pipelining:

A large and deep memory is implemented from available URAM primitives by connecting several URAMs in a matrix structure.

The matrix consists of rows and columns of URAMs. The URAMs in one column are cascaded using a built-in cascading circuit and several columns of URAMs are interconnected via an external cascading circuit, referred to as a horizontal cascade circuit.

As an example, Figure-1 show a matrix decomposition for a 4x4 URAM matrix for a 64K deep x 72-bit wide memory.

image.pngFig.1: A URAM Matrix of 4 rows x 4 columns implementing 64K deep and 72 bit wide Memory

Without pipelining, deep cascade structures result in large clock to out delays from memory access.  For example, the URAM matrix above can achieve about 350 MHz by default. To achieve memory access at higher speeds, a pipeline should be inserted. Vivado Synthesis automatically achieves this provided a certain number of output latencies are specified in the netlist. 

Specifying a Pipeline in an RTL Design:

There are two ways to specify the use of a pipeline in an RTL design, either by using the XPM flow, or by inferring the memory with behavioral RTL.

If the RTL design uses XPM to create URAM memory, the user can specify the pipeline requirement as a parameter to the XPM instance. The parameter “READ_LATENCY_A/B” captures the latency requirement for the memory.

The number of pipeline stages available is the LATENCY value minus two.  For example, if Latency is set to ten, it would allow eight register stages to be available for pipelining. The other two registers are used to create the URAM itself.

image.pngFig 2: Using XPM to set pipelining

If the user creates URAM by writing RTL using templates provided in the Vivado user guide, they can create as many stages of registers as are needed at the output of URAM. The only requirement is that along with the data, the enable of the pipeline registers also needs to be pipelined.

Figure 3. Shows the data and enable pipeline.

image.pngFig 3. Data and Enable pipeline specification at output of URAM block

Figure 4 shows an example of pipelining a RAM in RTL.

image.pngFig 4. a verilog template to specify data and enable pipelines

Analyzing the Log file:

Vivado Synthesis issues different messages related to the pipelining of URAM depending on context and scenario. The table below illustrates some of the messages to look for in the vivado.log file and the corresponding action to take.

Note that the recommended pipeline stages are based on fully pipelining the matrix which can achieve maximum performance (800MHz+). The recommendation does not depend on the actual timing constraint. 

Scenario

Messages

Action

URAM with no pipeline

 

WARNING: [Synth 8-6057] Memory: "uram00/ram1/mem_reg" defined in module: "top_sp_no_pipe" implemented as Ultra-Ram has no pipeline registers. It is recommended to use pipeline registers to achieve high performance

Increase Latency or  Insert a few pipeline stages.

URAM is severly under pipelined

CRITICAL WARNING: [Synth 8-6013] UltraRAM uram00/ram1/mem_reg is under-pipelined and may not meet performance target : Pipeline stages found = 1; Recommended pipeline stages =8

Increase Latency or  Insert a few pipeline stages.

URAM with reasonable pipelining

INFO: [Synth 8-5813] UltraRAM uram00/ram1/mem_reg: Pipeline stages found = 4; Recommended pipeline stages =8

Check if timing is met. Increase latency if performance is not met

URAM with more pipelines than needed

INFO: [Synth 8-5813] UltraRAM uram00/ram1/mem_reg: Pipeline stages found = 10; Recommended pipeline stages =8

 

Reduce latency, otherwise FF utilization wil increase significantly.

Pipelining Result

INFO: [Synth 8-5814] Pipeline result for URAM (uram00/ram1/mem_reg): Matrix size= (4 cols x 4 rows) | Pipeline stages => ( available = 10, absorbed = 8 )

 

 

 

Timing Performance Estimates:

The table below illustrates the relation between the number of pipeline registers and the maximum estimated frequency achievable.

Note that actual timing numbers will still depend on the final place & route result.

These numbers are based on a speedgrade-2 Virtex® UltraScale+™ part and on our example project of 64K x 72 URAM implemented with a 4x4 matrix.

 

Pipeline Stages

Pipelines Absorbed in URAM

Pipeline Resources used

Datapath Delay on critical path (ns)

Estimated Max Frequency

0

0

N/A

2.7

370MHz

1

1/1

OREG

2.15

465 MHz

2

2/2

OREG, FDRE

1.632

612 MHz

4

4/4

OREG, REGCAS, FDRE, IREG_PRE

1.376

726MHz

6

6/6

OREG, REGCAS, FDRE, IREG_PRE

1.376

726 MHz

8

8/8

OREG, REGCAS, FDRE, IREG_PRE

1.1

909 MHz

10+

8/10+

OREG, REGCAS, FDRE, IREG_PRE

1.1

909 MHz

 

The datapath delay has one or more of the following components.

Tco = 1.38ns,  Clk To CascadeOut  on URAM

Tco  = 0.82ns,  Clk To CascadeOut  on URAM with OREG=true

Tco = 0.726ns,  Clk to Dataout on URAM with OREG=true, CASCADE_ORDER = LAST

URAM -> URAM cascade delay of 0.2ns

URAM -> LUT net delay of  0.3ns

LUT Propagation delay = 0.125ns

LUT -> LUT net delay = 0.2ns

LUT5 -> FF delay = 0.05

Conclusion:

The URAM primitives are a powerful way to create very large RAM structures.  They are set up to be easily cascadable to create even larger RAMs in your design. 

However too many of these structures cascaded together can create a large delay through the RAM. Taking the time to fully pipeline your RAMs will yield benefits in the long run.

Tags (1)
3 Comments
Scholar drjohnsmith
Scholar
Lovely example, examples amplify the text wonderfully, and are much appreciated. I can guess what it means, but can we have a VHDL one as well as a V one , then I would not have to guess
Xilinx Employee
Xilinx Employee

Good idea.  Give me a few days to convert this over to VHDL and I will post.

Xilinx Employee
Xilinx Employee

I have attached two files, test.v and test.vhd.  Test.v was the original file that was used for this article and test.vhd is the VHDL equivalent of test.v.  Also, there are many Language templates in Vivado that show how to do this as well.  If you are in Vivado, click on Language Templates in the Flow Navigator and then do a search for UltraRAM.  Those examples are a little different than the ones here but accomplish the same type of functionality.