11-28-2018 10:35 AM
I'm using 2018.2 on Win10x64. I have a design that is using a multiplier IP (VLNV xilinx.com:ip:mult_gen:12.0). As a result of this, the associated library HDL files are copied from the <$XilinxInstallFolder>/data/ip/xilinx/mult_gen_v12_0\hdl folder into the <$ProjectFolder>/Vivado.srcs/ip/Multiplier/hdl> folder.
While developing the surrounding code, I found out of an interesting function available from a package within that multiplier library. That is the <mult_gen_v12_0_14_calc_fully_pipelined_latency> function in the <mult_gen_v12_0_14_pkg> package within the <mult_gen_v12_0_14> library. So, one of my file is having the following code:
27: library mult_gen_v12_0_14; 28: use mult_gen_v12_0_14.mult_gen_v12_0_14_pkg.all;
108: R := mult_gen_v12_0_14_calc_fully_pipelined_latency( -- <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
family => pFAMILY,
a_width => pWIDTH_A,
a_type => UNSIGNED_FLAG,
b_width => pWIDTH_B,
b_type => UNSIGNED_FLAG,
mult_type => 1, -- INTEGER; 1: Parallel/DSP
opt_goal => 1, -- INTEGER; 1: Speed
ccm_imp => 0, -- INTEGER; Irrelevant for Parallel/DSP configuration
b_value => "0" -- STRING; Irrelevant for Parallel/DSP configuration
But at synthesis, Vivado is complaining with the following message:
WARNING: [Synth 8-1090] 'mult_gen_v12_0_14_pkg' is not compiled in library mult_gen_v12_0_14 ["W:/EGL0051/Sources/HDL/MultiplierPackage.vhd":28]
ERROR: [Synth 8-2150] illegal named association in array index ["W:/EGL0051/Sources/HDL/MultiplierPackage.vhd":108]
The warning is because the package is not found. The error is because the function is undefined.
Question: How can I make this work? How to use functions from packages from the Xilinx IPs?
11-29-2018 06:07 PM
Sorry, I cannot find Xilinx documentation on the multiplier package you described.
As you know, using Xilinx IP (especially an undocumented Xilinx package) does not make your code very portable. Sure, we should/need use Xilinx IP in some cases (eg. the Clocking Wizard). However, multiplication is not one of those cases.
I’m sure you know that it easy to write portable code that describes multiplication, as shown by the following VHDL example.
P1: process(clk) begin if rising_edge(clk) then C <= A*B; end if; end process P1;
Vivado will then decide (without your help) to implement the multiplication using either LUTs or the DSP48E1. However, if A and B are many bits or clk is fast then the multiply may fail timing analysis. Then, you can do some pipelining as shown below.
P2: process(clk) begin if rising_edge(clk) then Ap1 <= A; Ap2 <= Ap1; Bp1 <= B; Bp2 <= Bp1; Cp2 <= Ap2*Bp2; Cp1 <= Cp2; C <= Cp1; end if; end process P2;
Again, the above VHDL is portable code. -and again (without your help) Vivado will probably implement this using the DSP48E1. Further, the DSP48E1 will (without your help) pull-in the registers (Apx, Bpx, Cpx) to fully pipeline its operation and help your project pass timing analysis.
So, at least for multiplication, please reconsider your plan to use Xilinx IP or Xilinx undocumented packages.
When life is simple, don't make it complicated :-)
à votre santé
11-30-2018 05:16 AM - edited 11-30-2018 05:22 AM
The multiplier IP I'm using is the standard Xilinx Multiplier IP that it available from the IP catalog under <Math Functions/Multipliers/Multiplier>. It is documented under PG106 that you can easily retrieve from the DocNav.
Of course, I understand your point. But I have a concern. I'm using that Multiplier (and other Xilinx math IPs as well) in a pipeline. That pipeline has "parallel" data paths that need to be properly delayed/aligned. So, with the A*B type code, I have no clue about the resulting latency (because Vivado is free to implement whatever it thinks is best). Using the Xilinx IP makes the latency predictable and allows me to have the delay lines on the parallel paths sized automatically. Am I missing something here?
I'd like to add that I'm more interested in scalability than portability. I mean I want the whole thing to scale based on data bit width, size of the convolution kernel. Convolution is a nice example. With a 3x3 kernel, that is 9 multiply followed by a 9 input add-reduce. If you go up to 5x5, it's 25 products and a 25-input deeper add-reduce tree. I need the latency of all that to be predictable to delay the other data paths properly (ie size the delay elements automatically).
11-30-2018 11:31 AM
The multiplier IP I'm using is the standard Xilinx Multiplier IP..
Yes, the Multiplier IP described in PG108 is standard – but you said “I found out of an interesting function available from a package within that multiplier”. This “interesting function” is what I was referring to as “undocumented”. This “interesting function” may disappear from future versions of the Multiplier IP (because things change).
So, with the A*B type code, I have no clue about the resulting latency…
You actually have full control of the latency by the way you write your HDL. This is going to be an interesting (for me anyway) story about the DSP48. So, bear with me.
In my last response to you, the VHDL process called P1 shows the multiply operation, C<=A*B. Things done in a VHDL clocked process must complete in one clock cycle – or fail timing analysis. So, if everything in the P1 process passes Vivado timing analysis then you know that C<=A*B completes with latency 1.
As a reminder, Vivado can implement multiplication using either LUTs or the DSP48E1. With a VHDL attribute statement (see UG901, pg60) you can specify the method you prefer as follows:
attribute use_dsp : string; attribute use_dsp of C : signal is “no”; --“no”=use LUTs, “yes”=use DSP48
The DSP48 operates with full performance when all of its internal registers are enabled. For C<=A*B, this means the DSP48 wants to enable two registers on each of the inputs that accept A and B – and it wants to enable two registers on the output that sends C (see UG479 on about pg14). The DSP48 cannot just enable these internal registers anytime it wants – because enabling registers changes latency. For example, if the DSP48 is used in process P1 for C<=A*B then the DSP48 cannot enable any of its internal registers – because the process P1 very clearly specifies that latency of the multiplication must be 1.
Now, the process I called P2 also computes C<=A*B but with latency 4. If you specify that the multiplication is to be done with LUTs (ie. attribute use_dsp of Cp2 is “no”) then P2 is kinda boring. However, if you specify that the multiplication is to be done using the DSP48 then things get interesting.
All things done in the P2 process must complete in one clock cycle (or fail timing analysis). However, Vivado synthesis is smart enough to know that it can: 1) remove all the registers (Ap1, Ap2, Bp1, Bp2, Cp1, Cp2) from this process, and 2) enable all the registers inside the DSP48, and 3) still have C<=A*B computed with latency 4!!
You might want to read that last paragraph again. -but, yes, this really happens! This is also called DSP48 “register pull-in”. -and yes, when register pull-in occurs, registers (Ap1, Ap2, Bp1, Bp2, Cp1, Cp2) will not be found in the netlist for your implemented design.
So, by placing pipeline registers in front of the inputs to a multiply – or by placing pipeline registers after the output of a multiply, you are providing registers that the DSP48 can pull-in. You don’t have to supply all the 6 pipeline registers shown in process, P2. That is, you can supply just enough to get the calculation latency that you need. However, as you know from using the IP, using pipeline registers on the output of the DSP48 generally helps more (in terms of passing timing analysis) than using pipeline registers on the inputs to the DSP48.
12-03-2018 05:48 AM
I agree with what you're saying. What's not clear to me is what is going to happen if you instruct the tool NOT to use DSP slices. In this case, is the multiply operation going to be possible in a single clock cycle? If the multiplier is implemented using LUTs and FFs, I would expect its latency to be somewhat larger.
I'm asking because I'm facing the same issue with the Adder/Subtractor. In this case, I can't allow the tool to use DSPs. I need the DSPs for multiply. So all my adders are LUT-based. And then, latency is a function of input bit-width amongst other parameters. And I have multiple use case where that latency is changing. And when switching from one use-case to the other (ie editing the top-level generics), I need the delay lines on the parallel data paths to adjust.
Playing around with the Xilinx Adder/Subtractor IP, I found latency is a function of the bit width of the operands (for a LUT-FF based adder).
Latency = ceil(max(Operand bit width)/12.0);
This is not documented either... and that puzzles me a bit. The only reason I know of for spreading an addition over multiple clock cycles is the length of the carry chain. So, if I was writing C <= A + B and forcing the tool to use LUTs and FFs, I really wonder what I would end up with... I feel the Xilinx IP is breaking down the top-level addition into chunks (and increasing latency) to keep the length of the carry chains manageable. In the end, that left me with the impression that the IP generator is aware of what the underlying FPGA can / can't do and that's why I made the decision to use the Xilinx IPs rather than using the VHDL * and +/- operators.
Using the Xilinx IP is painful (that brings me back to the original post) but is providing predicatble results (as per the above equation). But as you mentionned, all of that may crash in a future release. So I opened the door to perpetual maintenance... :-(
Similarly, I also need an AddReduce module which must be configurable for any number of operands of any bit width. That is making latency predictability even harder.
Again, I'm looking for portable and scalable code with sufficient pipelining to achieve timing closure up to ~300MHz), but with known predicatble latency such that I can use equations to automatically adjust the depth of the delay lines of the parallel data paths.
Does all that make sense?
Many thanks for helping!
12-03-2018 06:31 AM
Im not sure what you're thinking about the mult_gen_v12_0_14_calc_fully_pipelined_latency function. Because of the nature of it (its a VHDL function) on its own it cannot have any pipelining at all. I assume its just some sort of wrapper around the actual DSP multiplier in the DSP48. The pipelining would come by calling this function in a clocked process, with other associated signals in that process infering other registers.
The Synthesis tool is very good at inferring primitives, especially registers and DSPs, and then moving them around into appropriate locations to help improve timing. The same would hold for non DSP elements.
Honestly, the best and easiest solution is to write the easiest to understand and most portable code in the first instance. If it works at your desired frequency - great. That was easy. Otherwise you can start tweaking it by breaking up the pipeline to help guide the synthesis, and maybe eventually using the IP blocks to get it to work. You will have the schematics to help you on the way.
Make your life easiest with easy to understand and portable code. Then you can iterate it until you get what you need.
12-03-2018 07:09 AM
The mult_gen_v12_0_14_calc_fully_pipelined_latency function is NOT generating any hardware. It is simply reporting the latency of the resulting multiplier based on its generics. Once this value is known, it becomes possible to compute the depth of delay lines on other paths in the pipeline, such that all data paths are aligned.
The depth of those delay lines can be automatically computed from an equation calling that function. So, something like
constant MULT_LATENCY: POSITIVE := mult_gen_v12_0_14_calc_fully_pipelined_latency(multiplier configuration parameters);
DEPTH => MULT_LATENCY + 2, -- For example. If the configuration of the multiplier is changed, then MULT_LATENCY gets updated and the depth of the DelayLine is automatically adjusted. So, truly scalable code.
And be sure I fully agree with your line of thought. This stuff was first made working from simple code. But now, I'm at the point where I'm asked to make it fully reconfigurable and scalable by changing some top-level generics ONLY (ie without opening or having to worry about the underlying code). I'm asked to make this thing as simple to reconfigure as changing #define in 'C' code. And I still think it's possible.
I would like not to rely on that mult_gen_v12_0_14_calc_fully_pipelined_latency function. Even more, I would like to get rid of the Xilinx multiplier and adder IPs, and use VHDL * and +/- instead. But I first need to make sure this is indeed going to work.
12-03-2018 06:13 PM
Hi Claude & Richard,
..what is going to happen if you instruct the tool NOT to use DSP slices. In this case, is the multiply operation going to be possible in a single clock cycle? If the multiplier is implemented using LUTs and FFs, I would expect its latency to be somewhat larger.
Since the DSP48 was specifically designed to do math, I suspect synthesis will not be able to build a DSP48-equivalent with LUTs and registers.
From reading your other comments, it sounds like the question you really want answered is:
“How does the performance of inferred-math (ie. C <= A*B) compare to the performance of instantiated-math (ie. Xilinx IP)?”.
My experience with BRAM, indicates that for small BRAM you can infer or instantiate with similar (if not identical) performance. However, large BRAM has better performance when instantiated. I suspect the same is true for math. That is, for larger bit-widths, instantiated-math will have better performance (ie. lower latency) than inferred-math.
..latency is a function of input bit-width amongst other parameters. … I need the delay lines on the parallel data paths to adjust.
Instead of adjusting, would it be possible to design for worse-case (ie. largest) bit-widths? That is, your design would run with the same latency regardless of bit-width (as long as bit-width was less that worse-case design bit-width).
If you can "design for worse-case bit-width", then you can (in theory) infer all your math. That is, at the worse-case bit-width, you pipeline the inferred math operations until everything passes timing analysis. Then, you selectively pipeline a little more to balance latency along parallel paths. With this approach, your latencies are all strictly specified by your HDL. In the future, your design will either continue to work exactly the same (ie. with same latency) or it will fail timing analysis. -and it will only fail timing analysis if synthesis gets worse at inferring things (which is not likely).
I think Richard might approve of this approach, since he said:
Make your life easiest with easy to understand and portable code. Then you can iterate it until you get what you need.
12-04-2018 05:13 AM
Hi Mark & Richard,
You both had great comments. Thanks for sharing! I'll give all this some more thought.
Designing for the worst-case is something I usually don't like but that may be possible here.
I feel like we've been all around the topic and found no magic solution but you definitely made me look at the whole thing from a different perspective, which is often all we need!
Have a great day and thanks again for sharing with me & the community on this.