cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Visitor
Visitor
569 Views
Registered: ‎04-24-2020

DSP48E1 time multiplexing and inference

Jump to solution

I'm developing logic for the Artix-7 and I implemented a module to mix two 32 bit signed integer values (VAL1 and VAL2) using a 16 bit mixer value (MIX). It is essentially a fixed point version of the equation VAL1 * ((MIX_RANGE - MIX) / MIX_RANGE) + VAL2 * (MIX / MIX_RANGE). This appears to be working both in simulation and synthesis (with an issue described below). The synthesis is inferring DSPs for the multiply and adds I am using. I am also multiplexing this to process multiple streams of data at a clock frequency of 148.5 MHz.

The issue I'm having with the synthesized version (it is fine in the simulation) is that there is an unexpected offset of 1 array position in the destination result value array. There is an incrementing index used for multiplexing the operation and it increments after the calculation process runs. I am beginning to suspect that perhaps the DSP operations are exceeding a time window and the index is getting updated before the calculation is complete, causing it to get stored to the next index in the array. Is this possible, despite the synthesis tools inferring the math operations itself automatically? It doesn't skip any indexes, just offset by 1. Its quite possible something else is going on too though. Any tips on determining the maximum frequency of DSP operations, would be much appreciated, I was not able to find any information on this.

I'm also getting several DSP48E1 pipeline warnings, such as: DSP input is not pipelined. Pipelining DSP48 input will improve performance. In addition I still have a bunch of timing issues. I have a 148.5 MHz oscillator connected to the GTP transceiver, which I then send the TXOUTCLK to an MMCM to generate 196.608 MHz for audio processing purposes. A lot of the warnings I see is because of this relationship, such as:

TIMING #1 Critical Warning Invalid clock redefinition on a clock tree. The primary clock gtpLink/linkIfaceInst/audioClocksInst/gtpToAudioClkInst/inst/clk_in1 is defined downstream of clock gtpLink/linkIfaceInst/gtpgen.gtwizard_0_i/U0/gtwizard_0_init_i/gtwizard_0_i/gt0_gtwizard_0_i/gtpe2_i/TXOUTCLK and overrides its insertion delay and/or waveform definition

TIMING #1 Critical Warning The clocks clk_out1_GtpToAudioClk and gtpLink/linkIfaceInst/gtpgen.gtwizard_0_i/U0/gtwizard_0_init_i/gtwizard_0_i/gt0_gtwizard_0_i/gtpe2_i/TXOUTCLK are related (timed together) but they have no common primary clock. The design could fail in hardware. To find a timing path between these clocks, run the following command: report_timing -from [get_clocks clk_out1_GtpToAudioClk] -to [get_clocks gtpLink/linkIfaceInst/gtpgen.gtwizard_0_i/U0/gtwizard_0_init_i/gtwizard_0_i/gt0_gtwizard_0_i/gtpe2_i/TXOUTCLK]

 

I suspect some of these issues could be the cause of the strange behavior I'm seeing, but I'm having trouble resolving them.

Thanks in advance for any tips on this.

0 Kudos
1 Solution

Accepted Solutions
Highlighted
Visitor
Visitor
449 Views
Registered: ‎04-24-2020

Thank you for the useful information!  In particular I did not know about how to infer a DSP48 with pipelining.

I'm happy to report that I ended up getting my DSP logic working and it is passing timing analysis too. What I did was use the DSP48 Macro IP and defined it as an A*B+C function. I then enabled pipelining only on Tier 4. This latches the A, B, and C inputs as I expected.  I was able to do the 4 multiplies/adds with a single DSP in 5 cycles at 148.5 MHz (4 calculation cycles and 1 setup).  I believe I can reduce this to 3 cycles by adding an additional DSP for 2 calculations each running in parallel.  The remaining 2 calculations for each DSP are dependent on each other however, so I believe that is the limit.  It seems like the DSP48 Macro IP is quite powerful and I think I prefer it to trying to get the Xilinx synthesis tools to infer the DSP48.  This is being used on up to 80 audio streams running at up to 192 KHz.  So time multiplexing seems fairly elegant in this scenario and I will likely need some more DSPs for other subsystems.  The Artix-7 chip I'm using has 90 of them though, so that definitely seems like plenty, but lower power consumption is also an advantage to using as few as possible.

I still get a couple warnings about the lack of pipelining on port C (the add value) and the output port P.  I don't understand what advantage there would be to adding such pipelining, since at the moment I do setup of input ports, run the calculation, and then latch the output (3 cycles total if done without overlapping in parallel pipeline stages).  I guess pipelining each register would result in additional cycles for staging the input values (and possibly the output value too?), but since they would be overlapped it would just involve some extra initial cycles, but would still operate at roughly the same speed?  I can't see the advantage though in this scenario, since it is functioning at the clock frequency I'm running it at.  Maybe if I wanted to run it at a higher rate?

One thing I found exceedingly helpful, is I created a spreadsheet for the pipeline stages, with each stage as columns, and then placed the DSP inputs and other logic variables as rows.  It then became more obvious what actions needed to be done in each pipeline stage and I was able to break up the equation for the 32 bit mix function.

Now that I have a better understanding of the DSP48, it certainly opens up a lot of possibilities.

Cheers!

View solution in original post

0 Kudos
6 Replies
Highlighted
Visitor
Visitor
547 Views
Registered: ‎04-24-2020

After looking over some of the timing violations in the reports I found several relating to the code in question. I am working through these now and am hopeful the issues will be resolved after addressing them.  Any info on pipelining DSP48 resources or the other warnings related to interfacing the GTP TXOUTCLK to an MMCM would still be appreciated of course though.

0 Kudos
Visitor
Visitor
504 Views
Registered: ‎04-24-2020

By looking at the timing report I confirmed that the DSP mix calculation is taking longer than a single clock cycle. In fact it looks like 3 clock cycles would be required (at 148.5 MHz).

Looks like I need to be able to better customize the DSP slice(s) and pipeline the stages.  I looked through the resulting schematic but could not find the DSPs.  I thought using the inferred DSP logic as a reference point would be a good place to start, but it doesn't seem to show them?  I also see there is a DSP48 macro IP which I am looking into now.  I'm new to working with Xilinx FPGA DSP48 resources, so any tips on the best way to customize DSP logic would be most appreciated. Hopefully without having to delve too heavily into DSP intricacies, for example by using my existing VHDL logic to generate a starting point, would be ideal.

Valuable lessons learned: Don't ignore timing report warnings! Also, seems like synchronous logic is less problematic in regards to timing, especially when non-trivial conditional logic is involved.

0 Kudos
Highlighted
491 Views
Registered: ‎01-22-2015

@ElementGreen 

Here’s some thoughts on using the DSP48:

  1. The Artix-7 has lots (> 39) of DSP48s - see table 4 in document DS180.  So, probably no need for you to use them sparingly or to multiplex their use.

  2. The bible for the DSP48 in 7-Series FPGAs is UG479.  There, you'll find that the DSP48 can do a 25x18 two’s complement multiply.

  3. Table 31 of the datasheet, DS181, for the Artix-7 shows that the DSP48 can do this 25x18 multiply in one clock cycle at clock frequencies over 150MHz.   With full pipelining, the multiply can be done in 4 clock cycles at frequencies over 360MHz.

  4. You can instantiate the DSP48 as shown in UG953, but most of us prefer to infer it and to infer the pipeline registers used by the DSP48.

  5. Hopefully, you are familiar with VHDL.  The following VHDL snippets show how to infer the DSP48 without pipelining for the multiply operation, (P=A*B).  Note how I have used the VHDL attribute, USE_DSP, to tell Vivado that I want to use the DSP48.  See UG901 for more information on USE_DSP.
        constant NBIT : integer := ??;
        signal A,B : unsigned((NBIT-1) downto 0);
        signal P : unsigned((2*NBIT-1) downto 0);
        --
        attribute USE_DSP : string;
        attribute USE_DSP of P : signal is "YES";
        --
        PR1: process(clk1)
        begin
            if rising_edge(clk1) then
                P <= A * B;
            end if;
        end process PR1;
  6. The VHDL shown above specifies that the multiply must be done in 1 clock cycle.  If this is not possible then "P <= A*B" will fail timing analysis.

  7. If "P <= A*B" fails timing analysis then you can infer pipeline registers for the DSP48 to help "P <= A*B" pass timing analysis.  Inferring pipeline registers can be done using the VHDL shown below.  This VHDL will enable two pipeline registers on the DSP48 inputs and two pipeline registers on the DSP48 output - resulting in almost full-speed operation for the DSP48
        constant NBIT : integer := ??;
        signal A,B,A2,B2,A3,B3 : unsigned((NBIT-1) downto 0);
        signal P,P2,P3,P4 : unsigned((2*NBIT-1) downto 0);
        --
        attribute USE_DSP : string;
        attribute USE_DSP of P4 : signal is "YES";
        --
        PR2: process(clk1)
        begin
            if rising_edge(clk1) then
                A2 <= A;
                A3 <= A2;
                B2 <= B;
                B3 <= B2;
                P4 <= A3 * B3;
                P3 <= P4;
                P2 <= P3;
                P <= P2;
            end if;
        end process PR2;
  8. When you open the Vivado implemented design for your project, you will see the DSP48 block(s).  However, you will not see the pipeline registers (A2, A3, B2, B3, P2, P3) because the DSP48 has "pulled in" these registers.  This term "pulled in" means that Vivado has removed the pipeline registers from your VHDL and enabled the pipeline registers within the DSP48 block.

  9. I know that your multiplication is more than the 25x18 max capability of a single DSP48.  However, the VHDL coding examples shown above can still be used.  Since your design will probably use multiple DSP48s, you can continue to infer pipeline registers until your design passes timing analysis or the pipeline registers are no longer "pulled in" by the DSP48s.

Cheers,
Mark

 

Tags (1)
Highlighted
Visitor
Visitor
450 Views
Registered: ‎04-24-2020

Thank you for the useful information!  In particular I did not know about how to infer a DSP48 with pipelining.

I'm happy to report that I ended up getting my DSP logic working and it is passing timing analysis too. What I did was use the DSP48 Macro IP and defined it as an A*B+C function. I then enabled pipelining only on Tier 4. This latches the A, B, and C inputs as I expected.  I was able to do the 4 multiplies/adds with a single DSP in 5 cycles at 148.5 MHz (4 calculation cycles and 1 setup).  I believe I can reduce this to 3 cycles by adding an additional DSP for 2 calculations each running in parallel.  The remaining 2 calculations for each DSP are dependent on each other however, so I believe that is the limit.  It seems like the DSP48 Macro IP is quite powerful and I think I prefer it to trying to get the Xilinx synthesis tools to infer the DSP48.  This is being used on up to 80 audio streams running at up to 192 KHz.  So time multiplexing seems fairly elegant in this scenario and I will likely need some more DSPs for other subsystems.  The Artix-7 chip I'm using has 90 of them though, so that definitely seems like plenty, but lower power consumption is also an advantage to using as few as possible.

I still get a couple warnings about the lack of pipelining on port C (the add value) and the output port P.  I don't understand what advantage there would be to adding such pipelining, since at the moment I do setup of input ports, run the calculation, and then latch the output (3 cycles total if done without overlapping in parallel pipeline stages).  I guess pipelining each register would result in additional cycles for staging the input values (and possibly the output value too?), but since they would be overlapped it would just involve some extra initial cycles, but would still operate at roughly the same speed?  I can't see the advantage though in this scenario, since it is functioning at the clock frequency I'm running it at.  Maybe if I wanted to run it at a higher rate?

One thing I found exceedingly helpful, is I created a spreadsheet for the pipeline stages, with each stage as columns, and then placed the DSP inputs and other logic variables as rows.  It then became more obvious what actions needed to be done in each pipeline stage and I was able to break up the equation for the 32 bit mix function.

Now that I have a better understanding of the DSP48, it certainly opens up a lot of possibilities.

Cheers!

View solution in original post

0 Kudos
Highlighted
437 Views
Registered: ‎01-22-2015

Congratulations on getting things working!

What I did was use the DSP48 Macro IP and defined it as an A*B+C function.
Yes, that IP (ref document PG148) with its associated wizard makes the DSP48 easy to use.

I still get a couple warnings about the lack of pipelining ......  I don't understand what advantage there would be to adding such pipelining.
Vivado throws warnings if the DSP48 is not fully pipeline.  These warnings are just suggestions.  You can ignore these warnings if your design is passing timing analysis.  Once you have 3 pipeline registers (MREG, PREG, and an input register) then the DSP48 has near FMAX capability.

One thing I found exceedingly helpful, is I created a spreadsheet for the pipeline stages...
Can you describe this in more detail - thanks!

Mark

0 Kudos
Highlighted
Visitor
Visitor
419 Views
Registered: ‎04-24-2020

Thanks for confirming that pipelining is about achieving maximum frequency and not necessary if timing requirements are met.

One thing I haven't figured out yet, is if the DSP can use its previous P output as an input, without the additional pipeline latching stage. This would be necessary in order to implement a faster 2 cycle calculation with two DSPs that I mentioned in my previous comment.  At the moment the 4 cycle calculation with 1 DSP is staggered which allows for one additional cycle to latch P output results.

Here is an example of the type of spreadsheet table I mentioned for helping with pipelining.  In this example the two 32 bit source values are fetched from block memory and are split into two 16 bit values (Uint and Frac).

StageInit0Init1Init201234
DSP1 Calc   (Frac1*nMix)+hRange(Uint1*nMix)(Frac2*Mix)+P0(Uint2*Mix)+P1 
A  Frac1Uint1Frac2Uint2 Frac1
B  nMixnMixMixMix nMix
C  hRange0P0P1 hRange
bMemAddr
Src1(0) Src2(0)  Src1(Next) Src2(Next)
Result (P)    P0P1P2

P3 + P2 >> 16

 

Here is an example table for what would need to happen for a 2 cycle calculation. Note how there is no extra stage for latching the output.  I still don't know if this is possible.  I suppose an alternative would be to process multiple sources at the same time and stagger the results in order to achieve two total cycles of calculation, but that also complicates it significantly.

StageInit0Init1Init2012
DSP1 Calc   (Frac1*nMix)+hRange(Frac2*Mix)+P0 
DSP1 A  Frac1Frac2 Frac1
DSP1 B  nMixMix nMix
DSP1 C  hRange0 hRange
DSP2 Calc   (Uint1*nMix)(Uint2*Mix)+P0 
DSP2 A  Uint1Uint2 Uint1
DSP2 B  nMixMix nMix
DSP2 C  00 0
bMemAddrSrc1(0) Src2(0)Src1(Next) Src2(Next)
Result     DSP2P + DSP1P >> 16