cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Adventurer
Adventurer
589 Views
Registered: ‎09-18-2018

How to order sequence of operations without knowing how long a mathematical operation takes?

Jump to solution

Hello,

I had been previously been following what another developer had been doing with regards to the ordering of mathematical operations. That person used IP cores and case statements as shown in the example below.

The following two operations would become the verilog code shown below them

       a = (b * c)<<16;
       d = a * x;   
always @(posedge CLK, posedge RST) begin
    if(RST == 1) begin
        a <= 0;
        d <= 0;
        count <= 0;
    end else begin
        count <= count + 1;
        case(count)
          2:  begin 
                 mult1 <= b;
                 mult2 <= c;
              end
          27: a <= MullResult << 16;
          28: begin
                 mult1 <= a;
                 mult2 <= x;
              end
          53: d <= MullResult;
        end 
    end
end       


Multipier   Multiplier1(
                .CLK(CLK),
                .A(mult1),
                .B(mult2),
                .P(MullResult)
            );

By doing it that way, you know exactly how long the operation takes to complete.

However, I decided to do it as follows

always @(posedge CLK, posedge RST) begin
    if(RST == 1) begin
        a <= 0;
        d <= 0;
    end else begin
        a = (b * c)<<16;
        d = a * x;          
        end
    end
end

This way the operations are more obvious; however, the issue is not knowing how long the operation takes.

How long does (b*c)<<16 take?  I looked at the elaborated schematic and I see RTL_MULT, which doesn't tell me too much.  Does the output of RTL_MULT only change after the multiply operation has completed?  How long does it take to complete a multiply? What if some things in my code need to know whether or not the mathematical operation has completed. If you say it takes 1 clock cycle.  Is that true all of the time?  How do I decide whether or not to choose an IP core multiply over the c=a*b statement?  

In general:

1.  For mathematical operations, Is the left side only updated  after the right side has completed computing a value.  Is it the same whether or not use blocking or non-blocking.

2. How can I determine how long it takes to complete a mathematical operation, eg RTL_MULT?

3. When  should I use a = c*b over using an IP multiplication core and vice versa?

4. What is a good way to structure my code to make sure operations are performed in a specific order?

Thanks,

Stephen

0 Kudos
1 Solution

Accepted Solutions
Highlighted
Guide
Guide
409 Views
Registered: ‎01-23-2009

If it does not complete in one clock cycle, should the tool give me a timing error?

What this code says is that between the flip-flops that hold b and c and the flip-flops that hold a is a combinatorial network which computes (a*b)<<16. This network creates paths from a->c and b->c. For timing to be met, all of these paths must be less than the clock period specified for the (presumably common) clock shared by a, b, and c. If not, then you will get timing failures during synthesis and implementation.

If a timing error is not generated by the tool and the operation does not complete in one clock cycle, what would "a" be equal to after one clock cycle? Would it be equal to the previously computed value?

A timing error is generated. If you were to try to run the system (either in a real FPGA or with full back annotated timing simulation) the results would be garbage - on the clock edge after a and/or b change, the value of c will simply be corrupt - it may be partly the old value, partly the new value, and even partly neither. Even worse, it could go metastable - the individual flip-flops might not even be valid 0's and 1's.

Also, if I don't have the proper timing constraints set up, is it possible I would not get a timing error even though it takes more than one clock cycle?

Yes. You always need complete and correct constraints for your system to work. Without constraints the tools cannot know what a "clock cycle" is, and hence can't tell you if you meet timing or not. However...

Currently, the only timing constraint I have is for the clock that is used for all the modules.

For these paths (that start and end at internal flip-flops) the only constraint required is the correct clock - as long as you have a proper create_clock at the clock port of the design (along with a set_input_jitter command), then these paths are constrained.

Should the tool return an error If the shift doesn't complete in 1 clock cycle.

The concept of timing paths is true for all paths. In this case there is combinatorial logic between the flip-flops of MultResult and the flip-flops of "a" (assuming MultResult is a flip-flop). Again, this logic will create paths that must take less than one clock cycle. That being said, a shift operation is actually "free" - there is no actual logic on these paths. In this code, the upper bits of a[N-1:16] are loaded with the value of MultResult (where N is the size of a) and the lower 16 bits are loaded with 0. These paths are therefore just wires (with no logic) on them from MultResult to a. In a "sane" system, there is no risk of this failing timing.

Even if MultResult is not from flip-flops, but is the result of combinatorial logic, this would simply mean that the combinatorial logic of the shift operator (which I already explained is non-existent) is concatenated with the combinatorial logic that calculates MultResult. All the paths in this combined network need to meet timing, and will be checked by the tools.

Right now I leaning towards using case construct and and an IP multiplication core. However, I still have the shift operator that needs to be processed.

From the point of view of static timing analysis (which is what is done during synthesis and implementation), the tools don't care what the functionality is, nor how that functionality was created (from an IP core or RTL logic) - it is simply concerned with paths. If the paths between flip-flops can operate in one clock cycle (with clock skew, and other uncertainties taken into account, which the tool takes care of), then the design meets timing. If not, then it will have timing errors.

Avrum

View solution in original post

7 Replies
Highlighted
Advisor
Advisor
566 Views
Registered: ‎01-28-2008

Hi @shall785 

  In general it's best to do your own math implementation code or use an IP, instead of letting the synthesizer infer the logic. You'll get more control over what's being implemented, how much latency there will be and make sure it's mindful of used resources.

  It's likely your peer developer generated the multiplier IP and set it to a known latency of 25 clock cycles, and he's using the state machine to pace through the multiplier until the valid result comes out. This is a valid approach, and as long as the state machine is matched with the IP latency, the output will be correct.

  Another approach is to use a shift register and on data input, set a pulse into it. To follow the example, on each clock cycle the shift register moves this pulse through, and on clock 25, the pulse comes out on the other end, validating the multiplier output.

  In the code below, I use the shift register valid_sreg to track the initial pulse assertion, which will shift until bit 24 is valid, passing the multiplication result to do the shift-left 16.

reg [24:0] valid_sreg;

always @(posedge CLK, posedge RST) begin
  if(RST == 1) begin
    a <= 0;
    d <= 0;
    valid_sreg <= 25'h0;
  end else begin
    in_val <= 1'b0;
    if (init_pulse) begin
      mult1 <= b;
      mult2 <= c;
      in_val <= 1'b1;
    end else if (valid_sreg[24]) begin
      a <= MullResult << 16;
    end
    valid_sreg <= {valid_sreg[23:0], in_val};
  end
end       

  Of course the code itself is just a demonstration of a concept, not a solution in itself. Using these kind of walking pulses is useful in math computations and other pipelining constructs where latency tracking is critical. Sometimes, the output of the IP might change until a correct result is put on its output.

 

Thanks,

-Pat

 

Give kudos if helpful. Accept as solution if it solves your problem.
https://tuxengineering.com/blog

Highlighted
Adventurer
Adventurer
526 Views
Registered: ‎09-18-2018

Hello Pat,

Thanks for the response.

It looks like it would be best to share IP cores with multiple operations; however, I may do it the other way sometimes depending on the situation.

Some more questions:

1. For multiplication operation like a=(b*c)<<16 or something like it, is the left hand side only updated after (b*c)<<16 has been computed?  Does it matter whether I use blocking (=) or non-blocking (<=). 

2. Is there anyway to know how long an intrinsic operation like RTL_MULT will take?

Thanks,
Stephen

 

 

0 Kudos
Highlighted
Adventurer
Adventurer
514 Views
Registered: ‎09-18-2018

Hello Pat,

In addition to my previous two questions, does a shift (any number of bits) take one clock cycle?

Stephen

0 Kudos
Highlighted
504 Views
Registered: ‎06-21-2017

If you code the shift, or anything else, to complete in one clock cycle, then it will complete in one clock cycle. 

0 Kudos
Highlighted
Guide
Guide
503 Views
Registered: ‎01-23-2009

I think you have a basic misunderstanding of RTL.

Register Transfer Language is a subset of RTL where you explicitly state the registers and describe the transfers between them. So the number of clock cycles taken by an operation is determined solely by how you code the registers around the operation. So your code snippet

always @(posedge CLK, posedge RST) begin
    if(RST == 1) begin
        a <= 0;
        d <= 0;
    end else begin
        a <= (b * c)<<16;  // always use non-blocking assignments for inference of flip-flops
        d <= a * x;          
        end
    end
end

is stating that the "a" register is updated on each clock with the value of (b*c)<<16 and the "d" register is updated each clock with a*x - you are explicitly stating that there is one clock cycle available for each operation.

The next question is "is that reasonable?" This will entirely depend on the clock frequency and the width of the operands. Multiplication is a complex operation, if you infer a one clock cycle multiplication operation, it will do the entire multiplication combinatorially - that will be big and will take a lot of nanoseconds. If the number of nanoseconds it takes is less than your clock period, then this will pass timing. If your multiplier (for a given number of bits) takes more nanoseconds than the one clock period you have allowed it to take, then you will fail timing - your design will not work.

For this reason, there are many approaches to multiplication that use pipelining. When an operation is pipelined, you break the operation into smaller/simpler combinatorial steps that are carried out over successive clock cycles. Then as long as the nanoseconds required for each step of the multiplication is less than your clock period, the design will meet timing. But this is an architectural choice - you need to pipeline the multiplier in your RTL to make this happen.

There are a number of approaches here:

  • Pipeline it manually
    • design an architecture yourself to do the multiplication
  • Target it to the DSP48 multiplier in the FPGA
    • There are specific coding styles and recommended pipelining that will help the tools pack it into the DSP48
  • Use an IP wizard or IP core to implement the multiplier
    • This allows you to create multipliers with different pipelining characteristics and that do or do not use the DSP48

In all of these cases, though, the number of clocks required to do the multiplication is determined by how the multiplication module is created; the tool never "dynamically" determines how many clocks an operation takes (this is one of the fundamental concepts of RTL).

The other alternative is to use High Level Synthesis. In HLS, you code the algorithm you want and the HLS compiler determines how to pipeline it. But this is a totally different tool flow and a totally different approach to how to design for FPGAs (i.e. it is NOT RTL).

Avrum

Highlighted
Adventurer
Adventurer
484 Views
Registered: ‎09-18-2018

Hello Avrum,

Thank you very much for taking so much time to write your reply.  

So, by writing 

a <= (b * c)<<16; 

am I inferring that (b*c)<<16 will take one clock cycle to complete?   If it does not complete in one clock cycle, should the tool give me a timing error?  If a timing error is not generated by the tool and the operation does not complete in one clock cycle, what would "a" be equal to after one clock cycle? Would it be equal to the previously computed value?

Also, if I don't have the proper timing constraints set up, is it possible I would not get a timing error even though it takes more than one clock cycle?  Currently, the only timing constraint I have is for the clock that is used for all the modules. 

In my original post, I posted code that used an IP core and a case construct.  One of the case items was 

27: a <= MullResult << 16;

Should the tool return an error If the shift doesn't complete in 1 clock cycle.  

Right now I leaning towards using case construct and and an IP multiplication core.  However, I still have the shift operator that needs to be processed. 

Thanks,

Stephen

Stephen

0 Kudos
Highlighted
Guide
Guide
410 Views
Registered: ‎01-23-2009

If it does not complete in one clock cycle, should the tool give me a timing error?

What this code says is that between the flip-flops that hold b and c and the flip-flops that hold a is a combinatorial network which computes (a*b)<<16. This network creates paths from a->c and b->c. For timing to be met, all of these paths must be less than the clock period specified for the (presumably common) clock shared by a, b, and c. If not, then you will get timing failures during synthesis and implementation.

If a timing error is not generated by the tool and the operation does not complete in one clock cycle, what would "a" be equal to after one clock cycle? Would it be equal to the previously computed value?

A timing error is generated. If you were to try to run the system (either in a real FPGA or with full back annotated timing simulation) the results would be garbage - on the clock edge after a and/or b change, the value of c will simply be corrupt - it may be partly the old value, partly the new value, and even partly neither. Even worse, it could go metastable - the individual flip-flops might not even be valid 0's and 1's.

Also, if I don't have the proper timing constraints set up, is it possible I would not get a timing error even though it takes more than one clock cycle?

Yes. You always need complete and correct constraints for your system to work. Without constraints the tools cannot know what a "clock cycle" is, and hence can't tell you if you meet timing or not. However...

Currently, the only timing constraint I have is for the clock that is used for all the modules.

For these paths (that start and end at internal flip-flops) the only constraint required is the correct clock - as long as you have a proper create_clock at the clock port of the design (along with a set_input_jitter command), then these paths are constrained.

Should the tool return an error If the shift doesn't complete in 1 clock cycle.

The concept of timing paths is true for all paths. In this case there is combinatorial logic between the flip-flops of MultResult and the flip-flops of "a" (assuming MultResult is a flip-flop). Again, this logic will create paths that must take less than one clock cycle. That being said, a shift operation is actually "free" - there is no actual logic on these paths. In this code, the upper bits of a[N-1:16] are loaded with the value of MultResult (where N is the size of a) and the lower 16 bits are loaded with 0. These paths are therefore just wires (with no logic) on them from MultResult to a. In a "sane" system, there is no risk of this failing timing.

Even if MultResult is not from flip-flops, but is the result of combinatorial logic, this would simply mean that the combinatorial logic of the shift operator (which I already explained is non-existent) is concatenated with the combinatorial logic that calculates MultResult. All the paths in this combined network need to meet timing, and will be checked by the tools.

Right now I leaning towards using case construct and and an IP multiplication core. However, I still have the shift operator that needs to be processed.

From the point of view of static timing analysis (which is what is done during synthesis and implementation), the tools don't care what the functionality is, nor how that functionality was created (from an IP core or RTL logic) - it is simply concerned with paths. If the paths between flip-flops can operate in one clock cycle (with clock skew, and other uncertainties taken into account, which the tool takes care of), then the design meets timing. If not, then it will have timing errors.

Avrum

View solution in original post