Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

- Community Forums
- :
- Forums
- :
- About Our Community
- :
- Welcome & Join
- :
- Timing failure due to too many CARRY4 components i...

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted

jaga_nitc

Visitor

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-15-2016 04:51 AM

7,009 Views

Registered:
05-31-2016

*I implemented a noise removing system using spectral subtraction in VHDL. I am using kintex-7 evaluation board with VIVADO 2015.4 version. i am using 16 bit samples to perform FFT. The frame length is 64. So, i have created four variables of length 4096 bits to store 64 values of 64 bits each. Some variables are 2048 bits wide to store 64 values of 32 bits each (output of FFT).*

*I have packaged the noise removal system as an IP called Speech_Enhancement.*

*The system is working fine during simulation. The operating frequency of the IP is 50MHz.*

*.bit file generation is successful, but my design failed to meet timing constraints.*

*I checked the timing violated paths and found that too many CARRY4 components are created along the path which induces very huge path delay. I didnt understand 'the reason for so many CARRY4 components.*

*I have attached the related report files here. p**lease go through the files and suggest some solutions...*

1 Solution

Accepted Solutions

Highlighted

gszakacs

Professor

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-15-2016 07:53 AM

12,955 Views

Registered:
08-14-2007

Slack (VIOLATED) : -141.698ns (required time - arrival time)

Source: mb_subsystem_i/speech_enhancement_0/U0/speech_enhancement_ip_0/sample_index_reg[0]_rep__5_replica_1/C

(rising edge-triggered cell FDRE clocked by mmcm_clkout0 {rise@0.000ns fall@10.000ns period=20.000ns})

Destination: mb_subsystem_i/speech_enhancement_0/U0/speech_enhancement_ip_0/transfer_func_sq_reg[3968]/D

(rising edge-triggered cell FDRE clocked by mmcm_clkout0 {rise@0.000ns fall@10.000ns period=20.000ns})

Path Group: mmcm_clkout0

Path Type: Setup (Max at Slow Process Corner)

Requirement: 20.000ns (mmcm_clkout0 rise@20.000ns - mmcm_clkout0 rise@0.000ns)

Data Path Delay: 161.413ns (logic 102.088ns (63.246%) route 59.323ns (36.752%))** Logic Levels: 1441 (CARRY4=1373 LUT2=7 LUT3=58 LUT4=1 LUT5=1 RAMS64E=1)**

Clock Path Skew: -0.225ns (DCD - SCD + CPR)

Destination Clock Delay (DCD): 3.432ns = ( 23.432 - 20.000 )

Source Clock Delay (SCD): 3.914ns

Clock Pessimism Removal (CPR): 0.256ns

Clock Uncertainty: 0.094ns ((TSJ^2 + DJ^2)^1/2) / 2 + PE

Total System Jitter (TSJ): 0.071ns

Discrete Jitter (DJ): 0.174ns

Phase Error (PE): 0.000ns

The number of logic levels indicates that you're trying to do too much in one clock period. Most likely you have multiple levels of arithmetic, like Y = A + B + C + D - (E + F + G + H). In order to run at 50 MHz, you'd need to pipeline this operation over several clock periods. Alternatively, if you don't really need a new result every cycle (input data doesn't change at the clock rate), you could set a multicycle path.

-- Gabor

5 Replies

Highlighted

gszakacs

Professor

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-15-2016 07:53 AM

12,956 Views

Registered:
08-14-2007

Slack (VIOLATED) : -141.698ns (required time - arrival time)

Source: mb_subsystem_i/speech_enhancement_0/U0/speech_enhancement_ip_0/sample_index_reg[0]_rep__5_replica_1/C

(rising edge-triggered cell FDRE clocked by mmcm_clkout0 {rise@0.000ns fall@10.000ns period=20.000ns})

Destination: mb_subsystem_i/speech_enhancement_0/U0/speech_enhancement_ip_0/transfer_func_sq_reg[3968]/D

(rising edge-triggered cell FDRE clocked by mmcm_clkout0 {rise@0.000ns fall@10.000ns period=20.000ns})

Path Group: mmcm_clkout0

Path Type: Setup (Max at Slow Process Corner)

Requirement: 20.000ns (mmcm_clkout0 rise@20.000ns - mmcm_clkout0 rise@0.000ns)

Data Path Delay: 161.413ns (logic 102.088ns (63.246%) route 59.323ns (36.752%))** Logic Levels: 1441 (CARRY4=1373 LUT2=7 LUT3=58 LUT4=1 LUT5=1 RAMS64E=1)**

Clock Path Skew: -0.225ns (DCD - SCD + CPR)

Destination Clock Delay (DCD): 3.432ns = ( 23.432 - 20.000 )

Source Clock Delay (SCD): 3.914ns

Clock Pessimism Removal (CPR): 0.256ns

Clock Uncertainty: 0.094ns ((TSJ^2 + DJ^2)^1/2) / 2 + PE

Total System Jitter (TSJ): 0.071ns

Discrete Jitter (DJ): 0.174ns

Phase Error (PE): 0.000ns

The number of logic levels indicates that you're trying to do too much in one clock period. Most likely you have multiple levels of arithmetic, like Y = A + B + C + D - (E + F + G + H). In order to run at 50 MHz, you'd need to pipeline this operation over several clock periods. Alternatively, if you don't really need a new result every cycle (input data doesn't change at the clock rate), you could set a multicycle path.

-- Gabor

Highlighted

jaga_nitc

Visitor

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-15-2016 06:25 PM

6,956 Views

Registered:
05-31-2016

Dear Gabor,

Thank you very much for the quick reply.

Your guess about using multiple levels of arithmetic in one clock cycle seems to be true.

I will try to distribute the operations to different clock cycles and get back to you...

I don't understand the reason for creating 1373 CARRY4 elements. If possible, could you please clarify on that ?

Thanks again for the help,

Jaga

Highlighted

avrumw

Guide

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-16-2016 07:17 AM

6,911 Views

Registered:
01-23-2009

Each CARRY4 (as its name implies) is responsible for the propagation of the carry for 4 bits during an addition/subtraction operation.

So, if you add two numbers that are each 1024 bits wide, you will end up with 256 CARRY4 elements. By definition (since this is carry propagation), these will be in series, and hence will contribute 256 CARRY4 elements to the critical path.

Having 1373 carry elements means you are trying to do something on the order of 5500 bits worth of addition/subtraction in one clock period. Whether that 10 cascaded additions with operands of 550 bits each, or 550 cascaded additions with operands of 10 bits each (or any combination) is what you need to determine and re-architect - this is simply too much logic to do in one clock cycle; it needs to be pipelined.

Remember - carry chains will be used for any addition based operation. Clearly this includes addition and subtraction, but it also includes numerical comparison (<, <=, >, >=) since these operations are implemented using subtraction (unless one of the comparators is constant).

Avrum

Highlighted

jaga_nitc

Visitor

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-16-2016 05:59 PM

6,877 Views

Registered:
05-31-2016

Thank you Avrum for that clarification.

I think the operation which is responsible for timing failure is a **"64 bit division"**.

The statement goes something like this,

* transfer_func_sq(4095 *downto

The variables used here are of type unsigned and i am using ieee.numeric_std.all library.

Please suggest an efficient way of achieving 64-bit division. I am using Kintex-7 evaluation board. I am using 50MHz clock, so if possible I want to complete the division operation within 1 or 2 clock cycles. The utilization report suggests that I still have around 50% of resources left to use.

Thanks,

Jaga

Highlighted

gszakacs

Professor

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-16-2016 06:40 PM

6,868 Views

Registered:
08-14-2007

Division typically requires a much longer pipeline than one or two cycles. It's certainly possible to have a divider that takes a new value on every clock cycle or two, however the latency through the divider will be on the order of 10's of cycles depending on the length of the operands. If you are dividing by a constant, or a number that doesn't change often, you can get a much lower latency by calculating the inverse of the divisor and then doing multiplication, instead.

-- Gabor