UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Adventurer
Adventurer
432 Views
Registered: ‎07-24-2016

Timing Closure in a Big Demux

Jump to solution

Hi all,

 

demux.png

 

I will implement a big demultiplexer in a XC7VX690T-2FFG1761C (VC709), that is driven by a configuration logic and points to 1024 different 32-bit wide registers. The logic's clock frequency is 320Mhz. The configuration logic introduces the wrData (32-bit wide) and the address (10-bit wide) in one cycle, and after X clock cycles it asserts a write_enable pulse. The wrData and address stay stable for X clock cycles after the assertion of the write_enable. As per Xilinx's recommendations, I have set a multicycle path in the .xdc. It kinda looks like this:

 

set_multicycle_path $X -setup -from [get_cells *configuration_logic*/reg_wr_data_reg[*]] -to [get_cells *destination_logic*/register_XXXX_reg[*]]
set_multicycle_path $X-1 -hold -from [get_cells *configuration_logic*/reg_wr_data_reg[*]] -to [get_cells *destination_logic*/register_XXXX_reg[*]]

I still have not implemented all 1024 registers, a script will do that. The script will also generate the demux's .vhd, given a template. Up to now it seems to work for only two dummy registers, but I am concerned that I will not meet timing closure when I scale up, and also that I will cause congestion. I would like some advice on different implementation styles, in order to have the optimal result. So here are some demux examples:

 

Example 1:

The simplest:

 

process(clk)
begin
    if(rising_edge(clk))then
        if(address = "0000000000" and write_enable = '1')then
            register_0000 <= wrData;
        else
            register_0000 <= register_0000;
        end if;
    end if;
end process;

process(clk)
begin
    if(rising_edge(clk))then
        if(address = "0000000001" and write_enable = '1')then
            register_0001 <= wrData;
        else
            register_0001 <= register_0001;
        end if;
    end if;
end process;

 

Example 2:

Here, the 10-bit address is broken down in "sections", and each section is mapped into a one-hot style bus:

 

process(clk)
begin
    if(rising_edge(clk))then
        case address(9 downto 8) is
        when X"0"   => section_enable <= X"1";
        when X"1"   => section_enable <= X"2";
        when X"2"   => section_enable <= X"4";
        when others => section_enable <= X"8";
        end case;
    end if;
end process;

process(clk)
begin
    if(rising_edge(clk))then
        case address(7 downto 4) is
        when X"0"   => subsection_enable <= X"0001";
        when X"1"   => subsection_enable <= X"0002";
        when X"2"   => subsection_enable <= X"0004";
        when X"3"   => subsection_enable <= X"0008";
        when X"4"   => subsection_enable <= X"0010";
        when X"5"   => subsection_enable <= X"0020";
        when X"6"   => subsection_enable <= X"0040";
        when X"7"   => subsection_enable <= X"0080";
        when X"8"   => subsection_enable <= X"0100";
        when X"9"   => subsection_enable <= X"0200";
        when X"A"   => subsection_enable <= X"0400";
        when X"B"   => subsection_enable <= X"0800";
        when X"C"   => subsection_enable <= X"1000";
        when X"D"   => subsection_enable <= X"2000";
        when X"E"   => subsection_enable <= X"4000";
        when others => subsection_enable <= X"8000";
        end case;
    end if;
end process;

process(clk)
begin
    if(rising_edge(clk))then
        case address(3 downto 0) is                
        when X"0"   => register_enable <= X"0001";
        when X"1"   => register_enable <= X"0002";
        when X"2"   => register_enable <= X"0004";
        when X"3"   => register_enable <= X"0008";
        when X"4"   => register_enable <= X"0010";
        when X"5"   => register_enable <= X"0020";
        when X"6"   => register_enable <= X"0040";
        when X"7"   => register_enable <= X"0080";
        when X"8"   => register_enable <= X"0100";
        when X"9"   => register_enable <= X"0200";
        when X"A"   => register_enable <= X"0400";
        when X"B"   => register_enable <= X"0800";
        when X"C"   => register_enable <= X"1000";
        when X"D"   => register_enable <= X"2000";
        when X"E"   => register_enable <= X"4000";
        when others => register_enable <= X"8000";
        end case;
    end if;
end process;

process(clk)
begin
    if(rising_edge(clk))then
        if((section_enable(0) = '1') and (subsection_enable(0) = '1') and (register_enable(0) = '1'))then
            if(write_enable = '1')then
                register_0000 <= wrData;
            else
                register_0000 <= register_0000;
            end if;
        else
            register_0000 <= register_0000;
        end if;
    end if;
end process;

process(clk)
begin
    if(rising_edge(clk))then
        if((section_enable(0) = '1') and (subsection_enable(0) = '1') and (register_enable(1) = '1'))then
            if(write_enable = '1')then
                register_0001 <= wrData;
            else
                register_0001 <= register_0001;
            end if;
        else
            register_0001 <= register_0001;
        end if;
    end if;
end process;

 

 

Example 3:

A more extreme alternative of Example 2: Here I was thinking that maybe I could map the entire 10-bit address register into a 1024-bit wide one-hot representation of the address. Then, the wrData would fan-out to all destination registers, but each destination register's CE pin would be driven by the AND of write_enable with only one bit of the 1024 one-hot address representation. 

 

Is there really any difference between all three examples? Am I thinking too much here, and maybe the most straightforward implementation (Example 1) with a long multicycle path is enough? Does Example 2 make Vivado's work easier and implement this quicker? Does Example 3 make the routing easier, and all complexity is devoted to the conversion of the 10-bit address to the 1024-bit one-hot representation?

 

Or maybe I am just trying to do Vivado's work here and I should not worry?

 

0 Kudos
1 Solution

Accepted Solutions
Highlighted
Historian
Historian
391 Views
Registered: ‎01-23-2009

Re: Timing Closure in a Big Demux

Jump to solution

I think you are overthinking this...

 

The "demux" you are describing doesn't actually exist. Each register is enabled when the address matches its address and the WE is enabled. Since your address is 10 bits wide, this is a function of 11 bits. The only real issue is that you will have a fanout of 1024 for each bit of the address...

 

I suspect that no matter how you describe it, the end results will be the same, assuming you describe the address logic as purely combinatorial. Vivado (and other synthesis tools) are very efficient at converting a "bubble" of combinatorial logic into the most efficient implementation, regardless of how the original was described (not always true, but it should be in this case).

 

The fact that you are giving the address multiple cycles to propagate through 10 of the 11 inputs of this cone of logic will make up for the large fanout. My guess is that this won't really be a problem.

 

That being said, "functional" multicycle paths (MCPs) like this can be a bit of a pain. If possible it is better to pipeline than to do MCPs. In this case, simply registering the addresses will allow the tools to replicate them for fanout control. Of course this will cost additional flip-flops, but may be worth it - it is SO easy to get MCPs wrong, and SO hard to detect when you have (you just get devices that are unreliable and it is really hard to diagnose the problem back to a bad MCP).

 

So back to coding it - since the coding style doesn't matter for the implementation, you should find a coding style that is the simplest to do. One of the simplest ways is to simply code this as a two dimensional array - code it as a 1024 entries of 32 bits each. When done this way, the demux is simply a de-refrence of the address (forgive the Verilog pseudo code)

 

always @(posedge clk)
begin
  if (write_enable)
  begin
     register[address]  <= wrData;
  end
end

This is certainly easier than anything else. While this may look like a RAM (and alone it actually would be implemented as a RAM - well, at the moment a write only memory), whether it gets mapped to a BRAM a distributed RAM or flops will be determined by what you do with the contents of "register" - if multiple entries are used on a given clock, then the tools will have no choice but to implement this as an array of flip-flops.

 

BUT, that being said, this will be 32K flip-flops for your registers - that's a lot! Are you sure that these all need to be registers? Are there some number of them where the access pattern allows them to be moved to a RAM - for example if a chunk of them are a large lookup table where only one entry is used per clock, then moving these to a BRAM will save TONS of space (and routing resource and fanout on your addresss, etc...)

 

WARNING!!! NEVER USE MCPs for the address inputs of a block RAM. This is specifically illegal and will result in the RAM contents being corrupted - even when you are not doing a write! This is explicitly stated in the Memory Resources User Guide - for example, for the 7 series, UG473 v1.12 p. 12.

 

"The setup time of the block RAM address and write enable pins must not be violated.
Violating the address setup time (even if write enable is Low) can corrupt the data
contents of the block RAM."

 

Avrum

Tags (1)
7 Replies
Explorer
Explorer
423 Views
Registered: ‎08-16-2018

Re: Timing Closure in a Big Demux

Jump to solution

I think that will blow up at routing. I've seen impossible smaller things.

What about partitioning in, say, 2^5 demux of 2^5 outputs each? It may take more area, but hopefully not a lot more. It will certainly increase the latency.

when you say "a script will generate the 1024 registers", you could do that with GENERATE, couldn't you?

0 Kudos
Adventurer
Adventurer
415 Views
Registered: ‎07-24-2016

Re: Timing Closure in a Big Demux

Jump to solution

    What about partitioning in, say, 2^5 demux of 2^5 outputs each?

 

Can you elaborate more on that? (i.e. example)

 

 

   when you say "a script will generate the 1024 registers", you could do that with GENERATE, couldn't you?

 

Not really, as the destination registers (i.e. register_XXXX), will not be an array of 32-bit registers, but they will have user-defined names, generated by the script mentioned (e.g. 32-bit register in address "1010001100" will be register_cnt_limit).

0 Kudos
Explorer
Explorer
405 Views
Registered: ‎08-16-2018

Re: Timing Closure in a Big Demux

Jump to solution

Take the 5 MS bits of the address and feed them to a 1-to-32 demux with the 32-bit data.

Now you have 32x 32-bit channels. On each of those channels have another 1-to-32 demux, the selection bits are the remaining 5 bits of the address. You end up with 32x32 32-bit channels. Total you need 33 1-to-32 demux blocks.

 

Oh, I see. Well, I still think there might be a way (alias...?) Anyways,if a quick bunch of lines in python or whatever does the job, that's it

 

 

0 Kudos
Highlighted
Historian
Historian
392 Views
Registered: ‎01-23-2009

Re: Timing Closure in a Big Demux

Jump to solution

I think you are overthinking this...

 

The "demux" you are describing doesn't actually exist. Each register is enabled when the address matches its address and the WE is enabled. Since your address is 10 bits wide, this is a function of 11 bits. The only real issue is that you will have a fanout of 1024 for each bit of the address...

 

I suspect that no matter how you describe it, the end results will be the same, assuming you describe the address logic as purely combinatorial. Vivado (and other synthesis tools) are very efficient at converting a "bubble" of combinatorial logic into the most efficient implementation, regardless of how the original was described (not always true, but it should be in this case).

 

The fact that you are giving the address multiple cycles to propagate through 10 of the 11 inputs of this cone of logic will make up for the large fanout. My guess is that this won't really be a problem.

 

That being said, "functional" multicycle paths (MCPs) like this can be a bit of a pain. If possible it is better to pipeline than to do MCPs. In this case, simply registering the addresses will allow the tools to replicate them for fanout control. Of course this will cost additional flip-flops, but may be worth it - it is SO easy to get MCPs wrong, and SO hard to detect when you have (you just get devices that are unreliable and it is really hard to diagnose the problem back to a bad MCP).

 

So back to coding it - since the coding style doesn't matter for the implementation, you should find a coding style that is the simplest to do. One of the simplest ways is to simply code this as a two dimensional array - code it as a 1024 entries of 32 bits each. When done this way, the demux is simply a de-refrence of the address (forgive the Verilog pseudo code)

 

always @(posedge clk)
begin
  if (write_enable)
  begin
     register[address]  <= wrData;
  end
end

This is certainly easier than anything else. While this may look like a RAM (and alone it actually would be implemented as a RAM - well, at the moment a write only memory), whether it gets mapped to a BRAM a distributed RAM or flops will be determined by what you do with the contents of "register" - if multiple entries are used on a given clock, then the tools will have no choice but to implement this as an array of flip-flops.

 

BUT, that being said, this will be 32K flip-flops for your registers - that's a lot! Are you sure that these all need to be registers? Are there some number of them where the access pattern allows them to be moved to a RAM - for example if a chunk of them are a large lookup table where only one entry is used per clock, then moving these to a BRAM will save TONS of space (and routing resource and fanout on your addresss, etc...)

 

WARNING!!! NEVER USE MCPs for the address inputs of a block RAM. This is specifically illegal and will result in the RAM contents being corrupted - even when you are not doing a write! This is explicitly stated in the Memory Resources User Guide - for example, for the 7 series, UG473 v1.12 p. 12.

 

"The setup time of the block RAM address and write enable pins must not be violated.
Violating the address setup time (even if write enable is Low) can corrupt the data
contents of the block RAM."

 

Avrum

Tags (1)
Explorer
Explorer
388 Views
Registered: ‎08-16-2018

Re: Timing Closure in a Big Demux

Jump to solution

There is nothing more beautiful than something simple (and working)

0 Kudos
Visitor israelgr
Visitor
319 Views
Registered: ‎04-14-2018

Re: Timing Closure in a Big Demux

Jump to solution

Hello AvrumW, Johnvivm,

Cbakal mentioned two requirements in his question:

1) decoding the address.

2) accessing the configuration logic's registers.

 

If I understood correctly, both of you suggested a method how to write into the configuration logic but the registers need to have the ability to access individually by the "other" side of the configuration logic (as I understand from the question).

Meaning, if it is a traditional memory then only one value is exposed at a certain time, not all the 1024 registers.

Another question about the suggested implementation, do these registers should support other functionalities that expect from a register? like, "clear after write". If yes, does a memory based logic can support it and generate a pulse?

 

Regards,

Israel.

0 Kudos
Explorer
Explorer
312 Views
Registered: ‎08-16-2018

Re: Timing Closure in a Big Demux

Jump to solution

The memory I call 'traditional' run on glucose carried by blood.

 

0 Kudos