cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Adventurer
Adventurer
8,892 Views
Registered: ‎07-15-2013

array implementation: 2014.4 vs 2015.2

Jump to solution

Hello 

 

I have a simple code for array calculations 

as following:

void arraySum(int A[size], int B[size], int result[size])
{
	int sum[size];
	int mult[size];
	int diff[size];

	for(int i=0; i<size; i++)
	{
		sum[i] = A[i]+B[i];
		diff[i] = A[i]-B[i];
		mult[i] = A[i]*B[i];
	}

	for( int z=0; z<size; z++)
	{
		result[z] = sum[z] + diff[z] + mult[z];
	}

}

The interesting that the arrays implementation looks diifferent between HLS 2014.4 and HLS 2015.2

as following

HLS 2015.2


    * Memory: 
    +--------+--------------+---------+---+----+-------+-----+------+-------------+
    | Memory |    Module    | BRAM_18K| FF| LUT| Words | Bits| Banks| W*Bits*Banks|
    +--------+--------------+---------+---+----+-------+-----+------+-------------+
    |sum_U   |arraySum_sum  |      256|  0|   0|  76800|   32|     1|      2457600|
    |mult_U  |arraySum_sum  |      256|  0|   0|  76800|   32|     1|      2457600|
    |diff_U  |arraySum_sum  |      256|  0|   0|  76800|   32|     1|      2457600|
    +--------+--------------+---------+---+----+-------+-----+------+-------------+
    |Total   |              |      768|  0|   0| 230400|   96|     3|      7372800|
    +--------+--------------+---------+---+----+-------+-----+------+-------------+
HLS 2014.4

    * Memory: 
    +--------+--------------+---------+---+----+-------+-----+------+-------------+
    | Memory |    Module    | BRAM_18K| FF| LUT| Words | Bits| Banks| W*Bits*Banks|
    +--------+--------------+---------+---+----+-------+-----+------+-------------+
    |sum_U   |arraySum_sum  |      160|  0|   0|  76800|   32|     1|      2457600|
    |mult_U  |arraySum_sum  |      160|  0|   0|  76800|   32|     1|      2457600|
    |diff_U  |arraySum_sum  |      160|  0|   0|  76800|   32|     1|      2457600|
    +--------+--------------+---------+---+----+-------+-----+------+-------------+
    |Total   |              |      480|  0|   0| 230400|   96|     3|      7372800|
    +--------+--------------+---------+---+----+-------+-----+------+-------------+

So why each one has different implementation BRAM resource utilization 160 for HLS_2014.4 and 256 for HLS_2015.2??

 

Thanks 

Karim

0 Kudos
1 Solution

Accepted Solutions
Highlighted
Xilinx Employee
Xilinx Employee
15,955 Views
Registered: ‎08-17-2011

Re: array implementation: 2014.4 vs 2015.2

Jump to solution

Hello All & @eng_karim

 

I took the example code provided in the top of this thread. I run it through the 2 different VHLS versions using a kintex7 ;

After export_design -evaluate verilog the designs have both 768 BRAMs used (results below).

It looks like 2014.4 was making wrong estimations and 2015.3/2015.x are inline with Vivado.

 

FWIW the RTL code are nearly identical in the 2 versions.

 

I don't want to stir away the argument, but the line   result[z] = sum[z] + diff[z] + mult[z]; could be written   result[z] = A[z] + A[z] + A[z]*B[z];.... and not use RAMs at all.... maybe that's only what you need?

 

If you don't like what you get as a result then please just recode it in a way that makes more sense to you.

I'm not joking, use your C code to tell the tool exactly what you want instead instead of letting it infer a RAM for you because what VHLS does is only writing (* ram_style = "block" *)reg [DWIDTH-1:0] ram[MEM_SIZE-1:0]; with 32 and 76800. The RTL synthesis or implementation is going to make the actual BRAM logic - maybe having the the RAMs as 1 bit large and 16k long like mentioned before?

 

If you want something else, then just write it.

I give an example at the end of this post; minimum BRAMS, so trading lots of BRAM for lots of LUTs to make the multiplexers ; also included the export_design -evaluate report.

 

 

Implementation tool: Xilinx Vivado v.2014.4
Device target:       xc7k325tffg676-2
Report date:         Tue Sep 01 10:05:41 PDT 2015

#=== Resource usage ===
SLICE:          443
LUT:            688
FF:             231
DSP:              3
BRAM:           768
SRL:              0
#=== Final timing ===
CP required:    10.000
CP achieved:    8.611
Timing met

 

 

Implementation tool: Xilinx Vivado v.2015.3
Device target:       xc7k325tffg676-2
Report date:         Tue Sep 01 10:04:55 PDT 2015

#=== Resource usage ===
SLICE:          453
LUT:            690
FF:             231
DSP:              3
BRAM:           768
SRL:              0
#=== Final timing ===
CP required:    10.000
CP achieved:    8.921
Timing met

 

const int size = 76800;
const int bram_size = 512;


struct my_ram
{
    int ram_array[size/bram_size][bram_size];
    my_ram() {
#pragma HLS array_partition variable=ram_array complete dim=1
    }
    int read(int addr) { return ram_array[addr/bram_size][addr%bram_size];}
    void write(int addr, int value) {ram_array[addr/bram_size][addr%bram_size]=value;}
};


void top(int A[size], int B[size], int result[size])
{
	my_ram sum,mult,diff;

	for(int i=0; i<size; i++)
	{
		sum.write( i,A[i]+B[i]);
		diff.write(i,A[i]-B[i]);
		mult.write(i,A[i]*B[i]);
	}

	for( int z=0; z<size; z++)
	{
		result[z] = sum.read(z) + diff.read(z) + mult.read(z);
	}

}

 

 

Implementation tool: Xilinx Vivado v.2015.3
Device target:       xc7k325tffg676-2
Report date:         Tue Sep 01 10:37:00 PDT 2015

#=== Resource usage ===
SLICE:         2221
LUT:           6154
FF:             329
DSP:              3
BRAM:           450
SRL:              0
#=== Final timing ===
CP required:    10.000
CP achieved:    9.490
Timing met

- Hervé

SIGNATURE:
* New Dedicated Vivado HLS forums* http://forums.xilinx.com/t5/High-Level-Synthesis-HLS/bd-p/hls
* Readme/Guidance* http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

* Please mark the Answer as "Accept as solution" if information provided is helpful.
* Give Kudos to a post which you think is helpful and reply oriented.

View solution in original post

15 Replies
Highlighted
Teacher
Teacher
8,886 Views
Registered: ‎03-31-2012

Re: array implementation: 2014.4 vs 2015.2

Jump to solution
what is the latency for each implementation? probably the two versions are making different choices. Also are the two solutions using the same target chip?
- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
Highlighted
Adventurer
Adventurer
8,885 Views
Registered: ‎07-15-2013

Re: array implementation: 2014.4 vs 2015.2

Jump to solution

yes both using the same chip xc7z045ffg900-2 (Zynq 706)

HLS 2014.4
+ Timing (ns): 
    * Summary: 
    +---------+-------+----------+------------+
    |  Clock  | Target| Estimated| Uncertainty|
    +---------+-------+----------+------------+
    |default  |  10.00|      8.18|        1.25|
    +---------+-------+----------+------------+

+ Latency (clock cycles): 
    * Summary: 
    +--------+--------+--------+--------+---------+
    |     Latency     |     Interval    | Pipeline|
    |   min  |   max  |   min  |   max  |   Type  |
    +--------+--------+--------+--------+---------+
    |  691202|  691202|  691203|  691203|   none  |
    +--------+--------+--------+--------+---------+
HLS 2015.2
+ Timing (ns): 
    * Summary: 
    +--------+-------+----------+------------+
    |  Clock | Target| Estimated| Uncertainty|
    +--------+-------+----------+------------+
    |ap_clk  |  10.00|      8.16|        1.25|
    +--------+-------+----------+------------+

+ Latency (clock cycles): 
    * Summary: 
    +--------+--------+--------+--------+---------+
    |     Latency     |     Interval    | Pipeline|
    |   min  |   max  |   min  |   max  |   Type  |
    +--------+--------+--------+--------+---------+
    |  691202|  691202|  691203|  691203|   none  |
    +--------+--------+--------+--------+---------+
0 Kudos
Highlighted
Teacher
Teacher
8,876 Views
Registered: ‎03-31-2012

Re: array implementation: 2014.4 vs 2015.2

Jump to solution
actually even 160 is too much. If one builds 32 bit memories out of 18Kb BRAM at 512x36, one needs only 150 of them and a couple levels of muxes to select which one is in use.
- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
Highlighted
Explorer
Explorer
8,865 Views
Registered: ‎07-13-2015

Re: array implementation: 2014.4 vs 2015.2

Jump to solution

For what it's worth, Vivado 2015.1 matches Vivado 2015.2 (256 BRAM_18K blocks each).

 

At a guess, to make 160 RAMs (in 2014.4) it's done sixteen 8K*2-bit RAMs in parallel to get 32-bit, and then lined up ten of those in series to produce the requested depth (76800). I'd guess that this saves logic compared to using 160 512*36-bit blocks, since for that you'd need a 160-way multiplexer to select which BRAM_18K was being used at any time.

 

I'm not sure how it's managed to use 256 BRAM_18Ks. I can't find any logical mapping that'd use that many.

 

In the produced Verilog code, the RAM turns up like this:

 

(* ram_style = "block" *)reg [DWIDTH-1:0] ram[MEM_SIZE-1:0];

(DWIDTH=32, MEM_SIZE=76800)

 

HLS has just asked for a RAM big enough to hold everything, and its own resource usage figures seem to be a guess at how Vivado (non-HLS) will behave during its synthesis/implementation process.

0 Kudos
Highlighted
Adventurer
Adventurer
8,862 Views
Registered: ‎07-15-2013

Re: array implementation: 2014.4 vs 2015.2

Jump to solution

I think that we start tolook how to build these arrays 

but from my point of view We forgot to ask ourselves Why two versions of the same tool behave differently for translating the same piece of code ?? why such inconsistency ?

 

I will share the sysnthesis report for both just to have a look on them 

0 Kudos
Highlighted
Adventurer
Adventurer
8,859 Views
Registered: ‎07-15-2013

Re: array implementation: 2014.4 vs 2015.2

Jump to solution

@muzaffer @evanslatyer

I can explain the 160 for 2014.4 easily 

Simply each bit is built separately from the other one 

76800/(16*1024) = 4.68  BRAM18  approximately 5

then for 32-bit values it will 32*5 = 160BRAM18K

 

but for 2015.2 

I have no idea !!!

0 Kudos
Highlighted
Teacher
Teacher
8,842 Views
Registered: ‎03-31-2012

Re: array implementation: 2014.4 vs 2015.2

Jump to solution
One can do a 16 to 1 mux in a single slice so 150 to 1 mux is not too bad but I do like 8kX2 solution too. It's a matter of usage. Waste 10 18K brams or use more logic.
- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
Highlighted
Explorer
Explorer
8,841 Views
Registered: ‎07-13-2015

Re: array implementation: 2014.4 vs 2015.2

Jump to solution

That seems to be how HLS works. It'll happily map a 3-bit * 2-bit multiplier (used to select an array address) into a DSP48E, even though it'll only occupy a couple of LUTs anyway. I gather that when you get to the main synthesis/implementation (in Vivado non-HLS) you can get these implemented in LUTs.

0 Kudos
Highlighted
Moderator
Moderator
8,742 Views
Registered: ‎04-17-2011

Re: array implementation: 2014.4 vs 2015.2

Jump to solution
There is a roadmap to have a Core for implementing Multiplication operation in LUT's instead of DSP's in HLS itself. Stay Tuned :)
Regards,
Debraj
----------------------------------------------------------------------------------------------
Kindly note- Please mark the Answer as "Accept as solution" if information provided is helpful.

Give Kudos to a post which you think is helpful and reply oriented.
----------------------------------------------------------------------------------------------
0 Kudos
Highlighted
Adventurer
Adventurer
8,070 Views
Registered: ‎07-15-2013

Re: array implementation: 2014.4 vs 2015.2

Jump to solution

 But No one give me an explanation for this difference in implementation between the two versions ??

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
15,956 Views
Registered: ‎08-17-2011

Re: array implementation: 2014.4 vs 2015.2

Jump to solution

Hello All & @eng_karim

 

I took the example code provided in the top of this thread. I run it through the 2 different VHLS versions using a kintex7 ;

After export_design -evaluate verilog the designs have both 768 BRAMs used (results below).

It looks like 2014.4 was making wrong estimations and 2015.3/2015.x are inline with Vivado.

 

FWIW the RTL code are nearly identical in the 2 versions.

 

I don't want to stir away the argument, but the line   result[z] = sum[z] + diff[z] + mult[z]; could be written   result[z] = A[z] + A[z] + A[z]*B[z];.... and not use RAMs at all.... maybe that's only what you need?

 

If you don't like what you get as a result then please just recode it in a way that makes more sense to you.

I'm not joking, use your C code to tell the tool exactly what you want instead instead of letting it infer a RAM for you because what VHLS does is only writing (* ram_style = "block" *)reg [DWIDTH-1:0] ram[MEM_SIZE-1:0]; with 32 and 76800. The RTL synthesis or implementation is going to make the actual BRAM logic - maybe having the the RAMs as 1 bit large and 16k long like mentioned before?

 

If you want something else, then just write it.

I give an example at the end of this post; minimum BRAMS, so trading lots of BRAM for lots of LUTs to make the multiplexers ; also included the export_design -evaluate report.

 

 

Implementation tool: Xilinx Vivado v.2014.4
Device target:       xc7k325tffg676-2
Report date:         Tue Sep 01 10:05:41 PDT 2015

#=== Resource usage ===
SLICE:          443
LUT:            688
FF:             231
DSP:              3
BRAM:           768
SRL:              0
#=== Final timing ===
CP required:    10.000
CP achieved:    8.611
Timing met

 

 

Implementation tool: Xilinx Vivado v.2015.3
Device target:       xc7k325tffg676-2
Report date:         Tue Sep 01 10:04:55 PDT 2015

#=== Resource usage ===
SLICE:          453
LUT:            690
FF:             231
DSP:              3
BRAM:           768
SRL:              0
#=== Final timing ===
CP required:    10.000
CP achieved:    8.921
Timing met

 

const int size = 76800;
const int bram_size = 512;


struct my_ram
{
    int ram_array[size/bram_size][bram_size];
    my_ram() {
#pragma HLS array_partition variable=ram_array complete dim=1
    }
    int read(int addr) { return ram_array[addr/bram_size][addr%bram_size];}
    void write(int addr, int value) {ram_array[addr/bram_size][addr%bram_size]=value;}
};


void top(int A[size], int B[size], int result[size])
{
	my_ram sum,mult,diff;

	for(int i=0; i<size; i++)
	{
		sum.write( i,A[i]+B[i]);
		diff.write(i,A[i]-B[i]);
		mult.write(i,A[i]*B[i]);
	}

	for( int z=0; z<size; z++)
	{
		result[z] = sum.read(z) + diff.read(z) + mult.read(z);
	}

}

 

 

Implementation tool: Xilinx Vivado v.2015.3
Device target:       xc7k325tffg676-2
Report date:         Tue Sep 01 10:37:00 PDT 2015

#=== Resource usage ===
SLICE:         2221
LUT:           6154
FF:             329
DSP:              3
BRAM:           450
SRL:              0
#=== Final timing ===
CP required:    10.000
CP achieved:    9.490
Timing met

- Hervé

SIGNATURE:
* New Dedicated Vivado HLS forums* http://forums.xilinx.com/t5/High-Level-Synthesis-HLS/bd-p/hls
* Readme/Guidance* http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

* Please mark the Answer as "Accept as solution" if information provided is helpful.
* Give Kudos to a post which you think is helpful and reply oriented.

View solution in original post

Highlighted
Adventurer
Adventurer
8,018 Views
Registered: ‎07-15-2013

Re: array implementation: 2014.4 vs 2015.2

Jump to solution

Thank @herver

Actually I wrote this example just to report the difference in the implementation between the two versions

 

This point is not clear to me 

"After export_design -evaluate verilog the designs have both 768 BRAMs used (results below).

It looks like 2014.4 was making wrong estimations and 2015.3/2015.x are inline with Vivado."

 

do u mean that the sysnthesis results changed after export and from which report you got that values ??

or did u  check with vivado ?? 

 

also why did u use   kintex7 ? 

 

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
8,006 Views
Registered: ‎08-17-2011

Re: array implementation: 2014.4 vs 2015.2

Jump to solution

Let me try to explain again.

 

What i'm saying is that the report from the C synthesis from VHLS are only estimates.

 

To get better numbers is to run RTL synthesis and implementation using Vivado.

 

either you try yourself directly in Vivado or better, this is possible in VHLS from the export design feature.

 

Either use the VHLS GUI or from the TCL VHLS command line. In GUI do export design and select evaluate. On the tcl command line the equivalent command is export_design -evaluate verilog.

 

since both the RTL synthesis / implementation resource results in 2014.4 and 2015.x are the same, i conclude thaat the reporting results for this design is erronous in 2014.4 and is better in 2015.x

 

is that clearer?

- Hervé

SIGNATURE:
* New Dedicated Vivado HLS forums* http://forums.xilinx.com/t5/High-Level-Synthesis-HLS/bd-p/hls
* Readme/Guidance* http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

* Please mark the Answer as "Accept as solution" if information provided is helpful.
* Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
Highlighted
Adventurer
Adventurer
7,998 Views
Registered: ‎07-15-2013

Re: array implementation: 2014.4 vs 2015.2

Jump to solution

 I got it 

Thanks for your patient  @herver

This means that I shouldn't rely anymore on the estimations 

 

I have run it and got that values for Zynq 706

it is bit strange for me that the versions didn't give similar results

 

HLS 2014.4
#=== Resource usage ===
SLICE:          517
LUT:            825
FF:             169
DSP:              3
BRAM:           768
SRL:             20
#=== Final timing ===
CP required:    10.000
CP achieved:    7.826

HLS 2015.2
#=== Resource usage ===
SLICE:          446
LUT:            709
FF:             189
DSP:              3
BRAM:           768
SRL:             20
#=== Final timing ===
CP required:    10.000
CP achieved:    9.193
0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
7,988 Views
Registered: ‎08-17-2011

Re: array implementation: 2014.4 vs 2015.2

Jump to solution

Glad we have come to a common understanding.

 

The difference in LUT/FF (and as a consequence slices) does not worry me too much; you can also see that the timing achieved is also different.

 

The input C designs is the same but 2 tools versions are used to generate RTL and they are potentially different: it may well be that the microarchitecture generated is different.

imagine that those are 2 valid solutions in the exploration space of the possible implementations of the C code in RTL that meet 10ns critical worst period.

 

if you cross reference with my results, they were both the very similar (also recall the FPGA used are different)

 

if you want to explore further this topic, you can try to run Vivado RTL synth and impl from VHLS 2014.4 generated RTL solution on Vivado 2015.2 and vice versa

- Hervé

SIGNATURE:
* New Dedicated Vivado HLS forums* http://forums.xilinx.com/t5/High-Level-Synthesis-HLS/bd-p/hls
* Readme/Guidance* http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

* Please mark the Answer as "Accept as solution" if information provided is helpful.
* Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos