UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Participant valato
Participant
2,213 Views
Registered: ‎09-29-2017

Histogram synthesis

Jump to solution

Hi,

 

I am trying to implement histogram computation (at 150MHz on Zync 7020 484-1) but I am clueless. I tried many modifications to the following code but either I get II=2:

 

WARNING: [SCHED 204-68] Unable to enforce a carried dependence constraint (II = 1, distance = 1, offset = 1)
   between 'store' operation (Compression/compression.cpp:75) of variable 'histogram_V_load', Compression/compression.cpp:75 on local variable 'acc.V' and 'load' operation ('acc_V_1_load_1', Compression/compression.cpp:74) on local variable 'acc.V'.
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 2, Depth: 6.

 

or it does not work in cosim (when using false inter dependence):

 

RTL Simulation : 0 / 2 [0.00%] @ "116000"
// ERROR : Due to pragma (Compression/compression.cpp:34:1), dependence error (loop distance = 1) is detected in "`AUTOTB_DUT_INST.grp_linear_remap_fu_223"
//       : From memory access "histogram_V_address1" = 0x19bb @ "312261000"
//       : To memory access "histogram_V_address0" = DEP_address_0_to[0][13:0] = 0x19bb @ "312261000"

 

which is expected since different loop iterations can access the same address of histogram.

Here is my code:

 

#define GRAY_LEVELS (1 << 14) //* 14bit input
typedef ap_uint<11> iterator_t;
typedef ap_uint<19> histogram_t;
typedef ap_uint<14> luma_t;
typedef ap_uint<16> pixel_t;

static histogram_t histogram[GRAY_LEVELS];

void linear_remap(axi_stream & imgInput, axi_stream & imgOutput, iterator_t width, iterator_t height, pixel_t histFrom, pixel_t histTo, unsigned char mode, pixel_t param1, pixel_t param2)
{
#pragma HLS DEPENDENCE variable=histogram intra false
//#pragma HLS DEPENDENCE variable=histogram inter false

    pixel_t valueFrom, valueTo;
    if (mode == 0)
    {
        valueFrom = histFrom;
        valueTo = histTo;
    }
    else
    {
        valueFrom = param1;
        valueTo = param2;
    }
    pixel_t valueRange = valueTo - valueFrom;
    ap_uint<32> coef = (((ap_uint<32>)0xffff) << 16) / valueRange; //* 16.16

    histogram_t acc = 0;
    luma_t luma_old = 0;

    for (iterator_t y = 0; y < height; y++)
    {
MY_PRAGMA(HLS LOOP_TRIPCOUNT max=IMAGE_HEIGHT)
        for (iterator_t x = 0; x < width; x++)
        {
MY_PRAGMA(HLS LOOP_TRIPCOUNT max=IMAGE_WIDTH)
#pragma HLS PIPELINE II=1
#pragma HLS LOOP_FLATTEN OFF
            axi_packet pixel;
            imgInput >> pixel;
            pixel_t value = pixel.data;

            luma_t luma;
            luma = value.range(15, 2); //* upper 14bit valid

            if (luma == luma_old)
            {
                acc++;
            }
            else
            {
                histogram[luma_old] = acc; //* problematic 'load' operation
                acc = histogram[luma]+1; //* problematic 'store' operation
            }
            luma_old = luma;

            pixel_t valueOut;
            if (value < valueFrom)
            {
                valueOut = 0;
            }
            else if (value > valueTo)
            {
                valueOut = 0xffff;
            }
            else
            {
                ap_uint<32> temp = (value - valueFrom) * coef; //* 16.0 * 16.16 = 32.16 but result will never exceed 16.16
                valueOut = (temp >> 16);
            }
            pixel.data = valueOut;
            imgOutput << pixel;
        }
    }
    histogram[luma_old] = acc;
}

 

----- Please mark the post as an answer "Accept as solution" in case it helped to solve your problem. Give kudos in case the post guided you to the solution.
0 Kudos
1 Solution

Accepted Solutions
Scholar jmcclusk
Scholar
2,566 Views
Registered: ‎02-24-2014

Re: Histogram synthesis

Jump to solution

Isn't this a solved problem, part of the SDSoc toolbox?

 

https://www.xilinx.com/html_docs/xilinx2017_4/sdsoc_doc/unb1504034273879.html

 

This reports a processing rate of 1 pixel per clock,   or even 8 pixels per clock.

 

In particular, the histogram code seems to be available in this code module:

 

imgproc/xf_histogram.hpp

 

I should probably track this down and read it, to see how they get 1 pixel per clock.   They probably are using lookahead in the data pipeline, like I did.

 

EDIT:   And actually, the histogram code has been on Github for some time,  right here.

 

https://github.com/Xilinx/HLx_Examples/blob/master/Vision/img_histEq/src3/img_hist1.cpp

 

void img_hist_equaliz1( uint11 width, uint11 height, uint19 hist[GRAY_LEVELS], uint25 cdf[GRAY_LEVELS],
		RGB_t inp_img[MAX_WIDTH*MAX_HEIGHT], RGB_t out_img[MAX_WIDTH*MAX_HEIGHT])
{
   int i;
   uint11  x, y;
   uint8   Y, U, V;

   uint8   cdf_temp;
   uint25  cdf_mult, gain;

   uint32 acc = 0;
   uint8  old, val;

 // the following line is commented to allow a comparison with the same ref results even if the image is cropped
 // note that as an optimization, gain could be a constant pre-computed depending on the image size
//   gain = 256*1024*(GRAY_LEVELS-1) / (height*width-1);
//   gain = 256*1024*(GRAY_LEVELS-1) / (MAX_HEIGHT*MAX_WIDTH-1); //256*1024 are 18-bit Left Shift
//#ifndef __SYNTHESIS__
//   printf("gain = %f\n", (float) gain);
//#endif

   old = 0;
   Ly: for (y=0; y<height; y++)
   {
	#pragma HLS LOOP_TRIPCOUNT min=480 max=1080 avg=720

	   Lx: for (x=0; x<width; x++)
	   {
		   #pragma HLS LOOP_TRIPCOUNT min=640 max=1920 avg=1280

			Y = inp_img[y*MAX_WIDTH +x].R;
			U = inp_img[y*MAX_WIDTH +x].G;
			V = inp_img[y*MAX_WIDTH +x].B;

			// compute the histogram for current image I(n)
			// hist[Y] = hist[Y] + 1;

			val = Y;

			if (old==val)
				acc++;
			else
			{
				hist[old]=acc;
				acc=hist[val]+1;
			}
			old = val;

			// histogram equalization
			cdf_mult = ( (cdf[Y]-1) *(GRAY_LEVELS-1)) / (height * width -1);
			//cdf_mult =(cdf[Y]-1) * gain;
			//cdf_mult = cdf_mult >> 18;

			// clipping (due to the approssimation done above)
			if (cdf_mult >= (GRAY_LEVELS-1))
			   cdf_temp = (GRAY_LEVELS-1);
			else if (cdf_mult <= 0 )
			   cdf_temp = 0;
			else
			   cdf_temp = (uint8) cdf_mult;

			  Y = cdf_temp;

			  out_img[y*MAX_WIDTH +x].R = Y;
			  out_img[y*MAX_WIDTH +x].G = U;
			  out_img[y*MAX_WIDTH +x].B = V;

		} // end of Lx loop
	} // end of Ly loop
   	hist[old]=acc;
}

 

 

Don't forget to close a thread when possible by accepting a post as a solution.
0 Kudos
9 Replies
Scholar u4223374
Scholar
2,203 Views
Registered: ‎04-26-2015

Re: Histogram synthesis

Jump to solution

I haven't found a good solution to this yet. The easy workaround is to just run the histogram block twice as fast - it's a pretty simple operation so it wouldn't surprise me if you can get 300MHz from it. The harder workaround is to do it in HDL; again, it's a pretty simple operation, and HDL gives you control over what the RAM is doing on each clock cycle.

 

0 Kudos
Scholar jmcclusk
Scholar
2,182 Views
Registered: ‎02-24-2014

Re: Histogram synthesis

Jump to solution

I've solved the histogram problem in VHDL.    See this post for the VHDL code (which needs some polishing, I admit).  It's an ugly problem to solve, because of the read pipelining which will cause data loss when repeated words occur in the input stream if nothing special is done to prevent it.  Other solutions are rather brute force, avoiding the read latency problem by implementing parallel cores and distributing successive samples to the parallel cores.  Data hazard prevention has been done before, but was described rather vaguely, perhaps for patent reasons.

 

If the original poster can explain what they need, maybe I can find some time to adapt the RTL to work in the Zynq.  The challenge is how to glue a VHDL module into the framework that HLS creates..   Or perhaps the OP can use the data hazard prevention method directly in his C code.    Getting an IIS of 1 could still be a challenge, however.

Don't forget to close a thread when possible by accepting a post as a solution.
Highlighted
Participant valato
Participant
2,140 Views
Registered: ‎09-29-2017

Re: Histogram synthesis

Jump to solution

Thank you guys,

unfortunatelly I have no experience with VHDL - only codes generated by Vivado HLS. So I wouldn't be able to adapt or wrap your code to HLS. But still you brought me some interesting ideas.

My goal is to have IP that is able to do histogram based operations - equalization etc. I want to have at the beginning of each frame a histogram of previous frame. Compute CDF or some limits for linear remaping and then apply these parameters on current frame together with computing histogram for the next iteration.

It is really weird that it is so hard to implement it because on PC it is really trivial.

----- Please mark the post as an answer "Accept as solution" in case it helped to solve your problem. Give kudos in case the post guided you to the solution.
0 Kudos
Scholar jmcclusk
Scholar
2,134 Views
Registered: ‎02-24-2014

Re: Histogram synthesis

Jump to solution

Hmm..   The obvious solution is an AXI Stream IP block that connects to a video stream and generates a histogram for each color channel, then computes the CDF at the end of the frame, placing the CDF either into a BRAM buffer with a read port, or shoots it out via an AXI stream to another IP block that handles video equalization.  

 

There are complications, because if the video stream is more or less continuous, then double buffering (ping-pong) will be needed on the histogram RAM.   Another complication is that you probably want to handle video streams with 4 or 8 pixels per clock, so the histogram block needs to handle this as well (although it's not hard, actually).  

Don't forget to close a thread when possible by accepting a post as a solution.
Observer dr_jjrussell
Observer
2,124 Views
Registered: ‎11-02-2015

Re: Histogram synthesis

Jump to solution

One comment/One suggestion:

 

1) Is the optimization of saving up the accumulation until a new input value buying you anything?  This looks like data dependent optimization that may work in a serial computer, but not sure if it buys you anything in the parallel environment of an FPGA.  (And, yeah, this really is a question; I don't know, but my suspicion is that is doesn't, the loop time is determined by the longest path.)

 

2)  Whether this next suggestion works in practice depends on the relative size of the histogram (i.e. number of bins) to the image size.(number of pixels).  Believing the histogram is much smaller than the image, you might consider incrementing every other pixel into two different histograms and then summing the histograms at the end.  Each histogram will take the same number of cycles per bin, but there are only half the pixels. They should be able to run in parallel.  Whether this doubles your through-put is a detailed question.   This is really a common divide and conquer technique, so depending on the relative sizes, you could increase the number of histograms, perhaps buying a little more.  (The number of summations needed should go like log N, and with appropriate partitioning of the histogram memory can be made semi-arbitrarily fast depending on the resources you are willing to commit to it.)

 

 

Scholar u4223374
Scholar
2,082 Views
Registered: ‎04-26-2015

Re: Histogram synthesis

Jump to solution

@dr_jjrussell

 

(1) That's exactly what I found when I did it. I think that in VHDL you'd be able to do such an optimization, but to make it work you need to know exactly what's going to happen on every clock cycle. HLS doesn't allow that.

 

(2) This is the approach I eventually used, and it's straightforward to get 1 pixel per cycle from HLS. The downside is that it doubles the length of your buffer for a 1-bit reduction in width - generally a poor tradeoff, but in this case potentially acceptable. Adding more histograms will help if you're transferring multiple pixels in a single AXI Stream element, but isn't much use if you only transfer one pixel at a time.

 

@valato

>> It is really weird that it is so hard to implement it because on PC it is really trivial.

 

Try implementing it to run on a PC at one pixel per clock cycle! Not going to be easy.

 

Running it at one pixel every two clock cycles is trivial in HLS.

0 Kudos
Scholar jmcclusk
Scholar
2,567 Views
Registered: ‎02-24-2014

Re: Histogram synthesis

Jump to solution

Isn't this a solved problem, part of the SDSoc toolbox?

 

https://www.xilinx.com/html_docs/xilinx2017_4/sdsoc_doc/unb1504034273879.html

 

This reports a processing rate of 1 pixel per clock,   or even 8 pixels per clock.

 

In particular, the histogram code seems to be available in this code module:

 

imgproc/xf_histogram.hpp

 

I should probably track this down and read it, to see how they get 1 pixel per clock.   They probably are using lookahead in the data pipeline, like I did.

 

EDIT:   And actually, the histogram code has been on Github for some time,  right here.

 

https://github.com/Xilinx/HLx_Examples/blob/master/Vision/img_histEq/src3/img_hist1.cpp

 

void img_hist_equaliz1( uint11 width, uint11 height, uint19 hist[GRAY_LEVELS], uint25 cdf[GRAY_LEVELS],
		RGB_t inp_img[MAX_WIDTH*MAX_HEIGHT], RGB_t out_img[MAX_WIDTH*MAX_HEIGHT])
{
   int i;
   uint11  x, y;
   uint8   Y, U, V;

   uint8   cdf_temp;
   uint25  cdf_mult, gain;

   uint32 acc = 0;
   uint8  old, val;

 // the following line is commented to allow a comparison with the same ref results even if the image is cropped
 // note that as an optimization, gain could be a constant pre-computed depending on the image size
//   gain = 256*1024*(GRAY_LEVELS-1) / (height*width-1);
//   gain = 256*1024*(GRAY_LEVELS-1) / (MAX_HEIGHT*MAX_WIDTH-1); //256*1024 are 18-bit Left Shift
//#ifndef __SYNTHESIS__
//   printf("gain = %f\n", (float) gain);
//#endif

   old = 0;
   Ly: for (y=0; y<height; y++)
   {
	#pragma HLS LOOP_TRIPCOUNT min=480 max=1080 avg=720

	   Lx: for (x=0; x<width; x++)
	   {
		   #pragma HLS LOOP_TRIPCOUNT min=640 max=1920 avg=1280

			Y = inp_img[y*MAX_WIDTH +x].R;
			U = inp_img[y*MAX_WIDTH +x].G;
			V = inp_img[y*MAX_WIDTH +x].B;

			// compute the histogram for current image I(n)
			// hist[Y] = hist[Y] + 1;

			val = Y;

			if (old==val)
				acc++;
			else
			{
				hist[old]=acc;
				acc=hist[val]+1;
			}
			old = val;

			// histogram equalization
			cdf_mult = ( (cdf[Y]-1) *(GRAY_LEVELS-1)) / (height * width -1);
			//cdf_mult =(cdf[Y]-1) * gain;
			//cdf_mult = cdf_mult >> 18;

			// clipping (due to the approssimation done above)
			if (cdf_mult >= (GRAY_LEVELS-1))
			   cdf_temp = (GRAY_LEVELS-1);
			else if (cdf_mult <= 0 )
			   cdf_temp = 0;
			else
			   cdf_temp = (uint8) cdf_mult;

			  Y = cdf_temp;

			  out_img[y*MAX_WIDTH +x].R = Y;
			  out_img[y*MAX_WIDTH +x].G = U;
			  out_img[y*MAX_WIDTH +x].B = V;

		} // end of Lx loop
	} // end of Ly loop
   	hist[old]=acc;
}

 

 

Don't forget to close a thread when possible by accepting a post as a solution.
0 Kudos
Participant valato
Participant
2,055 Views
Registered: ‎09-29-2017

Re: Histogram synthesis

Jump to solution

@jmcclusk

This is exactly the code I am using.

 

@dr_jjrussell

As you can see they use the same technique to avoid accessing the same memory address.

 

The problem is with pragmas (which are missing in this code) - especially with DEPENDENCE. When using "DEPENDENCE inter false" cosim fails. When not using it HLS reports timing violation or II=2 with problematic store and load of 'acc'.

----- Please mark the post as an answer "Accept as solution" in case it helped to solve your problem. Give kudos in case the post guided you to the solution.
0 Kudos
Participant valato
Participant
1,978 Views
Registered: ‎09-29-2017

Re: Histogram synthesis

Jump to solution
I finally decided to go to lower frequency because in our latest project we do not need 1080p resolution.
But I confirm that version on GIThub is working. Although Synth reports 7.57ns, but after place and route it is only 5.45ns. So experience for me - do not rely too much on HLS estimates.
----- Please mark the post as an answer "Accept as solution" in case it helped to solve your problem. Give kudos in case the post guided you to the solution.
0 Kudos