UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Visitor bkochtud
Visitor
4,866 Views
Registered: ‎08-15-2015

Optimize II with DRAM access

Hi,

 

I am currently trying to build an image filter which needs quite a lot of additional data (15 x 32Bit words) per pixel for processing.

Since this is more than the FPGA can store in BRAM I am trying to use the external DRAM to store the data.

 

So this is my code, it reads the image line by line and does the processing that way.

void TOP(hls::stream<axiPixel> &in, hls::stream<axiPixel> &out, ap_fixed<32,16> *RAMin, ap_fixed<32,16> *RAMout)
{
	#pragma HLS INTERFACE m_axi port=RAMin
	#pragma HLS INTERFACE m_axi port=RAMout
	#pragma HLS INTERFACE axis port=out
	#pragma HLS INTERFACE axis port=in
	unsigned short y = 0;
	uint8_t pixelIn[640];
	uint8_t pixelOut[640];
	#pragma HLS DATAFLOW
	YLOOP:for ( y = 0; y < 480; y++)
	{
		#pragma HLS DATAFLOW

		bufferLine buf_in, buf_out;
		readLine(in, pixelIn);//read line of the image
		readBuffer(&buf_in,RAMin, 0,y);//read data needed for processing
		image_filter(pixelIn, pixelOut, buf_in, buf_out);//apply filter
		writeLine(out, pixelOut, last);//write back line
		writeBuffer(&buf_out,RAMout, 0,y);//write back data
	}
}
 

 The DATAFLOW works fine on the function calls but the overall II of the TOP function is really bad. So I would like to pipeline the YLOOP but everytime I add the PIPELINE directive synthesis takes really long and my system runs out of RAM sooner or later (I only have 8G).

 

Is my approach to the problem wrong or should I give the synthesis another try, maybe on a stronger machine?

 

Thanks in advance!

 

B

0 Kudos
4 Replies
Scholar u4223374
Scholar
4,854 Views
Registered: ‎04-26-2015

Re: Optimize II with DRAM access

The pipeline directive is not really suitable here. It inlines every function and unrolls every loop. Instead of readLine reading one pixel at a time, now readLine will attempt to read an entire line of pixels simultaneously - which will not work because you can't read 640 pixels at once from a stream. Similarly, readBuffer will read 15*640 = 9600 32-bit elements simultaneously from RAM. This isn't a question of having a more powerful computer; this system is never going to work well.

 

 

What II is the top function giving you? It seems likely that the absolute lower limit is 640*480*15*2 = 9,216,000 cycles, because that's how many clock cycles it'll take to read all that data out of RAM and then write it back when you're done. If that's too long, then you need to look at ways of optimizing the reads - either reducing the amount of data needed (do you really need 32-bit values, and are none of them shared between pixels?) or increasing available bandwidth (eg. a 512-bit AXI Master will take a lot of resources, but it can transfer all 15 32-bit values in one clock cycle).  Or you can keep it simple and just turn up the clock speed.

 

If 9,216,000 cycles is acceptable but it's taking much more than that, then you can keep the existing code but optimise it. In particular, have a look at the STREAM directive - it's used with the DATAFLOW one to improve performance and reduce resources.

Visitor bkochtud
Visitor
4,841 Views
Registered: ‎08-15-2015

Re: Optimize II with DRAM access

Thanks for your fast and helpful answer!

 

To give a bit more insight this is no traditional image filter but a filter which removes certain oscillations for every pixel from a grayscale video stream. To achieve this every pixel has to keep a history of its previous values (how many depends on the filter taps) so there is no sharing between pixels possible.

 

Currently the II for TOP is 4,625,940 but I would like to reduce it for a bit at least so I don't need a large FIFO on the input side.

So first I will try to got down to 8 bit values (instead of 32) and I will also have a look on increasing the bandwith for the AXI master - I totally did not think about that, thanks for pointing it out!

0 Kudos
Scholar u4223374
Scholar
4,833 Views
Registered: ‎04-26-2015

Re: Optimize II with DRAM access

That's a very annoying problem! It feels like there should be a nice solution using FFTs, but I can't see a good way to do that on a per-pixel basis.

 

What's the interface for your memory? Obviously there's not much point having an AXI master that's wider and faster than the RAM itself, but if you're using something like a Zynq with PS RAM then you can do some tricks. For example, turn the 64-bit AXI Master ports up to maximum speed (250MHz), which gives a maximum bandwidth of 2GB/s. If you can cut the values down to 8-bit (so your AXI masters will be 120-bit/128-bit wide) then you can run that block at 125MHz before saturating the port. AXI interconnects will handle the 128-bit/125MHz to 64-bit/250MHz conversion.

 

 

 

0 Kudos
Teacher muzaffer
Teacher
4,799 Views
Registered: ‎03-31-2012

Re: Optimize II with DRAM access

it seems you are doing interframe filtering on a per pixel basis. this suggests that a different image layout format would help if you can control it as you want pixels from N previous frames. It would help a lot to reduce memory accesses if you could split your images so that these pixels are next to each other in memory as opposed to having a stride of one full frame.
Also as @u43223374 suggests a transform domain solution would also help significantly. You can transform the filter once, transform each delay line of pixels, do pairwise multiplication and invert. There are some other details but this could help in computation a lot.
- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos