03-08-2011 12:50 AM
I am a newbie. I enter my designs using Verilog HDL, use ISE 12.4 for synthesis and implementation, and am targeting a Virtex-6 FPGA.
I am looking to design a block which can calculate the pseudo-inverse of a given matrix. Because of a lack of time, I do not wish to code this design on my own (and it's a big design). I was wondering if any of you folks are aware of alternative ways to do this (e.g., IP, open cores, etc.). If you are, kindly share your ideas with me here.
I appreciate your help.
Thanks and Regards,
03-08-2011 06:09 AM
I'm not aware of any IP that does this. I would write the code and use the SDK tool with microblaze and get it on a development board
03-08-2011 05:08 PM
How big of a matrix?
Fixed point or Floating point?
How many bits?
Where's the matrix stored? (Block Ram, FFs, other (offchip?))
Where will the results be stored?
What are the latency requirements?
What are the bandwidth requirements? (i.e. how often would you need to
calculate a new inverse).
Is there a CPU available to do some of the work?
03-08-2011 08:46 PM
I am trying to map a software implementation that uses the matrix pseudo-inversion into hardware. Over the past few days, I have learnt the hard way that some things are simply too complicated to be dealt with in hardware, and I believe this is one of them.
I read some literature concerning this topic, and although many of the matrix pseudo-inverse implementations are scalable, they are still suitable for small matrix sizes. Unfortunately, I do not have this constraint, which is why I am not sure I can implement this design in hardware.
@markcurry: Here are the details:
How big of a matrix? -- variable, but between (8 x 4) and (256 x 4)...it's a non-square matrix, hence the pseudo-inverse
Fixed point or Floating point? -- Floating point
How many bits? -- 64 bits...I am full of good news, aren't I?
Where's the matrix stored? (Block Ram, FFs, other (offchip?)) -- Block RAM...simply not possible to do this with FFs
Where will the results be stored? -- Another Block RAM
What are the latency requirements? -- latency is usually the time the o/p is available wrt to the change in the i/p...considering that definition, my worst case latency requirement is around 5 milliseconds
What are the bandwidth requirements (i.e. how often would you need to calculate a new inverse)? -- say, around 1 millisecond...I have a sequential implementation where this same pseudo-inverse code is called four times
Clock speeds? -- The rest of my code has a baseline speed of 106.4 MHz
Is there a CPU available to do some of the work? -- No, but I have a TI DSP chip...can this help?
@francism: Thank you for your input. I have this question for you -- when you have large matrices (like the ones I am talking about) to be manipulated, what is the standard procedure used in FPGAs? Using an embedded processor?
Thanks and Regards,
03-09-2011 08:32 AM
That floating point requirement, and 64 bits sure make things difficult.
If you could live with fixed point and something that would fit inside a DSP48, you'd
be in better shape (for an FPGA implementation). Those requirements are pushing you
towards a software solution. But the 1ms rate - that's pushing you back to FPGAs...
Sounds like a fun project!
03-09-2011 09:29 AM
Let me fill you in on a few more details of my task. Basically, I am supposed to design hardware that captures an image frame every 36 milliseconds, and then process it completely before another frame comes in. My module, and others, are what implement the "processing". The nature of this processing is sequential.
I calculated the latency and bandwidth based on some crude latency estimates I made from other parts of my design. I know that sounds wrong, but just for a moment, forget the latency and bandwidth figures I gave to you. The clock speed I mentioned is also a ballpark, but I suppose it's the best I can do with a floating-point implementation. Just consider the size of the matrices that I am talking about. And one more thing to add -- after this inversion operation is completed, I have to do a matrix multiplication as well...if I get a (4 x 48) matrix after inversion, I have to multiply it with a (48 x 1) matrix. Considered alone, I suppose the matrix multiplication is not a big issue; but in the grand scheme of things, it must be considered.
After I looked through reference FPGA implementations, the cost and effort of mapping matrix operations my seemed too great. Looking at the DSP chip, my mind came up with the following solution --
- Keep the rest of the frame processing logic in hardware
- Put the matrix operations in software and maybe provide some kind of hardware acceleration
I am a newbie, so I am not sure if this is the right way to do things. Do you still think I can do all operations in hardware? And you are more accurate than you know...this is indeed turning out to be a fun project for me :-)
Thanks and Regards,
03-10-2011 01:55 AM
03-10-2011 11:18 AM
Image processing - ok wasn't expecting that. So, what rcingham said - why floating point?
I imagine your pixels are coming in 8, or 10 bit (per component). (and probably exit the
same way). I can't imagine what sort of image processing internally would require
enough dynamic range to justify floating point. Surely a DSP48 should have enough
range to support what you're doing. But then, I don't know you're application.
FPGAs are good tools for image processing applications - the processing rates
required are too high for all but the fastest/biggest DSPs. FPGAs or GPU's are
about all the tools you've got. GPU's may be better for you now, as it sounds like
you're still in the "prototype" phase? Easier/quicker to reconfig just SW.
03-10-2011 09:17 PM
@markcurry, rchingam: Thank you!
My design spec comes from a C++ implementation (written by some very senior engineers in our organization) that runs on a DS processor. That algorithm uses floating-point variables throughout its implementation, for reasons beyond my comprehension and also probably because it's a lot easier (and slower) to use floating-point representation in a software environment.
I realize that ideally, I should have started with the analysis of the original algorithm, converted the floating-point spec to a fixed-point spec, and then put it into the FPGA. However, two factors were not in my favor: (1) I have never done this kind of floating- to fixed-point conversion before (remember I am still a newbie), and (2) I just did not have enough time to start with. Seeing that Xilinx IP cores support floating-point operations, I began the hardware implementation. Now, I have just about completed everything else, except this matrix pseudo-inversion part.
My current problem is two-fold:
(1) The size of my input matrix is not constant, so I do not know how to implement a pseudo-inversion (by the way, this will be the Moore-Penrose pseudo-inverse calculation) block that could handle it
(2) Even if I somehow manage to convert my floating-point spec to a fixed-point spec, I am looking at very big matrix sizes to be handled. Most of the literature on matrix pseudo-inversion hardware comes from the field of wireless communications, where it's popular in the implementation of MIMO systems. However, even there, the matrix sizes are not as big as I am supposed to handle.
If either of you are aware of how to overcome these two aspects, I request you to share them with me here.
I appreciate your help and can only say that the next time, I will start by analyzing and converting a floating-point spec to a fixed-point spec for FPGA implementation.
Thanks and Regards,
03-15-2011 04:35 AM
03-18-2011 11:09 AM
you could always out source the design
03-23-2011 07:08 AM
04-05-2011 02:01 AM
I was doing a few software- to hardware portations in the computer vision field..
After the first one I came up with the following procedure:
1. Create a test bench, which gives you "hard" results for a certain algorithm (e.g. for visual feature detectors and descriptors use images with known homographies relating them, and test how many points are found and related in both images by your algorithms)
2. Create an early model of your algorithms FPGA implementation, and define the critical internal data.
Critical data in this case is data that needs to be stored in Block Rams (if using Linebuffers), needs a lot of parallel processing, etc (any data that would use many FPGA resources, if it had too many bits)
3. For every critical internal algorithm data, change your algorithm, so that you can specify a maximum accuracy for this data, i.e. the number of bits you want to allow.
4. Calculate the impact of changing the bit-accuracies on your test scenario and put them into curves.
5. Analyse your curves and select an appropriate number of bits for every critical data step!
Then start with the actual FPGA design!
This seems to work out quite well for me!
Can I ask you what kind of algorithm you are porting and what purpose it is going to be used for?
04-24-2012 04:53 AM
That is a tough task for a newbie. In fact that is a tough task for anyone! The problem you are trying to solve is an active research issue in the FPGA design field. Accelerating the pseudoinverse computation of matrixes by hardware is a very complex and challenging task. I just think they asked you for too much.