07-30-2013 11:37 AM
I need to do something similar to the following code and I was thinking of using CoreGenerator for FPU unit however it just creates a core for operation on two 32bit vectors each presenting one single FP number. In order to do the operation on two floating point array of numbers should I use" generate for loop" like what follows in my post or are there better method or predefined IP/codes for FLOAT_ARRAY?
**In this code both the 1D and 2D arrays are floating point.
for(i=0; (i<100); i=i+1) begin: FPU_unit
for (j = 1; j <= ndelta; j++) for (k = 0; k <= nly; k++) begin new_dw = ((ETA * delta[j] * ly[k]) + (MOMENTUM * oldw[k][j])); w[k][j] += new_dw; oldw[k][j] = new_dw; end
07-30-2013 02:01 PM - edited 07-30-2013 02:02 PM
I suggest you spend some quality browser time researching why floating point is particulary slow, and therefore inefficient, in FPGA logic.
After you understand why floating point is so bad for FPGA devices (you must implement floating point units to perform floating point operations which are built from logic, DFF, and the hardened fixed point multiply-accumulators), then look how people solve that problem (they use fixed point with only the necessary resolution on the multiply-accumulators), or massively parallel implementations using many of the DSP blocks and BRAM's with multiple floating point cores).
(instantiate as many as needed to solve your problem)
You might wish to consider using the Zynq device, as it has dual Floating point cores, and dual SIMD cores, in addition to the dual A9 ARM processors.
07-30-2013 06:33 PM
Thanks for your information. What about using Vivado HLS? Do you think it will be efficient for a code which deals with 1D and 2D array of floating point numbers?
"Vivado High-Level Synthesis  is a HLS tool (for FPGAs and ASICs) which accelerates design implementation. It accepts a code written in C/C++ and SystemC and converts it directly to an RTL level to be downloaded on an FPGA."
07-30-2013 11:15 PM
without knowledge about how some HDL description is implemented as hardware even the best tool will not be useful.
Let's start from some point you probably already know about.
Some simple microprocessor system (if you like with FPU).
So, if you want to make some complex thing like a matrix operation with it, what happens?
Are you buying NxM Computers to have enough CPUs/FPUs for each element of the matrix?
I doubt so.
No you will be happy with the one CPU/FPU you have and rely on the systems speed while your software moves your data one by one between the memory, CPU and FPU registers.
It's similar with hardware like FPGAs.
Yes, it is possible to create multiple instances of identical hardware (like arithmetic cores) to some extent.
One is the limited size of the FPGAS logic fabric. So you have to think economically when planing your design.
A compromise between design size, speed and complexity has to be found.
And then it is similar like in the CPU based system.
Feed the data in one by one. (How does the data reach the FPGA anyway?)
Another point. Just think about the interfacing:
If you just assume a 32 bit datawith, multiplied with NxM elements will result in a insane high number of wires (even for quite small Ns and Ms). another physical limitation.
Have a nice synthesis