I would like to consider a design where a computation core broadcast data from one SLR to the others as represented in the following figure:
Where the color of the wire (broadcasting 30 vectors of 12 bits data) correspond to the color of the computation core that is broadcasting. Blue diamond are not used here but represent clock entries. Green diamond is a clock entry that is span in the entire design and the red diamond is the UART entry which also broadcast data through the entire device. The size of the rectangle of each core does not represent the among of resources taken on each SLR (It should be around 60% LUT, 70% BRAM, 40% DSP and 60% URAM per SLR)
The purpose of this design is to achieve high speed matrix vector multiplication by spreading the computation core in all the device.
My questions are:
* Is it feasible? I mean should I reconsider the design, the challenge may be not worse it.
* If it is feasable, what would be the best strategy to do so? Should I use AXI interconnect, pipelining the data that are crossing the SLR through SLL may be enough or is there (a) better way(s)?