Copying my arrays into internal BRAMs before starting computation

I'm trying to implement an inference engine on FPGA. Processing elements need to do SPVM on different sparse matrices. There is a control unit which is in charge of distributing these matrices to each PEs(processing elements). each PE should get its matrix and save it in an internal BRAM. How should I copy them all before starting the computation? should I use memcopy?

2. What kind of interface for the control unit and PEs is better? 

