Currently, I am using VCU9P device to implement my accelerator design. There should be a total of 22 same calculation cores inside my design. The physical location of each core is limited by pblock constraint. If there is only one core that could be anyone of 22 physical locations in the design, the max clock frequency can achieve 750M Hz/s. However, when 22 cores are all instanced in the design, the clock frequency will be reduced to about 600M Hz/s. Therefore, I have to think about the hierarchical design. In my impression about this design flow before, firstly, the sub-module should be synthesized and implemented independently, then the routed output files can be added into the top design to get a whole design. After I read ug947 document, I can't understand this flow. According to the Xilinx recommendation, the Lab2&3 in ug947 is a tutorial for HD when using Ultrascale device. But, it seems that there is only independent OOC synthesis step for the sub-module. However, the implementation step is finished together with the top design. By using the design flow(PR), can the Vivado tool implement each partition independently? My question is why this design flow can meet the high timing requirement?