07-27-2021 11:02 AM
I know there has been a number of discussions on this topic but I was wondering if someone could summarize the current state of the affairs. I have a big design in a RFSoC, which is mostly a BD with lots of IP cores. I have a fast machine with 48 CPU cores available but it seems that Vivado doesn't make a good use of them, especially during preparation for OOC runs whatever this means. I set general.maxThreads to 32 (maximum accepted) but still see only 4 or 5 CPU cores being used. This phase is taking many hours for my design after every small change in the BD.
07-27-2021 12:10 PM
Threads is a software division that doesn't necessarily translate to cores. The "multi-core adventure" that started when they couldn't turn up the clocks any more didn't work well because of the extra investment needed by software manufacturers that didn't do. So that's the sad reality: many cores, a few in use and lots of people believing they have something powerful with just an educated minority spotting the trick.
07-27-2021 12:30 PM
@joancab - Have you just called me uneducated?
07-27-2021 12:32 PM
> The "multi-core adventure" that started when they couldn't turn up the clocks any more didn't work well because of the extra investment needed by software manufacturers that didn't do.
To be fair - only certain sets of problems can be helped by parallel threads. (Amdahl's law). It's not just a case of "software manufacturers" (i.e. Xilinx) not doing something. It's just that only rare parts of the build process can actually benefit from using multiple cores. Last I remember, place, route and timing can make use of 8 cores, Synthesis only 2.
If one can figure out a better algorithm to "parallelize" the FPGA build process, you'll be a rich person.
07-27-2021 01:27 PM
Totaly agree with the above,
The tools are inherantly single core,
some bits can use two cores, but hats about it,
An interestng aside,
not done this for a few years, but might be worth while,
try a virtual machine,
the last experiments we did,
a 16 core processor could "simulate" the 2 core processor faster than the real processor !
The monitor software we had showed that more cores were being thrashed than when we had no VM, and the results came out quicker,
Then if you use linux on the VM, things get even quicker,
07-27-2021 01:45 PM
> If one can figure out a better algorithm to "parallelize" the FPGA build process, you'll be a rich person.
I think what I am asking at the moment doesn't require much magic. As I've said preparing OOC runs phase should be able to take advantage of at least as many cores as there are separate sub-designs.
07-27-2021 01:49 PM
07-27-2021 03:51 PM - edited 07-27-2021 03:54 PM
You need to increase the max jobs attribute. This defines the max number of ooc runs that compile in parallel. It can assign maxthreads per job.
While loads of cores is great for running the ooc runs in parallel, these arnt run very often (unless you do a fresh rebuild every time). Otherwise having maxthreads more than 4 doesn't really give much benefit.
Synthesis doesn't use much more than 1 thread. The only thing that the really uses all the threads is the router. Placement generally uses less.
07-28-2021 03:56 AM
Re different enviroments / PC's
Certainly there has been lots of experimentation over the decades,
but its a moving target,
and actual numbers form yesterday are irrelevant.
its costs nothing but your time,
memory speed / amount of memory , GPU card or in built GPU, mother board type and in particular its chip set / bus speed.
all have an effect,
"gamers" PC's performance rather than "desk top" PCs is what your looking for.
07-28-2021 05:24 AM - edited 07-28-2021 05:26 AM
In the older Vivado, there used to be a init.tcl file (or one had to create it, do not remember) under C:\Xilinx\Vivado\<vivado-version>\scripts
where something like
set_param general.maxThreads 8
could be added.
For the recent versions I am not very sure how it is or can be done. I did not bother to find because as explained by someone else above, it affects only OOC and Routing runs (which are not run very often).
Consider giving "Kudos" if you like my answer. Please mark my post "Accept as solution" if my answer has solved your problem
Asking for solutions to problems via PM will be ignored.
07-28-2021 05:37 AM
07-28-2021 08:30 AM
07-28-2021 08:45 AM
> "gamers" PC's performance rather than "desk top" PCs is what your looking for.
That's what I have:
AMD Ryzen Threadripper 3960X 24-core CPU
ASUS ROG STRIX TRX40-E Gaming motherboard
128 GB DDR4 3600 MHz
SSD hard drive
07-28-2021 08:57 AM
>In the older Vivado, there used to be a init.tcl file (or one had to create it, do not remember) under C:\Xilinx\Vivado\<vivado-version>\scripts --@dpaul24
Nice tip. BTW, it's now called Vivado_init.tcl. Same location (on windows).
There are other tricks to speed up builds (e.g., HD) but I find them more useful in non-embedded design flows.
BTW, PetaLinux tools are much better if you have a lot of cores.
When a Vivado build from scratch takes more than 2-4 hours, that is when I start employing other tactics. A 45 minute build is not worth it.
I just dug up some stuff I found helpful on RFSoC builds, that may or may not help @mmatusov (specifically, look into create_ip_run and generate_target commands...which you may need to refine and more selectively target):
set MY_IP_1 whatever_it_is1 set MY_IP_2 whatever_it_is2 set GENERATED_BD_FILE whatever_it_is3 ; # often design_1.bd, may need full path # Create OOC synthesis runs without starting OOC synthesis # in order to set properties of some of the OOC runs create_ip_run [get_files -of_objects [get_fileset sources_1] $GENERATED_BD_FILE] # Turn on retiming and performance optimization during OOC synth for these two high performance IPs set_property strategy Flow_PerfOptimized_high [get_runs $MY_IP_1] set_property STEPS.SYNTH_DESIGN.ARGS.RETIMING true [get_runs $MY_IP_1] set_property strategy Flow_PerfOptimized_high [get_runs $MY_IP_2] set_property STEPS.SYNTH_DESIGN.ARGS.RETIMING true [get_runs $MY_IP_2] generate_target all [get_files $GENERATED_BD_FILE] set_param general.maxThreads $NUM_THREADS launch_runs synth_1 -jobs $NUM_JOBS wait_on_run synth_1
There are other issues (not shown) I had to solve to get reasonable builds, such as applying placement properties via tcl script for a tcl script determined number of primitives (DSP48s). However, most projects won't need that level of optimization. (I needed it because I used up to 100% of the DSP48s at max frequency, and without the attributes, Vivado implementation takes a long time finding an optimal solution to place 100% of DSP48s).
There may be syntax errors above, I had to do some editing to remove project identifying information.
07-28-2021 09:49 AM
For reference, Big builds can easily take over night ,
07-28-2021 10:34 AM
>For reference, Big builds can easily take over night
Your builds sound far less than optimal. Could be:
1. Poorly written constraints.
2. Too small a part for design.
3. Project mode.
For reference, you want to target 2-4 builds per day or have lots of room for milestone slips...meaning, you're working for government.
07-28-2021 11:27 AM
> Your builds sound far less than optimal.
Not in my experience. We regularly have 6-8 hour jobs. (xcvu7p-flva2104). For the bigger multi-die parts it can easily go longer.
This is just an unfortunate side effect of these larger, deeply embedded parts. Is it ideal - no way! Can it be better - I hope so! But for these designs only getting 1 spin a day (or two if I'm lucky,and get up early for the first spin, and stay up late for the second) isn't ideal, but just the way things are.
07-28-2021 11:42 AM
@maps-mpls - Thank you for the tips! I must be doing something wrong, probably at multiple levels, but my worst nightmare right now is how long it takes after a small change to the BD design, such as disabling TREADY on a few AXIS_SWITCH blocks, to actually start synthesis. I am doing this in the project mode and I believe Vivado is essentially doing BD validation after I launch a synthesis run. I don't know if the sheer size of my BD design is the problem but it takes hours before a few required OOC runs actually start. After that it is not too bad. Implementation takes about 6.5 hours and that's where my multi-core machine really shines as I can run multiple strategies in parallel in the same amount of time.
07-28-2021 12:38 PM
Thats not entirely fair.
1. 2019.2 has a known issue processing constraints. moving to 2020.2 chopped an hour off our builds (and looking through the logs, a LOT of this time is in constraint processing)
2. This is usually not a variable and is a constant. Its pretty normal to add features over time.
3. I dont know about the difference. But I really dont want to handle the threading building 100 OOC xilinx IPs in non-project mode (launch_runs synth_1 handles it for you in project mode)
Our builds are currently 4 hours. And this has been a fair benchmark everywhere Ive worked. A previous job required me to build 40 builds a night at 8 hours / build (we had plenty of horsepower) to get maybe 2-3 builds that met timing. And that had plenty of area constraints. I never want to have to do that again.
07-28-2021 06:27 PM
I did not mean to be unfair.
4 hour build is okay for 2 builds / long day during debug.