cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
puv97
Adventurer
Adventurer
614 Views
Registered: ‎01-28-2021

Measure PL time execution

Hi !

 

I'm using the ZCU102 and I'd like to execute some HLS code in the PL. I have built the IP block with all the FIFOs but I'd like to measure the time that it takes.

 

How can I do that? I'm using Vivado 2020.1

 

Thanks !

0 Kudos
16 Replies
joancab
Teacher
Teacher
587 Views
Registered: ‎05-11-2015

HLS tells you the latency in the summary report. Then, depending how your data is transferred, you need to add the transfer in and out. 

In the worst case, set an ILA up and measure it.

0 Kudos
puv97
Adventurer
Adventurer
576 Views
Registered: ‎01-28-2021

Hi @joancab

Oh thanks! I thought that the latency is not a real measure.

Regarding to the ILA, I'm not sure how to use it because I tried it, I add it, I validate the design and then nothing happens.

 

Thanks! 

0 Kudos
joancab
Teacher
Teacher
515 Views
Registered: ‎05-11-2015

Well, latency is the 'good' measure if consecutive processes don't overlap. If, for example, you send data when reading the previous process output, then a more appropriate figure could be the II (initiation interval), the time between input data coming in or output data coming out. II <= latency, less than when there is overlap.

For the ILA, you drop it there, configure it, connect it, synthesize, make your bitstream , and nothing happens, yes. Then you program the FPGA, either with JTAG or SD card and in Vivado (project open), click on Open Hardware Manager (with JTAG connected). It will connect to the FPGA and open panels for you with the data set up in the ILA, then you can trigger and capture almost like a signal analyzer.

joancab_0-1624953749043.png

 

 

'plain' ILA is good for discrete signals, then there is the System ILA that is better for signals like AXI, UART, SPI ,etc.

0 Kudos
puv97
Adventurer
Adventurer
504 Views
Registered: ‎01-28-2021

Hi @joancab

I'd like to use the PL to unroll some loops that the PS sends using AXI (I don't know if the communication is working well). So in this case, does the latency represents the correct measure? Because with the unroll the execution is parallel.

 

With the ILA I connected the SD with the petalinux project created with the XSA without ILA, and then I connected the JTAG to my computer with Vivado. I'll try the project with the XSA with the ILA, maybe that is the problem.

 

Thanks! 

0 Kudos
dpaul24
Scholar
Scholar
494 Views
Registered: ‎08-07-2014

@puv97 ,

I can describe a concept for the PL logic block that can exactly tell how many clock cycles a task needs.

So there is a logic block at the PL which is doing a "certain thing" at a certain clock rate.

My assumption - You want to measure how fast is this "certain thing" done, in terms of the number of clock cycles for that logic block.

This can be achieved by using two flags (two single bit signals) and an up-counter of appropriate width. This additional measuring logic needs to be appended to your existing design for the PL. Lets call the flags, task_start and task_end. The task_start has to be asserted when the logic block starts execution and de-asserted after 1 clock cycle. When the execution of the logic block is finished, task_end has to be de-asserted for 1 clock cycle.  What you have to do is to make the counter count up when task_start is asserted and stop the counter when task-end is asserted. Then the counter value gives you the no. of clock cycles required by the logic to complete a task.

Now since you have an AXI4 full interface, this counter value needs to be stored in a register that is AXI compliant. Zynq will read out this reg and you know how long it has taken.

------------FPGA enthusiast------------
Consider giving "Kudos" if you like my answer. Please mark my post "Accept as solution" if my answer has solved your problem
Asking for solutions to problems via PM will be ignored.

0 Kudos
dpaul24
Scholar
Scholar
491 Views
Registered: ‎08-07-2014

I'd like to use the PL to unroll some loops that the PS sends using AXI (I don't know if the communication is working well). So in this case, does the latency represents the correct measure? Because with the unroll the execution is parallel.

Latency measurement can also be achieved if you simulate your design. AXI is a handshake based protocol. You ask for something from an AXI slave, and after a ceratin time you get an answer back and then the communication cycle is closed. You can basically count the number of clock cycles in the simulation waveform from the time a request is placed til the time the slave closes the response. But you have to keep in mind that in this process, the AXI communication latency is also included (communication over the AXI Interconnect, register stages, etc).

------------FPGA enthusiast------------
Consider giving "Kudos" if you like my answer. Please mark my post "Accept as solution" if my answer has solved your problem
Asking for solutions to problems via PM will be ignored.

0 Kudos
puv97
Adventurer
Adventurer
464 Views
Registered: ‎01-28-2021

Hi @dpaul24 

 

But if I use the HLS synthesis to obtain the latency, is there included the AXI measurement ? Because there is not a real process, I'm not using the PS in this synthesis I'm only using the test.cpp and the main.cpp to send and process the data.

Sorry for my stupid doubts, I'm very very new in this stuff and I'm a bit lost.

 

Thanks !

0 Kudos
puv97
Adventurer
Adventurer
460 Views
Registered: ‎01-28-2021

Hi !

Maybe this is a stupid question but, Is the same as use a library like chrono to measure the time that the code takes?

Thanks !

0 Kudos
dpaul24
Scholar
Scholar
459 Views
Registered: ‎08-07-2014

But if I use the HLS synthesis to obtain the latency, is there included the AXI measurement ?

i did not understand this Q. How can someone obtain latency from HLS synthesis?

------------FPGA enthusiast------------
Consider giving "Kudos" if you like my answer. Please mark my post "Accept as solution" if my answer has solved your problem
Asking for solutions to problems via PM will be ignored.

0 Kudos
dpaul24
Scholar
Scholar
459 Views
Registered: ‎08-07-2014

Is the same as use a library like chrono to measure the time that the code takes?

I did not get the use of the word "library" in this sentence.

But yes it is similar to measure the no. of clock cycles taken by a uP to execute a certain task. You start the chrono at the start of task and stop the chrono at the end of task execution. Then you look at the chrono value.

Your original Q was - I have built the IP block with all the FIFOs but I'd like to measure the time that it takes.

To put a timer there and reading out its values will guarantee a precise time measurement.

------------FPGA enthusiast------------
Consider giving "Kudos" if you like my answer. Please mark my post "Accept as solution" if my answer has solved your problem
Asking for solutions to problems via PM will be ignored.

0 Kudos
joancab
Teacher
Teacher
421 Views
Registered: ‎05-11-2015

There is a fundamental difference between timing things in software and in hardware.

In software, all operations queue to be executed by the CPU. Any timing measures exactly all and every operation between start and end of the chrono. Defining what to measure and where is the start and stop (code points) is usually straightforward.

In hardware (FPGA) data can flow not only in parallel but also pipelined in different stages. On top of that, typically an FPGA process involves data transfer commanded by a processor that takes time and may or may not be included in the measure. So, what to measure and where and when to start and stop is something most of the time to be defined by the designer or tester. Is not that is difficult, it's just subjective. Like 'what's the Earth diameter?' a simple question with no simple answer. Diameter between poles? across the equator? including atmosphere? If so, up to what height?

0 Kudos
puv97
Adventurer
Adventurer
417 Views
Registered: ‎01-28-2021

HLS tells you the latency in the summary report. Then, depending how your data is transferred, you need to add the transfer in and out.

In the first reply you told me this, so I thought that the summary report is the one obtained from the Vivado HLS synthesis! So, you was referring to the Vivado IP block synthesis?

 

Sorry for the misunderstanding.

0 Kudos
puv97
Adventurer
Adventurer
411 Views
Registered: ‎01-28-2021

Hi !

I'd like to measure the the time that takes the PL processing (FIFOs and the HLS code) and also I'd like to measure only the time that takes the HLS using the PL. I'd like this measures to compare them with a execution in a normal processor.

 

Thanks !

0 Kudos
joancab
Teacher
Teacher
388 Views
Registered: ‎05-11-2015

@puv97 I meant the HLS synthesis, sorry. Vivado only checks RTL timing that has nothing to do with execution time.

I get your goal, you want to see how much you can accelerate in hardware. Depending on the HLS interface for data, the transfer time will be included in the latency or II calculation. What are your input and output interfaces? Do you transfer an array that has to be buffered before starting a calculation?

0 Kudos
puv97
Adventurer
Adventurer
367 Views
Registered: ‎01-28-2021

Hi @joancab 

Exactly ! That's the goal !

Well I have 2 options. The first one is with AXI4-stream interfaces but it works with buses so I think my code it is not ok. And the second one is passing as parameter of the function the array that I want to operate with, this option does not have communication with the PS.

Thanks!

0 Kudos
Rmccarty
Explorer
Explorer
342 Views
Registered: ‎09-05-2020

I have done this very thing with an arty z7 setting/clearing a 1 bit axi gpio connected to one of the digital io pins on the arduino header and measuring the time with an oscilloscope. I started with my routines running in the ps and then migrated the bottlenecks to hls with axi stream in/out and dmas connected to the hp slave ports. Setting the pin and then clearing it when then dma is finished provides a good rough estimate of the pl ip execution time. I found that the total process ( streaming data in and out of ocm using all 4 hp slave ports) takes about 10-15% longer than the total stated by the hls timing report, which is about what i expected.

0 Kudos