cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
kjgreenwood
Adventurer
Adventurer
1,706 Views
Registered: ‎04-10-2019

Performance Degradation from DNNDK v3.0_190430 to DNNDK v3.0_190531

I'm running YOLOv3 with 2 DPU cores and 4 YOLO threads. I compiled using DNNDK v3.0_190430 and got a consistent frame rate between 28 and 29 fps. This is fairly close to the max I could expect since dpuGetTaskProfile() gives me a value of about 67000 usec.

But then I upgraded to DNNDK v3.0_190531 and re-ran DECENT and DNNC. deploy.prototxt and deploy.caffemodel were identical after the run. Only my .elf changed. But my frame rate dropped to 24 fps. dpuGetTaskProfile() gives me only a slightly higher value of around 68000 usec.

I'm pretty sure the increased time is in the DPU execution because no other code has changed. Also, I can increase the number of YOLO threads and I don't see any improvement. When I switch back to my 190430 .elf file, I go right back to 28/29 fps.

I am using dpuSetInputImageWithScale() to set my input image. Is there any possibility that there was a change in this function's code and that it's processing time isn't reflected in the dpuGetTaskProfile() output?

0 Kudos
23 Replies
linqiang
Xilinx Employee
Xilinx Employee
1,657 Views
Registered: ‎05-10-2019

Which platform are you using?

For YOLOv3,  I tested the performance between two different version of XILINX AI SDK and the results are as follow.

./test_performance_yolov3_adas_512x256 test_performa nce_yolov3_adas_512x256.list -t 8 -s 60

 

yolov3_adas_512x256_190430.jpg
yolov3_adas_512x256_190531.jpg
0 Kudos
kjgreenwood
Adventurer
Adventurer
1,641 Views
Registered: ‎04-10-2019

I installed for the ZCU102. But I'm running on a custom platform with a ZU9. I've enabled 2 B4096 cores @ 300 MHz.

I converted a trained YOLOv3 from Keras to Caffe before running on the tools and probably have some differences between my implementation and yours (at the up_sample layers in particular). Also, I'm using a 416X416 input, so I would expect some difference.

I'm having some difficulty trying to make sense of your numbers. I assume that DPU_MEAN is the average time it takes for 1 core to finish it's execution. If that's the case, then with 3 cores, I can't see how you could possibly get more than about 88 fps which would be keeping each core continually running at 34 msec per execution.

The 215 fps you have looks like it's just your end-to-end mean multiplied by the number of sofware threads. My understanding is that the number of SW threads is only chosen to make sure that each DPU core is continually fed with input data. If each of your SW threads is running continually to give you 215 fps, how are you preventing blocking with the DPU resource? Each DPU can only process one command at a time, right?

I see the performance degradation when I test the actual number of fps I'm getting through the system instead of looking at the mean processing time for the DPU. That's what leads me to believe that the difference is coming in the dpuSetInputImageWithScale() function or somewhere else that may not be tracked in the dpuGetTaskProfile() function. Are you using dpuSetInputImageWithScale() with your model?

0 Kudos
kjgreenwood
Adventurer
Adventurer
1,633 Views
Registered: ‎04-10-2019

Sorry, for clarity, the question at the end of the second to last paragraph should be "each DPU CORE can only process one command at a time, right?"

0 Kudos
kjgreenwood
Adventurer
Adventurer
1,626 Views
Registered: ‎04-10-2019

I did some further testing to try to isolate where the increased time is happening. It isn't in the dpuSetInputImageWithScale() function. I tried replacing that with dpuSetInputImage() and got the same results. I also timed the function and got similar numbers with both DNNDK versions.

I then tried timing the call to dpuRunTask(). It was a fairly sloppy timing test with all interrupts still enabled, so there is a lot of fluctuation expected in the timed value. However, the minimum value should be a good indication of the time for the DPU execution.

With DNNDK v3.0_190430, I got a minimum value of 67 msec which lined up very well with what I got from dpuGetTaskProfile() and with the 28 - 29 fps I'm seeing from the system with 2 cores enabled.

With DNNDK v3.0_190531, I got a minimum value of 83 msec which did not line up well at all with the 68 msec I got from dpuGetTaskProfile(). But it does line up really well with the 23 - 24 fps I'm seeing from the system.

At this point, it really appears that the compilation yields very different results between the two versions. I'm not sure how you got your end-to-end timing values, but I'd be very interested seeing if you similar results to what you've posted above if you just measure the number of frames you process over a several second period.

0 Kudos
linqiang
Xilinx Employee
Xilinx Employee
1,599 Views
Registered: ‎05-10-2019

Multithread operation is to increase the utilization of DPU and improve the parallelism of other pre and post processing.

The DPU running time depends on the amount of computation, complexity  and type of the model.

Would you please test it again with the same model? For example, run the 0531 app with the 0430 model

multithread example.jpg
0 Kudos
kjgreenwood
Adventurer
Adventurer
1,592 Views
Registered: ‎04-10-2019

I have  been running with the same app. The only thing I've been changing between the two is the model. That's why I'm convinced there is a difference.

Your diagram confirms how I thought things should be calculated. Based on the numbers you provided and the diagram, it shouldn't be possible for you to get more than 89 fps.

Each core takes about 34 msec to process one network execution. That means each core can run 29.4 executions per second. You have 3 cores. So if you have enough software threads to keep the cores 100% busy all the time, you should only be able to run a maximum of 3 X 29.4 = 88.2 network executions. There is no way to get 215 fps.

Increasing the number of threads beyond what it takes to keep the cores busy doesn't add fps. Your example masks this because the number of frames completed in the time in question just happens to be equal to the number of threads. But if you had 9 threads instead of 6, your fps would still be 6*1000/zz.

It's easy to see if you throw in your actual numbers into your diagram:

DPU_process = 34 msec

zz = 2*(DPU process) + pre_process + post_process =74 msec (this gives you your 37 msec e2e)

fps = 6*1000/74 = 81 fps

0 Kudos
linqiang
Xilinx Employee
Xilinx Employee
1,572 Views
Registered: ‎05-10-2019

in 1 thread, the DPU running time is 9.2ms (high efficiency, but low throughput) and the performance is 84fps.

in multithread, 3 DPU were fully utilized to improve throughput, but the efficiency of a single DPU is slightly reduced. One DPU running time is 34.3ms/3 = 11.4ms. And the performance is 215fps (less than 84*3 )

So in your reply, DPU time should be 11.4ms. Pre and Post process time = 37.1-34.3 = 2.8ms

zz = 2*(DPU process) + pre_process + post_process =25.6msec (Take 37.1ms e2e for example)

fps = 6*1000/25.6 = 234.4fps  (This is the ideal value, but it's actually a little bit lower than this. Everything fits the theory, it works in prcatice, doesn't it)

 

 

one thread and multithread.jpg
0 Kudos
kjgreenwood
Adventurer
Adventurer
1,563 Views
Registered: ‎04-10-2019

According to the name of your output and the way it was presented in your diagram, your DPU_MEAN is the average time it takes for a DPU core to process one network execution. It's already been divided by the total number of frames processed just like you showed in the diagram. So one core's running time is, on average, 34.3 ms - not 11.43 ms. The fact that you have 3 cores doesn't change the running time for the single core. Once you have the average processing time for the single core, you can take the reciprocal to find that each core can process 29.2 fps. Multiply by your 3 cores gives 87.5 fps for the entire system (assuming no time for pre and post processing).

I would also disagree with what you're calling thread performance. In your diagram, the reason for the 6 multiplier in calculating fps was because 6 frames had been processed. It's not because there happened to be 6 threads. If there were 1000 threads, 994 of them would have been idle waiting for a DPU core to free up. You would have processed the same 6 frames and the fps would have remained exactly the same.

But an easy way to check it would be to count the number of frames that you process in 10 seconds. If it's over 2000, your method is confirmed.

0 Kudos
linqiang
Xilinx Employee
Xilinx Employee
1,552 Views
Registered: ‎05-10-2019

We can see the calculation process by looking at the header file. By modifying , the number of the process frames is printed.

/usr/include/xilinx/benchmark/benchmark.hpp

 

static void report(std::ostream * p_out) {
std::ostream & out = *p_out;
float fps = ((float)g_total) / ((float)g_num_of_seconds);
out << "Frames Processed =" << g_total << "\n";
out << "FPS=" << fps << "\n";
out << "E2E_MEAN=" << g_e2e_mean << "\n";
out << "DPU_MEAN=" << g_dpu_mean << "\n";
out << std::flush;
return;
}

 

 

 

StatSamples e2eStatSamples(0);
StatSamples dpuStatSamples(0);
for(auto &r : results) {
auto result = r.get();
total = total + result.ret;
e2eStatSamples.merge(result.e2eSamples);
dpuStatSamples.merge(result.dpuSamples);
}

g_e2e_mean = e2eStatSamples.getMean();
g_dpu_mean = dpuStatSamples.getMean();

g_total = total;

 

 

benchmask_test.jpg
0 Kudos
kjgreenwood
Adventurer
Adventurer
1,542 Views
Registered: ‎04-10-2019

I stand corrected. If your g_total is counted and not calculated (it appears to be based on a quick review of benchmark.hpp), then you're clearly getting 215 fps. 

So then it must be your DPU_MEAN that's off. As you showed in your diagram, it's the total DPU processing time divided by the number of frames processed. Assuming the cores are 100% utilized and treating pre and post processing times as 0, the DPU processing time is 30 seconds (10 seconds/core * 3 cores). Your frames processed is 2166.

DPU_MEAN = 30 / 2166  = 0.0139 seconds or 13.9 msec - rather than the 34.3 msec reported.

So something's still off. I'll have to spend some more time looking through benchmark.hpp to see if I can find where the discrepancy comes in.

 

In the meantime - going back to the original issue - would it be possible for you to run a video file with at least 2000 frames through a 190430 compiled version and a 190531 compiled version of YOLOv3 and verify that they take roughly the same time to complete? I'm seeing a roughly 20% slow down between the two.

0 Kudos
kjgreenwood
Adventurer
Adventurer
1,532 Views
Registered: ‎04-10-2019

After looking at benchmark.hpp a little closer, there is only one way I can make sense of the numbers you're getting. The code should give numbers that make sense together but only if each thread blocks in the call to model.run() until it has processed a frame. However, if it does a sleep when it's blocked on an available DPU core and then goes around again for another try, that would explain the numbers. And from the numbers, I would guess the sleep time is approximately 40 msec.

With 8 threads you should get nearly 100% utilization out of your DPU cores. For a 10 second run, you'd get 80 total seconds of thread time. 3 cores would be 100% active for 10 seconds. That means 30 of the 80 seconds would be tied up running actual frames through the DPU and would result in about 875 total frames processed. The other 50 seconds would be filled with artificial frame counts every 40 msec resulting in about 1250 erroneous frame counts. That would give a total frame count of 1250 + 875 = 2125 frames.

Since the DPU time is just queried from the device, I assume it will just get the last frame processing time whether a new frame was processed or not. Therefore, the DPU average time will still be a good measure. It will just have about 1250 samples that are repeats of other samples. But when averaged by the number of samples, the average will still be good.

The way the E2E time is coded, you'll always end up with the total thread time devided by the total frame counts 80 sec / 2125 is about 37 msec.

If this is correct, then I would suspect that using 10 threads for 10 seconds would give a value of about 2625 total frames - which would be a huge jump given that the total frames processed should be maxed out with 8 threads.

0 Kudos
linqiang
Xilinx Employee
Xilinx Employee
1,525 Views
Registered: ‎05-10-2019

We tested the performance of different threads from 1 to 16.When the number of threads increases to a certain level, the performance reaches an upper limit and is difficult to improve further.

I need to correct an error that DPU_MEAN is the DPU running time per thread.When a single thread is running, the DPU time is 9.2ms, while when a double thread is running, the DPU time is 10.4ms. With the increase of threads and the competition of resources, the DPU time will become longer (such as the limitation of DDR bandwidth, etc.).So the 6-thread diagram that I drew before might be too simple, but it's more complicated than that, so let me draw a 2-thread diagram.In addition, the actual results of the 2 threads running can also be seen.

In addition, as for the phenomenon that the 0531 version is slower than the 0430 version when you test video, I also found a reduction of about 10% in the test.The code in the demo.h header file has changed a lot, and we are still analyzing the details.

16 thread-2thread.JPG
0531.jpg
0430.jpg
0 Kudos
linqiang
Xilinx Employee
Xilinx Employee
1,519 Views
Registered: ‎05-10-2019

2 thread

 

2 threads.jpg
0 Kudos
kjgreenwood
Adventurer
Adventurer
1,503 Views
Registered: ‎04-10-2019

Thanks for the update on the slowdown between versions you saw. I'm not using any of the demo software. My software is identical for both and I'm seeing a 20% slowdown.

The diagram you have shown above is exactly how I've been saying it should work. The DPU time is the average time it takes for one core to process one frame. The number of threads should have absolutely no effect on this number.

The E2E time is also fairly constant until the number of threads increases to the point where DPU cores can be kept busy during pre and post processing. At that point, the E2E time should DECREASE with the number of threads until it saturates when all DPU cores are kept busy the entire time. It definitely should not increase with increased threads as the numbers you've given show.

I can see in benchmark.hpp, what is being measured for E2E time and I see why it doesn't match the diagram or the description above. In the code, the sum E2E time samples will just be the number of threads multiplied by the duration of the test.

 

inline BenchMarkResult thread_main_for_performance(
const ImageList* image_list,
std::unique_ptr<T>&& model) {
    long ret = 0;
    StatSamples e2e_stat_samples(10000);
    StatSamples dpu_stat_samples(10000);
    for (ret = 0; ! g_stop; ++ret) {
        xilinx::base::TimeMeasure::getThreadLocalForDpu().reset();
        auto start = std::chrono::steady_clock::now();
        model->run((*image_list)[ret]);
        auto end = std::chrono::steady_clock::now();
        auto end2endtime = int(std::chrono::duration_cast<std::chrono::microseconds>(end - start).count());
        auto dputime = xilinx::base::TimeMeasure::getThreadLocalForDpu().get();

        e2e_stat_samples.addSample(end2endtime);
        dpu_stat_samples.addSample(dputime);
    }
    return BenchMarkResult{ret,std::move(e2e_stat_samples),std::move(dpu_stat_samples)};
}

So, the start time for the next frame will essentially be the same time as the end of the previous frame. No matter how long the thread waits to execute (whether it's waiting for a DPU core or actually doing pre and post processing), that entire time will be captured by the (end - start) calculation in end2endtime.

The results you provided confirm this:

16 threads for 10 seconds = 160 seconds  dividing that by 2160 frames = 73 msec.

10 threads for 10 seconds = 100 seconds  dividing that by 2150 frames = 47 msec.

8 threads for 10 seconds = 80 seconds   dividing that by 2160 frames = 37 msec.

2 threads for 10 seconds = 20 seconds dividing that by 1540 frames = 13 msec.

 

What I can't figure out is how the DPU time calculation could be behaving the way it is. The real functionality is in the call to xilinx::base::TimeMeasure::getThreadLocalForDpu().get() which isn't available. I made the assumption that it was just making a call to the dpuGetTaskProfile() function. But that's clearly not the case since the average DPU processing time somehow INCREASES dramatically with the number of threads - totally contrary to diagram above. Even when the number of frames processed stays essentially the same, the number of threads gives a sizeable increase to the average DPU frame processing time. This makes no sense.

If it's easy to do, I'd be really interested in the value you get from dpuGetTaskProfile() for processing one frame. This value is extremely consistent for me (regardless of number of threads). It also corresponds very well with the actual fps I'm measuring out of my system - as would be expected if the diagram you provided were correct.

0 Kudos
linqiang
Xilinx Employee
Xilinx Employee
1,485 Views
Registered: ‎05-10-2019

dpuGetTaskProfile is the time from the initiation of DPU processing to the end of DPU processing. In multithreaded processing, the DPU mean processing time will include system waiting, kernel resource scheduling and real DPU process time. To some extend, the DPU processing time becomes insufficient to accurately reflect the actual DPU running time when multithreading.

 

To get back to your question, the maximum frame rate for your system running YOLOv3 is only 28/29 fps. This is weird. On ZCU102, even if a single thread runs YOLOv3 with maximum frame rate of 84 fps and multiple threads of 215 fps.(YOLOv3 model, 5.46Gops, 512x256)

1 I wonder if you can test your program with the SDK model? (YOLOv3 model, 5.46Gops, 512x256)

2 How does your board DDR perform compared to the ZCU102?

3 Can you use the model that are ran by 0430 DNNDK (DECENT and DNNC) and re-compile the program by the 0531 SDK? We need to rule out whether the differences are due to differences in the models themselves compiled form different versions of DNNDK or in the programs themselves. 

0 Kudos
kjgreenwood
Adventurer
Adventurer
1,477 Views
Registered: ‎04-10-2019

I make the dpuGetTaskProfile call immediately after running the model and I get a very consistent 66 +/- 1 msec. It doesn't wander at all outside this range. I get it when I run with 1, 2, 4, or 6 threads. I get the same value with either 1 or 2 cores enabled. I haven't enabled 3 cores yet. I added a CPU time measure around my call to run the model. It fluctuates greatly (as would be expected because it measures the system wait and kernel scheduling, etc.). However, the minimum values always line up with what I what I get from dpuGetTaskProfile.

When I run with 1 core, I saturate my fps rate with 2 threads and I get about 14.5 fps. When I run with 2 cores, I come close to saturation with 3 threads, but definitely get there with 4 and I get close to 29 fps. I measure the processed frames by individually tagging and counting each frame after it has been processed by one of my YOLO threads. That way I can verify that every frame is processed and each one is handled in-order. And my video output qualitatively looks accurate for the fps rate I'm getting relative to the fps rate of the source video. So I'm fairly confident that my fps measure is correct. Everything I'm seeing says that the numbers I'm getting from dpuGetTaskProfile accurately reflect the time it takes to run the model and that it's what is limiting me to roughly 15 fps per DPU core.

But running the model involves a lot of things including moving a ton of memory around. At this point, I can't say whether my system is bottlneck free and optimized to run the DPU efficiently. But I'm extremely confident that without changing anything except the version of DNNC that the model was compiled with, I get a 20% hit in performance.

1. I can look into running the SDK model. I'll see if I can get that running tomorrow.

2. I can look into this also.

3. I can try this too.

 Thank you for looking into this. I know it can be time consuming and my responses are rarely short.

 

0 Kudos
linqiang
Xilinx Employee
Xilinx Employee
1,465 Views
Registered: ‎05-10-2019

hi 

yolov3 performance 2DPU.jpg
0 Kudos
kjgreenwood
Adventurer
Adventurer
1,449 Views
Registered: ‎04-10-2019

My model would have the same structure as the YOLOV3_VOC with the possible exception in the 2 up_sample layers since there are at least two different ways to implement that.

My performance matches the table you provided very well. It also matches the performance of the ZCU102 other than the fact that it gets 41 fps with 3 cores enabled. But that's exactly the rate I that my analysis shows that I should run if I enable the third DPU core and add 2 processing threads.

So I assume that means that you now trust that I'm measuring things correctly and we can focus on the original issue.

0 Kudos
kjgreenwood
Adventurer
Adventurer
1,445 Views
Registered: ‎04-10-2019

Now that I it looks like we agree on the performance I'm measuring, and I'm perfectly happy with the 29 fps I'm getting (I can always enable the third core if I want more), it no longer seems like steps 1 or 2 that you recommended above make sense.

I thought I knew what you were asking for step 3. But looking back at it now, I don't think I do.

When you said, "re-compile the program by the 0531 SDK," isn't that just petalinux 2018.2? That's what I've already compiled with. I don't have the AI SDK installed. I'm currently pushing my space limits in my development VM and this would be a fairly big time investment. I'd like to know exactly what you're looking for before taking that step - especially if the information can be obtained more easily some other way.

You also said, "we need to rule out whether the differences are due to differences in the models themselves compiled form different versions of DNNDK or in the programs themselves." Which programs are you referring to? I only have one program. I just re-build it to link in the .elf generated by either 190430 or 190531. When I link in the 190430 .elf, I get 29 fps. When I link in the 190531 .elf, I get 24 fps. This would seem to rule out the software. By the way, I think it can be isolated to DNNC since the output of DECENT is identical between the versions.

You should be able to easily reproduce my results. Get your YOLOV3_VOC and run it through both versions of DNNDK. Then build the associated c++ application. When you build it with the 190531 .elf linked in, you should see the same performance hit I do. You'll be able to re-build to switch back and forth between the two. At that point, you'll have everything I have.

You can also look back at the first few posts I made to this thread. They have all of the relevant information I've found.

0 Kudos
kjgreenwood
Adventurer
Adventurer
1,443 Views
Registered: ‎04-10-2019

It would also be good to know what changes went in to DNNDK V3.0 190531 (I haven't been able to find any information calling that out). But I assume it's only bug fixes since the minor revision didn't change. If I roll back to DNNDK V3.0 190430, am I likely to run into one of the bugs? If not, I'd rather use 190430 until the reason for the performance hit in 190531 is found.

0 Kudos
linqiang
Xilinx Employee
Xilinx Employee
1,413 Views
Registered: ‎05-10-2019

Thank you for the information. All verified cases indicate that the difference in performance is due to two different versions of DNNC. Unfortunately, I used the old and new versions of DNNC to regenerate the YOLOv3 model elf files, and after testing, there was no difference in performance. (Only slight difference between debug mode and normal mode, 39fps vs 40.1fps)

I asked my colleagues who developed DNNC that there might be differences in the compiler back end, resulting in differences in instructions generated by different models due to internal structure differences, which would affect performance. This is possible for user-defined models. If you want to accurately locate the reason, you need to contact the relevant FAE and provide deploy.prototxt and deploy.caffemodel for our further analysis.

Here is my script for generating the elf files.

~/dnndk_test/xilinx_dnndk_v3.0_190521/host_x86/pkgs/ubuntu16.04/dnnc-dpu1.4.0 --prototxt=deploy.prototxt --caffemodel=deploy.caffemodel --output_dir=. --net_name=yolov3_voc_416 --dpu=4096FA --cpu_arch=arm64 --mode=normal

Here is the run time driver version.

root@zcu102:~# dexplorer -v
DNNDK version 3.0
Copyright @ 2018-2019 Xilinx Inc. All Rights Reserved.

DExplorer version 1.5
Build Label: Apr 25 2019 09:50:06

DSight version 1.4
Build Label: Apr 25 2019 09:50:06

N2Cube Core library version 2.3
Build Label: Apr 25 2019 09:50:20

DPU Driver version 2.1.0
Build Label: Apr 25 2019 09:50:03

0 Kudos
kjgreenwood
Adventurer
Adventurer
1,398 Views
Registered: ‎04-10-2019

Thank you for looking into it. I'll work through an FAE like you recommended. In the meantime, I'll just roll back to 190430 until I run into something that forces me to upgrade.

0 Kudos
kjgreenwood
Adventurer
Adventurer
1,389 Views
Registered: ‎04-10-2019

I'm just adding this post because I did some further testing and I wanted to document my observations in case this is becomes an issue for other users and is taken up in the future.

I added a cpu time measurement around my call to dpuRunTask(). I know this measurement is mostly meaningless when there are many threads. But I dropped down to a single thread so that there would never be any blocking waiting for a DPU core. I got the same behavior between both versions and the cpu timing lined up very closely with the 66 msec I got from dpuGetTaskProfile().

Then I added a second thread. Since I have 2 DPU cores enabled, I both threads should still be able to execute without being blocked on a DPU core. This was confirmed with the 190430 version where again I got a solid 66 msec for both the cpu measured time and dpuGetTaskProfile(). However, the 190531 version behaved very differently. dpuGetTaskProfile() gave a steady 67 msec. But the cpu measured time increased to a value well over 70 msec and had a much greater variance (probably +/- 8 msec or so).

Because the DPU processing time stays consistent, it leads me to believe that somehow with 190531, I'm having to wait for DPU core availability when that shouldn't be the case. I haven't yet been able to figure out a way to isolate to see if the problem only occurs when trying to execute on the second core or if it has to do with trying to execute on either core while the other is running.

0 Kudos