10-18-2019 10:22 AM
Hi,
I have a question regarding scheduling of two tasks on two dpu cores of ZCU104 board.
For example, I have task A and B takes 60ms respectively to inference.
Right now I found that if I first set input node images to the tasks and then launch two threads to the rundputask, the two tasks go to core0 and core1 no problem.
However if I move the set input node images into the threads and maybe because the time difference of preprocessing (resize?), the two tasks tend to run only on one core which causes the latter task a significant delay.
My question is that if it is possible to pin the task to different dpu cores or if there is any other trick I could try to avoid the scheduling latency.
Thanks
10-23-2019 01:53 PM
Hi Jason,
You're right, I added barrier on the preprocessings and can see the tasks scheduled on two cores at the same time.
Thanks
10-20-2019 06:25 PM
Hi jiansheng@baidu.com ,
I agree with you that you should avoid combining pre-process and DPU run task in ths same thread.
I would suggest you to refer to our DNNDK examples. e.g. in this facedetect example: https://github.com/Xilinx/Edge-AI-Platform-Tutorials/blob/3.1/docs/DPU-Integration/reference-files/files/face_detection/face_detection.cc
It seperate the whole flow into 3 steps:
1. Reader thread : Read images from camera and put it to the input queue 2. Worker thread : Each worker thread repeats the following 3 steps util no images: (1) get an image from input queue; (2) process it using DenseBox model; (3) put the processed image to the display queue. 3. Display thread : Get output image from queueShow and display it
And connect them with queues.
It would be more efficient and would take lower latency comparing with mix all the things together.
Hope this can help.
10-21-2019 01:56 PM
Hi Jason,
Thanks for your reply. I just checked the face_detection code you refer to. One difficulty is that our application is not really throughput oriented as your example, we want more deterministic behavior.
As the picture shows below, assuming a camera provides image every 100ms, we want high priority tasks (0 and 1) to run on two cores ASAP(using the same image) and followed by some lower priority light models.
And just to clarify, in previous posts by preprocessing I simply mean resize the image to fit the model's input size, the models' input sizes are different.
Any further suggestions to tackle our application?
Thanks,
Jian
10-21-2019 08:16 PM
Hi jiansheng@baidu.com ,
Yes. I agree with you that you may need to do more modification here to suit your design.
I think you can do some code profiling to check which part of the code would take heavy CPU loading(time). And I would suggest you NOT to put these heaving CPU loading code and DPU task into same thread so that CPU/DPU can work parallelly.
In my opinion if core0 is working on caculation and you successfully send a new DPU task here it should come to core1.
10-23-2019 01:53 PM
Hi Jason,
You're right, I added barrier on the preprocessings and can see the tasks scheduled on two cores at the same time.
Thanks
10-23-2019 05:48 PM
Hi jiansheng@baidu.com ,
Good to know that and thanks for sharing your test experience. :-)