cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Visitor
Visitor
436 Views
Registered: ‎01-16-2020

RPU0 slower than RPU1 when running Matrix Multiplication Demo (UG1186 v2018.3)

Problem: When running the Matrix Multiplication Demo, RPU0 is ~5x slower than RPU1

Question: Is RPU0 expected to be slower than RPU1? or is there a way to increase the speed of RPU0 to match RPU1?

Logs:

RPU0 

 # ./mat_mul_demo -d /dev/rpmsg0 -n 10000
 Matrix multiplication demo start 
 Open rpmsg dev /dev/rpmsg0! 
 Creating ui_thread and compute_thread ... 
 Quitting application .. 
 Matrix multiplication demo end 
Matrix Multiplication
	Rounds: 10000
	Duration: 2921ms
 Quitting application .. 
 Matrix multiply application end 

RPU1 (results)

# ./mat_mul_demo -d /dev/rpmsg1 -n 10000
 Matrix multiplication demo start 
 Open rpmsg dev /dev/rpmsg1! 
 Creating ui_thread and compute_thread ... 
 Quitting application .. 
 Matrix multiplication demo end 
Matrix Multiplication
	Rounds: 10000
	Duration: 585ms
 Quitting application .. 
 Matrix multiply application end 

 

Details:

We modified mat_mul_demo.c to check for duration.  We also removed printfs, and mutexes.

* Original mat_mul_demo.chttps://github.com/Xilinx/meta-openamp/blob/rel-v2018.3/recipes-openamp/rpmsg-examples/rpmsg-mat-mul/mat_mul_demo.c

* Modified mat_mul_demo.c - attached

Baremetal firmware running on RPU0 and RPU1:

https://github.com/Xilinx/embeddedsw/tree/release-2018.3/lib/sw_apps/openamp_matrix_multiply

 

Environment: SDK 2018.3, ZCU102

Setup: ZynqMP Linux Master running on APU with RPMsg in kernel space and 2 RPU slaves

References followed

https://xilinx-wiki.atlassian.net/wiki/spaces/A/pages/18841990/OpenAMP+2018.2

* https://www.xilinx.com/support/documentation/sw_manuals/xilinx2018_3/ug1186-zynq-openamp-gsg.pdf

 

Thank you in advance!

3 Replies
Highlighted
Moderator
Moderator
316 Views
Registered: ‎05-10-2017

Re: RPU0 slower than RPU1 when running Matrix Multiplication Demo (UG1186 v2018.3)

I didn't do 10000 runs but for about 5500, I found it to be pretty compbarable for r5-0 and r5-1

r5-0

real 91m33.500s
user 0m0.814s
sys 0m1.715s

r5-1

real 91m34.505s
user 0m0.791s
sys 0m0.672s

Did you modify RING_TX, RING_RX and the RSC_RPROC_MEM entries in rsc_table.c for RPU-1. The values need to be changed from default for R5-1. Also linker script memory nodes should match what you have in the device-tree

 

 

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
0 Kudos
Highlighted
Observer
Observer
276 Views
Registered: ‎07-02-2019

Re: RPU0 slower than RPU1 when running Matrix Multiplication Demo (UG1186 v2018.3)

Hello @jovitac ,

Thank you for checking our query.

We have several clarifications.


@jovitac wrote:

I didn't do 10000 runs but for about 5500, I found it to be pretty compbarable for r5-0 and r5-1

r5-0

real 91m33.500s
user 0m0.814s
sys 0m1.715s

r5-1

real 91m34.505s
user 0m0.791s
sys 0m0.672s

 

We would like to ask how to get above values.
Was it executed like this?

 

[r5-0] time mat_mul_demo -n 5500
[r5-1] time mat_mul_demo -d /dev/rpmsg1 -n 5500

 

 

 

If yes, we would like to confirm if it measures the R5's processing time.

Our understanding is that the `time` tool measures the resource usage of the master application on A53 side, not the remote app on R5.
Is this understanding correct?

Our goal was to check R5-0's performance over R5-1's.
So we modified the demo code slightly to remove the wait time from the A53 master app, and to isolate the processing time in R5 cores. We mentioned the following in the first query:

We modified mat_mul_demo.c to check for duration. We also removed printfs, and mutexes.

* Original mat_mul_demo.c - https://github.com/Xilinx/meta-openamp/blob/rel-v2018.3/recipes-openamp/rpmsg-examples/rpmsg-mat-mul/mat_mul_demo.c
* Modified mat_mul_demo.c - attached

 

Can you tell us if there is a better way to get the processing time in R5?

 

Did you modify RING_TX, RING_RX and the RSC_RPROC_MEM entries in rsc_table.c for RPU-1. The values need to be changed from default for R5-1. Also linker script memory nodes should match what you have in the device-tree


Yes, we already modified above addresses. We can run both R5-0 and R5-1 firmware concurrently so I think address conflicts were not an issue.

Thank you again for your time.

0 Kudos
Highlighted
Moderator
Moderator
234 Views
Registered: ‎05-10-2017

Re: RPU0 slower than RPU1 when running Matrix Multiplication Demo (UG1186 v2018.3)

Yes I did use time. Let me check if we have a different way to measure R5 processing time. I'll also try running with the modifications you have.

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
0 Kudos