UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Adventurer
Adventurer
5,136 Views
Registered: ‎01-24-2014

How to read PMU correctly for multi-thread on dual core Cortex-A9 zynq 702 under userspace

Hi,

 

I have enabled the userspace pmu access by building a kernel module for both core on Cortex-a9. Then I follow the standard procedure of pmu counting:

 

1. Disable performance counters

2. Set cycle counter tick rate

3. Reset performance counters

4. Enable performance counters

5. Call function to profile

6. Disable performance counters

7. Read out performance counters

8. Check that performance counters did not overflow

 

The program can successfully read the pmu counter values without overflow. The problem is:

When only single thread is used for one core, the cycle counter and programmable counter can give the right cycle numbers (matched PAPI result). However, the value from the cycle counter is always the same for different profiling codes when two separate threads running on both core (the cycle number also remains the same for different profiling codes using PAPI). By the way, I am using the pthread library to create two thread running two cores through the processor affinity and the platform is ZYNQ 7000 SoC.

 

The situation happens not only when using the pmu but also PAPI.  Did I miss some important steps to read PMU correctly when multiple thread on dual core is used?

 

Thanks.

 

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 

The example profiling code is like this:

     matrix_multply1(){

       cpu_set_t cpuset;

       cpu_set_t cpuget;

       CPU_ZERO(&cpuset);

       CPU_SET(1, &cpuset);

       if (pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset) < 0) {

            fprintf(stderr, "set thread affinity failed\n");

       }

       if (pthread_getaffinity_np(pthread_self(), sizeof(cpuget), &cpuget) < 0){

            printf("can not get thread affinity!\n");

       }

       if(CPU_ISSET(1, &cpuget)){

            printf("i am running on processor 1\n");

       }

          ......

     }

     matrix_multply2(){

       cpu_set_t cpuset;

       cpu_set_t cpuget;

       CPU_ZERO(&cpuset);

       CPU_SET(1, &cpuset);

       if (pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset) < 0) {

            fprintf(stderr, "set thread affinity failed\n");

       }

       if (pthread_getaffinity_np(pthread_self(), sizeof(cpuget), &cpuget) < 0){

            printf("can not get thread affinity!\n");

       }

       if(CPU_ISSET(1, &cpuget)){

            printf("i am running on processor 1\n");

       }

          ......

     }

     main(){

         ......

  disable_ccnt();         

  disable_pmn(0);

  disable_pmn(1);

  disable_pmn(2);

  disable_pmn(3);

  disable_pmn(4);

  disable_pmn(5);

  reset_ccnt();

  reset_pmn();

  pmn_config(0,0x11);//   total cycle !!!!!!!!!!

  pmn_config(1,0x68);//   total instruction !!!!!!!!!!

  pmn_config(2,0x04);//   data cache access !!!!!!!!!!

  pmn_config(3,0x03);//   data cache miss !!!!!!!!!!

  pmn_config(4,0x10);//   branch miss-predicted !!!!!!!!!!

  pmn_config(5,0x12);//   Predictable branches !!!!!!!!!!

  enable_ccnt();

  enable_pmn(0);

  enable_pmn(1);

  enable_pmn(2);

  enable_pmn(3);

  enable_pmn(4);

  enable_pmn(5);

  // motion estimation dual

  for(i=0;i<len;i++)

  {

       pthread_create (&thread1, NULL, (void *) &matrix_multply1, (void *) &medata1);

       pthread_create (&thread2, NULL, (void *) &matrix_multply2, (void *) &medata2);

       pthread_join(thread1, NULL);

       pthread_join(thread2, NULL);

  }

  disable_ccnt();        

  disable_pmn(0);

  disable_pmn(1);

  disable_pmn(2);

  disable_pmn(3);

  disable_pmn(4);

  disable_pmn(5);

  time_end   = rdtsc32();

  time_end1   = read_pmn(0);

  time_end2   = read_pmn(1);

  time_end3   = read_pmn(2);

  time_end4   = read_pmn(3);

  time_end5   = read_pmn(4);

  time_end6   = read_pmn(5);

 

  printf("cycle=%d\n instruction=%d\n cache access=%d\n"

  "cache miss=%d\n,branch miss-predicted=%d\n"

  "Predictable branches=%d\n",

  (time_end1 - time_start1)/len,

  (time_end2 - time_start2)/len,

  (time_end3 - time_start3)/len,

  (time_end4 - time_start4)/len,

  (time_end5 - time_start5)/len,

  (time_end6 - time_start6)/len);

     }

 

e.g. I changed the matrix size or the loop length while the cycle number remains the same, however the timing is different using clock().

 

(the thread is also in arm community: http://community.arm.com/message/20183#20183)

0 Kudos