cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
aymanhendawy
Visitor
Visitor
674 Views
Registered: ‎11-11-2018

How much instruction cycles required by Cortex-A53 Neon intrinsic functions execution?

Hi, 

I working on ZCU102 board of ARM Cortex-A53 runing linux image built by petalinux 2018.2, and I was trying to write some Neon intrinsic C codes like adders and multippliers in the code snippet below, but when I profiled the neon intrinsic functions I got unlogic execution cycles of about 600 cycles, which doesn't make sense to me, and I don't know what's wrong with my code or the default compiler?

Refering to this link in section "NEON floating-point instructions timing", I see my output cycles are contradecting what's listing there, also is there a working link listing the whole NEON instructions timing for Cortex- A53.

compiler options added,

-c -fmessage-length=0 -MT"$@" -mcpu=cortex-a53 -march=armv8-a -mno-lint -mdev-no-llvm -O3 -Ofast

void add32fn(void * a, void * b)
{
       int32x4_t a32, b32, res32;

	a32 = vld1q_s32(( int32_t *) a);
	b32 = vld1q_s32(( int32_t *) b);

	reset(&prof);
	start(&prof);
	res32 = vaddq_s32 (a32, b32);
	stop(&prof);
	volatile uint32_t cycles  = avg_cpu_cycles(&prof);
       //cycles = 600

	vst1q_s32((int32_t *)(res), res32);
}
void cmplxfloatmul(void * a, void *b)
{
	float32x4x2_t 	aNeon, bNeon, resNeon;
	float32x4_t 	RXR, RXI;

	aNeon =  vld2q_f32 ((float32_t const *) (a ));
	bNeon =  vld2q_f32 ((float32_t const *) (b ));

	reset(&prof);
	start(&prof);
	RXR  = vmulq_f32 (aNeon.val[0], bNeon.val[0]);			//aR*bR
	stop(&prof);
	volatile uint32_t cycles  = avg_cpu_cycles(&prof);
//cycles are about 600 for this and any of the below functions

	resNeon.val[0]  = vfmsq_f32 (RXR, aNeon.val[1], aNeon.val[1]); 			//aR*bR-aI*bI
	RXI	= vmulq_f32 (aNeon.val[0], bNeon.val[1]);				//aR*bI
	resNeon.val[1] = vfmaq_f32 (RXI, bNeon.val[1], aNeon.val[0]); 			//aR*bI+aI*bR

	vst2q_f32((float32_t *)(res + idx), resNeon);
	} 

 

 

0 Kudos
Reply
2 Replies
ibaie
Xilinx Employee
Xilinx Employee
585 Views
Registered: ‎10-06-2016

Hi @aymanhendawy ,

I've never tried to use NEON Intrinsics on Linux userspace application, so not sure about it's performance so will need to do some testing. From instruction cycle count that is really specific to ARM, but I think there is no public document listing them.

Regards


Ibai
Don’t forget to reply, kudo, and accept as solution.
0 Kudos
Reply
aymanhendawy
Visitor
Visitor
504 Views
Registered: ‎11-11-2018

Kindly any updates in this regards

0 Kudos
Reply