cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Visitor
Visitor
247 Views
Registered: ‎03-11-2020

Problem in understanding behaviour of aarch64-none-elf-gcc compiler on Neon intrinsics for ARM cortex a53

Hi,

I am using Xilinx SDK 2019.1 for my application and running it on ARM cortex a53  processor with Neon and floating point engine support available.

The problem I am facing is that, I am unable to understand the disassembly of neon intrinsics functions in my code at highest level optimization i.e O3.

The code is simple. I am giving as an input two floating point arrays of each 16 elements and then multiplying each 4 elements chunk of array A with array B and storing its result in array C.

The C version of my code is:

// initialized arrays

float A[16]= {1.0,2.0,3.0,4.0,
1.0,2.0,3.0,4.0,
1.0,2.0,3.0,4.0,
1.0,2.0,3.0,4.0
};

float B[16] = {1.0,2.0,3.0,4.0,
1.0,2.0,3.0,4.0,
1.0,2.0,3.0,4.0,
1.0,2.0,3.0,4.0
};
float C[16];

//function definition

#include <arm_neon.h>

void multiply_4x4_neon(float *A, float *B,float *C) {
// these are the columns A
float32x4_t A0;
float32x4_t A1;
float32x4_t A2;
float32x4_t A3;

float32x4_t B0;
float32x4_t B1;
float32x4_t B2;
float32x4_t B3;

float32x4_t C0;
float32x4_t C1;
float32x4_t C2;
float32x4_t C3;

C0 = vmovq_n_f32(0);
C1 = vmovq_n_f32(0);
C2 = vmovq_n_f32(0);
C3 = vmovq_n_f32(0);

A0 = vld1q_f32(A);
B0 = vld1q_f32(B);
C0 = vmlaq_f32(C0,A0, B0);
vst1q_f32(C, C0);

A1 = vld1q_f32(A+4);
B1 = vld1q_f32(B+4);
C1 = vmlaq_f32(C1,A1, B1);
vst1q_f32(C+4, C1);

A2 = vld1q_f32(A+8);
B2 = vld1q_f32(B+8);
C2 = vmlaq_f32(C2,A2, B2);
vst1q_f32(C+8, C2);

A3 = vld1q_f32(A+12);
B3 = vld1q_f32(B+12);
C3 = vmlaq_f32(C3,A3, B3);
vst1q_f32(C+12, C3);

}

The assembly of above code at zero optimization level and at O3 optimization level is attached in text format.

I could not understand that there are so many load/store and other redundant instructions at zero optimization level. 

The setting of compiler is attached. I am not using any compiler option for optimization. My doubt is that I need to tell the compiler to use hardware linkages to avoid 

this loading and storing data like I should use -mfloat-abi=hard in optimization setting of compiler but I am unable to set this option as my compiler is not recognizing it.

All of my variables are also local too.

For O3 optimization, I could not understand the disassembly.

I could not understand why there is a function body of vst1q_f32 (float32_t *a, float32x4_t b)  intrinsic function starting in the middle of assembly code.

I know that at highest optimization level, the compiler is somehow jumping around the instructions.

Could someone please help me on these confusions?

 

toolChainSetting.PNG
0 Kudos
1 Reply
Highlighted
Visitor
Visitor
176 Views
Registered: ‎03-11-2020

Any one who can help on this question?

0 Kudos