04-24-2018 08:45 AM
Hello,
I'm using SDK version 2016.4 (I could also use 2017.2) and I want to enable hard floating point option in a bare metal project for the Zynq UltraScale+ MPSoC device.
As far as I know, I have to modify the BSP settings, but I'm not very sure about which compiler to use and which extra flags I have to add. I have tried different combinations but I'm always getting errors when I build the project. My questions are the following ones:
> Do I have to use aarch64-none-elf-gcc or arm-none-eabi-gcc compiler? Or maybe another one (armv8a-arm-none-eabi for example)??
> If I use the arm-none-eabi-gcc compiler, does it mean that the ARM Cortex A53 is operating in 32-bit mode?
> Which extra compiler flags should I use?
-mcpu=cortex-a53
-mfpu=vfpv4?? or fp-armv8? or maybe neon-fp-armv8?
-mfloat-abi=hard
Any more flags?
Any suggestion or help will be welcome :)
Thanks in advance,
Ane
04-25-2018 02:37 AM
Hi @airazusta
1. aarch64-none-elf-gcc is the right compiler for MPSoC if you are intended to use 64bit compiler. That's actually the default tooclahin when a 64 bit application is generated from SDK for MPSoC devices.
2. Yes that's it. Actually you can select the arm-none-eabi- toolchain in the new application wizard if you select 32 bit mode
3. The aarch64 toolchain use hard floating point by default so it's not required to modify nothing. Actually if not wrong currently aarch64 does not have any soft float ABI available (ARM docs) so the only option is to use hard FP or just disable FP (which does not allow you to use any FP operations in the code.
As quick example you can use the helloworld application template and add few lines with FP variables:
int main() { volatile double a, b, c; init_platform(); a = 1.2; b = 2.4; c = a + b; printf("Hello World %lf\n\r", c); cleanup_platform(); return 0; }
If you compile it with the default settings and double-click on the ELF file (opens a binary editor file), you can check the disably code for the FP operations and check how FP instructions have been used.
a = 1.2; dc4: b200e7e0 mov x0, #0x3333333333333333 // #3689348814741910323 dc8: f2e7fe60 movk x0, #0x3ff3, lsl #48 dcc: 9e670000 fmov d0, x0 dd0: fd0017a0 str d0, [x29, #40] b = 2.4; dd4: b200e7e0 mov x0, #0x3333333333333333 // #3689348814741910323 dd8: f2e80060 movk x0, #0x4003, lsl #48 ddc: 9e670000 fmov d0, x0 de0: fd0013a0 str d0, [x29, #32] c = a + b; de4: fd4017a1 ldr d1, [x29, #40] de8: fd4013a0 ldr d0, [x29, #32] dec: 1e602820 fadd d0, d1, d0 df0: fd000fa0 str d0, [x29, #24]
Best Regards,
Ibai Erkiaga
04-25-2018 03:21 AM
Hi @airazusta,
So just adding more information about 32bit toolchain usage (which is compatible with Zynq-A9 cores) in case it might be useful for other forum users looking into floating point usages.
The 32bit toolchain supports the soft float ABIs and it can be configured through mfloat-abi option as documented in the GCC documentation. If you generate a 32bit target application on A53, you will see that mfloat-abi is set to hard both in the BSP settings and the application toolchain settings.
Taking a look to the ELF output generated with the same example used in the previous post you can see how FP instructions have been used (hard floating point).
a = 1.2; 550: e3032333 movw r2, #13107 ; 0x3333 554: e3432333 movt r2, #13107 ; 0x3333 558: e3033333 movw r3, #13107 ; 0x3333 55c: e3433ff3 movt r3, #16371 ; 0x3ff3 560: e14b20fc strd r2, [fp, #-12] b = 2.4; 564: e3032333 movw r2, #13107 ; 0x3333 568: e3432333 movt r2, #13107 ; 0x3333 56c: e3033333 movw r3, #13107 ; 0x3333 570: e3443003 movt r3, #16387 ; 0x4003 574: e14b21f4 strd r2, [fp, #-20] ; 0xffffffec c = a + b; 578: ed5b1b03 vldr d17, [fp, #-12] 57c: ed5b0b05 vldr d16, [fp, #-20] ; 0xffffffec 580: ee710ba0 vadd.f64 d16, d17, d16 584: ed4b0b07 vstr d16, [fp, #-28] ; 0xffffffe4 printf("Hello World %lf\n\r", c); 588: e14b21dc ldrd r2, [fp, #-28] ; 0xffffffe4 58c: e30c0e38 movw r0, #52792 ; 0xce38 590: e3400000 movt r0, #0 594: eb0000fd bl 990 <printf>
Changing the mfloat-abi to soft will force the toolchain to use soft ABI for floating point operations rather than FP instruction, so the dissably code will look into something like:
a = 1.2; 550: e3032333 movw r2, #13107 ; 0x3333 554: e3432333 movt r2, #13107 ; 0x3333 558: e3033333 movw r3, #13107 ; 0x3333 55c: e3433ff3 movt r3, #16371 ; 0x3ff3 560: e14b20fc strd r2, [fp, #-12] b = 2.4; 564: e3032333 movw r2, #13107 ; 0x3333 568: e3432333 movt r2, #13107 ; 0x3333 56c: e3033333 movw r3, #13107 ; 0x3333 570: e3443003 movt r3, #16387 ; 0x4003 574: e14b21f4 strd r2, [fp, #-20] ; 0xffffffec c = a + b; 578: e14b00dc ldrd r0, [fp, #-12] 57c: e14b21d4 ldrd r2, [fp, #-20] ; 0xffffffec 580: eb000077 bl 764 <__adddf3> 584: e1a02000 mov r2, r0 588: e1a03001 mov r3, r1 58c: e14b21fc strd r2, [fp, #-28] ; 0xffffffe4 printf("Hello World %lf\n\r", c);
Regards,
Ibai
05-02-2018 07:45 AM
Hi @ibaie,
Thank you for your answer, it was really helpful.
However, I've started trying the three possible compilation options (64 bit hard fp, 32 bit hard fp and 32 bit soft fp) with a very simple application and I got some incoherencies regarding the time they take for doing the FP operations. My application basically does the following: it calculates the time needed for making a FP adding operation.
int main()
{
XTime time_init, time_end, time_total;
float USecsElapsed;
float a,b,c;
XTime_GetTime(&time_init);
a = 1.2245;
b = 2.436898;
c = a + b;
XTime_GetTime(&time_end);
time_total = time_end - time_init;
USecsElapsed = ((float) (time_total * 1000000) / (float) COUNTS_PER_SECOND);
printf("Time required= %.3f us\r\n Cycles = %d\r\n", USecsElapsed, (int)time_total);
return 0;
}
I compiled the application in the three possible modes, I attach screenshots of the disassembly just to see it is done correctly:
And the times I obtained are the following ones:
I also tried multiplying 'a' and 'b' instead of adding them and I got these times:
Maybe I'm mistaken, but shouldn't be 64bit hard FP as fast as 32bit hard FP (or even faster) and not the slowest one?? In the three cases the time it takes XTime_GetTime() function for exectuing is the same (0.31us).
Best Regards,
Eskerrik asko :)
Ane
05-02-2018 05:56 PM
Hello @airazusta
The hard-fp / soft-fp option defines for the way floating point values/variables are passed when calling a function
That's what ARM calls the AEBI.
With soft-fp, the caller puts the floating point arguments in the integer registers (R#) and the function copies these registers in the FPU for processing
With hard-fp, the caller puts the floating point arguments in the FPU and the function does not have to do a copy to use them.
The measurements you are doing show in-line code, so hard-fp or soft-fp doesn't matter.
05-03-2018 02:21 AM
05-03-2018 03:05 AM
Hi @airazusta,
So now this is more ARM architecture question rather than really toolchain stuff :)
I would also expect ARMv8 FP operation to be as fast as ARMv7 FP, so your measurement results seems to be bit confusing for me as well. Nevertheless measuing single operations I think is not really meaningfull as some of the microcode and features of the processor nowadays (branch prediction, pipelining...) may impact on the execution.
Nevertheless I just made a quick change in your code to just "measure" the FP operation rather than operation + variable assigment and that takes bit more sense.
Regards,
Ondo Izan ;)
Ibai
05-03-2018 09:26 AM
Hi @airazusta
your measurements have a high probability affected by the cache; i.e. if the code is in cache or not and if the section of code spans a single or two cache lines.
The most precise way to perform your measurement is to iterate at least twice (the more is better) and skipping the time taken for the first iteration.
e.g.:
for (i=0 ; i<N ; i++) {
if (i == 1) {start timer}
code to test
}
stop timer