cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Visitor
Visitor
2,863 Views
Registered: ‎06-22-2017

Compiler and compiler flags for enabling hard floating point in bare metal Zynq UltraScale+

Hello, 

 

I'm using SDK version 2016.4 (I could also use 2017.2) and I want to enable hard floating point option in a bare metal project for the Zynq UltraScale+ MPSoC device. 

 

As far as I know, I have to modify the BSP settings, but I'm not very sure about which compiler to use and which extra flags I have to add. I have tried different combinations but I'm always getting errors when I build the project. My questions are the following ones:

 

     > Do I have to use aarch64-none-elf-gcc or arm-none-eabi-gcc compiler? Or maybe another one (armv8a-arm-none-eabi for example)?? 

     > If I use the arm-none-eabi-gcc compiler, does it mean that the ARM Cortex A53 is operating in 32-bit mode?

     > Which extra compiler flags should I use?

                   -mcpu=cortex-a53

                   -mfpu=vfpv4?? or fp-armv8? or maybe neon-fp-armv8?

                   -mfloat-abi=hard

        Any more flags?

 

Any suggestion or help will be welcome :) 

 

Thanks in advance,

Ane

                   

 

0 Kudos
7 Replies
Highlighted
Xilinx Employee
Xilinx Employee
2,819 Views
Registered: ‎10-06-2016

Hi @airazusta

 

1. aarch64-none-elf-gcc is the right compiler for MPSoC if you are intended to use 64bit compiler. That's actually the default tooclahin when a 64 bit application is generated from SDK for MPSoC devices.

2. Yes that's it. Actually you can select the arm-none-eabi- toolchain in the new application wizard if you select 32 bit mode

3. The aarch64 toolchain use hard floating point by default so it's not required to modify nothing. Actually if not wrong currently aarch64 does not have any soft float ABI available (ARM docs) so the only option is to use hard FP or just disable FP (which does not allow you to use any FP operations in the code.

 

As quick example you can use the helloworld application template and add few lines with FP variables:

int main()
{
   volatile double a, b, c;

    init_platform();

    a = 1.2;
    b = 2.4;
    c = a + b;
    printf("Hello World %lf\n\r", c);

    cleanup_platform();
    return 0;
}

If you compile it with the default settings and double-click on the ELF file (opens a binary editor file), you can check the disably code for the FP operations and check how FP instructions have been used.

    a = 1.2;
     dc4:	b200e7e0 	mov	x0, #0x3333333333333333    	// #3689348814741910323
     dc8:	f2e7fe60 	movk	x0, #0x3ff3, lsl #48
     dcc:	9e670000 	fmov	d0, x0
     dd0:	fd0017a0 	str	d0, [x29, #40]
    b = 2.4;
     dd4:	b200e7e0 	mov	x0, #0x3333333333333333    	// #3689348814741910323
     dd8:	f2e80060 	movk	x0, #0x4003, lsl #48
     ddc:	9e670000 	fmov	d0, x0
     de0:	fd0013a0 	str	d0, [x29, #32]
    c = a + b;
     de4:	fd4017a1 	ldr	d1, [x29, #40]
     de8:	fd4013a0 	ldr	d0, [x29, #32]
     dec:	1e602820 	fadd	d0, d1, d0
     df0:	fd000fa0 	str	d0, [x29, #24]

Best Regards,

Ibai Erkiaga


Ibai
Don’t forget to reply, kudo, and accept as solution.
0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
2,813 Views
Registered: ‎10-06-2016

Hi @airazusta,

 

So just adding more information about 32bit toolchain usage (which is compatible with Zynq-A9 cores) in case it might be useful for other forum users looking into floating point usages.

 

The 32bit toolchain supports the soft float ABIs and it can be configured through mfloat-abi option as documented in the GCC documentation. If you generate a 32bit target application on A53, you will see that mfloat-abi is set to hard both in the BSP settings and the application toolchain settings.

 

capture 1.JPG

capture 2.JPG

 

Taking a look to the ELF output generated with the same example used in the previous post you can see how FP instructions have been used (hard floating point).

a = 1.2;
     550:	e3032333 	movw	r2, #13107	; 0x3333
     554:	e3432333 	movt	r2, #13107	; 0x3333
     558:	e3033333 	movw	r3, #13107	; 0x3333
     55c:	e3433ff3 	movt	r3, #16371	; 0x3ff3
     560:	e14b20fc 	strd	r2, [fp, #-12]
    b = 2.4;
     564:	e3032333 	movw	r2, #13107	; 0x3333
     568:	e3432333 	movt	r2, #13107	; 0x3333
     56c:	e3033333 	movw	r3, #13107	; 0x3333
     570:	e3443003 	movt	r3, #16387	; 0x4003
     574:	e14b21f4 	strd	r2, [fp, #-20]	; 0xffffffec
    c = a + b;
     578:	ed5b1b03 	vldr	d17, [fp, #-12]
     57c:	ed5b0b05 	vldr	d16, [fp, #-20]	; 0xffffffec
     580:	ee710ba0 	vadd.f64	d16, d17, d16
     584:	ed4b0b07 	vstr	d16, [fp, #-28]	; 0xffffffe4
    printf("Hello World %lf\n\r", c);
     588:	e14b21dc 	ldrd	r2, [fp, #-28]	; 0xffffffe4
     58c:	e30c0e38 	movw	r0, #52792	; 0xce38
     590:	e3400000 	movt	r0, #0
     594:	eb0000fd 	bl	990 <printf>

Changing the mfloat-abi to soft will force the toolchain to use soft ABI for floating point operations rather than FP instruction, so the dissably code will look into something like:

 

a = 1.2;
     550:	e3032333 	movw	r2, #13107	; 0x3333
     554:	e3432333 	movt	r2, #13107	; 0x3333
     558:	e3033333 	movw	r3, #13107	; 0x3333
     55c:	e3433ff3 	movt	r3, #16371	; 0x3ff3
     560:	e14b20fc 	strd	r2, [fp, #-12]
    b = 2.4;
     564:	e3032333 	movw	r2, #13107	; 0x3333
     568:	e3432333 	movt	r2, #13107	; 0x3333
     56c:	e3033333 	movw	r3, #13107	; 0x3333
     570:	e3443003 	movt	r3, #16387	; 0x4003
     574:	e14b21f4 	strd	r2, [fp, #-20]	; 0xffffffec
    c = a + b;
     578:	e14b00dc 	ldrd	r0, [fp, #-12]
     57c:	e14b21d4 	ldrd	r2, [fp, #-20]	; 0xffffffec
     580:	eb000077 	bl	764 <__adddf3>
     584:	e1a02000 	mov	r2, r0
     588:	e1a03001 	mov	r3, r1
     58c:	e14b21fc 	strd	r2, [fp, #-28]	; 0xffffffe4
    printf("Hello World %lf\n\r", c);

Regards,

Ibai


Ibai
Don’t forget to reply, kudo, and accept as solution.
0 Kudos
Highlighted
Visitor
Visitor
2,761 Views
Registered: ‎06-22-2017

Hi @ibaie,

 

Thank you for your answer, it was really helpful.

 

However, I've started trying the three possible compilation options (64 bit hard fp, 32 bit hard fp and 32 bit soft fp) with a very simple application and I got some incoherencies regarding the time they take for doing the FP operations. My application basically does the following: it calculates the time needed for making a FP adding operation. 

 

int main()
{
      XTime time_init, time_end, time_total;
      float USecsElapsed;
      float a,b,c;

   

      XTime_GetTime(&time_init);

      a = 1.2245;
      b = 2.436898;
      c = a + b;

 

      XTime_GetTime(&time_end);

 

      time_total = time_end - time_init;
      USecsElapsed = ((float) (time_total * 1000000) / (float) COUNTS_PER_SECOND);

      printf("Time required= %.3f us\r\n Cycles = %d\r\n", USecsElapsed, (int)time_total);

 

      return 0;

}

 

I compiled the application in the three possible modes, I attach screenshots of the disassembly just to see it is done correctly:

 

  • 64 bit architecture - Hard FP

          64bit_hard.JPG

 

  • 32 bit architecture - Hard FP

         32bit_hard.JPG

 

  • 32 bit architecture - Soft FP 

         32bit_soft.JPG

And the times I obtained are the following ones:

  • 64bit hard FP --> 0.94 us
  • 32bit hard FP --> 0.32 us
  • 32bit soft FP --> 0.7 us

I also tried multiplying 'a' and 'b' instead of adding them and I got these times:

  • 64bit hard FP --> 0.94 us
  • 32bit hard FP --> 0.32 us
  • 32bit soft FP --> 0.56 us

Maybe I'm mistaken, but shouldn't be 64bit hard FP as fast as 32bit hard FP (or even faster) and not the slowest one?? In the three cases the time it takes XTime_GetTime() function for exectuing is the same (0.31us).

 

Best Regards,

 

Eskerrik asko :)

Ane

0 Kudos
Highlighted
Scholar
Scholar
2,751 Views
Registered: ‎04-13-2015

Hello @airazusta

 

The hard-fp / soft-fp option defines for the way floating point values/variables are passed when calling a function

That's what ARM calls the AEBI.

With soft-fp, the caller puts the floating point arguments in the integer registers (R#) and the function copies these registers in the FPU for processing

With hard-fp, the caller puts the floating point arguments in the FPU and the function does not have to do a copy to use them.

The measurements you are doing show in-line code, so hard-fp or soft-fp doesn't matter.

 

 

 

 

Highlighted
Xilinx Employee
Xilinx Employee
2,734 Views
Registered: ‎10-06-2016

Hi @ericv,

You are right in your statements but if not wrong when @airazusta says soft FP means softfloat, as in the code snapshot it can be easily see how the soft float library functions are being call.

I just realize that in my last post I was not really precise, as I said that using soft value on mfloat-abi forces to use float ABI, rather than library calls.

Ibai
Don’t forget to reply, kudo, and accept as solution.
0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
2,729 Views
Registered: ‎10-06-2016

Hi @airazusta,

 

So now this is more ARM architecture question rather than really toolchain stuff :)

 

I would also expect ARMv8 FP operation to be as fast as ARMv7 FP, so your measurement results seems to be bit confusing for me as well. Nevertheless measuing single operations I think is not really meaningfull as some of the microcode and features of the processor nowadays (branch prediction, pipelining...) may impact on the execution.

 

Nevertheless I just made a quick change in your code to just "measure" the FP operation rather than operation + variable assigment and that takes bit more sense.

 

Regards,

Ondo Izan ;)

Ibai


Ibai
Don’t forget to reply, kudo, and accept as solution.
0 Kudos
Highlighted
Scholar
Scholar
2,720 Views
Registered: ‎04-13-2015

Hi @airazusta

 

your measurements have a high probability affected by the cache; i.e. if the code is in cache or not and if the section of code spans a single or two cache lines.

The most precise way to perform your measurement is to iterate at least twice (the more is better) and skipping the time taken for the first iteration.

e.g.:

for (i=0 ; i<N ; i++) {

    if (i == 1) {start timer}

    code to test

}

stop timer

 

0 Kudos