cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
by_sauka
Visitor
Visitor
3,781 Views
Registered: ‎12-20-2016

AXI4-Lite register read is too slow

Hi everyone,

 

I'm using AXI4-Lite peripheral generated from template in Vivado, connected to GP0 port. I want to have a register and read its value from Linux (and I never write to it from PS side).

 

The program logic is as follows:

- part of HDL design writes data into some memory buffer, and then writes size of data to register slave_reg0 of AXI4_lite peripheral; after this it generates IRQ for processor.

- in interrupt service routine in Linux I read value of slave_reg0  using ioread32() function

- then I configure AXI DMA according to the size read from slave_reg0 to transfer data from memory buffer to DDR.

 

All logic in PL is running at 200MHz.

 

My problem is that part of HDL design generates data each 1ms, however only ioread32 takes almost 2ms! 

According to this thread AXI-Lite has to be much faster.

Could someone advise where can be the problem?

 

Pavel

 

 

 

 

0 Kudos
6 Replies
by_sauka
Visitor
Visitor
3,772 Views
Registered: ‎12-20-2016

Below is vhdl code for AXI4-Lite peripheral, I slightly modified template. 

https://gist.github.com/sauka/f1d4ce839922b1930904f99e39b2f391

Notes:

  input port available_data_cnt is the input from another part of HDL-design (which is in charge of writing slave_reg0 and generating IRQ);

  in "write logic generation" process I have commented out part related to slave_reg0, I always write into it only from PL;

 

 

br,

Pavel

0 Kudos
johnmcd
Xilinx Employee
Xilinx Employee
3,726 Views
Registered: ‎02-01-2008

If you are using zynq, and baremetal, then check the mmu config. I believe the default in the baremetal BSP file translation_table.S sets the address range 0x40000000-0x7fffffff as strongly ordered. You could try changing the address range to 'sharable device'. If I recall correctly, this change will give you a throughput that is 10x of 'strongly ordered'.

 

Something else you could do is change the mmu translation tables such that the address range is tagged as memory, non-cached. Then if you read from incrementing addresses, the SCU will coalesce the accesses causing an AXI3 burst transfer. But bursts may not be what you want.

0 Kudos
by_sauka
Visitor
Visitor
3,668 Views
Registered: ‎12-20-2016

Hi johnmcd,

 

Thanks for your answer.

Yes, I'm running Zynq (ZC706 board), Linux from analog devices repo. Is your answer still relevant? I think at the stage of bootloader it doesn't matter whether I run baremetal or Linux, or I'm wrong here?

 

Now the settings in translation_table.S are:

 

.rept	0x0400			/* 0x40000000 - 0x7fffffff (FPGA slave0) */
.word	SECT + 0xc02		/* S=b0 TEX=b000 AP=b11, Domain=b0, C=b0, B=b1 */
.set	SECT, SECT+0x100000
.endr

I need to change them like this, right?

 

.word SECT + 0xc06

 

I have a big part of design coming from Analog Devices, and they also use AXI4 communication to access register space. How above changes will affect other IP blocks apart from my custom IP?

 

p.s. so far I have no chance to test your suggestion in hardware, but I'll let you know soon how it works.

 

Kind Regards,

Pavel

0 Kudos
johnmcd
Xilinx Employee
Xilinx Employee
3,656 Views
Registered: ‎02-01-2008

My answer was specific to baremetal and not linux. I'm not sure what takes place when using ioread32(). I have used mmap() and a simple uio driver, in the past, to interact with custom logic.

 

You could try using a baremetal app to test throughput using the different translation table entries to verify shared vs strongly ordered device. It may be possibly that your linux ioread32() is already adjusting the address range to shared device but you will have to check that part on your own.

 

You are correct with your translation table entry to change the address range from strongly ordered to shared. This value repeats 0x0400 times and therefore will affect all peripherals within the 0x40000000-0x7fffffff address range. You could adjust the repeat value and add additional lines to reduce the affected address range but this change wouldn't be necessary for a simple test.

 

 

0 Kudos
by_sauka
Visitor
Visitor
3,594 Views
Registered: ‎12-20-2016

Thanks for your help.

 

I played with mmu translation table, and it seems to have almost no effect. I suspect that the issue might be related to design itself, since it is quite big and it is occupying almost all fpga area (and I have strict clock-frequency requirements, so I use many place-and-route optimization strategeis).

 

I have a new Ultrascale+ device (zcu102), so I'm trying to implement my project for this new device.

 

In the same time, could you provide some working example of measuring execution time in bare-metal application? I know a way to use AXI-Timer, but it's highly undesirable for me to modify HDL design by adding this timer and reimplement full project.

 

br,

Pavel

0 Kudos
johnmcd
Xilinx Employee
Xilinx Employee
3,564 Views
Registered: ‎02-01-2008

There are triple timer counters available in the PS so a timer in the PL is not necessary.

 

There are also axi performance monitors (APM) available in the PS and I see that the baremetal driver includes a few examples.

 

0 Kudos