cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
martin6314
Contributor
Contributor
2,054 Views
Registered: ‎02-16-2018

ERR_MEM on tcp_write from a corrupt memory table (standalone BSP)

Jump to solution

We use LWIP202_V1_1 on a Ultrascale++ Standalone platform. Since the relevant code is the same on the Vitis 2020.1 LWIP211_v1_2, we think this information might be valueable also for the actual release.

Our project listens on two ports for incoming connections. Port 80 transmits mostly standard-length TCP frames and another port transmits short 32 to 96 byte junks. Up to medium load traffic, we observe no errors. If we challange the firmware by sending much data on both ports out to the PC, we receive an error ERR_MEM=-1 ("Out of memory error") from the tcp_write after a while and the connection must be closed for partwise recovery. The project uses the raw mode of LWIP (no sockets).

We think the problem might be caused in the board support package of Xilinx and show a solution below.

0 Kudos
1 Solution

Accepted Solutions
martin6314
Contributor
Contributor
2,049 Views
Registered: ‎02-16-2018

The problem depends strongly on timing (speed of the network and optimization level of the code). Our code runs in the main (Standalone BSP). We do not call any LWIP function in an interupt. However, there are interrupts within the Xilinx-BSP implementation of the LWIP.

We analysed the memory handler in mem.c (ram_heap). Normally, ram_heap contains a chain of allocated memory slots. Each contains an information, if the slot is occupied and pointers up and down. The last slot points to the end of the heap and is always empty (not occupied). At this last slot, a mem_malloc can acquire new data, if the lower slots are all occupied. In the error situation, this last slot suddenly is occupied and still points to the end of the heap. This is suprising, because the "legal" slots fill only 6% of the available space. At this time, it looks like someone has acquired all remaining memory (94%) in one shot.

We further analyzed the next higher layer that uses mem.c in LWIP. This is the pbuf.c layer (PBUF_RAM) that is used for buffering the tcp_write data until it is acknowledged by the PC. Because of the time dependent nature of the problem, we consider an interrupt problem. Because it is not allowed to access LWIP functions in an interrupt we do not access any LWIP function in interrupt context. However, the BSP implementation of Xilinx still uses the interrupt emacps_send_handler on a driver level. This means, there might still be an interrupt caused access violation of the mem.c layer.

We observe that after a sent frame is acknowledged by the PC, the following functions are executed to release the buffers sent:
emacps_send_handler -> process_sent_bds -> pbuf_free -> mem_free
Please note that this is called in an interrupt context within the Xilinx board support package.

On the other side, there is the tcp_write command, which fills the data into the buffer with the following functions:
tcp_write (PBUF_RAM) -> pbuf_alloc -> mem_malloc
Please note that this is called in the main. This means it can be interrupted by the emacps_send_handler as discussed above.

Looking into mem_malloc, we see there is a stub LWIP_MEM_ALLOC_PROTECT for protecting against an interrupt.
However, this stup is empty, so there is no protection.

Our solution was to fill these originally empty stubs in mem.c with
#define LWIP_MEM_ALLOC_DECL_PROTECT() SYS_ARCH_DECL_PROTECT(lev_alloc)
#define LWIP_MEM_ALLOC_PROTECT() SYS_ARCH_PROTECT(lev_alloc)
#define LWIP_MEM_ALLOC_UNPROTECT() SYS_ARCH_UNPROTECT(lev_alloc)

and now the problem disappeared.

We conclude that tcp_write acquired memory using mem_malloc, but was interupted by emacps_send_handler and caused a corrupted memory structure in mem.c. With the mentioned solution, the memory did not get corrupted anymore.

View solution in original post

0 Kudos
3 Replies
martin6314
Contributor
Contributor
2,050 Views
Registered: ‎02-16-2018

The problem depends strongly on timing (speed of the network and optimization level of the code). Our code runs in the main (Standalone BSP). We do not call any LWIP function in an interupt. However, there are interrupts within the Xilinx-BSP implementation of the LWIP.

We analysed the memory handler in mem.c (ram_heap). Normally, ram_heap contains a chain of allocated memory slots. Each contains an information, if the slot is occupied and pointers up and down. The last slot points to the end of the heap and is always empty (not occupied). At this last slot, a mem_malloc can acquire new data, if the lower slots are all occupied. In the error situation, this last slot suddenly is occupied and still points to the end of the heap. This is suprising, because the "legal" slots fill only 6% of the available space. At this time, it looks like someone has acquired all remaining memory (94%) in one shot.

We further analyzed the next higher layer that uses mem.c in LWIP. This is the pbuf.c layer (PBUF_RAM) that is used for buffering the tcp_write data until it is acknowledged by the PC. Because of the time dependent nature of the problem, we consider an interrupt problem. Because it is not allowed to access LWIP functions in an interrupt we do not access any LWIP function in interrupt context. However, the BSP implementation of Xilinx still uses the interrupt emacps_send_handler on a driver level. This means, there might still be an interrupt caused access violation of the mem.c layer.

We observe that after a sent frame is acknowledged by the PC, the following functions are executed to release the buffers sent:
emacps_send_handler -> process_sent_bds -> pbuf_free -> mem_free
Please note that this is called in an interrupt context within the Xilinx board support package.

On the other side, there is the tcp_write command, which fills the data into the buffer with the following functions:
tcp_write (PBUF_RAM) -> pbuf_alloc -> mem_malloc
Please note that this is called in the main. This means it can be interrupted by the emacps_send_handler as discussed above.

Looking into mem_malloc, we see there is a stub LWIP_MEM_ALLOC_PROTECT for protecting against an interrupt.
However, this stup is empty, so there is no protection.

Our solution was to fill these originally empty stubs in mem.c with
#define LWIP_MEM_ALLOC_DECL_PROTECT() SYS_ARCH_DECL_PROTECT(lev_alloc)
#define LWIP_MEM_ALLOC_PROTECT() SYS_ARCH_PROTECT(lev_alloc)
#define LWIP_MEM_ALLOC_UNPROTECT() SYS_ARCH_UNPROTECT(lev_alloc)

and now the problem disappeared.

We conclude that tcp_write acquired memory using mem_malloc, but was interupted by emacps_send_handler and caused a corrupted memory structure in mem.c. With the mentioned solution, the memory did not get corrupted anymore.

View solution in original post

0 Kudos
liuhb
Observer
Observer
1,969 Views
Registered: ‎08-20-2014

I've checked the code, I think it's already been implemented?

 

/* Allow mem_free from other (e.g. interrupt) context */
#define LWIP_MEM_FREE_DECL_PROTECT() SYS_ARCH_DECL_PROTECT(lev_free)
#define LWIP_MEM_FREE_PROTECT() SYS_ARCH_PROTECT(lev_free)
#define LWIP_MEM_FREE_UNPROTECT() SYS_ARCH_UNPROTECT(lev_free)
#define LWIP_MEM_ALLOC_DECL_PROTECT() SYS_ARCH_DECL_PROTECT(lev_alloc)
#define LWIP_MEM_ALLOC_PROTECT() SYS_ARCH_PROTECT(lev_alloc)
#define LWIP_MEM_ALLOC_UNPROTECT() SYS_ARCH_UNPROTECT(lev_alloc)

 

version: lwip140_v2_1

 

Thanks!

 

0 Kudos
martin6314
Contributor
Contributor
1,944 Views
Registered: ‎02-16-2018

Hi liuhb

With default settings, this is not enabled if using raw and not socket mode.
I observed this for the LWIP202_V1_1 version and found it the same for the LWIP211_v1_2 version.

But if you set LWIP_ALLOW_MEM_FREE_FROM_OTHER_CONTEXT 1
in lwip-2.1.1\src\include\lwip\opt.h,
the protection will be enabled.

Thanks also to greg for pointing me to LWIP_ALLOW_MEM_FREE_FROM_OTHER_CONTEXT instead of hard-coding:
https://forums.xilinx.com/t5/Ethernet/LWIP-transmit-stalls-on-Zynq-Ultrascale/m-p/1141381/highlight/true#M20070 

Best regards
Martin

0 Kudos