cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
gbredthauer
Explorer
Explorer
2,559 Views
Registered: ‎02-27-2008

LWIP transmit stalls on Zynq Ultrascale+

Jump to solution

Setup: I'm developing a bare-metal embedded system using a Zynq Ultrascale+ MPSOC that communicates with a host PC over TCP/IP using the LWIP stack provided in Vitis 2019.2.  The host PC sends commands/queries to the Zynq, which replies. 

Problem: The replies occasionally (after several seconds of 100Hz messaging) get stuck in the LWIP transmit buffers, even though I call tcp_write followed by tcp_output.  I've confirmed that the reply data never gets onto the wire using Wireshark and a managed switch.  After getting stuck, a subsequent TCP write will "unstick" the LWIP stack, and both the stuck and subsequent data show up at the host, but without that nudge it will stay stuck indefinitely, regardless of calls to tcp_output.  tcp_sndbuf() shows that the LWIP stack has the data in its buffers, since it will report n bytes less buffer space than normal (eg 8188 vs 8192 if 4 bytes are stuck).

Clues: I've reduced this phenomenon to a minimal case, using a slightly modified version of the LWIP echo example Vitis generates (listed below).  If I reply to the host in the recv_callback, this problem does not manifest.  If I reply to the host from the main loop, this problem occurs.  You can switch the code below between the two cases by switching the IN_RCV define on or off.  I don't want to generate all replies in the receive callback because some commands require a long operation before replying, and I want to keep calling the LWIP tcp_fasttmr/tcp_slowtmr routines during those operations.  To get a stall to happen, I run the code below on the Zynq without IN_RCV defined, and the PC has a simple Python script that sends ">xxSTATUS\n" messages (where xx counts up), and the Zynq replies with "<xxSTATUS\n" (transmitted from the main loop).  If IN_RCV is defined, replies are transmitted from recv_callback(), and no stall ever happens.

//#define IN_RCV

#include <stdio.h>
#include <string.h>

#include "lwip/err.h"
#include "lwip/tcp.h"
#include "xil_printf.h"

int rxlen;
char rxbuf[100];
struct tcp_pcb *rxpcb;

int transfer_data() {
#ifndef IN_RCV
	int i;

	if (rxlen)
	{
		rxbuf[0] = '<';
		if (tcp_sndbuf(rxpcb) > rxlen) {
			for (i = 0; i < rxlen; i++)
				if (rxbuf[i] == '>')
					rxbuf[i] = '<';
			tcp_write(rxpcb, rxbuf, rxlen, 1);
			if (tcp_output(rxpcb) != ERR_OK)
				printf("tcp_output error\r\n");
		} else
			xil_printf("no space in tcp_sndbuf\n\r");
		rxlen = 0;
	}
#endif
	return 0;
}

void print_app_header()
{
#if (LWIP_IPV6==0)
	xil_printf("\n\r\n\r-----lwIP TCP echo server ------\n\r");
#else
	xil_printf("\n\r\n\r-----lwIPv6 TCP echo server ------\n\r");
#endif
	xil_printf("TCP packets sent to port 4242 will be echoed back\n\r");
}

err_t poll_callback(void *arg, struct tcp_pcb *tpcb)
{
	printf("sendbuf: %d\r\n", tcp_sndbuf(tpcb));
	return ERR_OK;
}

err_t recv_callback(void *arg, struct tcp_pcb *tpcb,
                               struct pbuf *p, err_t err)
{
	int i;
	char *s;

	if (!p) {
		tcp_close(tpcb);
		tcp_recv(tpcb, NULL);
		return ERR_OK;
	}

#ifdef IN_RCV
	if (tcp_sndbuf(tpcb) > p->len) {
		s = (char *)p->payload;
		for (i = 0; i < p->len; i++)
			if (s[i] == '>')
				s[i] = '<';
		err = tcp_write(tpcb, p->payload, p->len, 1);
	} else
		xil_printf("no space in tcp_sndbuf\n\r");
#else
	memcpy(rxbuf, p->payload, p->len);
	rxlen = p->len;
	rxpcb = tpcb;
#endif

	tcp_recved(tpcb, p->len);
	pbuf_free(p);
	return ERR_OK;
}

err_t accept_callback(void *arg, struct tcp_pcb *newpcb, err_t err)
{
	static int connection = 1;

	tcp_recv(newpcb, recv_callback);
	tcp_poll(newpcb, poll_callback, 1);
	tcp_arg(newpcb, (void*)(UINTPTR)connection);
	connection++;

	return ERR_OK;
}

int start_application()
{
	struct tcp_pcb *pcb;
	err_t err;
	unsigned port = 4242;
	rxlen = 0;

	pcb = tcp_new_ip_type(IPADDR_TYPE_ANY);
	err = tcp_bind(pcb, IP_ANY_TYPE, port);
	tcp_arg(pcb, NULL);
	pcb = tcp_listen(pcb);
	tcp_accept(pcb, accept_callback);

	xil_printf("TCP echo server started @ port %d\n\r", port);

	return 0;
}

 

Guesses: My first thought was that calling tcp_write from the main loop was somehow in the wrong context.  Putting a breakpoint in recv_callback shows that it is indeed called from the main loop's context, and not from some interrupt, so I believe calling tcp_write from the main loop is legal.  The echo example even has a transfer_data() routine called from the main loop.  My only other guess was that LWIP was disabling interrupts, or doing something else before or after the recv_callback gets triggered that avoids this stalled transmit,  but after stepping through the LWIP code I only see a tcp_output() call after the recv_callback().

Request: Has anyone else seen similar behavior, and found a solution?

Thanks!

-Greg

Tags (2)
0 Kudos
1 Solution

Accepted Solutions
martin6314
Contributor
Contributor
1,981 Views
Registered: ‎02-16-2018

Hi Greg,

You might want to consider this solution to a similar problem:

https://forums.xilinx.com/t5/Ethernet/ERR-MEM-on-tcp-write-from-a-corrupt-memory-table-standalone-BSP/m-p/1141250 

The solution was to disable interrupts in ram.c to prevent the Xilinx layer from interrupting tcp_write

Best regards
Martin

View solution in original post

9 Replies
gbredthauer
Explorer
Explorer
2,495 Views
Registered: ‎02-27-2008

Update: I built the same echo server on the ZCU102, and saw the same bug when using Vitis 2019.2.  However, the bug goes away for both the ZCU102 and my hardware when compiled with Vivado 2017.3.  I'm not sure if this is a bug that got introduced in the 2019.2 LWIP code or the emacps driver.  I'll stick with 2017.3 for now, but this will be worth tracking down and correcting in the future.

-Greg 

0 Kudos
shabbirk
Moderator
Moderator
2,470 Views
Registered: ‎12-04-2016

Hi Greg

If you have 2018.1, I would suggest to once check with this version and check to see if you still see this issue.

The reason is in 2017.3, we have lwip141 and starting from 2018.1, we have implemented echo server targeting lwip2.0.2

 

Best Regards

Shabbir

0 Kudos
gbredthauer
Explorer
Explorer
2,275 Views
Registered: ‎02-27-2008

Another update:

I've spent a while debugging this issue in 2017.3.  The root problem is that the emacps driver does not check to see if a DMA is already in progress before trying to start a new DMA, and fails silently if they overlap.  So, if my code calls tcp_output() from the main loop and LWIP had just started a packet (even just an ACK), there's a chance that the second send gets swallowed. An ugly hack to fix the problem is to create a global flag (I called it "tx_in_progress") that gets set when a DMA is started (at the bottom of emacps_sgsend() in xemacpsif_dma.c), and cleared when the DMA completes (in XEmacPs_IntrHandler() in xemacps_intr.c).  I then check tx_in_progress, and don't call tcp_output() if it's set, and the stalling issue goes away.

A clean fix would require correcting the Xilinx driver code, but I don't know if having it return an error or waiting and retrying would be best. 

--Greg

nanz
Moderator
Moderator
2,222 Views
Registered: ‎08-25-2009

HI @gbredthauer ,

Thanks for the follow-up and this is great info.

Is there a test case and steps you can share so I could try to reproduce the issue on ZCU102 or ZCU106?

If so, I can go ahead to file a change request to clean up Xilinx driver.


-------------------------------------------------------------------------------------------

Don’t forget to reply, kudo, and accept as solution.

If starting with Versal take a look at our Versal Design Process Hub and our Versal Blogs and our Versal Ethernet Sticky Note.

-------------------------------------------------------------------------------------------
0 Kudos
gbredthauer
Explorer
Explorer
2,156 Views
Registered: ‎02-27-2008

I've done additional testing on the ZCU102.  I used Vivado 2019.1, so I could use the example design from XAPP1305 to use an SFP network connection via AXI Ethernet instead of the PS GEM. 

SFP test:

  1. Install an RJ45 SFP module into the SFP0 slot on the ZCU102, power on
  2. Build the pl_eth_1g example from XAPP1305 in Vivado 2019.1
  3. Program the PL from the hardware manager in Vivado (after setting the option to allow programming ES2 silicon, since my ZCU102 has an ES2 part and 2019.1 has dropped support for it)
  4. Export hardware to the SDK
  5. Generate the LWIP echo example in the SDK (I disable DHCP in the BSP, set the link rate to 1000 instead of autonegotiate, and set the IP to 10.0.0.2 in main.c)
  6. Replace the contents of echo.c with the echo code I posted earlier
  7. Run the program on the PS
  8. Run a python 3 script on a Windows PC with IP 10.0.0.1 to hammer the ZCU102 with network queries (attached)

The script ran overnight, and exchanged 27M messages without an error.

GEM test:

  1. Create a ZCU102 design in Vivado 2019.1
  2. Add a block design
  3. Add the Zynq Ultrascale+ IP
  4. Run board automation
  5. Customize the Zynq, remove the HP AXI ports
  6. Validate and save the design
  7. Export hardware to the SDK
  8. Generate the LWIP echo example in the SDK (I disable DHCP in the BSP, and set the IP to 10.0.0.2 in main.c)
  9. Replace the contents of echo.c with the echo code I posted earlier
  10. Run the program on the PS
  11. Run a python 3 script on a Windows PC with IP 10.0.0.1 to hammer the ZCU102 with network queries (attached)

The script will typically run for 10-60 seconds, and then give an error (as a response from the ZCU102 was not received).

I think this is a pretty conclusive apples-to-apples test that shows there's a bug in the GEM driver or hardware.  Short term, I'll respin my board to route the RGMII interface to the PL instead of the PS so I can bypass the GEM.

--Greg

martin6314
Contributor
Contributor
1,982 Views
Registered: ‎02-16-2018

Hi Greg,

You might want to consider this solution to a similar problem:

https://forums.xilinx.com/t5/Ethernet/ERR-MEM-on-tcp-write-from-a-corrupt-memory-table-standalone-BSP/m-p/1141250 

The solution was to disable interrupts in ram.c to prevent the Xilinx layer from interrupting tcp_write

Best regards
Martin

View solution in original post

gbredthauer
Explorer
Explorer
1,957 Views
Registered: ‎02-27-2008

Nice catch!  I'm rerunning my ZCU102/SDK2019.1/GEM test.  It failed as before unchanged, and then I altered opt.h to set:

#define LWIP_ALLOW_MEM_FREE_FROM_OTHER_CONTEXT 1

This define appears to enable the mutexes mentioned in your thread for protecting from interrupt access race conditions that could corrupt the packet buffers.  I changed this in the Xilinx LWIP source folder so it would propagate to the BSP (I know it's also possible to copy those sources somewhere else and point the BSP at the new modifications).  It's been running without an error for about an hour.  I'll let it go overnight just to make sure it's stable.

Thanks!

-Greg

0 Kudos
martin6314
Contributor
Contributor
1,926 Views
Registered: ‎02-16-2018

Hi Greg,

I am glad if it worked.

However: Please note that protecting mem_free does help if you exchange a lot of data from the PC to your hardware. Then your read-callback code (running in main) will use pbuf_free and your solution will protect mem_free against the interrupting emacps_recv_handler.

My solution aims at protecting the other way round: If you have a lot of tcp_write calls (running in main), these calls use mem_malloc and mem_malloc must be protected against the interrupting emacps_send_handler. This requires changing the definition for LWIP_MEM_ALLOC_PROTECT in the BSP-Code.

Best regards
Martin

 

0 Kudos
martin6314
Contributor
Contributor
1,923 Views
Registered: ‎02-16-2018

Hi Greg,

Please discard my last post. I misunderstood LWIP_ALLOW_MEM_FREE_FROM_OTHER_CONTEXT.
Yes, you are right: This switch will turn on the required protections.

Sorry, and best regards
Martin

0 Kudos