cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Explorer
Explorer
431 Views
Registered: ‎02-27-2008

LWIP transmit stalls on Zynq Ultrascale+

Setup: I'm developing a bare-metal embedded system using a Zynq Ultrascale+ MPSOC that communicates with a host PC over TCP/IP using the LWIP stack provided in Vitis 2019.2.  The host PC sends commands/queries to the Zynq, which replies. 

Problem: The replies occasionally (after several seconds of 100Hz messaging) get stuck in the LWIP transmit buffers, even though I call tcp_write followed by tcp_output.  I've confirmed that the reply data never gets onto the wire using Wireshark and a managed switch.  After getting stuck, a subsequent TCP write will "unstick" the LWIP stack, and both the stuck and subsequent data show up at the host, but without that nudge it will stay stuck indefinitely, regardless of calls to tcp_output.  tcp_sndbuf() shows that the LWIP stack has the data in its buffers, since it will report n bytes less buffer space than normal (eg 8188 vs 8192 if 4 bytes are stuck).

Clues: I've reduced this phenomenon to a minimal case, using a slightly modified version of the LWIP echo example Vitis generates (listed below).  If I reply to the host in the recv_callback, this problem does not manifest.  If I reply to the host from the main loop, this problem occurs.  You can switch the code below between the two cases by switching the IN_RCV define on or off.  I don't want to generate all replies in the receive callback because some commands require a long operation before replying, and I want to keep calling the LWIP tcp_fasttmr/tcp_slowtmr routines during those operations.  To get a stall to happen, I run the code below on the Zynq without IN_RCV defined, and the PC has a simple Python script that sends ">xxSTATUS\n" messages (where xx counts up), and the Zynq replies with "<xxSTATUS\n" (transmitted from the main loop).  If IN_RCV is defined, replies are transmitted from recv_callback(), and no stall ever happens.

//#define IN_RCV

#include <stdio.h>
#include <string.h>

#include "lwip/err.h"
#include "lwip/tcp.h"
#include "xil_printf.h"

int rxlen;
char rxbuf[100];
struct tcp_pcb *rxpcb;

int transfer_data() {
#ifndef IN_RCV
	int i;

	if (rxlen)
	{
		rxbuf[0] = '<';
		if (tcp_sndbuf(rxpcb) > rxlen) {
			for (i = 0; i < rxlen; i++)
				if (rxbuf[i] == '>')
					rxbuf[i] = '<';
			tcp_write(rxpcb, rxbuf, rxlen, 1);
			if (tcp_output(rxpcb) != ERR_OK)
				printf("tcp_output error\r\n");
		} else
			xil_printf("no space in tcp_sndbuf\n\r");
		rxlen = 0;
	}
#endif
	return 0;
}

void print_app_header()
{
#if (LWIP_IPV6==0)
	xil_printf("\n\r\n\r-----lwIP TCP echo server ------\n\r");
#else
	xil_printf("\n\r\n\r-----lwIPv6 TCP echo server ------\n\r");
#endif
	xil_printf("TCP packets sent to port 4242 will be echoed back\n\r");
}

err_t poll_callback(void *arg, struct tcp_pcb *tpcb)
{
	printf("sendbuf: %d\r\n", tcp_sndbuf(tpcb));
	return ERR_OK;
}

err_t recv_callback(void *arg, struct tcp_pcb *tpcb,
                               struct pbuf *p, err_t err)
{
	int i;
	char *s;

	if (!p) {
		tcp_close(tpcb);
		tcp_recv(tpcb, NULL);
		return ERR_OK;
	}

#ifdef IN_RCV
	if (tcp_sndbuf(tpcb) > p->len) {
		s = (char *)p->payload;
		for (i = 0; i < p->len; i++)
			if (s[i] == '>')
				s[i] = '<';
		err = tcp_write(tpcb, p->payload, p->len, 1);
	} else
		xil_printf("no space in tcp_sndbuf\n\r");
#else
	memcpy(rxbuf, p->payload, p->len);
	rxlen = p->len;
	rxpcb = tpcb;
#endif

	tcp_recved(tpcb, p->len);
	pbuf_free(p);
	return ERR_OK;
}

err_t accept_callback(void *arg, struct tcp_pcb *newpcb, err_t err)
{
	static int connection = 1;

	tcp_recv(newpcb, recv_callback);
	tcp_poll(newpcb, poll_callback, 1);
	tcp_arg(newpcb, (void*)(UINTPTR)connection);
	connection++;

	return ERR_OK;
}

int start_application()
{
	struct tcp_pcb *pcb;
	err_t err;
	unsigned port = 4242;
	rxlen = 0;

	pcb = tcp_new_ip_type(IPADDR_TYPE_ANY);
	err = tcp_bind(pcb, IP_ANY_TYPE, port);
	tcp_arg(pcb, NULL);
	pcb = tcp_listen(pcb);
	tcp_accept(pcb, accept_callback);

	xil_printf("TCP echo server started @ port %d\n\r", port);

	return 0;
}

 

Guesses: My first thought was that calling tcp_write from the main loop was somehow in the wrong context.  Putting a breakpoint in recv_callback shows that it is indeed called from the main loop's context, and not from some interrupt, so I believe calling tcp_write from the main loop is legal.  The echo example even has a transfer_data() routine called from the main loop.  My only other guess was that LWIP was disabling interrupts, or doing something else before or after the recv_callback gets triggered that avoids this stalled transmit,  but after stepping through the LWIP code I only see a tcp_output() call after the recv_callback().

Request: Has anyone else seen similar behavior, and found a solution?

Thanks!

-Greg

Tags (2)
0 Kudos
5 Replies
Highlighted
Explorer
Explorer
367 Views
Registered: ‎02-27-2008

Re: LWIP transmit stalls on Zynq Ultrascale+

Update: I built the same echo server on the ZCU102, and saw the same bug when using Vitis 2019.2.  However, the bug goes away for both the ZCU102 and my hardware when compiled with Vivado 2017.3.  I'm not sure if this is a bug that got introduced in the 2019.2 LWIP code or the emacps driver.  I'll stick with 2017.3 for now, but this will be worth tracking down and correcting in the future.

-Greg 

0 Kudos
Highlighted
Moderator
Moderator
342 Views
Registered: ‎12-04-2016

Re: LWIP transmit stalls on Zynq Ultrascale+

Hi Greg

If you have 2018.1, I would suggest to once check with this version and check to see if you still see this issue.

The reason is in 2017.3, we have lwip141 and starting from 2018.1, we have implemented echo server targeting lwip2.0.2

 

Best Regards

Shabbir

0 Kudos
Highlighted
Explorer
Explorer
147 Views
Registered: ‎02-27-2008

Re: LWIP transmit stalls on Zynq Ultrascale+

Another update:

I've spent a while debugging this issue in 2017.3.  The root problem is that the emacps driver does not check to see if a DMA is already in progress before trying to start a new DMA, and fails silently if they overlap.  So, if my code calls tcp_output() from the main loop and LWIP had just started a packet (even just an ACK), there's a chance that the second send gets swallowed. An ugly hack to fix the problem is to create a global flag (I called it "tx_in_progress") that gets set when a DMA is started (at the bottom of emacps_sgsend() in xemacpsif_dma.c), and cleared when the DMA completes (in XEmacPs_IntrHandler() in xemacps_intr.c).  I then check tx_in_progress, and don't call tcp_output() if it's set, and the stalling issue goes away.

A clean fix would require correcting the Xilinx driver code, but I don't know if having it return an error or waiting and retrying would be best. 

--Greg

Highlighted
Moderator
Moderator
94 Views
Registered: ‎08-25-2009

Re: LWIP transmit stalls on Zynq Ultrascale+

HI @gbredthauer ,

Thanks for the follow-up and this is great info.

Is there a test case and steps you can share so I could try to reproduce the issue on ZCU102 or ZCU106?

If so, I can go ahead to file a change request to clean up Xilinx driver.

"Don't forget to reply, kudo and accept as solution."
0 Kudos
Highlighted
Explorer
Explorer
28 Views
Registered: ‎02-27-2008

Re: LWIP transmit stalls on Zynq Ultrascale+

I've done additional testing on the ZCU102.  I used Vivado 2019.1, so I could use the example design from XAPP1305 to use an SFP network connection via AXI Ethernet instead of the PS GEM. 

SFP test:

  1. Install an RJ45 SFP module into the SFP0 slot on the ZCU102, power on
  2. Build the pl_eth_1g example from XAPP1305 in Vivado 2019.1
  3. Program the PL from the hardware manager in Vivado (after setting the option to allow programming ES2 silicon, since my ZCU102 has an ES2 part and 2019.1 has dropped support for it)
  4. Export hardware to the SDK
  5. Generate the LWIP echo example in the SDK (I disable DHCP in the BSP, set the link rate to 1000 instead of autonegotiate, and set the IP to 10.0.0.2 in main.c)
  6. Replace the contents of echo.c with the echo code I posted earlier
  7. Run the program on the PS
  8. Run a python 3 script on a Windows PC with IP 10.0.0.1 to hammer the ZCU102 with network queries (attached)

The script ran overnight, and exchanged 27M messages without an error.

GEM test:

  1. Create a ZCU102 design in Vivado 2019.1
  2. Add a block design
  3. Add the Zynq Ultrascale+ IP
  4. Run board automation
  5. Customize the Zynq, remove the HP AXI ports
  6. Validate and save the design
  7. Export hardware to the SDK
  8. Generate the LWIP echo example in the SDK (I disable DHCP in the BSP, and set the IP to 10.0.0.2 in main.c)
  9. Replace the contents of echo.c with the echo code I posted earlier
  10. Run the program on the PS
  11. Run a python 3 script on a Windows PC with IP 10.0.0.1 to hammer the ZCU102 with network queries (attached)

The script will typically run for 10-60 seconds, and then give an error (as a response from the ZCU102 was not received).

I think this is a pretty conclusive apples-to-apples test that shows there's a bug in the GEM driver or hardware.  Short term, I'll respin my board to route the RGMII interface to the PL instead of the PS so I can bypass the GEM.

--Greg

0 Kudos