09-14-2018 04:53 PM - edited 09-14-2018 04:58 PM
When receiving data via TCP/IP, the data often gets corrupted. I can reproduce the data corruption with a few lines of code. I can post code here. I'm using lwIP 2.0.2 in a baremetal app for the Zynq 7010 (zybo board). The app is compiled using XSDK version 2018.2.1. I'm using the Ethernet controller of the Zynq's PS, not an AXI Ethernet in the PL.
I simply send 1MB of random data preceded by an adler32 checksum. If I send the data slowly, almost no corruptions occur. However, if I send the data without any throttling, the data almost surely gets corrupted. The Zybo board is connected directly to an 100MBit USB adapter.
I believe the issue might be related to the checksum offloading. Or maybe the handling of the DMA buffers has an error?
In lwipopts.h we have the following:
#define CHECKSUM_GEN_TCP 0 #define CHECKSUM_GEN_UDP 0 #define CHECKSUM_GEN_IP 0 #define CHECKSUM_CHECK_TCP 0 #define CHECKSUM_CHECK_UDP 0 #define CHECKSUM_CHECK_IP 0 #define LWIP_FULL_CSUM_OFFLOAD_RX 1 #define LWIP_FULL_CSUM_OFFLOAD_TX 1
If I set CHECKSUM_CHECK_TCP to 1 by hand, the errors disappear completely. So this seems to be a quite good workaround, even though it might cause a lot of CPU load.
However, lwipopts.h is frequently overwritten by the XSDK (cleaning the project, modifying the BSP settings). As long as the problem persists, I need a reliable way to enable the workaround. When I review the BSP's settings, all the checksum offloading settings are disabled. But the options seem to apply to AXI Ethernet only. I'm using the controller in the Zynq's PS, not an AXI controller in the PL. Unfortunately, as was pointed out here before, the editor of the lwip options is quite limited. So I don't seem to have any say in the value of the CHECKSUM_CHECK_* values above.
Is Xilinx aware of the problem? If not here, where can I file a proper bug report?
This is quite a severe problem, since it affects most users of lwIP+TCP/IP on the Zynq 7000.
09-14-2018 05:05 PM
I made sure that I have -O0 in the compiler options of the BSP. There was no improvement. The data still gets corrupted.
09-15-2018 01:33 PM - edited 09-16-2018 06:16 AM
Turns out that the problem is related to the size of the TCP segments.
With the default TCP settings, the tcp_wnd setting is pretty small: 2048. This results in TCP segments not being any larger than 1024, because the host (a Linux PC in my case) won't send TCP segments larger than half the window size. I don't like such a small TCP window, so I had changed the setting to 0x8000. After all, such a small window will impact performance.
Since I changed the tcp_wnd parameter to 0x8000 and the default tcp_mss setting is 1460, the TCP segments became larger than 1024 bytes and I saw lots and lots of checksum errors.
If I set tcp_mss to 1024 to artificially limit the size of TCP segments, then the checksum problems disappear - even if I use the larger tcp_wnd value. So setting the tcp_mss to something small (like 1024) seems to be a good workaround.
Update: The problem persists. Data corruption occurs once every 600MB, even if the MSS is 1024
Update 2: an even smaller MSS value (768) seems to reduce the likelihood of data corruption even further. I was able transfer gigabytes without data corruption. However, the bug still persists and this is simply a workaround.
09-18-2018 09:06 AM
09-18-2018 09:33 AM - edited 09-18-2018 09:33 AM
I will try that fix within a few hours. Thanks for pointing that thread out.
How can I keep the XSDK from overwriting the fixed version of the files?
09-18-2018 12:38 PM - edited 09-18-2018 01:00 PM
The solution discussed in https://github.com/Xilinx/embeddedsw/issues/53 fixes the problem described above as well.
I edited the file SDK/2018.2/data/embeddedsw/ThirdParty/sw_services/lwip202_v1_1/src/contrib/ports/xilinx/netif/xemacpsif_dma.c of my Xilinx installation to permanently apply the fix.