cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Observer
Observer
2,334 Views
Registered: ‎05-27-2016

FreeRTOS lwip driver cache issues in 2018.2

Jump to solution

Hello everyone,

 

We are having an odd issue with FreeRTOS and lwIP 2.0.2 port from Xilinx in XSDK 2018.2 when there is high throughput. It is possible that older versions of XSDK are also affected.

Basically we started seeing bad TCP/IP incoming data and after a week of debugging we eventually traced it to cache incoherency.

Knowing what it is, we eventually found forum post of other people complaining about the same thing back in 2016:

https://forums.xilinx.com/t5/Embedded-Development-Tools/SDK-2016-1-and-standalone-v5-4-with-LWIP-possible-cache-problems/td-p/698789

 

Unfortunately, their solution didn't work for us. Maybe things changed in later versions of Vivado.

 

We are using a Z7020 and the hardcore PS EMAC for our ethernet needs with a 1000BASE-T link.

Disabling data caches makes the problem go away and all symptoms seem to match with cache issues.

 

The main issue here is that when the checksum offload is being used in hardware, lwip won't do a software check, so all frames received from DMA will be considered correct. The TCP/IP socket interface will never issue an error and data will be corrupted without the user realizing it.

 

Here is the easiest way to replicate the problem that we could find:

1) Create a fresh BSP with FreeRTOS, lwip202 and libmetal.

 

2) In lwip202 settings change tcp_options->tcp_wnd from the default 2048 to 4096. lwip documentation mentions that this value should be 2 times higher than tcp_mss anyway, so the default is wrong. Leaving the default value won't make the problem to show.

 

3) Still in lwip202 settings change the api_mode to SOCKET API.

 

4) Trickiest part of the process: The easiest way to detect the problem is to have lwip doing sofware checksums and drop bad frames which will allow us to trigger on the code or count the dropped frames with lwip statistics.

By default, if the PS EMAC is being used then the generated lwip options header will have software checksums disabled and there is no way to force that option in the BSP settings - this might not be the case for PL MACs.

The reason for this is that checksums are done in the PS EMAC and the software assumes all received frames has been previously checked - which makes sense.

To force software checks, I modified line 745 in the lwip202.tcl script that generate lwip options:

https://github.com/Xilinx/embeddedsw/blob/master/ThirdParty/sw_services/lwip202/data/lwip202.tcl#L745

This file can be found in the local installation folder, which in my case (Linux) is:

/opt/Xilinx/SDK/2018.2/data/embeddedsw/ThirdParty/sw_services/lwip202_v1_1/data/lwip202.tcl

Change the default value from 0 to 1 and regenerate the BSP. lwip will then start checking checksums from incoming frames.

WARNING: Don't forget to change this value back since this will affect all other projects too if their BSPs get regenerated

 

5) Create a new Application using this BSP and pick the Xilinx TCP iperf demo. Everything should compile without issues.

 

6) Debug it a target hardware and set a breakpoint on line 150 of the tcp_in.c file found in the BSP generated folder:

[bsp_name]/ps7_cortexa9_0/libsrc/lwip202_v1_1/src/lwip-2.0.2/src/core/tcp_in.c

This part of the code should NOT be grey-ed out in XSDK, if so then software checksum is still disabled.

https://github.com/Xilinx/embeddedsw/blob/master/ThirdParty/sw_services/lwip202/src/lwip-2.0.2/src/core/tcp_in.c#L150

 

7) Finally, run iperf from the host machine and see the code breaking on that breakpoint.

e.g.: iperf -c 192.168.250.243 -i 5 -t 60 -w 64k

 

We believe that this problem is affecting other users without them being aware.

 

Thanks in advance for any help in fixing it!

 

1 Solution

Accepted Solutions
Highlighted
Observer
Observer
2,182 Views
Registered: ‎05-27-2016

Radhey,

 

I opened an issue on Github (https://github.com/Xilinx/embeddedsw/issues/53) and Harini suggested a code addition that seems to have solved the problem. I've opened a pull request for the fix after testing it.

 

Thanks for your help.

View solution in original post

0 Kudos
9 Replies
Highlighted
Adventurer
Adventurer
2,303 Views
Registered: ‎09-19-2016

Hello,

 

we also had some corrupted data on receive end. But we have been using lwIP RAW in baremetal application. There is a lwIP pbuf_copy_partial function used for copying data from TCP stack buffers to application buffers. This function is also used inside lwip_recv, that is, lwip_recvfrom function which is used when having FreeRTOS+lwIP Socket application. We have modified this pbuf_copy_partial function by adding this line of code:

Xil_DCacheInvalidateRange((UINTPTR)p->payload, buf_copy_len);

exactly before memcpy line. Once we added this, we didn't have any problems with data corruption.

 

Even though we modified lwIP function, this still seems to be more of a problem of Xilinx Ethernet drivers.

 

Best regards,
Nenad

0 Kudos
Highlighted
Observer
Observer
2,291 Views
Registered: ‎05-27-2016

Hi Nenad,

 

Thank you for your suggestion. Unfortunately that didn't work for us.

We still see lwip dropping frames due to bad checksums and the PS EMAC continues to report no errors as expected for a 60 second iperf test:

 

Bytes recv: 1908840353 (1.78 GiB)
Checksum Err (emac): 0
Checksum Err (lwip): 1630

 

I agree with your observation that this seems to be a driver issue.

 

Cheers

0 Kudos
Highlighted
Adventurer
Adventurer
2,286 Views
Registered: ‎09-19-2016

We have disabled all lwIP checksums, by the way. 

How do you count those lwIP cheksum errors?

0 Kudos
Highlighted
Observer
Observer
2,278 Views
Registered: ‎05-27-2016

Hi Nenad,

 

If you are using lwip through the BSP with the PS EMAC then you will need to follow step 4 of my first post. If not using the PS EMAC you can enable it in BSP settings -> lwip202 -> temac_adaptor_options-> [*]checksum_offload all to false.

If you are not using lwip through the BSP then you can go to the lwipopts.h file and add/modify CHECKSUM_CHECK_TCP to 1.

^ This will enable the checksums.

 

To enable statistics you can also do it on the BSP: lwip202->stats_options->lwip_stats to true.

Or set LWIP_STATS and TCP_STATS to 1 in lwipopts.h.

 

If you do both then you will just need to include <lwip/stats.h> on your code and periodically check lwip_stats.tcp.chkerr.

 

You won't see lwip_stats.tcp.chkerr increment if software checks didnt' get enable, so I recommend you check if the code in tcp_in.c is being compiled or not:

https://github.com/Xilinx/embeddedsw/blob/master/ThirdParty/sw_services/lwip202/src/lwip-2.0.2/src/core/tcp_in.c#L150

 

Best

0 Kudos
Highlighted
Observer
Observer
2,220 Views
Registered: ‎05-27-2016

 I had to bump posts, but we are still facing this issue and it is also likely to affect other users who are not aware of it. Thanks!

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
2,129 Views
Registered: ‎02-20-2014

Cache issue could be due to prefetching. As a quick experiment -

a) Try adding DCacheInvalidate() after DMA transaction is completed and then pass it to the upper layer. 

 

Please keep us posted on the results.

 

Thanks,

Radhey

0 Kudos
Highlighted
Observer
Observer
2,183 Views
Registered: ‎05-27-2016

Radhey,

 

I opened an issue on Github (https://github.com/Xilinx/embeddedsw/issues/53) and Harini suggested a code addition that seems to have solved the problem. I've opened a pull request for the fix after testing it.

 

Thanks for your help.

View solution in original post

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
2,069 Views
Registered: ‎02-20-2014

Thanks for the update. Good to hear that it fixed cache issue.

 

0 Kudos
Highlighted
Adventurer
Adventurer
741 Views
Registered: ‎04-27-2011

I've encountered this in baremetal on an Ultrascale+ using a single Cortex-R5 and 2018.2. Enabling LWIP software TCP checksumming didn't fix the problem (by triggering a re-send of the packet that contains the locally corrupted data). I _think_ this is because that checksumming code is the first to pull the payload through the data cache, so it's clean and checksums OK that first time. The data then gets corrupted in cache later, prior to being presented to my callback.

On a 1G ethernet link the corrupted payload occurs roughly once every 250MBytes of data RX'ed. It occurs more frequently on slower links.

This forum post had the fix that worked for me: call Xil_DCacheFlushRange() directly before accessing the payload in the callback.

Stacey

0 Kudos