cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Observer
Observer
11,450 Views
Registered: ‎10-20-2014

LWIP flips bytes when receiving large amounts of data via TCP.

 
Hello,
 
I'm investigating a strange problem when receiving large amounts of data (i.e. about 3GB) with LWIP over TCP.  The problem is: LWIP sends TCP-RST packets with byte-flipped port numbers and duplicate ACKS about every 500-1000 ms.
This is what I caught with Wireshark while receiving data with LWIP about every half a second or so:
 


No.     Time     Source                Destination           Protocol Length Info


 150333 10.955   192.168.1.200         192.168.1.10          TCP      1454   55210 > 7788 [PSH, ACK] Seq=140273001 Ack=1 Win=64240 Len=1400


 150334 10.956   192.168.1.10          192.168.1.200         TCP      60     7788 > 55210 [ACK] Seq=1 Ack=140267401 Win=65535 Len=0


 150335 10.956   192.168.1.200         192.168.1.10          TCP      1454   55210 > 7788 [PSH, ACK] Seq=140274401 Ack=1 Win=64240 Len=1400


 150336 10.956   192.168.1.200         192.168.1.10          TCP      1454   55210 > 7788 [PSH, ACK] Seq=140275801 Ack=1 Win=64240 Len=1400


 150337 10.956   192.168.1.10          192.168.1.200         TCP      60     7788 > 55210 [ACK] Seq=1 Ack=140270201 Win=65535 Len=0


#150338 10.956   192.168.1.10          192.168.1.200         TCP      60     27678 > 43735 [RST, ACK] Seq=1 Ack=3367293192 Win=65535 Len=0


 150339 10.956   192.168.1.10          192.168.1.200         TCP      60     7788 > 55210 [ACK] Seq=1 Ack=140271601 Win=64135 Len=0


#150340 10.956   192.168.1.10          192.168.1.200         TCP      60     [TCP Dup ACK 150339#1] 7788 > 55210 [ACK] Seq=1 Ack=140271601 Win=64135 Len=0


 150341 10.956   192.168.1.200         192.168.1.10          TCP      1454   55210 > 7788 [PSH, ACK] Seq=140277201 Ack=1 Win=64240 Len=1400


 150342 10.956   192.168.1.200         192.168.1.10          TCP      1454   55210 > 7788 [PSH, ACK] Seq=140278601 Ack=1 Win=64240 Len=1400


 150343 10.956   192.168.1.200         192.168.1.10          TCP      1454   55210 > 7788 [PSH, ACK] Seq=140280001 Ack=1 Win=64240 Len=1400


#150344 10.956   192.168.1.10          192.168.1.200         TCP      60     [TCP Dup ACK 150339#2] 7788 > 55210 [ACK] Seq=1 Ack=140271601 Win=64135 Len=0


#150345 10.956   192.168.1.10          192.168.1.200         TCP      60     [TCP Dup ACK 150339#3] 7788 > 55210 [ACK] Seq=1 Ack=140271601 Win=64135 Len=0


#150346 10.956   192.168.1.10          192.168.1.200         TCP      60     [TCP Dup ACK 150339#4] 7788 > 55210 [ACK] Seq=1 Ack=140271601 Win=64135 Len=0


 150347 10.956   192.168.1.200         192.168.1.10          TCP      1514   [TCP Fast Retransmission] 55210 > 7788 [PSH, ACK] Seq=140271601 Ack=1 Win=64240 Len=1460


 150348 10.956   192.168.1.10          192.168.1.200         TCP      60     [TCP Dup ACK 150339#5] 7788 > 55210 [ACK] Seq=1 Ack=140271601 Win=64135 Len=0


 150349 10.994   192.168.1.10          192.168.1.200         TCP      60     7788 > 55210 [ACK] Seq=1 Ack=140281401 Win=65535 Len=0


 150350 10.994   192.168.1.200         192.168.1.10          TCP      1454   55210 > 7788 [PSH, ACK] Seq=140281401 Ack=1 Win=64240 Len=1400
 
Please note that the Port-Number 27678 is the byte-flipped version of 7788, and 43735 is the byte-flipped version of 55210.  While this bug doesn't break the TCP-Flow, I'd still rather have it fixed, because it might cause other problems, and has negative impact on latency.
 
Test Setup:

  • XC702 Evaluation Board.   
  • Xilinx SDK 14.4  
  • LWIP as generated by the SDK BSP, configured like in xapp1026 for high speed data transfer (e.g. large tcp window)
  • Software (Zynq): Standalone echo server from the SDK example with the following modifications
    • port changed from 7 to 7788
    • in receive_callback():  the received buffer is copied to some location in DDR3 RAM, instead of echoing it back to he client.
  • Software (PC): TCP sender (winsock) that opens the port and continuously spams the server with data.

   
In the echo-server only thte recv_callback was changed (and the port-number was set to 7788):  
 

static char* rxbuf = (char*)(310*1024*1048); // use buffer at 310mb
err_t recv_callback(void *arg, struct tcp_pcb *tpcb,
                               struct pbuf *p, err_t err)
{
    /* do not read the packet if we are not in ESTABLISHED state */
    if (!p) {
        tcp_close(tpcb);
        tcp_recv(tpcb, NULL);
        return ERR_OK;
    }
 
    if( !p->tot_len ) {
        for(;;);
    }
 
  // copy the whole pbuf-chain
    int copy_len = pbuf_copy_partial(p, rxbuf, p->tot_len, 0  );
    //int copy_len = p->tot_len;
 
    if( copy_len != p->tot_len ) {
        for(;;);
    }
 
    /* indicate that the packet has been received */
    tcp_recved(tpcb, p->tot_len);
 
    /* free the received pbuf */
    pbuf_free(p);
 
    return ERR_OK;
}

 
Strangely, when I comment out the call to pbuf_copy_partial(), everything works fine, no byte-flipped RST packets. But as soon as  I'm actually doing something to the received data (i.e. copying it), LWIP generates these wrong RST-packets.
 
The problem seems to be correlated with recv_callback being called with a chained pbuf, i.e. one where p->len != p->tot_len. It  took me some time to figure out that I can't just memcpy p->payload up to length p->len like the echo server example suggests, because from time to time depending on memory condiditons recv_callback gets multiple chained pbufs.  
 
Now I wonder, did anyone ever run a tcp-receiving application, that receives more than just a few bytes from the  
client (i.e. a web server), and does not discard the data (i.e. rxperf in xapp1026)?  
 
And isn't the echo-server and rxperf wrong in acknowledgeing only p->len bytes instead of p->tot_len, or is there a reason why  there are no chained pbufs to be expected in an echo server or txperf application?
 

0 Kudos
23 Replies
Highlighted
Visitor
Visitor
11,125 Views
Registered: ‎12-14-2013

Hi, Did you make your tcp/ip transmission data worked? If so, can you please post your code so i can compare with mine. I want to sent datas to a fpga ethernet port and store these datas in the ddr ram. the problem is i do not receive all the data and i do not know why because i think that i did everything write. i used the xilinxapp1026 as source of inspiration to write my code.
0 Kudos
Highlighted
Observer
Observer
10,991 Views
Registered: ‎10-20-2014

The code is right there in my first post. Everything else is the same as in any other example, e.g. the Webserver Example in xapp1026. The trick is to use pbuf_copy_partial() instead of using pbuf->payload and pbuf->len.

 

However the real problem, is still not solved:

After about 400MB have been received, I still get TCP-RST packets with byte-flipped port address and a 250ms hiccup until TCP recovers the connection. This all on a short and direct ethernet connection where no transmission errors are to be expected.

 

I traced the problem a bit from the netif-function to lwip with these results:

  •  suddenly a partly corrupt pbuf is received. The problem looks like this (XX marks data corruption):

 

|--ETH--|---IP---|-----TCP----|------DATA-------------------|
___________________________XXXXXXXXXX________________________
  • part of the tcp header port and some payload data is corrupted (for the tcp-header it's byte-flipped port numbers, the first few data bytes are 'normally' corrupted - not just byte-flipped).
  • LWIP code responds to this erroneous packet by sending TCP-RST.

What is very strange, is that the problem seems to be caused by a  _READ_ access to the pbuf->payload memory in the tcp receive callback, because:

  • if I copy the pbuf with pbuf_copy_partial(), or with memcpy of pbuf->payload, packets get corrupted.
  • if I just read the pbuf->payload (to calculate the sum of the bytes), packets get corrupted.
  • If I copy some other unrelated 1400 byte buffer to another location in the recv_callback (to create a similar delay and memory access pattern), then no error occurs.

All this with a very bare application, that does only the basic initialization to set up a TCP listening server for receiving data.

 

So does anyone have an idea why _reading_ from the pbuf->payload buffer could cause corruption?

 

Later I might post the full sdk test project to trigger this problem...

 

0 Kudos
Highlighted
Observer
Observer
10,976 Views
Registered: ‎10-20-2014

Update: when disabling all memory caches, the problem disappears (and the data rate drops to 130 mbit/s).

Could be a problem in the lwip driver, maybe it's not invalidating caches as required.

0 Kudos
Highlighted
Explorer
Explorer
9,306 Views
Registered: ‎08-21-2013

bpelger.astyx,

 

Not sure if you are still having an issue, but you may want to look at my posts in response to this message:

 

https://forums.xilinx.com/t5/Embedded-Development-Tools/Zynq-Ethernet-driver-Sending-several-BdRings/td-p/450544/highlight/true

 

There were indeed issues with the cache flushing in the Xilinx drivers both for lwip and their core drivers. Nobody seemed very interested in fixing them at the time. I stopped using the tools and have no idea if the bugs are still present.

0 Kudos
Highlighted
Observer
Observer
9,232 Views
Registered: ‎10-20-2014

Thanks for the link, I'll check that thread out.

 

As for the Problem: No I didn't solve it, I just accepted it. TCP recovers from this error with a small delay. At least in 2014.4 it does.

 

I've also tested it in Vivado 2015.1 with the Test-Setup I described above, and the problem *seems* to be gone, with the emphasis being on *seems". With a different application, the problem is actually worse, with corrupt data being received in the TCP stream. It obviously dependent on the application's memory access pattern.

 

So I just sticked with 2014.4, since I don't have time to hunt Xilinx Bugs, or to prepare another test case. They would ignore it anyway, like before.

 

 

 

0 Kudos
Highlighted
Explorer
Explorer
9,205 Views
Registered: ‎08-21-2013

Good luck.

 

This link explains it as well:

 

https://forums.xilinx.com/t5/Embedded-Development-Tools/Bug-s-in-Xil-DCacheInvalidateRange-in-Standalone-v-3-11-a/m-p/464728#M31641

 

also:

 

https://forums.xilinx.com/t5/Embedded-Development-Tools/Why-does-lwip-not-use-proper-TCP-MSS/m-p/461118

 

The usual no response from Xilinx.  On the hardware side they seem very responsive. And the SDK is far superior as a tool to the competitor in terms of usability. But when it comes to the actual quality of low-level drivers etc, there seems to very little interest, effort or testing. Putting parenthesis around macro arguments is C programming 101.

 

There was clearly no regression testing on the invalidate cache functions. I cut and pasted them into a C program and fed a large range of arguments into them and the bugs show themselves very quickly.

 

Again, I stopped using the tools and don't know if these things are still an issue.

 

FreeRTOS is releaseing a Xilinx port of their own TCP/IP stack. Might be worth a look.

 

 

0 Kudos
Highlighted
Adventurer
Adventurer
9,153 Views
Registered: ‎11-05-2014

Putting parenthesis around macro arguments is C programming 101.

So is not accessing beyond the bounds of an array, and yet LWIP is doing just this...

https://forums.xilinx.com/t5/Embedded-Development-Tools/LWIP-Echo-Server-example-with-2nd-ethernet/m-p/662632#M38305

The usual no response from Xilinx
ditto.
0 Kudos
Highlighted
Explorer
Explorer
8,847 Views
Registered: ‎08-21-2013

On the hardware side, they are very resposive (even to "dumb" questions). When it comes to embedded tools, even bugs handed to them on a silver platter are ignored. All very odd. 

 

I had hoped the new ZYNQ MpSoc would have cache coherent peripherals (as Altera does) which would greatly  improve performance and simplifiy drivers (ie no cache flushing), but no such luck.

0 Kudos
Visitor
Visitor
7,906 Views
Registered: ‎03-11-2016

To fixed this bug just flush the cache:

#include "xil_cache.h" //Added

static char* rxbuf = (char*)(310*1024*1048); // use buffer at 310mb
err_t recv_callback(void *arg, struct tcp_pcb *tpcb,
                               struct pbuf *p, err_t err)
{
    /* do not read the packet if we are not in ESTABLISHED state */
    if (!p) {
        tcp_close(tpcb);
        tcp_recv(tpcb, NULL);
        return ERR_OK;
    }
 
    if( !p->tot_len ) {
        for(;;);
    }
   
    Xil_DCacheFlushRange((unsigned int)p->payload, p->len);//Added
   
  // copy the whole pbuf-chain
    int copy_len = pbuf_copy_partial(p, rxbuf, p->tot_len, 0  );
    //int copy_len = p->tot_len;
 
    if( copy_len != p->tot_len ) {
        for(;;);
    }
 
    /* indicate that the packet has been received */
    tcp_recved(tpcb, p->tot_len);
 
    /* free the received pbuf */
    pbuf_free(p);
 
    return ERR_OK;
}

Highlighted
Visitor
Visitor
4,849 Views
Registered: ‎09-05-2012

minhquang

Thanks! Adding Xil_DCacheFlushRange command works great!

0 Kudos
Highlighted
Observer
Observer
4,835 Views
Registered: ‎10-20-2014

Thanks, I can also confirm that this fixes the bug.

 

Though is it a complete solution?

The pbuf is a linked list.

If we flush only p->payload, then what  p->next->payload or p->next->next->payload?

Sometimes p->next is not NULL, and that case has to be dealt with (thats why we need pbuf_copy_partial instead of memcpy).

Though I did some more thorough tests on this, and found that the bug is really fixed with only flushing p->payload.

 

I just want to understand why this is sufficient. Maybe because tcp_in.c only modifies the tcp header, which is normally in the first p->payload?

0 Kudos
Highlighted
4,809 Views
Registered: ‎07-11-2011

minhquang thank you sooooooooo much!  You saved my day ;)

 

I had been fighting a problem that had been plagueing me for months and was looking deeper today.  When using UDP receive.  Occasionally 32 bytes of a new packet would have old packet data in it (and would move its location around randomly).  I had been trying all kinds of stuff to figure out the cause.  I suspected caching, and was diving into the xilinx drivers ... xemacpsif_dma and such...  The Ethernet on this project has been sooo rough with this SOC, lots of problems with trying to do more than run a single Ethernet port with the given Xilinx software and such.  Lots of hacks needed to get things working (including modification to the tcl script that generates lwip support).

 

Thanks again minhquang!

0 Kudos
Highlighted
Adventurer
Adventurer
4,148 Views
Registered: ‎11-05-2014

@ewong3 wrote:
So is not accessing beyond the bounds of an array, and yet LWIP is doing just this...

https://forums.xilinx.com/t5/Embedded-Development-Tools/LWIP-Echo-Server-example-with-2nd-ethernet/m-p/662632#M38305

 

So it appears that the message I originally linked to above has been deleted. Note the red message at the top "The message you are trying to access is permanently deleted."  It now makes me look like I wasn't responsive to someone trying to help, but I had actually posted a fix to my problem that showed a glaringly dumb bug in Xilinx's lwip stack.  I sure would like an explanation why it was deleted.

0 Kudos
Highlighted
Visitor
Visitor
2,005 Views
Registered: ‎08-18-2017

I have had the same problem getting data corruption under heavy rx load. I initially thought it was because of cache problems but it turned out to be nothing to do with cache at all.

There are many people suggesting that adding a call to Xil_DCacheFlushRange() fixes the problem but all it does is mask the real problem. In fact I can't work out how it would fix the problem at all really, and it adds a lot of unnecessary waiting time to your rx code.

The actual problem is because the Xilinx driver currently hands over control of the rx descriptors to the hardware BEFORE it has told the descriptor where to DMA the data.

To fix the problem, move the call to XEmacPs_BdRingToHw() in the setup_rx_bds() function so that it (and its associated status checking) is called AFTER the call to XEmacPs_BdSetAddressRx(). Also move the dsb() so that it happens just before XEmacPs_BdRingToHw().

There is a rather amusing comment for the XEmacPs_BdRingToHW() function:

"Any changes made to these BDs after this point will corrupt the BD list leading to data corruption and system instability."

Perhaps Xilinx should have paid more attention to their own comments?

Highlighted
Observer
Observer
1,976 Views
Registered: ‎10-20-2014

Hi jonnyd,

If that is correct, then that is an additional issue, that also has to be corrected.

The actual issue from this thread is a different one though:
It seems that it burns down to the Xilinx' LWIP driver (Folder "contrib") frees RX DMA buffers immediately in the RX ISR before the LWIP stack and the application had a chance to process the Buffer/Packet.
Therefore incoming RX packets could at any time overwrite packets that have not yet been processed.

I've discussed this with Simon Goldschmidt (LWIP developer) here:
http://lists.nongnu.org/archive/html/lwip-devel/2018-01/msg00128.html

I was complaining there about LWIP because it modifies received TCP packet headers, but Simon is right in that this actually is a Xilinx Problem: the application itself could at any time write to the RX packet.

I did the easy fix of flushing the cache at each exit of the tcp_input() function. But this is just a workaround.
An issue can still occur if the application processes rx packets significantly slower than the arrive.
In this case the EMAC is supposed to drop new incoming packetes, but the Xilinx implementation will actually overwrite old packets instead, which is a lot worse.

I guess the best advice would be not to use Xilinx' LWIP implementation or to switch to Linux.

0 Kudos
Highlighted
Visitor
Visitor
1,970 Views
Registered: ‎08-18-2017

I saw your post on the lwip forum and have also replied to that but it hasn't been accepted yet.

With all due respect, i think you are mistaken about the xilinx driver freeing dma buffers before they are used by lwip. It is the buffer descriptors that are freed in the rx interrupt, not the pbufs themselves. This, as far as i can work out, is entirely correct because when the rx interrupt is triggered the descriptor has done its job of telling the ethernet dma where to put the data.

In the rx interrupt, any free descriptors are assigned new pbufs from the rx pool and so there is no chance of them being assigned pbufs which are already in use.

I have just remembered that there was another separate issue with the xilinx driver that i came across a while ago, caused all kinds of problems. It was to do with pbufs being allocated within an interrupt but being freed outside of interrupts. When i get back to the office i will try and give you more details. I believe if you solve these two problems with the driver, it ends up working pretty well.

Highlighted
Observer
Observer
1,947 Views
Registered: ‎10-20-2014

Hmm, ok I might be mistaken and the pbufs are not freed in the rx interrupt.

But then I have no idea how to explain the issue that I was observing: I catched the LWIP in a breakpoint receiving packets with byte-flipped port numbers. In such a packet I also observed corrupt data on the first few bytes.
This looks very much like exactly one cache line (128 byte?) being corrupted. Now I know that LWIP byte-flips some TCP header fields in-place.

To me it looks like in an unlucky sequence of events:
- Lwip byte-flips the port number in the TCP header of a received packet. This change is now waiting in the write cache.
- The lwip stack processes and frees the pbuf (probably not invalidating or flushing it?).
- Emac DMA overwrites this "free" memory with another rx packet.
- The cache controller now decides to flush the cache and overwrites the TCP header and the first few bytes of the packet.
- RX interrupt is triggered and invalidates the cache lines for the packet, in effect making the corruption permanent.
Needless to say that this issue is hard to reproduce/debug, since it only happens in the the rare case of this unlucky event sequence.

Some people suggest calling Xil_DCacheFlushRange() in  LWIP RX callback to flush the pbuf (which now points to the data part). This works because it will also flush the TCP header because it is in the same cache line with the first data byte of the packet. Where it fails to work are packets that never cause a user RX callback such as TCP-ACKs.

Now the problem you describe is about the DMA hardware getting the wrong destination address in the buffer descriptor. I don't see how this is related. How can this cause cause a corruption in only the first cache line of a received packet?

You mention a second problem with pbuf freeing/allocating inside/outside interrupts. If this is also not related, then we would have 3 critical issues in this lwip driver.
Given that i've started this thread a few years ago and Xilinx doesn't seem to care, the lesson for me is that Xilinx' focus is probably on the linux drivers. Do not use bare-metal if you need a working TCP stack.

 

0 Kudos
Highlighted
Visitor
Visitor
1,941 Views
Registered: ‎08-18-2017

I just spent about half an hour typing out a reply but something went wrong with the forum and I lost it all.

I can't spend another half an hour re-typing it so I'll summarise:

I'm going to make 2 assumptions:

You have set MEM_ALIGNMENT to 32 or a multiple of 32 then your pbuf payloads will always be aligned to a cache line.

You are using a version of Xilinx SDK where they have fixed the bug in Xil_DCacheInvalidateRange() where it flushes / invalidates more cache lines than it is supposed to.

 

I believe your unfortunate sequence of events is not possible, due to the fact that in setup_rx_bds() the cache is marked invalid for the new pbuf payload. This means that if lwip wants to write to this memory, that area of memory must be fetched from DDR into the cache before the write happens. So the processors view of memory is correct at this point. There is no need to do any further flushes/invalidates because the memory is not used by anything outside of the processor.

In step 2, you state that "The lwip stack processes and frees the pbuf" so as far as lwip and your application are concerned you are finished with that data. Therefore if there were any problems with cache after this point, it doesn't matter.

 

Have you observed any corruption in any other part of the data you receive? My application receives multiple 128Kb chunks of data and I can see corruption happening in different places for different chunks. I discovered the "Xilinx Tools" -> "Dump/Restore Data File" utility. It enabled me to dump the entire 128Kb chunk of received data straight from memory to a file which I could diff with the file I originally sent (using vbindiff).

I agree, out of the box the Xilinx driver is very buggy to the point where I'm amazed that anyone is able to make a working product from it. However if you fix this and other problems, it ends up being pretty stable.

 

0 Kudos
Highlighted
Visitor
Visitor
1,934 Views
Registered: ‎08-18-2017

Ok so I've just made a load of changes to my application and the problem is back!

I think I have tracked it down to Xilinx's implementation of a packet queue. It completely fails when used in interrupts. None of the fields of the struct pq_queue_t are declared volatile so how on earth this works un-modified is beyond me.

I have changed this (in xpqueue.h) to:

typedef struct {
  volatile void *data[PQ_QUEUE_SIZE];
  volatile int head;
  volatile int tail;
  volatile int len;
} pq_queue_t;

and it seems to work.

This might also explain why adding a call to Xil_DCacheFlushRange() fixed the problem for some people, because there is a memory barrier instruction in that function.

I notice in SDK 2018.3 these bugs are all still present.

0 Kudos
Highlighted
Observer
Observer
1,927 Views
Registered: ‎10-20-2014

Hi johnnyd,

You should really type the answers in an editor first, just in case.

It seems that the bug you describe with the bd address is a newly introduced one,
that was not present in the version I was using (from XSDK 2014.4, lwip 1.4.1).

Here is setup_rx_bds() from lwip202_v1_1 from XSDK 2018.2:

 

void setup_rx_bds(xemacpsif_s *xemacpsif, XEmacPs_BdRing *rxring)
{
	XEmacPs_Bd *rxbd;
	XStatus status;
	struct pbuf *p;
	u32_t freebds;
	u32_t bdindex;
	u32 *temp;
	u32_t index;

	index = get_base_index_rxpbufsstorage (xemacpsif);

	freebds = XEmacPs_BdRingGetFreeCnt (rxring);
	while (freebds > 0) {
		freebds--;
#ifdef ZYNQMP_USE_JUMBO
		p = pbuf_alloc(PBUF_RAW, MAX_FRAME_SIZE_JUMBO, PBUF_POOL);
#else
		p = pbuf_alloc(PBUF_RAW, XEMACPS_MAX_FRAME_SIZE, PBUF_POOL);
#endif
		if (!p) {
#if LINK_STATS
			lwip_stats.link.memerr++;
			lwip_stats.link.drop++;
#endif
			printf("unable to alloc pbuf in recv_handler\r\n");
			return;
		}
		status = XEmacPs_BdRingAlloc(rxring, 1, &rxbd);
		if (status != XST_SUCCESS) {
			LWIP_DEBUGF(NETIF_DEBUG, ("setup_rx_bds: Error allocating RxBD\r\n"));
			pbuf_free(p);
			return;
		}
		status = XEmacPs_BdRingToHw(rxring, 1, rxbd);
		if (status != XST_SUCCESS) {
			LWIP_DEBUGF(NETIF_DEBUG, ("Error committing RxBD to hardware: "));
			if (status == XST_DMA_SG_LIST_ERROR) {
				LWIP_DEBUGF(NETIF_DEBUG, ("XST_DMA_SG_LIST_ERROR: this function was called out of sequence with XEmacPs_BdRingAlloc()\r\n"));
			}
			else {
				LWIP_DEBUGF(NETIF_DEBUG, ("set of BDs was rejected because the first BD did not have its start-of-packet bit set, or the last BD did not have its end-of-packet bit set, or any one of the BD set has 0 as length value\r\n"));
			}

			pbuf_free(p);
			XEmacPs_BdRingUnAlloc(rxring, 1, rxbd);
			return;
		}
#ifdef ZYNQMP_USE_JUMBO
		if (xemacpsif->emacps.Config.IsCacheCoherent == 0) {
			Xil_DCacheInvalidateRange((UINTPTR)p->payload, (UINTPTR)MAX_FRAME_SIZE_JUMBO);
		}
#else
		if (xemacpsif->emacps.Config.IsCacheCoherent == 0) {
			Xil_DCacheInvalidateRange((UINTPTR)p->payload, (UINTPTR)XEMACPS_MAX_FRAME_SIZE);
		}
#endif
		bdindex = XEMACPS_BD_TO_INDEX(rxring, rxbd);
		temp = (u32 *)rxbd;
		if (bdindex == (XLWIP_CONFIG_N_RX_DESC - 1)) {
			*temp = 0x00000002;
		} else {
			*temp = 0;
		}
		temp++;
		*temp = 0;
		dsb();

		XEmacPs_BdSetAddressRx(rxbd, (UINTPTR)p->payload);
		rx_pbufs_storage[index + bdindex] = (UINTPTR)p;
	}
}

As you can see, the address bug you describe exists. Also the payload is invalidated correctly.

Now this is the version from 2014.4 where I have the workaround of flushing tcp header at tcp_input() exits:

 

void setup_rx_bds(XEmacPs_BdRing *rxring)
{
	XEmacPs_Bd *rxbd;
	XStatus Status;
	struct pbuf *p;
	unsigned int FreeBds;
	unsigned int BdIndex;
	unsigned int *Temp;

	FreeBds = XEmacPs_BdRingGetFreeCnt (rxring);
	while (FreeBds > 0) {
		FreeBds--;
		Status = XEmacPs_BdRingAlloc(rxring, 1, &rxbd);
		if (Status != XST_SUCCESS) {
			LWIP_DEBUGF(NETIF_DEBUG, ("setup_rx_bds: Error allocating RxBD\r\n"));
			return;
		}
		BdIndex = XEMACPS_BD_TO_INDEX(rxring, rxbd);
		Temp = (unsigned int *)rxbd;
		*Temp = 0;
		if (BdIndex == (XLWIP_CONFIG_N_RX_DESC - 1)) {
			*Temp = 0x00000002;
		}
		Temp++;
		*Temp = 0;

		p = pbuf_alloc(PBUF_RAW, XEMACPS_MAX_FRAME_SIZE, PBUF_POOL);
		if (!p) {
#if LINK_STATS
			lwip_stats.link.memerr++;
			lwip_stats.link.drop++;
#endif
			LWIP_DEBUGF(NETIF_DEBUG, ("unable to alloc pbuf in recv_handler\r\n"));
			XEmacPs_BdRingUnAlloc(rxring, 1, rxbd);
			dsb();
			return;
		}
		XEmacPs_BdSetAddressRx(rxbd, (u32)p->payload);
		dsb();

		rx_pbufs_storage[BdIndex] = (int)p;
		Status = XEmacPs_BdRingToHw(rxring, 1, rxbd);
		if (Status != XST_SUCCESS) {
			LWIP_DEBUGF(NETIF_DEBUG, ("Error committing RxBD to hardware: "));
			if (Status == XST_DMA_SG_LIST_ERROR)
				LWIP_DEBUGF(NETIF_DEBUG, ("XST_DMA_SG_LIST_ERROR: this function was called out of sequence with XEmacPs_BdRingAlloc()\r\n"));
			else
				LWIP_DEBUGF(NETIF_DEBUG, ("set of BDs was rejected because the first BD did not have its start-of-packet bit set, or the last BD did not have its end-of-packet bit set, or any one of the BD set has 0 as length value\r\n"));
			return;
		}
	}
}

 

As you can see the address bug is not present here, but the payload is *not* invalidated, so the cache bug is still present.


I vaguely remember that at that time I also tried Xil_DCacheInvalidateRange at just this place, but the application completely crashed. This might be due to a the bug you describe in Xil_DCacheInvalidateRange(). Maybe this bug  is present in 2014.4.

So you have the choice between the devil and the deep blue sea:
- take the old version 2014.4 you get the cache invalidate bug.
- take the new version 2018.2 you get the bd address bug (and probably less performance due to completely invalidating the complete puf payload)

For me I stay with 2014.4 with my workaround, and that way I'm not affected by the new bd address bug.

As a side note I'm also not happy with LWIP itself for patching around headers of received TCP packets - for nothing but avoiding a few port number byte-flips.
I consider that also poor code, although it is technically not an error (Simon is right here: if the application gets the rx buffer, it can do with it whatever it wants, including overwriting).

>> None of the fields of the struct pq_queue_t are declared volatile
Well to my understanding without "volatile" the compiler is free to optimize/delay/skip/reorganize the memory access, but it doesn't have to. If most compilers don't do that with your access pattern, then it can work.
But take a compiler with better optimization, and then it breaks.

 

0 Kudos
Highlighted
Advisor
Advisor
1,912 Views
Registered: ‎10-10-2014

in case you're doubting about cach issues, it could be usefull to turn off caching completely and see if the problem disappears - at that point you can be (not completely but) quiet sure it's a caching issue. 

I did some low level ethernet frame communication (without lwIP, just bare metal code) a few years ago with a Zynq, and I remember having the same strange issues if I did not correctly flushed/invalidated stuff at the right time. I discovered that by invalidating the cache completely.

 

** kudo if the answer was helpful. Accept as solution if your question is answered **
0 Kudos
Highlighted
Observer
Observer
1,696 Views
Registered: ‎10-20-2014

I did try to turn off cache completely and then it works.

But that alone is not enough to tell whether it's a cache issue. 
That's because without cache nework throughput drops by a factor of about 10. When you debug something that only happens under full load, then reducing the performance by 10 makes it work of course - even if it is not cache related.

Also invalidating the cache completely when you only need to invalidate a single packet/buffer is a performance issue.

0 Kudos
Highlighted
Visitor
Visitor
1,647 Views
Registered: ‎08-18-2017

I thought I'd got rid of this problem but it has reared its ugly head again.

Yes, moving the rx pbuf pool into non-cached DDR makes the problem go away. But also, so does setting it to Write-through, no write allocate. So you still get most of the speed benefit of the cache without the corruption.

I have given up trying to understand why the corruption happens. I have fixed so many bugs in Xilinx's network interface code and am convinced that the cache invalidation is happening in the correct place, yet it still doesn't work without changing the TLB attributes.

0 Kudos