09-18-2018 01:32 PM
Does anyone ever look into the PCIe AXI interface bus using the Xilinx ILA logic analyzer fro PCIe transaction?
I noticed that when I send a packet from the Host to my FPGA, the transmission received is fragmented into multiple PCIe packet instead of one. For example if my payload is only 8 DW, I noticed the transmission consist of 4 multiple small PCIe packet, each consisting of 2 DW instead of one packet of 8 DW.
Is this normal, does PCIe fragment its packet during a transmission?
09-18-2018 02:04 PM
It's allowed by the PCIE spec, and often occurs in my experience.
It's not clear from your description if your talking about PCIE Completion Packets, or MWr Packets, but in any event breaking things up is allowed (and ofter required) to meet MAX READ REQUEST size and/or MAX PAYLOAD sizes values.
See this thread also:
09-18-2018 02:20 PM
Look at the assembly code for your host application. If it actually writes to the address assigned to the PCIe BAR with load or store instructions the best you can hope for from a single PCIe packet is the architectural width of your processor.
For any x86 processor that isn't ancient that is 64bits or 2DWs. For some not so old ARM or MIPS that might only be 32bits or 1DW. For writes to the FPGA you might be able to do Write Combining, but that is not trivial and doesn't work for every use case.
Without some sort of hardware to combine the transactions it isn't PCIe that is fragmenting it is the CPU.
That is why for high performance you need to use some sort of DMA engine. For 8DW payloads it might not be worth the overhead but if you can group a bunch together you can get much better performance by having the DMA operate on say 100 packets of 8DW at a time.
09-18-2018 03:46 PM
I am referring to PCIe Write from the Host side.
Back to my example of 8 DW, My FPGA PCI bus is receiving 4 fragmented packet and they are actually out of order. Even though each fragmented packet has a corresponding pcie address associated with it, I have to concatenating it in the proper order as it is received.
It make sense that it could be fragmented from what you described but what causes it to be received out of order.
I'm only using the PCIe endpoint, but does the PCIe DMA/subsytem IP performs all the concatenation and reordering function and have all this transparent for the user?
09-18-2018 03:58 PM
What is your PCIE host device? Another FPGA, or a processor of some sort?
What you're describing is entirely a function of the PCIE host device. If things were setup there to transmit a "single" TLP MWr request of 8 DW (with a single PCIE Tag), then nothing at any lower level could cause those packets to be transmitted out of order and still be in spec. I could see, perhaps something breaking them up at a lower level, - but the PCIE spec would require that different TLPs with the same TAG to be transmitted in order.
The behavior you are describing would explicitly require (4) 2 DW PCIE TLPs to be built up, each with separate TAGs. This could only be occurring at the host side of things at some level. Is a DMA engine of some sort involved at the host side?
09-19-2018 03:02 PM
My FPGA communicates with a Linux PC containing an Intel core i3.
We wrote our PCIe driver and the SW side is only using "memcpy" to transfer DW payload. I'm using the PCIe endpoint because only a few DW transfer is required in my application.
When I snoop the PCIe AXI bus during a 8 DW payload transfer from PC to FPGA. I do noticed fragmentation, out of order and repeat TLP.
For example, if my payload consist of 8 DW = A, B, C, D, E, F, 1, 2.
I would received 5 pkt in the following order as shown:
pkt1 = C, D : addr=0x8 (rcv out of order, but correct addr in the pcie header)
pkt2 = A, B : addr=0x0
pkt3 = E, F : addr=0xC
pkt4 = A, B : addr=0x0 (repeat)
pkt5 = 1, 2 : addr=0x18
Is this normal for pcie spec, any idea why this could occur?
Do you know if the DMA for PCIe IP performs all the reordering, strips the repeat and concatenate the fragment packet and present them to the user? I might start looking into this pcie dma IP instead of the pcie endpoint.
09-19-2018 03:14 PM - edited 09-19-2018 03:19 PM
The low level "memcpy" is likely highly optimized. It may very well be doing a lot under the covers.
Since these are MemWr TLP packets, not completion packets, then whatever code is underneath memcpy is what is breaking up your packets - it's NOT the low-level PCIE bus that is doing it, I believe.
It would be enlightening to gather the PCIE TAG information for each packet too (editted - I see the addresses)
As to REPEAT TLP packets. That shouldn't occur, unless error conditions are present. (Like hardly ever). That is highly unusual. And this type of error handling is at the Data Link Layer of the PCIE spec (it's unclear if your instrumentation is monitoring the TLP layer, or one layer lower at the Data link Layer)
09-19-2018 03:40 PM
Like I replied before, you need to look at the actual assembly that results from your driver and app code. Compilers do a lot of things to optimize code that don't make sense at the first glance, particularly when memcpy gets involved. Compilers expect everything to be in memory or caches so they optimize for that unless you tell them not to and what is optimal for memory controllers and cache is rarely optimal for a PCIe device.
You probably have to play with compiler flags and optimization settings. You might have to make sofware changes such as some direct pointer assignments rather than a memcpy. You might need to do special things to prevent the processor from reordering instructions too.
The point is you need look at the software layers before assuming the AXI-PCIe bridge or the PCIe hardware is to blame. You can look at the PCIe DMA core but if you keep your "bad" software it won't help.
09-19-2018 03:44 PM
The tag in the pcie header for all the fragmented pkt is the same, they are all zero.
I understand the repeat is due to error....but the repeat pkt has the correct DW though, therefore could it be something else?
I'm using an extender ribbon cable for my pcie card, wonder if this is causing intermitting error
Have you ever use the DMA/subsystem PCIe IP? Does it reorder them and concatenate the fragment pkt?
I'll have the SW team look at the underlying of the memcpy call and determine if it is doing this.
Thanks for all you help
09-19-2018 03:53 PM
The PCIE TAG being all zeros - I thought that was wrong, but I believe that's a supported "feature" of the Xilinx PCIE endpoint. I think there's a mode you can set it to, where the Xilinx endpoint fully manages PCIE tags - so you never see the actual values. Your instrumentation is likely seeing this "dummy" TAG.
(Checking documentation): AXISTEN_IF_ENABLE_CLIENT_TAG = "FALSE" means the xilinx core manages PCIE tags, and present you with dummy information.
Regarding using the Xilinx DMA IP. I can't comment much as we use our own DMA IP. However, this is a much more complicated solution. Within the IP it'll need to deal with out-of-order Completion Packets, which are entirely possible, and common. (But if the IP works, you should never know).
I'd focus on diagnosing what's occurring now, rather than throwing things out and trying a new solution.