12-28-2020 05:53 AM
We have a legacy design with a Virtex-5 using the Endpoint Block Plus 1.14. The design works with no issues when connected directly to the PCIe bus x4 Gen1.1 and using the slot clock. A few years ago we needed a FiberOptic link and used the Samtec PCIEO hardware to connect our hardware to the PC's PCIe bus and that worked with no problem. The Samtec PCIEO fiber optic hardware at the endpoint had its own PCIe clock which in essence made this an asynchronous clocking system. The Samtec PCIEO fiberoptic hardware is now obsolete and we changed the design with to use the Samtec PCUO PCIE FIreFly hardware. At the endpoint end we made an interposer board for the FireFly hardware and placed a PCIe compliant clock on it. So this design is also asynchronous clocking and not unlike the previous Samtec PCIEO fiberoptic hardware. The issue is link training does not exit to L0. I started troubleshooting with a chipscope and the signals recommended in the Virtex-5 LTSSM Debug Guide. I can see the endpoint cycling constantly through the Detect, Polling, and Configuration states. I do have VCD dumps of some of the states, but it is difficult at the chipscope level and minimum buffer size to get to the bottom of whats going on. I did connected a protocol analyzer between the FireFly and the Virtex-5 endpoint to get a better understanding of what is going on. I created a documented that compares a Good and the Bad FireFly LTSSM trace that I have attached. It appears that in the configuration state the host proposes a "Link" number of 2, the endpoint responds with the same link number for a few packets, and then the endpoint goes back to a pad to the the link number? Its also strange that the NFTS numbers between the good and bad traces are different? I have worked with the Samtec support engineers and they see no issue with the implementation of their hardware.
01-12-2021 08:57 AM
It is difficult to tell what is happening. You have a working case scenarion and a failing scenario. The reason for failure probably may not lie within the IP as the passing case works fine Having said that the normal course of debug would be the guidelines mentioned here: https://www.xilinx.com/Attachment/Xilinx_Answer_42368_Debugging_Guide.pdf which we believe you have already been referring to. Apart from that, please review the known issues in this link: https://www.xilinx.com/support/answers/51597.html Also, there are some limitations in the IP that are documented in the corresponding user guide. They may not be related here but would be good to review. In general, the debug that would apply here is checking clocks are ok such as jitter etc., make sure power doesn't have noise. All these are listed in the debug document.
In your word document, you mention about link number. It sounds like the issue is due to the link number. Could you check why the host is sending '2' instead of '0'? Another thing regarding the IP advertising Gen3 is odd. This IP doesn't support Gen3. Also, do check with the switch vendor too to see if they have any patches. Multiple times in the past, a pcie issue which was referred as being an IP issue ended up being a switch issue.
01-12-2021 09:34 AM
Thank you for replying. I have been following the debugging guide. I do have chipscope VDC files for the original hardware setup where the link trains and the FiberOptic hardware setup where the link fails. I will attached two of those VCD's that capture the Reset to Detect waveforms. I am trying to understand the differences between the two. It smells to me like in the failing optical setup that somethings goes wrong at the very beginning and that is why I have attached the VCD's for the good and bad Reset to Detect state portion of startup. Then, whatever that initial failure is, the the root (switch) and the endpoint are confused about one another? While the attached VCD's do not show it I have checked PLLLKDET and that is staying locked. The fact that the endpoint is transitioning through Detect, Polling, and Configuration would seem to indicate there is no lane SI issues? We have a scope with PCIe compliance software and the clock passes at the FPGA. In the failing VCD there seems to be a lot more activity in the pipe_rx_status? As I mentioned I am trying to figure out what this might mean. In regards to the switch it is a PLX/Broadcom device and have tried some setting changes with no luck. I have tried two other FireFly cards from another vendor with a different PLX/Broadcomm part number and I still have the same result. Seems to me its something in our hardware that I can't put my finger on. Let me know if something is popping out in the VCDs. I have other VCDs of the Detect to Polling and Polling to Configuration state transitions or if you have some suggestion of some other signal to monitor.
Thanks again for the help
01-14-2021 04:51 AM
rx_status is an indication of error. It is mentioned on page-28 of the debug guide and also you will find more details about this in the GTP user guide. Do you see this signal toggling in configuration state too? Is the behaviour same in both working and failing scenarios?
I see you are using x4 lane. Would it be possible to check with x1 lane?
You mentioned that you have tried with other firefly cards. Do you see the same link number proposed as '2' in all cases? Have you checked why there is '2' instead of '0'?
Do you have other endpoint cards that you can check with the same setup? I would be interested to see if the link number proposed is same or different and if it works or not.
I assume you are checking with your own design. Would it be possible to try with Gen1x1 configuration with the example design that comes with the generation of the IP?
01-14-2021 06:08 AM
Thank you for the reply. I do have traces through all the states for a "good" and "bad" system that I have attached. Once the "bad" leaves the detect state the rx_status remains at 0 for both good or bad, but in the bad I can see it randomly(?) changing during the reception of packets. I think that is what I was missing is I assumed things were fine in a previous state if it had moved on to the new state. I now have an area to focus on and I was in the middle of trying to convert the design to x1.
I believe I have seen the 2 with other FIreFly cards and a different endpoint card of this same design. The reason for this may be the hardware setup between the good and the bad traces. For the good the endpoint is connected directly to the PC's motherboard slot. In that case for physical reasons I have to use our LeCroy T1-4 Edge PCIe card protocol analyzer. In the case of the Bad, between the FPGA endpoint and the PC is the FireFly and the FireFly card with the 8733 PLX switch. In this case I have to have mid-bus probes soldered to the FIreFly board to record with a LeCroy T3-8 summit protocol analyzer. So I wonder in the latter case if the link number 2 is whats proposed by the 8733 switch? The crazy thing is I always see the FPGA endpoint reporting Gen1, 2, and 3 capable which it is not. So my theory that something gets total confused from the beginning, like in the detect state.
I never thought about your the example Gen1 x1 design, that might be easier to load onto our board than modifying our design.
So I have a number of things to try, that will take me a few days, and I will come back when I have some results.
01-14-2021 11:06 AM
Looking a little closer in the reset to detect state the rx_status in the good and the bad are both showing disparity and 8B/10B decoding errors. Being I can only get a chipscope depth of 2048 its difficult to look beyond in that state. When I check the good and bad in the transition to polling and configuration the rx_status is 0 and that appears to be the case when TS1 and TS2 sets are being exchanged. So is it possible that these 8B/10B decoders at this early stage for both the good and the bad are just getting the disparity established and that establishment takes place latter in the Detect state and that is why the LTSSM successfully moves onto Polling and Configuration? I have attached a GTK wave screen shot of the good and bad rx_errors. One thing that is curious is the bad in which the hardware is connected to the switch has less "receptions" of data, where the good that is connected to the PC's PCIE bus has more receiver activity? Not certain if this is a clue or just a difference in how the switch or the PC's PCIe bus behave?
01-18-2021 12:40 PM
I was able to compile the Gen1 x1 example design provided with the endpoint block 1.14 we are using in the design. It has really made no difference in the LTSSM issue I am having. I have attache two screen shots from the protocol analyzer. Please note that the protocol analyzer confirms both the Root and the Endpoint are connected as x1 and 2.5Gbs. In this case when a Link number is proposed a Link number of 0 is first sent from the endpoint to the PCIe switch (root), the PCIe switch responds in kind with the 0 link number, and then after a number of TS1 ordered sets the link number goes back to PAD?
01-18-2021 05:14 PM
Is it correct to say Gen1x1 example design also shows the same exact issue as with your own design but with the gen1x1 design, the link number proposed is always '0' and not '2'?
Could you share the following:
1. The entire project directory for Gen1x1 design.
2. Protocol Analyzer Waveform dump file for the working case scenario
3. Protocol Analyzer Waveform dump file for the failing case scenario i.e. when connecting to the switch.
Also, would it be possible to check with v1.15 of the core?
01-19-2021 07:20 AM
Yes it appears the Gen 1 version exhibits the exact same behavior expect a link number 0 is proposed instead of 2. In the protocol screen shot that I sent in the previous reply it appears the endpoint is the first to propose the link number 0 and in an earlier failing protocol trace in the x4 configuration a link number of 2 was proposed, but I think it was the root that proposed it first? Not certain if either the endpoint or the switch (root) can propose first? Then as always a number if TS1's latter the Link Number goes back to PAD
Attached is a ZIP of the example design directory generated by Coregen. FYI the x1 example design did not compile with the .BAT file with out errors. Those errors indicated that adjacent MGT blocks needed to be enabled and I added those to the UCF. My PCIe Reset_N signal came in on a different Pin and needed and Inversion (which I have in my design) so I added that modification. I don't think that impacts what is going on.
In terms of the protocol analyzer dump, I have attached the good Lecroy analyzer trace. Please note, because this was taken with the hardware that has the copper cable to a host PCIe card, we used our Lecroy T1-4 Edge protocol analyzer and that file can be opened with the Lecroy PETracer analyzer software. I tried to update the trace with Lecroy's newer "PCIe Protocol Analysis" software, but it fails for some reason. For the failing system where the FireFly hardware and the Dolphin PCIe card with the PLX/Broadcom 8733 switch is in between the FPGA's endpoint and the PC's root complex, as mentioned we are using the Lecroy Summit T3-8 with solder in embedded probes to capture that data with the PCIe Protocol Analysis software. That trace was too large to upload and I will lower the buffer to get it below 19MB and upload latter today. I will also work on trying endpoint block 1.15.
I also attached an image of the port of the switch that is connected to the FPGA when the FPGA was setup with the x1 example design. The connection to the host (port
After I get the bad trace uploaded I will work on trying 1.15.
Thanks again for your help.
01-21-2021 05:17 PM
Thanks for the files and checking with v1.15. I was able to download and open the captures after a bit of juggling with the analysis software.
Had a quick look of the captures but didn't spot any obvious issues.
Since connecting directly to the host works and the issue is only when the switch is connected, we believe there is something that the switch is doing that the V5 PCIe is not happy with. This is just a theory
at the moment. I will need some time to comb through the captures you sent. What I would suggest is to focus on the LeCroy data packet analysis during polling and configuration and check if the flow of packets
i.e. TS1, TS2, and their corresponding fields are correct according to what is described in the spec. The key here will be to find out who is breaking the protocol. If we can figure out that, it will help narrow down
the issue and focus only on why there is a deviation from the protocol.
01-22-2021 11:24 AM
Agreed. I have been combing the Chipscope and the Protocol for any thing that might be SI related, but can't find anything. I'll start checking the switch.
01-25-2021 06:06 AM
Below is from the passing case:
Below is from the failing case:
Could you please confirm if the upstream port is the first to always initiate non-pad link number value in the failing case?
Also, you had mentioned the following:
"In the good setup the entire system is powered up at the same time so there is a period of time before the FPGA configures and begins sending packets in the direction of the root complex.
In the bad system the FPGA is powered up before the root complex, so packets between the root complex and the FPGA begin almost immediately."
Would it be possible to check by powering up the system similar to in the working case scenario?
01-25-2021 11:29 AM
I have seen it both ways? In this case the root is the first to propose the link number, the endpoint agrees, and then the endpoint is the first to go back to the pad. I am fairly certain that it is always the endpoint that goes back to the pad first. Almost if timing out about something? Some of my traces have a link number of 2 and some 0 and I think that is because I moved the host card in the PC to try a different slot.
I have confirmed that the issue is the same if the FPGA powers up with the PC or is powered ahead of the PC power up. I have confirmed in the case where the FPGA is powered from the PC and thus powers up at the same time, that the FPGA is configured before the PCIe Reset_N is released. Green is PC power, Yellow is FPGA done, and Blue is the PCIe_reset_n signal to the endpoint
I have changed out switch setting to force that port to Gen 1 and I has not made a difference?
01-28-2021 11:42 AM
Attached is the files for the proposal from the FPGA and the Switch. One thing I learned is if I capture the very first link training TS2's to TS1's it looks like its always the FPGA that proposes first. After power up the system is constantly looping through the LTSSM states. If I capture those randomly then from time to time I will see the root or Switch propose first. I think in all cases the FPGA is the first to drop the link number like its timing out. Still strange to me how the FPGA and Switch/Root always both start sending TS2's at the same time?
A few weeks ago I modified and existing design that we have to work with this system. The original design was a FireFly module that connects to the host PC. The other end of the FireFly module connects to a PEX 8608 PCIe switch. Then the other 4 lanes on the switch connected to 4 USB3.0 host controllers. So that design was a USB to FireFly bridge if you will. I modified the design by taking the USB host controller off and replacing it with the 4 lane cable re-driver that allows me to connect it to this FPGA. I got the boards Monday and when I test it out, the FireFly links to switch to the PC no problem, but I have the exact same problem with the switch to FPGA link. I can see in chipscope that the LTSSM is doing the same looping; Detect, Polling, Configuration, and back to detect. I don't have the protocol analyzer on this hardware as very difficult to move in an embedded system. The big advantage is I can access the PLX switch with their tools. I can see that the switch has detected the PCIe lanes to the FPGA, so it has gotten that figured out. PLX does have an errata about connecting to older Gen 1 devices when it advertises speeds greater than Gen 2, so I disabled the recommended Upconfigure capability, but it made no difference.
02-03-2021 04:33 PM
Thanks for the capture. In one of the captures, I see Host sending the PAD after it had sent '0' in the link field.
Also, the behaviour in upstream direction cannot be explained with the spec. Could you provide similar captures for Gen1x1 design?
02-05-2021 12:36 PM
I have another clue. In the past couple of weeks we took an existing design with a FireFly optical module that is connected to a PEX 8608 switch that we use for another product which connects USB3.0 hubs through a FireFly back to a host. Port 0 is the FireFly and it connects to the upstream root complex. We removed the USB3.0 hubs from port 1 so it could connect to the FPGA with the switches downstream port 1. When we run this design the FireFly connects to the host no problem, but the FPGA still exhibits the same symptoms when I look in chipscope. Because the PCIe traces are on the PCB it would take a lot of work for me to get the protocol analyzer on it. This hardware allows me to make any PCIe switch setting changes that I would like. I did try a number of things like setting port 1 of the switch to Gen 1 only, but nothing seemed to work. I then had the idea of setting the switches Port 1 into Master Loopback mode just to verify that the FPGA was seeing the correct information from the switch and responding in kind. When I tested that I can see in chipscope the data test pattern I programmed into the switch, but I can see that the endpoint LTSSM is still looping in the Detect, Polling, and Configuration states? It like it did not see the bits in the TS1 sets telling it to go into Loopback mode? My assumption is this endpoint should support this mode from what I have read? I have attached some VCDs showing the issue. Next, I am going to try to capture in chipscope if I can the point where the switch sets the loopback bits.
02-10-2021 02:29 PM
I see the following in the three waveforms you sent:
I don't see any TS1/TS2 on RXDATA.
If you could send gen1x1 protocol capture, we could check how exactly TS1/TS2s are being exchanged. That will help to narrow down to the potential root cause.
The reason for asking x1 capture is to rule out any issue due to lane reversal.
02-12-2021 05:42 AM
Understood. I'll work on setting that up in the next day or two. Unfortunately at this point I had to return the PCIe protocol analyzer rental so I will only have chipscope traces.
02-16-2021 01:47 PM
I converted the design to x1 and updated the chipscope for the reduced lanes. I have attached a zip with a number of VCD captures. I hope I have captured signals of interest? If not let me know what you would like to see and I can fix that. There is a VCD for the transition from Initial to the Detect state that I captured by having the board powered up so I could have chipscope running and then I powered up the PC root complex so you can see the PCIe reset being asserted. The rest of the VCDs capture the transitions between states. I also checked the endpoint side of the PCIe switch and it indicates that there is one lane detected and that its in the link training process.
Thanks again for your help.
02-18-2021 08:44 AM
Unfortunately I had to return the rental protocol analyzer because of expense. My most recent approach to this problem is putting the switch connected to the endpoint into Master loopback mode to try to test the round trip communications, but as I previously mentioned it does not appear the endpoint responds and continues in its looping in the Detect, Polling, and Configuration states? I can definitely see the PCIe switch sending the loopback data pattern I programmed in the chipscope of the endpoint, but the endpoint does not appear to enter in to the loopback slave mode? That seems to be a common thread with the entire issue, I can "see" what I consider valid data on the RX_DATA and TX_DATA on chipscope within the endpoint, but its if neither side is on the "same page" in terms of processing and responding to the information its being given? I was going to try to find in chipscope where the switch sets the TS1 and TS2 bits for loopback to see where the endpoint is and if it should be responding to the message?
02-24-2021 06:18 AM
I went back to looking at the Chipscope traces for a good and a bad link training in the configuration state. The one curiosity that I see is with the pipe_rx_data_k[3:0] signal. In the "bad" trace on the top of the attached screen capture, I see that set to Fh when the RX Data is BCh or the COMM character and I would think this would be normal? In the case of the good trace at the bottom of the capture I do not see this asserting and stays at zero. Yet latter on in the "good" trace the FPGA endpoint begins to propose an 8h for a link number and latter (not in screen capture) the lanes numbers are proposed? So while the RX data that is being received by the endpoint looks indetical, for some reason pipe_rx_data_k[3:0] is behaving differently between the two?
02-24-2021 07:15 PM
That looks a little odd. Are those the same captures you posted earlier? I see the below waveform in that capture. I don't see any BC.
Could you post both working and non working versions for x1 design?
02-25-2021 12:06 PM
So I have many traces laying around and to be certain of the data I am sending, I did some re-captures and attached. A did a number of captures of the "good" setup that passes link training, makes it to L0, and seems to have no issues. Here is the crazy part that I don't understand; in some case I see the BC and the K asserting and in other cases I see the FC and the K asserting, yet in both cases I see entry in LO? I have attached 4 traces each of a separate power up of the "good" x1 hardware that shows both cases. Before clouding the issue with bad traces I just wanted to try to understand why some times I see FC and some times I see BC on both RX and TX? I have not changed the Chipscope? My setup has the FPGA powered up before the the PC so I can capture the start of link training. I have Chipscope setup and ready to trigger on PCIe_Reset_n so I am certain to capture the initial LTSMM states. I have Chipscope running off the PCIe clock that feeds the endpoint. So I have no clue whats going on. Any ideas appreciated.