UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Explorer
Explorer
15,958 Views
Registered: ‎12-01-2010

Throughput issues with DMA and tx_buf_av in 7 Series Integrated PCIe Block

Jump to solution

I am having some serious throughput issues while implementing DMA on the PCIe core.  I am using the VC707 development board.  I have instantiated ChipScope into the design to view what is happening. Core implementation details are: 

    7 Series Integrated Block for PCIe (v2.1)

    PCIe Gen2 x8

    MAX_PAYLOAD_SIZE = 256B

    TX_Buff_av = 29

    128 bit AXI width

    TLPs are 64 bit writes (4DW) with 176 byte payload, no digest

 

My transactions consist of 3 groups of 12 TLPs every 13us, for a total of 36 TLPs.  This is what i'm seeing.

 

PCIe-Capture-tx_ready_delay-1.jpg

 

As you can see in the image, the first group transaction executes just fine, slowly refilling with plenty of time afterwards up to the maximum 29 buffers available.  This is the expected PCIe core behavior.  After the second group, the buffers available are at 7 for quite some time.  See below.

 

PCIe-Capture-tx_ready_delay-2.jpg

 

This is perplexing.  Why do the buffers not refill immediatley, and instead wait 711 clock cycles (@250 MHz) before they begin to free? The third group doesn't even finish before tready goes low, preventing further transmissions for periods of time while the buffer become available.  The fourth group is the same.

 

By the 5th group, we've reached a sort of stasis where i am sending single TLPs whenever the PCIe core AXI allows.  See below.

 

PCIe-Capture-tx_ready_delay-3.jpg

 

The time between transmissions varies, but averages an atrocious 150 clock cycles (@250 MHz).  Why is this?  This calculates out to ~290MB/s, well below the 4GB/s expected. In addition, the tx_buf_av never drops below 5.  Why doesn't it go down to 1 or 0?  Seems like a big waste of buffer space.

 

So That's where i'm at.  My throughput it totally shot, and i can't meet expected system speed.  What is really interesting is the "decay" in performance.  If the packets were malformed in any way, they PCIe Root Port would reject them consistently.  I am also clearly not sending too fast, as the first  group had no problems getting across with plenty of time to spare.

 

Has anyone seen this before?  Any suggestions?

Thanks in advance.

0 Kudos
1 Solution

Accepted Solutions
Explorer
Explorer
24,428 Views
Registered: ‎12-01-2010

Re: Throughput issues with DMA and tx_buf_av in 7 Series Integrated PCIe Block

Jump to solution

Marcus,
    I decided to explore the payload size a bit more, and pulled the cfg_dcommand[15:0] register out of the Core.  I looked at the Max_Payload_Size bits and they are set to "001" which indicates a 256B max payload.  This is the assumption i have been operating off of the entire time, as it is also the maximum the Xilinx Core can hadle at x8 Gen 2.
    But i decided to follow your suggestion, despite the register telling me otherwise.  If you are correct, and an upstream switch did in fact limit the payload size to 128, then the onerous would be on an intermediary switch to break the packet up into two.  As far as i understand, this would not generate any errors, but would necessitate tying up the buffers for twice the duration.

    None the less, i spent a few days making changes and mucking around with different things.  I put in some registered control bits that let me dynamically change the TLP size between transactions.  One of the settings was to cut the packet length in two (below 128 Bytes), which doubled the number of TLPs. This unfortunately didn't seem to have much effect either way.   It really seems to be system dependent. 

 

 

I found a great article talking specifically about this issue.

 

http://www.design-reuse.com/articles/15900/realizing-the-performance-potential-of-a-pci-express-ip.html

 

It discusses how poor response times will cause transmission stalls, which is really the best way to describe what is going on here. So in the end, you were correct in numerous posts when you mentioned flow control and ACKs.  There is really nothing inherently wrong with what i'm doing.  It's a total system interaction issue due to a LOT of data being transmit, slow ACK response times, and even possible dropped packets due to signal integrity, resulting in many retries.

 

The solution here is really to implement a very large DDR FIFO on the output data AXIS (right before the core) to attempt to buffer data through the stalls, and then transmit full throttle to catch up.  Depending on the data size, this may indeed require tens or hundreds of Megabytes, much more than the block RAM available.  Additionally, one can attempt to eliminate as many un-necessary background processes running in windows to alleviate PCIe bus activity.  These two things together, and you should be able to consistently transfer all of the Posted Writes you need.

 

 

I am greatful for your help & suggestions, and time spent replying to my posts.

Thanks,
Mark

0 Kudos
9 Replies
15,942 Views
Registered: ‎02-28-2011

Re: Throughput issues with DMA and tx_buf_av in 7 Series Integrated PCIe Block

Jump to solution

Hi,

 

i dont have much time, so here is the short answer:

 

buffers are held until the link partner send an acknowledge. It must be held in case it has to send the package again.

Depending on the core settings you cannot see such ack message because the core hides it.

 

Regards Markus

0 Kudos
Explorer
Explorer
15,925 Views
Registered: ‎12-01-2010

Re: Throughput issues with DMA and tx_buf_av in 7 Series Integrated PCIe Block

Jump to solution

One thing i forgot to mention, is that the PCIe behavior is absolutely NOT consistent.  Sometimes the behavior is as described above.  Sometimes it takes 10 or 20 groups before the performance degrades.  And sometimes, i have NO ISSUES what so ever, and the entire data set of 2,000+ groups gets through without issue in the allotted time.

 

It's this lack of reliability and consistency that has me flummoxed the most.  What do i need to do to ensure some sort of reliable performance baseline for a PCIe design?  Are there some PCIe motherboard chipset settings that can improve performance?

0 Kudos
15,905 Views
Registered: ‎02-28-2011

Re: Throughput issues with DMA and tx_buf_av in 7 Series Integrated PCIe Block

Jump to solution

You are using the board in a Desktop PC right ? a few comments:

 

- How is the board connected to the root complex (directly or over a switch)?

- If there is a switch, which other board is connected to the same root complex over the switch?

  -> other board might be using the connection and bus gets "unreliable"

- If there is a switch, what is the max payload of that one (switches i use only use 128bytes instead of your used 176)

  -> packages are split at switch and connection might get "unreliable"

- Maybe the bus has a lot of errors? can you check that somehow? That would cause a lot of retries of packages and buffer count would not free fast enough

- 711 clock cycles is about 2,8us which looks quite reasonable to clear a buffer from write to acknowledge. I just checked my design and it takes about 2,1us there (VPX rack, PCIe x1 Gen1)

- did you check how flow control works? Can you enable flow control messages and check what happens?

 

Regards Markus

 

0 Kudos
Explorer
Explorer
15,892 Views
Registered: ‎12-01-2010

Re: Throughput issues with DMA and tx_buf_av in 7 Series Integrated PCIe Block

Jump to solution

I just realized that my Link Status Register shows x1 Gen2.  The core is implemented as an x8 Gen 2, and this is verified in the Link Capabilites Register.  So that means that the training process with the Root Port has negotiated down to the miniumum of x1!  This would explain my bandwidth issue!

 

I have vinidcated the VC707 card and the PC as possible suspects, leaving only the design.  What design/core settings and or implementations could possibly cause link negotiations to go down to x1??  Could implementation timing be an issue?

 

Xilinx has a good document to help me start.  It's Xilinx Answer 56616 - 7 Series PCIe Link Training Debug Guide.

I will use it to get a good start on my link width train down issue, but any suggestions would be greatly appreciated.

 

 

0 Kudos
Explorer
Explorer
15,845 Views
Registered: ‎12-01-2010

Re: Throughput issues with DMA and tx_buf_av in 7 Series Integrated PCIe Block

Jump to solution

After a few days of playing around with the code & different core settings, i finally gave up.  I had the idea to start a new project, but this time in Vivado 2013.3.  This version also has a new version of the 7 Series PCI Express enndblock - v2.2

I took my existing code (that consistently trained down to x1) and imported it directly into the new project.  I also instantiated the new v2.2 core as well, making sure to use the same settings as before.

 

The GOOD News:

Low and behold - the design trains to x8 consistently !!  I have no idea what did it, nor do i really care at this point.  I am continuing my development with Vivado 2013.3, as it is working well.  If you have a similar issue, i recommend upgrading to the latest version of the S/W and the core.

 

The BAD News:

I am still encountering the exact same problem that led me to write this post in the first place: the saxis_tx_tready signal from the core goes low frequently, causing heavy transmission delays.  This behaviour is totally RANDOM.  This means that the link width was NOT the reason for the throughput hit.

 

So when i'm sending a burst of 50 MB, for example, i expect it to be send in under 200ms.  Instead, i am getting transmission times upwards of 1 second!  Some times it actually gets through without even one saxis_tx_tready deassertion.  Other times, it is deasserted for pretty much the entire transaction.  Mostly, it's somwhere in between, with some portions unaffected, and other random portions heavily delayed.

 

I will continue to play with the core settings to see if anything improves, but as always, if anyone has any experience with this issue, please post it on this forum.  Thanks.

 

0 Kudos
15,842 Views
Registered: ‎02-28-2011

Re: Throughput issues with DMA and tx_buf_av in 7 Series Integrated PCIe Block

Jump to solution

Hi,

 

well your problem seems very odd indeed. I am running a design with PCIe X1 Gen1 (Virtex6) with over 160MB/s without any issues.

How are you sending the 50MB ? continues packets without pauses or do you have pauses between packets?

Maybe flow control messages are not able to be sent because you "flood" the tx ?

 

Regards Markus

0 Kudos
Explorer
Explorer
15,836 Views
Registered: ‎12-01-2010

Re: Throughput issues with DMA and tx_buf_av in 7 Series Integrated PCIe Block

Jump to solution

Markus,

  Thanks for your reply.  The data stream is variable, but there are definitley spaces between packets.  They are NOT back to back.

A typical scenario would be 60 TLPs, spaced 5 clock cycles apart (20 ns), followed by an 8us break.  This is repeated thousands of times.

 

This doesn't seem to be a "flood of too much data" issue.  The key thing i am noticing is that you'd think that during the breaks (8us is a LONG break) the credits would get fully refreshed, and ready to accept the next group.  But that's not the case.  They just stay low for long periods of time, apparently for no reason.  And like i stated before, it's totally random.  Sometimes i won't get held off even one time, and my 50MB transmission gets through no problem.

 

I'm not sure if this is a Endpoint Core issue, or something going on upstream, holding off the Endpoint.

 

I'm stuck, and i don't really know how to proceed on this.

0 Kudos
15,824 Views
Registered: ‎02-28-2011

Re: Throughput issues with DMA and tx_buf_av in 7 Series Integrated PCIe Block

Jump to solution

Hi,

 

I agree with you and doubt that it is an endpoint issue.

maybe it helps to create a picture of the pcie connections of your PC. looks to me like the PCIe x8 is going through a switch before it goes to the root complex. The PCIe x8 might be shared between the FPGA board and the graphics card for example ?

The payload size of such switch migth also be an issue. if the payload is higher than the one of the switch the packet will be split. Maybe that is the issue? I am asuming that becasue the switches i saw only had a payload of 128bytes (min required by PCie spec) and your packet size is bigger than that. Did you try to reduce to 128Bytes (0x20h)?

 

Regards Markus

Explorer
Explorer
24,429 Views
Registered: ‎12-01-2010

Re: Throughput issues with DMA and tx_buf_av in 7 Series Integrated PCIe Block

Jump to solution

Marcus,
    I decided to explore the payload size a bit more, and pulled the cfg_dcommand[15:0] register out of the Core.  I looked at the Max_Payload_Size bits and they are set to "001" which indicates a 256B max payload.  This is the assumption i have been operating off of the entire time, as it is also the maximum the Xilinx Core can hadle at x8 Gen 2.
    But i decided to follow your suggestion, despite the register telling me otherwise.  If you are correct, and an upstream switch did in fact limit the payload size to 128, then the onerous would be on an intermediary switch to break the packet up into two.  As far as i understand, this would not generate any errors, but would necessitate tying up the buffers for twice the duration.

    None the less, i spent a few days making changes and mucking around with different things.  I put in some registered control bits that let me dynamically change the TLP size between transactions.  One of the settings was to cut the packet length in two (below 128 Bytes), which doubled the number of TLPs. This unfortunately didn't seem to have much effect either way.   It really seems to be system dependent. 

 

 

I found a great article talking specifically about this issue.

 

http://www.design-reuse.com/articles/15900/realizing-the-performance-potential-of-a-pci-express-ip.html

 

It discusses how poor response times will cause transmission stalls, which is really the best way to describe what is going on here. So in the end, you were correct in numerous posts when you mentioned flow control and ACKs.  There is really nothing inherently wrong with what i'm doing.  It's a total system interaction issue due to a LOT of data being transmit, slow ACK response times, and even possible dropped packets due to signal integrity, resulting in many retries.

 

The solution here is really to implement a very large DDR FIFO on the output data AXIS (right before the core) to attempt to buffer data through the stalls, and then transmit full throttle to catch up.  Depending on the data size, this may indeed require tens or hundreds of Megabytes, much more than the block RAM available.  Additionally, one can attempt to eliminate as many un-necessary background processes running in windows to alleviate PCIe bus activity.  These two things together, and you should be able to consistently transfer all of the Posted Writes you need.

 

 

I am greatful for your help & suggestions, and time spent replying to my posts.

Thanks,
Mark

0 Kudos