cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Observer
Observer
765 Views
Registered: ‎08-20-2019

QDMA C2H completion entry timing

I am using PCIe-QDMA on a custom hardware and the firmware is developed using Vivado 2019.2. I am using H2C and C2H streaming modes, and C2H mode uses completion entry write back. I am referring to Xilinx example designs using QDMA for my logic development.

I can see in the example code that for C2H, the completion write back is performed with HAS_PLD set and is done before the C2H transfer is initiated. But this doesn't work for my use case as I need to extract some information from the C2H packet and include that in the completion entry. And I want to avoid buffering up the C2H packet just for this purpose. So I redesigned that part of the logic to initiate a C2H transfer first, and then do the completion write back (with HAS_PLD set) after the C2H transfer is finished.

This largely seems to work. However, it seems the host at times is not seeing some completion entry write backs. I have some counters running in software to track this. And this gets worse when I am forwarding a lot packets to host with minimum delay between C2H packets.

So is this supported by QDMA? And is there any timing relation between C2H and the completion entry? I couldn't find out anything relevant in pg302. Thanks for the help.

Tags (1)
0 Kudos
13 Replies
Highlighted
Moderator
Moderator
733 Views
Registered: ‎02-16-2010

Hi @xivar 

Your sequence of CMPT and C2H packet is expected to work fine. 

What is the value read in QDMA_C2H_WRB_COAL_CFG register?

What is the CMPT buffer size set in the IP GUI? If it is set to "16", can you try using "32"?

Please share the completion context settings used with the design.

 

------------------------------------------------------------------------------
Don't forget to reply, give kudo and accept as solution
------------------------------------------------------------------------------
0 Kudos
Highlighted
Observer
Observer
713 Views
Registered: ‎08-20-2019

Hi @venkata 

Increasing the buffer size to 32 didn't help, it doesn't get any better. Values for some of the registers are as below (all values are in hex)

CMPT == 00008000 10001003 f8000000 00d81f80 18000007
QDMA_C2H_WRB_COAL_CFG == 80050010
QDMA_C2H_CMPT_COAL_BUF_DEPTH == 00000020

CMPT is the completion context structure. I am using completion entry size of 32B.

0 Kudos
Highlighted
Observer
Observer
699 Views
Registered: ‎08-20-2019

Another observation. If I add a delay between packets at the input stream - usleep(100) - all seems to work well.

So my theory is, and I could be wrong, when a lot of packets are forwarded QDMA, all descriptors in C2H descriptor ring gets use up. Before software has a chance to populate them again, QDMA sends more packets. And those packets disappear (to somewhere).

What happens when QDMA runs out of C2H descriptors? Does it apply some sort of back pressure to the user logic by dropping c2h_tready (I would have expected so) ? Or is there any other mechanisms? My observation (by looking at tready) is QDMA is almost always ready to accept more packets.

0 Kudos
Highlighted
Contributor
Contributor
675 Views
Registered: ‎01-13-2020

Yes, c2h_tready provide backpressure for user logic. Also we observed what we can cause the stall of the c2h pipeline c2h_tready=0, but I don't know why.

Long backpressure on h2c_tready can cause driver write problems.

Annotation 2020-04-30 135346.png
0 Kudos
Highlighted
Observer
Observer
590 Views
Registered: ‎08-20-2019

Sorry for the delayed response. For some reason, I am not getting email notifications for updates to my forum posts.

Thanks @dmsspb, that's an interesting capture. Have you enabled fetch credit in the driver for this case? For my use case, fcrd_en set to 0 in software.

I do sometimes see c2h_tready dropping to 0 between transfers. We have kept a small C2H ring size in software and the packet drops occur exactly when all the c2h descriptors are used up. Software requires some time to repopulate that and I was expecting c2h_tready to drop to 0 during this time. So that user logic won't send any more packets until the ring is populated again. Sadly I don't see c2h_tready going low for that duration and the packets just disappear.

I haven't tried applying back pressure on H2C, so I can't comment on that. But regarding stalling C2H pipeline, any tips on reproducing that? I may want to avoid getting into that state even accidentally.

0 Kudos
Highlighted
Contributor
Contributor
571 Views
Registered: ‎01-13-2020

Prefetch enabled.
This is a capture of the Xilinx dmaperf utility with an input file with an approximate context:
mode=st
dir=c2h
pf_range=0:0
q_range=0:3
flags=
cmpl_status_acc=5
dump_en=0
tmr_idx=5
cntr_idx=6
trig_mode=cntr_tmr
pfetch_en=1
cmptsz=0
rngidx=5
runtime=10
num_threads=1
bidir_en=0
num_pkt=64
pkt_sz=512
pci_bus=81
pci_device=00

FPGA project https://www.xilinx.com/Attachment/Xilinx_Answer_71453_QDMA_Performance_v6.pdf
0 Kudos
Highlighted
Adventurer
Adventurer
413 Views
Registered: ‎05-04-2017

Hi @xivar
Did you figure out how the backpressure works, we are using the qdma with dpdk driver now, at the beginning we use only the "s_axis_c2h_tready" signal as backpressure control, but when in speed test, it seemed that the signal does not have related to each queue, for example, when in 100Gbps flow test with 8 queue, part of the queue is underflow while others overflow.
It seems that another ports called "tm_dsc" should fixed this issue, but there is so little descriptions about these signal in pg302.
Thanks a lot.
0 Kudos
Highlighted
Observer
Observer
395 Views
Registered: ‎08-20-2019

@dlfcno I haven't figured out how to make QDMA apply backpressure via tready. I am still having this issue. But I do have a work around. I added a logic to monitor the traffic manager interface and stop sending packets to QDMA when credits run out, though I am not convinced why QDMA can't handle this by itself. Refer page 36 pg302 v3.0 for TM interface and also have a look at the ports on page 109.

I saw that this interface broadcasts the available credits in a queue (via tm_dsc_sts_avl) when a queue is initialized or when queue PIDX is updated by software. All I did was store this information (available credits) to block ram. And I have one entry per queue in ram. Before sending packets to QDMA I read and check the entry corresponding to the QID. Decrement the credit when a packet is send to QDMA. When the value becomes 0, stop sending packets to QDMA. You also want to update the credits when another TM update happens so that you can send more packets.

0 Kudos
Highlighted
Contributor
Contributor
376 Views
Registered: ‎01-13-2020

May be helpful: https://forums.xilinx.com/t5/PCIe-and-CPM/QDMA-tm-dsc-Credits-issued-and-then-cleared-during-queue-start/m-p/1111503#M16569

You are right, FPGA should count QDMA tm_dsc credits for C2H transfers. 1 credit up to 4096 bytes (1 descriptor) for QDMA "speed project". PG302 said, what C2H packet size should be <= 7*descritor (7*4096 bytes)  for Vivado 2018.3, and <=31*descriptor for newest Vivado.

We don't check that, because C2H packet up to 4096 bytes is OK for us.

C2H signals during packet transmission:
s_axis_c2h_tdata = our data
s_axis_c2h_dpar = parity check for a current beat of tdata. Code from example project
s_axis_c2h_ctrl_len = length of the packet. Set at start of the packet and hold during packet transmission. Length computed with FIFO packet mode using.
s_axis_c2h_ctrl_qid = queue ID. hold during packet
s_axis_c2h_ctrl_has_cmpt = 1 always
s_axis_c2h_ctrl_marker = 0 always
s_axis_c2h_ctrl_port_id = 0 always. port_id not documented. We found what H2C port_id=0 always for all PF
s_axis_c2h_mty = empy bytes at the last beat, = 0 otherwise
...
s_axis_c2h_tready use for backpressure

CMPT port:

Example project uses a 8-bytes completer for C2H packets. QDMA completion port has only 2 packets internal buffer (p48 PG302 v3.0, and we confirm it through ILA). So, example project use 2048 beats FIFO before this port (PG302 recommends FIFO of 512 completers min).
s_axis_c2h_cmpt_tdata =[63:20] user metadata [19:4] packet length, [3]=1 descriptor used (according to FPGA/driver examples) [2:0]=0, but in CMPT descriptor we see [3:0]=0xA.
s_axis_c2h_cmpt_size = 0 always (use 8 byte completer)
s_axis_c2h_cmpt_dpar = code from example project. Parity for every 32 bits
s_axis_c2h_cmpt_ctrl_qid = QID == packet QID
s_axis_c2h_cmpt_ctrl_marker = 0 always
s_axis_c2h_cmpt_ctrl_user_trig =0 always
s_axis_c2h_cmpt_ctrl_cmpt_type = 2'b11 always, HAS_PLD
s_axis_c2h_cmpt_ctrl_wait_pld_pkt_id = 1...65535 at start, then 0...65535 at work
s_axis_c2h_cmpt_ctrl_port_id = 0 always
s_axis_c2h_cmpt_ctrl_col_idx = 0
s_axis_c2h_cmpt_ctrl_err_idx = 0
s_axis_c2h_cmpt_tvalid set for 1 clk on next clk immediately after c2h_tlast. Get code with FIFO from "speed example"
s_axis_c2h_cmpt_tready for backpressure, connected to CMPT FIFO

tm_dsc port. Get/invalidate descriptors for C2H (dir=C2H, Stream, QID, q_en). Port ignored for H2C. 

AXIS status port - not used

We use
H2C packets <= 1 descriptor 4096 bytes.
C2H packets <= 1 descriptor 4096 bytes.
C2H/H2C packet consume one descriptor (Len no matter: 64 bytes or 1500 bytes).


0 Kudos
Highlighted
Adventurer
Adventurer
340 Views
Registered: ‎05-04-2017

Hi@dmsspb:
Thanks a lot for your reply.
The C2H packet size you mentioned above is just a little differences with our issue, in our test, we send a continuous stream of 100Gbps with the length of 1024 byte, after starting the test for some time we can found some packets lossing.
In the qdma's axis stream status ports, there is a signal named "axis_c2h_status_drop", its definition in pg302 is shown below:

We had monitored both "axis_c2h_status_drop" and "s_axis_c2h_cmpt_tready" in ILA, and we did captured the high level of "axis_c2h_status_drop", meanwhile the signal "s_axis_c2h_cmpt_tready" was also in hign level,  as the definition, "PCIe drops the packet if it does not have either sufficient data buffer to store a C2H packet or does not have enough descriptors to transfer the full packet to the host.", we believed that some of the queues had run out their descrpptors.

Also about this signal "s_axis_c2h_cmpt_ctrl_wait_pld_pkt_id", in our test, we think it should start at 0, until 65536, then roll back to 0, if we followed in example to start at 1, the first packets is always lost.

axis_s_status_drop.png
0 Kudos
Highlighted
Contributor
Contributor
319 Views
Registered: ‎01-13-2020

Hi @dlfc!

According PG302 packet ID should be started from 1 after reset. And then count 1...65535, 0, 1 ... See code from Xilinx "speed" example design. We have no first packet loss.

If you lost packets, IMHO, then next time there might be something wrong with the c2h descriptors and c2h buffers. Try to analyse C2H descriptor context step-by-step, packet-by-packet. There is something info about consumed and available descriptors, prefetched credits, errors.

Then try to analyse the QDMA bar 0 register dump with "err" in the register names.

c2h_len = 1024 bytes no problem.

I have no idea about ("axis_c2h_status_drop"=1 & "s_axis_c2h_cmpt_tready"=1). Did you count "tm_dsc port" descriptors? What about available descriptors in the C2H context at that time? Is descriptors in the C2H context == received desc. through "tm_dsc", consumed descriptors?

About pictures: this is a Xilinx "speed example" project. t - time defined by the cmpt FIFO in-out latency, and matched with the c2h_tlast timings. C2H 2 packets by 64 bytes.

P.S. What throughput did you get at 100Gbps on 8 queues via DPDK?

qdma_c2h2_2x64bytes.png
qdma_pkt_id_start_at1.png
qdma_c2h1_2x64bytes.png
qdma_pkt_id_speed_example_code.png
0 Kudos
Highlighted
Adventurer
Adventurer
245 Views
Registered: ‎05-04-2017

Hi @dmsspb:
Sorry to reply late;
For now, we haven't used the "tm_dsc" ports, but we planed to use it, in our current design, we followed a reference design from Xilinx which bypassed the "tm_dsc" ports, but it seemed that the packets loss is very frequent.
Honestly, the throughput on on 8 queues via DPDK is not clear now, according to https://www.xilinx.com/Attachment/Xilinx_Answer_71453_QDMA_Performance_v6.pdf, we had test our design in 1024 byte, and we had slowed the speed down to 90Gbps, there is still some packets loss.
Highlighted
Contributor
Contributor
221 Views
Registered: ‎01-13-2020

Hi @dlfc !

It's easy to try to generate a speed example design" as described in https://www.xilinx.com/Attachment/Xilinx_Answer_71453_QDMA_Performance_v6.pdf p.4, then inject in the code xilinx_qdma_pcie_ep.sv (* mark_debug = "true" *) near QDMA port wires, then synthesis, add debug etc...Then run in hardware. After that you may see the credits from "tm_dsc" port.
And, IMHO, in this example desc_cnt_clr_comb signal need to be corrected.

0 Kudos