04-27-2020 06:03 PM
I am using PCIe-QDMA on a custom hardware and the firmware is developed using Vivado 2019.2. I am using H2C and C2H streaming modes, and C2H mode uses completion entry write back. I am referring to Xilinx example designs using QDMA for my logic development.
I can see in the example code that for C2H, the completion write back is performed with HAS_PLD set and is done before the C2H transfer is initiated. But this doesn't work for my use case as I need to extract some information from the C2H packet and include that in the completion entry. And I want to avoid buffering up the C2H packet just for this purpose. So I redesigned that part of the logic to initiate a C2H transfer first, and then do the completion write back (with HAS_PLD set) after the C2H transfer is finished.
This largely seems to work. However, it seems the host at times is not seeing some completion entry write backs. I have some counters running in software to track this. And this gets worse when I am forwarding a lot packets to host with minimum delay between C2H packets.
So is this supported by QDMA? And is there any timing relation between C2H and the completion entry? I couldn't find out anything relevant in pg302. Thanks for the help.
04-28-2020 03:33 PM
Your sequence of CMPT and C2H packet is expected to work fine.
What is the value read in QDMA_C2H_WRB_COAL_CFG register?
What is the CMPT buffer size set in the IP GUI? If it is set to "16", can you try using "32"?
Please share the completion context settings used with the design.
04-28-2020 10:50 PM
Increasing the buffer size to 32 didn't help, it doesn't get any better. Values for some of the registers are as below (all values are in hex)
CMPT == 00008000 10001003 f8000000 00d81f80 18000007 QDMA_C2H_WRB_COAL_CFG == 80050010 QDMA_C2H_CMPT_COAL_BUF_DEPTH == 00000020
CMPT is the completion context structure. I am using completion entry size of 32B.
04-28-2020 11:55 PM
Another observation. If I add a delay between packets at the input stream - usleep(100) - all seems to work well.
So my theory is, and I could be wrong, when a lot of packets are forwarded QDMA, all descriptors in C2H descriptor ring gets use up. Before software has a chance to populate them again, QDMA sends more packets. And those packets disappear (to somewhere).
What happens when QDMA runs out of C2H descriptors? Does it apply some sort of back pressure to the user logic by dropping c2h_tready (I would have expected so) ? Or is there any other mechanisms? My observation (by looking at tready) is QDMA is almost always ready to accept more packets.
04-30-2020 03:54 AM - edited 04-30-2020 06:08 AM
Yes, c2h_tready provide backpressure for user logic. Also we observed what we can cause the stall of the c2h pipeline c2h_tready=0, but I don't know why.
Long backpressure on h2c_tready can cause driver write problems.
05-06-2020 10:39 PM
Sorry for the delayed response. For some reason, I am not getting email notifications for updates to my forum posts.
Thanks @dmsspb, that's an interesting capture. Have you enabled fetch credit in the driver for this case? For my use case, fcrd_en set to 0 in software.
I do sometimes see c2h_tready dropping to 0 between transfers. We have kept a small C2H ring size in software and the packet drops occur exactly when all the c2h descriptors are used up. Software requires some time to repopulate that and I was expecting c2h_tready to drop to 0 during this time. So that user logic won't send any more packets until the ring is populated again. Sadly I don't see c2h_tready going low for that duration and the packets just disappear.
I haven't tried applying back pressure on H2C, so I can't comment on that. But regarding stalling C2H pipeline, any tips on reproducing that? I may want to avoid getting into that state even accidentally.
05-07-2020 03:38 AM
05-27-2020 08:55 PM
05-28-2020 12:40 AM
@dlfcno I haven't figured out how to make QDMA apply backpressure via tready. I am still having this issue. But I do have a work around. I added a logic to monitor the traffic manager interface and stop sending packets to QDMA when credits run out, though I am not convinced why QDMA can't handle this by itself. Refer page 36 pg302 v3.0 for TM interface and also have a look at the ports on page 109.
I saw that this interface broadcasts the available credits in a queue (via tm_dsc_sts_avl) when a queue is initialized or when queue PIDX is updated by software. All I did was store this information (available credits) to block ram. And I have one entry per queue in ram. Before sending packets to QDMA I read and check the entry corresponding to the QID. Decrement the credit when a packet is send to QDMA. When the value becomes 0, stop sending packets to QDMA. You also want to update the credits when another TM update happens so that you can send more packets.
05-28-2020 04:19 AM - edited 05-28-2020 04:46 AM
You are right, FPGA should count QDMA tm_dsc credits for C2H transfers. 1 credit up to 4096 bytes (1 descriptor) for QDMA "speed project". PG302 said, what C2H packet size should be <= 7*descritor (7*4096 bytes) for Vivado 2018.3, and <=31*descriptor for newest Vivado.
We don't check that, because C2H packet up to 4096 bytes is OK for us.
C2H signals during packet transmission:
s_axis_c2h_tdata = our data
s_axis_c2h_dpar = parity check for a current beat of tdata. Code from example project
s_axis_c2h_ctrl_len = length of the packet. Set at start of the packet and hold during packet transmission. Length computed with FIFO packet mode using.
s_axis_c2h_ctrl_qid = queue ID. hold during packet
s_axis_c2h_ctrl_has_cmpt = 1 always
s_axis_c2h_ctrl_marker = 0 always
s_axis_c2h_ctrl_port_id = 0 always. port_id not documented. We found what H2C port_id=0 always for all PF
s_axis_c2h_mty = empy bytes at the last beat, = 0 otherwise
s_axis_c2h_tready use for backpressure
Example project uses a 8-bytes completer for C2H packets. QDMA completion port has only 2 packets internal buffer (p48 PG302 v3.0, and we confirm it through ILA). So, example project use 2048 beats FIFO before this port (PG302 recommends FIFO of 512 completers min).
s_axis_c2h_cmpt_tdata =[63:20] user metadata [19:4] packet length, =1 descriptor used (according to FPGA/driver examples) [2:0]=0, but in CMPT descriptor we see [3:0]=0xA.
s_axis_c2h_cmpt_size = 0 always (use 8 byte completer)
s_axis_c2h_cmpt_dpar = code from example project. Parity for every 32 bits
s_axis_c2h_cmpt_ctrl_qid = QID == packet QID
s_axis_c2h_cmpt_ctrl_marker = 0 always
s_axis_c2h_cmpt_ctrl_user_trig =0 always
s_axis_c2h_cmpt_ctrl_cmpt_type = 2'b11 always, HAS_PLD
s_axis_c2h_cmpt_ctrl_wait_pld_pkt_id = 1...65535 at start, then 0...65535 at work
s_axis_c2h_cmpt_ctrl_port_id = 0 always
s_axis_c2h_cmpt_ctrl_col_idx = 0
s_axis_c2h_cmpt_ctrl_err_idx = 0
s_axis_c2h_cmpt_tvalid set for 1 clk on next clk immediately after c2h_tlast. Get code with FIFO from "speed example"
s_axis_c2h_cmpt_tready for backpressure, connected to CMPT FIFO
tm_dsc port. Get/invalidate descriptors for C2H (dir=C2H, Stream, QID, q_en). Port ignored for H2C.
AXIS status port - not used
H2C packets <= 1 descriptor 4096 bytes.
C2H packets <= 1 descriptor 4096 bytes.
C2H/H2C packet consume one descriptor (Len no matter: 64 bytes or 1500 bytes).
05-28-2020 07:19 PM
Thanks a lot for your reply.
The C2H packet size you mentioned above is just a little differences with our issue, in our test, we send a continuous stream of 100Gbps with the length of 1024 byte, after starting the test for some time we can found some packets lossing.
In the qdma's axis stream status ports, there is a signal named "axis_c2h_status_drop", its definition in pg302 is shown below:
We had monitored both "axis_c2h_status_drop" and "s_axis_c2h_cmpt_tready" in ILA, and we did captured the high level of "axis_c2h_status_drop", meanwhile the signal "s_axis_c2h_cmpt_tready" was also in hign level, as the definition, "PCIe drops the packet if it does not have either sufficient data buffer to store a C2H packet or does not have enough descriptors to transfer the full packet to the host.", we believed that some of the queues had run out their descrpptors.
Also about this signal "s_axis_c2h_cmpt_ctrl_wait_pld_pkt_id", in our test, we think it should start at 0, until 65536, then roll back to 0, if we followed in example to start at 1, the first packets is always lost.
05-29-2020 03:29 AM - edited 05-29-2020 03:53 AM
According PG302 packet ID should be started from 1 after reset. And then count 1...65535, 0, 1 ... See code from Xilinx "speed" example design. We have no first packet loss.
If you lost packets, IMHO, then next time there might be something wrong with the c2h descriptors and c2h buffers. Try to analyse C2H descriptor context step-by-step, packet-by-packet. There is something info about consumed and available descriptors, prefetched credits, errors.
Then try to analyse the QDMA bar 0 register dump with "err" in the register names.
c2h_len = 1024 bytes no problem.
I have no idea about ("axis_c2h_status_drop"=1 & "s_axis_c2h_cmpt_tready"=1). Did you count "tm_dsc port" descriptors? What about available descriptors in the C2H context at that time? Is descriptors in the C2H context == received desc. through "tm_dsc", consumed descriptors?
About pictures: this is a Xilinx "speed example" project. t - time defined by the cmpt FIFO in-out latency, and matched with the c2h_tlast timings. C2H 2 packets by 64 bytes.
P.S. What throughput did you get at 100Gbps on 8 queues via DPDK?
05-31-2020 07:37 PM
06-01-2020 01:56 AM
Hi @dlfc !
It's easy to try to generate a speed example design" as described in https://www.xilinx.com/Attachment/Xilinx_Answer_71453_QDMA_Performance_v6.pdf p.4, then inject in the code xilinx_qdma_pcie_ep.sv (* mark_debug = "true" *) near QDMA port wires, then synthesis, add debug etc...Then run in hardware. After that you may see the credits from "tm_dsc" port.
And, IMHO, in this example desc_cnt_clr_comb signal need to be corrected.
08-04-2020 11:43 PM
I was working on similar application on AXI_MM based example design and continues transfer(back to back) and see some timeout errors. Recently i could try the same on 2020.1 along with latest driver seems to be working and NO Timeout error. you may try