12-09-2019 03:52 AM
Hi everybody,
I'm working with this testbed:
1) KCU116 board
2) Mellanox 5X on linux Debian (AN/LT disabled , I forced 100G speed)
3) CMAC IP set with RS-FEC enabled
When I connect the transceiver I get right alignment on KCU, stat_rx_status, stat_rx_aligned, synced are ok, but I receive a stat_rx_remote_fault. (The eye is open on GT I checked them with System IBERT IP)
With this situation I have link down on Mellanox side.
I read PG157 on 10G Ethernet IP, it exaplains the bring up process of a 10G and it seems that the bring up sequence requires to send idle signal.
Can please confirm that the 10G explanation can be applyed to 100G CMAC IP?
On the PG203 v3 there aren't any references to the remote signaling (its usage specially related with bring up sequence)(there is but the description is a tautological one...), the Bring up sequence paragraph ends simply after the alignment is reached, instead, it seems like this step to exchange rx_remote_fault and send idle is mandatory (the game is not finished simply with the alignment...).
Pleas Can you confirm that 100G doesn't implement at all the MAC layer?
I tried out this "problem" with 3 different design:
1) Xilinx example design
2) custom design with AXI-Lite and AXI stream
3) Custom design with HDL code to control the CMAC and LBUS
So I think it is related to the protocol.
So Could you please provide an in depth description of the start up sequence of the CMAC and give me an hint on reference guide on how to address the stat_rx_remote fault signal?
Really thank you!
01-16-2020 04:01 AM
Finally I solved it!
The very stupid things on KCU116 is to enable TX of SFP cages with some jumpers on the boards.
In particular I followed the link https://www.xilinx.com/support/answers/69315.html on section "Default Jumper Settings",
adding J16, J17, J42, J54 enable TX aand finally I'm able to send packet.
ps However is mandatory to control the reconciliation layer signals (rfi, idle etc.)
12-10-2019 12:10 AM
When you receive remote fault, it means the link partner, Mellanox is having local fault.
So you should keep sending IDLE until the link partner completes the alignment and stops sending remote fault to us.
12-12-2019 08:46 AM
Hi @guozhenp
I'm glad to hear from you!
I implemented an FSM to manage the reconciliation when remote fault happens.
I show you what happen in both rising-edge of stat_rx_align signal debugged with ILA.
I have asserted the right signals as suggested on the PG203, in particular I asserted CTL_SEND_RFI and CTL_RX_ENABLE, waiting for STAT_RX_ALIGNED.
As soon as I get the alignment I receive th RX_REMOTE_FAULT from Mellanox and I started to send CTL_TX_SEND_IDLE and I deassert CTL_TX_SEND_RFI. (Please see the following ILA Traces)
Now the problem is that when RX_REMOTE_FAULT become zero after a while I lose the alignment.
As you suggested as soon as I get RX_REMOTE_FAULT = 0 I stop to send the idle and I enable the tx with CTL_TX_ENABLE, after that I lose the RX_ALIGN.
I get these 2 situations continuosly.
RX_ALIGN = 0 - SEND_RFI on ->
RX_ALIGN = 1 - SEND_RFI off ->
REMOTE_FAULT rise - SEND_IDLE on ->
REMOTE_FAULT falls -> (TX_ENABLE - SEND_IDLE off) ->
RX_ALIGN = 0
and then I restart.
It seems like I'm loosing the alignment as soon as I enable the TX. What can cause the dislignment from cmac internals?
Besides I tryied also without TX/RX flow control.
Thank you in advance.
12-12-2019 08:55 AM
I add a thing,
as soon as I lose the rx alignment I assert for one clock cycle the
GTWIZ_RESET_RX_DATAPATH...
Thank you
12-12-2019 06:54 PM
Could you try asserting CTL_TX_ENABLE later? To confirm that the RX alignment lost is related to it.
Normaly, TX/RX is working seperately. TX should not affect the RX alignment.
And can you see the link partner link status? How is it?
Please add all the CMAC IP core input/output signals into ILA, especially the status signals, stat_tx/rx_*
12-16-2019 09:43 AM
Thank you for your reply @guozhenp
Mellanox has "no link detected",
form Xilinx side I have produced the following ILA traces
TRACE A
TRACE B
On TRACE A you can see the stat_rx_alignment (yellow trace) signal rises, the rx_remote_fault rises as well, and I send the idle.
On the TRACE B there is the rx_remote_fault that falls, but after a while I get stat_rx_hi_ber HIGH (red traces).
I lose the rsfec alignment lock a few cycles before...
In my CMAC I enabled RS-FEC with the following parameters, so full operation
ctl_rsfec_ieee_error_indication_mode = 1
ctl_rx_rsfec_enable_indication = 1
ctl_rx_rsfec_enable_correction =1
After 3 uncorrected_cw_inc pulses I get hi_ber.
From the documentation if there is an hi ber it means the channel is not good equalized, but if so I shoudn't get the alignment at all, Am I right?
I attached an in-system IBERT and I get the following eyes:
They are very ugly,
I have also a dubt, can I run the eye scan meanwhile my design is running?
What is the next step to debug this situation?
If the problem is really the hi ber can you suggest me a parameter setting flow to set the tx/pre/post and the right equalization?
Thank you in advance
Regards
12-16-2019 08:00 PM
When you get the 3 continuous uncorrected cw error, RSFEC will lost the alignment. And CMAC RX can't work any more.
When you run in-system IBERT, the design should keep working at the same time.
How long time do you get the uncorrected error after RSFEC is aligned? Very soon? It looks like the link SI is not good.
But anyhow, could you have a try on our CMAC IP core example design first? Does the example have the same failure?
Can the link partner send/receive PRBS for testing? If so, you can run IBERT to test the link.
Is this 4x 25Gbps? I think GT RX side always enable DFE auto.
12-20-2019 04:26 AM
Hi @guozhenp
How long time do you get the uncorrected error after RSFEC is aligned? Very soon?
I managed to measure the time in particular:
from rising edge of stat_rx_aligned to falling edge of stat_rx_aligned it takes 96us.
But anyhow, could you have a try on our CMAC IP core example design first? Does the example have the same failure?
I tryed with xilinx example design and I get the same results,
I sees that example design doesn't send the send_idle,nonethenless the stat_rx_remote_fault is deasserted, so the reason mellanox deassert its rx_remote_fault is not the idle received but at this point I think it deassert it because it has detected a high BER as well...
Can the link partner send/receive PRBS for testing?
Yes Mellanox can send and receive PRBS Could you please drive me on performing this test, which signal I have to move on CMAC side?
To perform the test I have to stop to manage the reconciliation layer signals like send_idle send_rfi?
From my side the designs both custom and Xilinx ex, works well in near-loopback(010), in which I have 66% of open eye margin on IBERT.
Thank you
ps I opened another thread to investigate the fs vendor transceiver compability with KCU116.
12-22-2019 01:09 AM
You can create a new IBERT example design in IP Catalog to test the link.
You can also create a new thread on it.
01-08-2020 02:03 AM
Hi @guozhenp ,
sorry for the lack of communication in this vacancy period.
I have a dubt regarding the clock of the GT,
Can you confirm that clock frequency must be set to 161.1328MHz?
I route it from Si570.
I ask this because leaveing the frequency to 156.25MHz shows the same problem of hi_ber.
Thank you
01-16-2020 04:01 AM
Finally I solved it!
The very stupid things on KCU116 is to enable TX of SFP cages with some jumpers on the boards.
In particular I followed the link https://www.xilinx.com/support/answers/69315.html on section "Default Jumper Settings",
adding J16, J17, J42, J54 enable TX aand finally I'm able to send packet.
ps However is mandatory to control the reconciliation layer signals (rfi, idle etc.)