cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Adventurer
Adventurer
1,090 Views
Registered: ‎07-27-2018

US+ 100G rx_remote_fault in real scenario link down

Jump to solution

Hi everybody,

I'm working with this testbed:

1) KCU116 board

2) Mellanox 5X on linux Debian (AN/LT disabled , I forced 100G speed)

3) CMAC IP set with RS-FEC enabled

When I connect the transceiver I get right alignment on KCU, stat_rx_status, stat_rx_aligned, synced are ok, but I receive a stat_rx_remote_fault. (The eye is open on GT I checked them with System IBERT IP)

With this situation I have link down on Mellanox side.

I read PG157 on 10G Ethernet IP, it exaplains the bring up process of a 10G and it seems that the bring up sequence requires to send idle signal.

Can please confirm that the 10G explanation can be applyed to 100G CMAC IP?

On the PG203 v3 there aren't any references to the remote signaling (its usage specially related with bring up sequence)(there is but the description is a tautological one...), the Bring up sequence paragraph ends simply after the alignment is reached, instead, it seems like this step to exchange rx_remote_fault and send idle is mandatory (the game is not finished simply with the alignment...).

Pleas Can you confirm that 100G doesn't implement at all the MAC layer?

I tried out this "problem" with 3 different design:

1) Xilinx example design

2) custom design with AXI-Lite and AXI stream

3) Custom design with HDL code to control the CMAC and LBUS

So I think it is related to the protocol.

So Could you please provide an in depth description of the start up sequence of the CMAC and give me an hint on reference guide on how to address the stat_rx_remote fault signal?

Really thank you!

 

 

0 Kudos
Reply
1 Solution

Accepted Solutions
Adventurer
Adventurer
722 Views
Registered: ‎07-27-2018

Finally I solved it!

The very stupid things on KCU116 is to enable TX of SFP cages with some jumpers on the boards.

In particular I followed the link https://www.xilinx.com/support/answers/69315.html on section "Default Jumper Settings",

adding J16, J17, J42, J54 enable TX aand finally I'm able to send packet.

ps However is mandatory to control the reconciliation layer signals (rfi, idle etc.)

View solution in original post

0 Kudos
Reply
10 Replies
Xilinx Employee
Xilinx Employee
1,044 Views
Registered: ‎05-01-2013

When you receive remote fault, it means the link partner, Mellanox is having local fault.

So you should keep sending IDLE until the link partner completes the alignment and stops sending remote fault to us.

0 Kudos
Reply
Adventurer
Adventurer
1,007 Views
Registered: ‎07-27-2018

Hi @guozhenp 

I'm glad to hear from you!

I implemented an FSM to manage the reconciliation when remote fault happens.

I show you what happen in both rising-edge of stat_rx_align signal debugged with ILA.

 

I have asserted the right signals as suggested on the PG203, in particular I asserted CTL_SEND_RFI and CTL_RX_ENABLE, waiting for STAT_RX_ALIGNED.

As soon as I get the alignment I receive th RX_REMOTE_FAULT from Mellanox and I started to send CTL_TX_SEND_IDLE and I deassert CTL_TX_SEND_RFI. (Please see the following ILA Traces)

 

align_rising.png

Now the problem is that when RX_REMOTE_FAULT become zero after a while I lose the alignment.

As you suggested as soon as I get RX_REMOTE_FAULT = 0 I stop to send the idle and I enable the tx with CTL_TX_ENABLE, after that I lose the RX_ALIGN.

 

align_falling.png

 

I get these 2 situations continuosly.

RX_ALIGN = 0 - SEND_RFI on ->

RX_ALIGN  = 1 - SEND_RFI off ->

REMOTE_FAULT rise - SEND_IDLE on ->

REMOTE_FAULT falls -> (TX_ENABLE - SEND_IDLE off) ->

RX_ALIGN = 0

and then I restart.

It seems like I'm loosing the alignment as soon as I enable the TX. What can cause the dislignment from cmac internals?

Besides I tryied also without TX/RX flow control.

Thank you in advance.

 

0 Kudos
Reply
Adventurer
Adventurer
1,005 Views
Registered: ‎07-27-2018

I add a thing,

as soon as I lose the rx alignment I assert for one clock cycle the

GTWIZ_RESET_RX_DATAPATH...

Thank you

0 Kudos
Reply
Xilinx Employee
Xilinx Employee
964 Views
Registered: ‎05-01-2013

Could you try asserting CTL_TX_ENABLE later? To confirm that the RX alignment lost is related to it.

Normaly, TX/RX is working seperately. TX should not affect the RX alignment.

And can you see the link partner link status? How is it?

 

Please add all the CMAC IP core input/output signals into ILA, especially the status signals, stat_tx/rx_*

0 Kudos
Reply
Adventurer
Adventurer
933 Views
Registered: ‎07-27-2018

Thank you for your reply  @guozhenp 

Mellanox has "no link detected",

form Xilinx side I have produced the following ILA traces

TRACE A

 

alignment_done.png

TRACE B

 

hi_ber.png

On TRACE A you can see the stat_rx_alignment (yellow trace) signal rises, the rx_remote_fault rises as well,  and I send the idle.

On the TRACE B there is the rx_remote_fault that falls, but after a while I get stat_rx_hi_ber HIGH (red traces).

I lose the rsfec alignment lock a few cycles before...

In my CMAC I enabled RS-FEC with the following parameters, so full operation

ctl_rsfec_ieee_error_indication_mode = 1
ctl_rx_rsfec_enable_indication = 1
ctl_rx_rsfec_enable_correction =1

After 3 uncorrected_cw_inc pulses I get hi_ber.

From the documentation if there is an hi ber it means the channel is not good equalized, but if so I shoudn't get the alignment at all, Am I right?

I attached an in-system IBERT and I get the following eyes:

 

eye_after_hi_ber.png

They are very ugly,

I have also a dubt, can I run the eye scan meanwhile my design is running?

What is the next step to debug this situation?

If the problem is really the hi ber can you suggest me a parameter setting flow to set the tx/pre/post and the right equalization?

 

Thank you in advance

Regards

0 Kudos
Reply
Xilinx Employee
Xilinx Employee
912 Views
Registered: ‎05-01-2013

When you get the 3 continuous uncorrected cw error, RSFEC will lost the alignment. And CMAC RX can't work any more.

When you run in-system IBERT, the design should keep working at the same time.

How long time do you get the uncorrected error after RSFEC is aligned? Very soon? It looks like the link SI is not good.

But anyhow, could you have a try on our CMAC IP core example design first? Does the example have the same failure?

Can the link partner send/receive PRBS for testing? If so, you can run IBERT to test the link.

Is this 4x 25Gbps? I think GT RX side always enable DFE auto.

0 Kudos
Reply
Adventurer
Adventurer
879 Views
Registered: ‎07-27-2018

Hi @guozhenp 

How long time do you get the uncorrected error after RSFEC is aligned? Very soon?

I managed to measure the time in particular:

from rising edge of stat_rx_aligned to falling edge of stat_rx_aligned it takes 96us.

But anyhow, could you have a try on our CMAC IP core example design first? Does the example have the same failure?

I tryed with xilinx example design and I get the same results,

I sees that example design doesn't send the send_idle,nonethenless the stat_rx_remote_fault is deasserted, so the reason mellanox deassert its rx_remote_fault is not the idle received but at this point I think it deassert it because it has detected a high BER as well...

Can the link partner send/receive PRBS for testing?

Yes Mellanox can send and receive PRBS Could you please drive me on performing this test, which signal I have to move on CMAC side?

To perform the test I have to stop to manage the reconciliation layer signals like send_idle send_rfi?

From my side the designs both custom and Xilinx ex, works well in near-loopback(010), in which I have 66% of open eye margin on IBERT.

Thank you

 

ps I opened another thread to investigate the fs vendor transceiver compability with KCU116.

https://forums.xilinx.com/t5/Ethernet/KCU116-which-optical-transceiver-use-for-4xzSFP/m-p/1057359#M17660

0 Kudos
Reply
Xilinx Employee
Xilinx Employee
852 Views
Registered: ‎05-01-2013

You can create a new IBERT example design in IP Catalog to test the link.

You can also create a new thread on it. 

0 Kudos
Reply
Adventurer
Adventurer
770 Views
Registered: ‎07-27-2018

Hi @guozhenp ,

sorry for the lack of communication in this vacancy period.

I have a dubt regarding the clock of the GT,

Can you confirm that clock frequency must be set to 161.1328MHz?

I route it from Si570.

I ask this because leaveing the frequency to 156.25MHz shows the same problem of hi_ber.

 

Thank you

 

0 Kudos
Reply
Adventurer
Adventurer
723 Views
Registered: ‎07-27-2018

Finally I solved it!

The very stupid things on KCU116 is to enable TX of SFP cages with some jumpers on the boards.

In particular I followed the link https://www.xilinx.com/support/answers/69315.html on section "Default Jumper Settings",

adding J16, J17, J42, J54 enable TX aand finally I'm able to send packet.

ps However is mandatory to control the reconciliation layer signals (rfi, idle etc.)

View solution in original post

0 Kudos
Reply