cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
jeremy-iress
Visitor
Visitor
689 Views
Registered: ‎01-29-2021

SolarFlare TX stuck with port_enabled=1: resetting channels

Jump to solution

Hi there,

I'm experiencing the same issue as described here : https://www.xilinx.com/support/answers/75094.html but on a Solarflare 6000 series and with only 16 TX queues:

41:00.0 Ethernet controller: Solarflare Communications SFC9020 [Solarstorm]
        Subsystem: Solarflare Communications SFN6122F-R7 SFP+ Server Adapter

As far as I can see I have the latest available firmwares:

 

Solarstorm firmware update utility [v7.5.2]
Copyright Solarflare Communications 2006-2018, Level 5 Networks 2002-2005

eth0 - MAC: 00-0F-53-22-6B-98
Firmware version: v7.5.2
Controller type: Solarflare SFC9000 family
Controller version: v3.3.2.1000
Boot ROM version: v5.2.1.1000

The Boot ROM firmware is up to date
The controller firmware is up to date

eth1 - MAC: 02-0F-C5-A4-2E-E4
Firmware version: v7.5.2
Controller type: Solarflare SFC9000 family
Controller version: v3.3.2.1000
Boot ROM version: v5.2.1.1000

The Boot ROM firmware is up to date
The controller firmware is up to date

eth2 - MAC: 02-0F-C5-A4-2E-E4
Firmware version: v7.5.2
Controller type: Solarflare SFC9000 family
Controller version: v3.3.2.1000
Boot ROM version: v5.2.1.1000

The Boot ROM firmware is up to date
The controller firmware is up to date

eth3 - MAC: 00-0F-53-22-71-B9
Firmware version: v7.5.2
Controller type: Solarflare SFC9000 family
Controller version: v3.3.2.1000
Boot ROM version: v5.2.1.1000

The Boot ROM firmware is up to date
The controller firmware is up to date

 

Can I still upgrade firmwares or shall I decrease the number of TX queues to 15 like in the referenced article ? 32 logicals CPU => rss_cpus=31

Thanks in advance for your help.

Tags (1)
0 Kudos
1 Solution

Accepted Solutions
abrunnin
Xilinx Employee
Xilinx Employee
623 Views
Registered: ‎03-31-2020

That suggests that the issue is probably not the coming from the firmware; but rather the kernel not seeing the completion result in time.  That could be a problem with the driver, or it could be that the server was not able to pass on the event in time (which could be due to an overload of serial logging; or because of real-time priority processes).

If you have not already, I would suggest updating to the latest version of the driver (which can be downloaded from https://support-nic.xilinx.com/ )

You might also like to raise a support ticket so that we can properly investigate this - but as I previously mentioned, this card is out of support.

View solution in original post

7 Replies
abrunnin
Xilinx Employee
Xilinx Employee
675 Views
Registered: ‎03-31-2020

Hi Jeremy.

This card is not vulnerable to that specific issue; and the firmware is also too new to be vulnerable to it.

We'd need to see the whole error message to diagnose this - the "TX stuck" message on its own is quite generic.  It just means that the kernel requested a transmit, and timed out waiting for the completion to be acknowledged.

This could be due to a fault on the card, or a firmware problem, or a driver problem; or even just the kernel being too slow to acknowledge the completion event (perhaps due to realtime-priority processes).

 

Please note, however, that the SFN6122F is end of life and has been out of support since 2019.  We are very limited in the amount of assistance we can give with these cards.

0 Kudos
jeremy-iress
Visitor
Visitor
653 Views
Registered: ‎01-29-2021

Hi abrunnin,

I know this model is old and end of support, if this is about a hardware issue then I'll deal with it and replace the card.

Anyway, here is the full trace:

Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729579] ------------[ cut here ]------------
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729588] WARNING: CPU: 1 PID: 0 at /home/build/linux-4.4/net/sched/sch_generic.c:303 dev_watchdog+0xde/0x144()
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729590] NETDEV WATCHDOG: eth2 (sfc): transmit queue 5 timed out
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729591] Modules linked in: cpufreq_powersave cpufreq_userspace cpufreq_stats cpufreq_conservative 8021q garp mrp stp llc team_mode_loadbalance team onload(O) sfc_char(O) sfc_resource(O) sfc_a
ffinity(O) x86_pkg_temp_thermal intel_powerclamp nls_utf8 nls_cp437 vfat coretemp fat kvm irqbypass crct10dif_pclmul crc32_pclmul aesni_intel snd_pcm aes_x86_64 sfc(O) lrw gf128mul snd_timer glue_helper ablk_helper cryptd snd soundcore jo
ydev ptp efi_pstore pps_core dcdbas efivars mdio iTCO_wdt pcspkr hid_generic mtd iTCO_vendor_support evdev wmi rtc_cmos acpi_power_meter acpi_pad 8250_fintek button tpm_tis tpm sb_edac edac_core processor usbhid hid lpc_ich mei_me mfd_cor
e mei ipmi_watchdog ipmi_si ipmi_poweroff ipmi_devintf ipmi_msghandler autofs4 ext4 crc16 mbcache jbd2 sg sd_mod sr_mod cdrom crc32c_intel ahci libahci ehci_pci ehci_hcd libata megaraid_sas usbcore usb_common scsi_mod
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729638] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G           O    4.4.0-1-amd64 #1 Debian 4.4.0-spciq1
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729640] Hardware name: Dell Inc. PowerEdge R620/01W23F, BIOS 2.8.0 06/26/2019
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729641]  0000000000000006 ffffffff811e0f00 ffff88100f203e50 ffffffff810536c1
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729643]  ffffffff8133f1aa ffff881008e90000 ffff88100f203ea8 ffffffff8133f0cc
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729645]  ffff881008e903f8 ffffffff81053719 ffffffff815373ad ffff881000000030
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729647] Call Trace:
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729648]  <IRQ>  [<ffffffff811e0f00>] ? dump_stack+0x40/0x50
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729657]  [<ffffffff810536c1>] ? warn_slowpath_common+0x94/0xa9
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729658]  [<ffffffff8133f1aa>] ? dev_watchdog+0xde/0x144
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729660]  [<ffffffff8133f0cc>] ? netif_tx_unlock+0x42/0x42
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729662]  [<ffffffff81053719>] ? warn_slowpath_fmt+0x43/0x4b
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729666]  [<ffffffff813e3479>] ? _raw_spin_unlock_irqrestore+0x11/0x13
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729668]  [<ffffffff8133f07d>] ? netif_tx_lock+0x6c/0x79
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729670]  [<ffffffff8133f1aa>] ? dev_watchdog+0xde/0x144
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729674]  [<ffffffff8109368b>] ? call_timer_fn+0x30/0xe3
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729676]  [<ffffffff8133f0cc>] ? netif_tx_unlock+0x42/0x42
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729677]  [<ffffffff81093be2>] ? run_timer_softirq+0x193/0x1ba
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729680]  [<ffffffff810569e0>] ? __do_softirq+0xf8/0x279
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729684]  [<ffffffff8109d4d9>] ? clockevents_program_event+0xcd/0xe9
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729685]  [<ffffffff81056ce7>] ? irq_exit+0x52/0xbb
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729689]  [<ffffffff810399c4>] ? smp_apic_timer_interrupt+0x25/0x2f
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729691]  [<ffffffff813e4527>] ? apic_timer_interrupt+0x87/0x90
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729692]  <EOI>  [<ffffffff812fe8cc>] ? cpuidle_enter_state+0x13a/0x18f
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729698]  [<ffffffff812fe885>] ? cpuidle_enter_state+0xf3/0x18f
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729702]  [<ffffffff8107f703>] ? cpu_startup_entry+0x17f/0x1f3
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729703]  [<ffffffff81038025>] ? start_secondary+0xff/0x101
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729705] ---[ end trace 9f5d47a4e326b622 ]---
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729708] sfc 0000:42:00.0 eth2: TX queue timeout: printing stopped queue data
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729710] sfc 0000:42:00.0 eth2: Channel 5: enabled Busy poll 0x0 NAPI state 0x8 Doorbell not held not coalescing
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729712] sfc 0000:42:00.0 eth2: Tx queue: insert 2, write 2, read 2
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.729713] sfc 0000:42:00.0 eth2: Tx queue: insert 1190247, write 1190247, read 1190133
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.730083] [sfc efrm] efrm_dl_reset_suspend:
Jan 27 04:20:01 hkghke-a1715-32 kernel: [ 2916.730087] sfc 0000:42:00.0 eth2: resetting (RECOVER_OR_ALL)
Jan 27 04:20:07 hkghke-a1715-32 kernel: [ 2922.348118] sfc 0000:42:00.0 eth2: link up at 10000Mbps full-duplex (MTU 1500)
Jan 27 04:20:07 hkghke-a1715-32 kernel: [ 2922.359256] [sfc efrm] efrm_dl_reset_resume: ok=1

 

0 Kudos
abrunnin
Xilinx Employee
Xilinx Employee
638 Views
Registered: ‎03-31-2020

That is still just the backtrace from the kernel's stuck watchdog firing.  Was there firmware output nearby showing a list of contents of "R00/R01/R02 etc."?

0 Kudos
jeremy-iress
Visitor
Visitor
634 Views
Registered: ‎01-29-2021

I don't see anything related in logs

0 Kudos
abrunnin
Xilinx Employee
Xilinx Employee
624 Views
Registered: ‎03-31-2020

That suggests that the issue is probably not the coming from the firmware; but rather the kernel not seeing the completion result in time.  That could be a problem with the driver, or it could be that the server was not able to pass on the event in time (which could be due to an overload of serial logging; or because of real-time priority processes).

If you have not already, I would suggest updating to the latest version of the driver (which can be downloaded from https://support-nic.xilinx.com/ )

You might also like to raise a support ticket so that we can properly investigate this - but as I previously mentioned, this card is out of support.

View solution in original post

jeremy-iress
Visitor
Visitor
487 Views
Registered: ‎01-29-2021

I have upgraded to the latest onload driver, we were previously using openonload driver version 201811.
I'll check if issue persists in the next days. If yes, I'll consider replacing the card.

Thank you for help.

0 Kudos
jeremy-iress
Visitor
Visitor
447 Views
Registered: ‎01-29-2021

Sadly issue persists so I'm going to replace the card.

[Tue Feb  2 04:33:44 2021] sfc 0000:42:00.0 eth2: TX queue timeout: printing stopped queue data
[Tue Feb  2 04:33:44 2021] sfc 0000:42:00.0 eth2: Channel 7: enabled Busy poll 0x0 NAPI state 0x8 Doorbell not held not coalescing
[Tue Feb  2 04:33:44 2021] sfc 0000:42:00.0 eth2: Tx queue: insert 34, write 34, read 34
[Tue Feb  2 04:33:44 2021] sfc 0000:42:00.0 eth2: Tx queue: insert 375588, write 375588, read 375503
[Tue Feb  2 04:33:44 2021] sfc 0000:42:00.0 eth2: TX stuck with port_enabled=1: resetting channels
[Tue Feb  2 04:33:44 2021] [onload] oo_dl_reset_suspend:
[Tue Feb  2 04:33:44 2021] [sfc efrm] efrm_dl_reset_suspend:
[Tue Feb  2 04:33:44 2021] sfc 0000:42:00.0 eth2: resetting (RECOVER_OR_ALL)
[Tue Feb  2 04:33:45 2021] [sfc efrm] efrm_dl_reset_resume: ok=1
[Tue Feb  2 04:33:45 2021] [onload] oo_dl_reset_resume:



0 Kudos