cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Observer
Observer
6,942 Views
Registered: ‎08-09-2013

xilinx_emacps problems - loosing access to gem registers

We are working on our Zynq-based board ( http://blog.elphel.com ) and I had to get into the network driver - it did not work out of the box with Atheros AR8035 we have. I've spent some time troubleshooting the problems, but then I tried the same test with the Microzed with the factory image and it performed even worse. Here is the script I used:

------------
#!/bin/sh
 IP="192.168.0.15"
 fails=0
 tries=0
 while true
 do
   ifconfig eth0 down
   ifconfig eth0 up
   ping -c 1 -w 10 $IP >/dev/null
   if [ $rslt != 0 ] ; then
       echo "Waiting more..." # Sometimes ping returns "network not available" too early
       sleep 5
       ping -c 1 -w 10 $IP >/dev/null
   fi
   fails=$(($fails+$?))
   tries=$(($tries+1))
   echo "tries: $tries fails: $fails"
 done

Is it the same for the other boards too? I tried it on our board and Microzed, both with dmatest enabled (without dmatest the errors are much less frequent).

When digging into xilinx_emacps.c I first noticed that xemacps_mdio_read() and xemacps_mdio_write() were sometimes starting transmission when the shift register was still not idle, but fixing that did not make Ethernet down/up test work without problems.

Later I noticed some strange thing that I do not understand - sometimes when starting xemacps_init_hw() and xemacps_probe() the access to gem registers was lost (reading zeros, ignoring writes). I temporarily added a function that waits for access (no actions were needed - just waiting) and called it in the beginning of the xemacps_init_hw() and xemacps_probe():
static int wait_register_access(struct net_local *lp)
{
    int timeout=1000;
    int tries=0;
    u32 regval_dbg=xemacps_read(lp->baseaddr, XEMACPS_PHYMNTNC_OFFSET);

    if ((xemacps_read(lp->baseaddr, XEMACPS_NWCFG_OFFSET) | xemacps_read(lp->baseaddr, XEMACPS_NWSR_OFFSET)) == 0 ) {
        for (tries=1;tries<timeout;tries++){
            if ((xemacps_read(lp->baseaddr, XEMACPS_NWCFG_OFFSET) | xemacps_read(lp->baseaddr, XEMACPS_NWSR_OFFSET)) != 0 ) break;
            printk("-");
        }
        dev_warn(&lp->pdev->dev,"Seems I/O register access was lost. Waited %d (of %d), Now [XEMACPS_NWCFG_OFFSET]=0x%08x,  [XEMACPS_NWSR_OFFSET]=0x%08x, old [XEMACPS_PHYMNTNC_OFFSET]=0x%08x\n",tries,timeout, (int) xemacps_read(lp->baseaddr, XEMACPS_NWCFG_OFFSET), (int) xemacps_read(lp->baseaddr, XEMACPS_NWSR_OFFSET),(int) regval_dbg);
        if (tries<timeout) return 0;
        return 1;
    }
    return 0;
}

With that extra code the failures reported by the test went away, but I still do not understand what causes the problem. And our code just waits to get access to registers, no guarantee it is not lost later during communication with PHY. Is it something related to specific GEM hardware? Or theses are memory management problems? Something related to SMP?

My current state of the driver mods is in this patch http://sourceforge.net/p/elphel/meta-elphel393/ci/master/tree/recipes-kernel/linux/linux-xlnx/xilinx_emacps_elphel393.patch , but as I wrote above - the problem was reproduced on Microzed with the factory image.

Andrey Filippov
Elphel, Inc.
Salt Lake City, Utah


0 Kudos
10 Replies
Highlighted
Observer
Observer
6,939 Views
Registered: ‎08-09-2013

Sorry, there was a bug in the script I posted - should be "$?" instead of the "$rslt". And 192.168.0.15 needs to be changed to the host accessible in your network:

#!/bin/sh
 IP="192.168.0.15"
 fails=0
 tries=0
 while true
 do
   ifconfig eth0 down
   ifconfig eth0 up
   ping -c 1 -w 10 $IP >/dev/null
   if [ $? != 0 ] ; then
       echo "Waiting more..." # Sometimes ping returns "network not available" too early
       sleep 5
       ping -c 1 -w 10 $IP >/dev/null
   fi
   fails=$(($fails+$?))
   tries=$(($tries+1))
   echo "tries: $tries fails: $fails"
 done
0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
6,937 Views
Registered: ‎03-13-2012

The observation regarding the register accesses is most likely due to clock gating. The driver uses the CCF to manage the device's clocks. When the clocks are disabled the registers are inaccessible.

0 Kudos
Highlighted
Observer
Observer
6,933 Views
Registered: ‎08-09-2013

Is the problem tested with the script present on other Zynq-based hardware (and possibly newer software)? Maybe it is already fixed?

If not - how I can I verify that it is CCF-related (and what is "CCF") ?  It seems to me, that the lost access is caused by unrelated drivers (dmatest, mmc) - how can this problem be solved, at least from what side?

 

We are preparing to do a lot of custom driver development for our Zynq-based hardware, so I would like first to understand problems/limitations of the platform while using just factory software that is already running on the thousands of systems.


Andrey Filippov

Elphel, Inc.

Salt Lake City, Utah

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
6,929 Views
Registered: ‎03-13-2012

CCF==common clock framework. The driver uses it to enable and disable the clocks (look for functions containing clk_). When the clocks are disabled you will read only zeros from the GEM registers and writes will have no effect.

The CCF exports information through debugfs if enabled. You can check the state of the gem clocks by reading <debugfs_mountpoint>/clk/clk_summary (requires debugfs and CCF debugfs option to be enabled in the kernel).

0 Kudos
Highlighted
Observer
Observer
6,922 Views
Registered: ‎08-09-2013

Both CONFIG_COMMON_CLK_DEBUG and CONFIG_COMMON_CLK_VERSATILE are enabled, but I do not see "clk_summary":

root@elphel393:/sys/kernel/debug/clk# ls -all
drwxr-xr-x    4 root     root             0 Jan  1  1970 .
drwx------    9 root     root             0 Jan  1  1970 ..
drwxr-xr-x    2 root     root             0 Jan  1  1970 orphans
drwxr-xr-x    5 root     root             0 Jan  1  1970 ps_clk
root@elphel393:/sys/kernel/debug/clk# cd ps_clk/
root@elphel393:/sys/kernel/debug/clk/ps_clk# ls -all
drwxr-xr-x    5 root     root             0 Jan  1  1970 .
drwxr-xr-x    4 root     root             0 Jan  1  1970 ..
drwxr-xr-x    3 root     root             0 Jan  1  1970 armpll
-r--r--r--    1 root     root             0 Jan  1  1970 clk_enable_count
-r--r--r--    1 root     root             0 Jan  1  1970 clk_flags
-r--r--r--    1 root     root             0 Jan  1  1970 clk_notifier_count
-r--r--r--    1 root     root             0 Jan  1  1970 clk_prepare_count
-r--r--r--    1 root     root             0 Jan  1  1970 clk_rate
drwxr-xr-x    5 root     root             0 Jan  1  1970 ddrpll
drwxr-xr-x   16 root     root             0 Jan  1  1970 iopll

 

And can you please verify that you do not have the same problem on your system? ifconfig eth0 down; ifconfig eth0 up; work flawlessly even with dmatest running?

 

Andrey Filippov

Elphel, Inc.

Salt Lake City, Utah

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
6,918 Views
Registered: ‎03-13-2012


@elphel wrote:

Both CONFIG_COMMON_CLK_DEBUG and CONFIG_COMMON_CLK_VERSATILE are enabled, but I do not see "clk_summary":

root@elphel393:/sys/kernel/debug/clk# ls -all
drwxr-xr-x    4 root     root             0 Jan  1  1970 .
drwx------    9 root     root             0 Jan  1  1970 ..
drwxr-xr-x    2 root     root             0 Jan  1  1970 orphans
drwxr-xr-x    5 root     root             0 Jan  1  1970 ps_clk
root@elphel393:/sys/kernel/debug/clk# cd ps_clk/
root@elphel393:/sys/kernel/debug/clk/ps_clk# ls -all
drwxr-xr-x    5 root     root             0 Jan  1  1970 .
drwxr-xr-x    4 root     root             0 Jan  1  1970 ..
drwxr-xr-x    3 root     root             0 Jan  1  1970 armpll
-r--r--r--    1 root     root             0 Jan  1  1970 clk_enable_count
-r--r--r--    1 root     root             0 Jan  1  1970 clk_flags
-r--r--r--    1 root     root             0 Jan  1  1970 clk_notifier_count
-r--r--r--    1 root     root             0 Jan  1  1970 clk_prepare_count
-r--r--r--    1 root     root             0 Jan  1  1970 clk_rate
drwxr-xr-x    5 root     root             0 Jan  1  1970 ddrpll
drwxr-xr-x   16 root     root             0 Jan  1  1970 iopll

 

And can you please verify that you do not have the same problem on your system? ifconfig eth0 down; ifconfig eth0 up; work flawlessly even with dmatest running?

 

Andrey Filippov

Elphel, Inc.

Salt Lake City, Utah


You're apparently on an old kernel. The information regarding enable/disable state is present on yours too, but you have to traverse the directories in that debugfs folder to find it.

 

I'll see if I can find to do some quick tests later. But no promises. What I can say from prior experience, ethernet usually works well for me, but I usually bring it up once and am happy with it.

Also, I don't know how severe your delays are, but when you bring down the ethernet interface the driver gates off the clocks. And they are restarted when you bring the IF back up. And then autonegotiation runs to negotiate the link speed. So, some delay is normal, I think. To prevent the clock gating you could try disabling RUNTIME_PM.

0 Kudos
Highlighted
Observer
Observer
6,911 Views
Registered: ‎08-09-2013

Which version of the kernel (and linux-xlnx) do you recommend to use?

I added dev_dbg() and found that xemacps_runtime_resume() is called each time prior to success/failures on trying to access registers.

 

As for network working - originally I also did not try to run a stress-test - I found the problem when debugging communication with a different PHY. And then I noticed, that it sometimes comes up, sometimes - not, depending on debug print, moment when MMC was detected, etc. That is why on most systems it may always come up as the boot sequence is rather stable.

 

Troubleshooting these problems we made the down/up test script and tried on Microzed (factory image) - it survived just over 30 cycles until turning network LEDs off forever (reported 3 failures in the process)

 

Andrey Filippov

Elphel, Inc.

Salt Lake City, Utah

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
6,898 Views
Registered: ‎03-13-2012

How long do you have to run the test to make it fail?


I ran - a slightly modified version - on a zc706 and don't see any issues (just the last few lines):

tries: 110 fails: 0
[  737.044347] xemacps e000b000.ps7-ethernet: link up (1000/FULL)
tries: 111 fails: 0
[  740.584446] xemacps e000b000.ps7-ethernet: link up (1000/FULL)
tries: 112 fails: 0
[  745.184431] xemacps e000b000.ps7-ethernet: link up (1000/FULL)
tries: 113 fails: 0
[  748.714429] xemacps e000b000.ps7-ethernet: link up (1000/FULL)
tries: 114 fails: 0
[  752.244428] xemacps e000b000.ps7-ethernet: link up (1000/FULL)
tries: 115 fails: 0
[  756.774297] xemacps e000b000.ps7-ethernet: link up (1000/FULL)
tries: 116 fails: 0

 

My version of the script basically just repleases the IP and uses ip instead of ifconfig:

#!/bin/sh                                                                        
 IP="10.10.70.101"                                                               
 fails=0                                                                         
 tries=0                                                                         
 while true                                                                      
 do                                                                              
   ip link set eth0 down                                                         
   ip link set eth0 up                                                           
   ping -c 1 -w 10 $IP >/dev/null                                                
   if [ $? != 0 ] ; then                                                         
       echo "Waiting more..." # Sometimes ping returns "network not available" too early
       sleep 5                                                                   
       ping -c 1 -w 10 $IP >/dev/null                                            
   fi                                                                            
   fails=$(($fails+$?))                                                          
   tries=$(($tries+1))                                                           
   echo "tries: $tries fails: $fails"                                            
 done 

 

I don't have a microzed though, so this was on a zc706.

0 Kudos
Highlighted
Observer
Observer
6,893 Views
Registered: ‎08-09-2013

Factory image on the Microzed produced error once per ~10 cycles, and after 30 just was never able to turn network on again. And yes - xemacps_runtime_resume() was sometimes actually called later than accessing gem registers, so just waiting was helping.

We are still struggling to run linux-xlnx master-next (so far having problems during pl330 initialization), I'll try the new driver and see if it needs to run on our hardware (with Atheros AR8035) - it needs fixup as Atheros driver does not support writing registers from DT.

 

Andrey Filippov

Elphel, Inc.

Salt Lake City, Utah

 

0 Kudos
Highlighted
Observer
Observer
2,146 Views
Registered: ‎08-09-2013

The problem was in the line

 rc = pm_runtime_get(&lp->pdev->dev);

that was later replaced with

 rc = pm_runtime_get_sync(&lp->pdev->dev);

Andrey

0 Kudos