UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Visitor slocke
Visitor
3,978 Views
Registered: ‎01-26-2018

Cadence I2C driver bug

There seems to be a bug with the Cadence I2C driver. 

 

I'm using a custom board, but I suspect it can be reproduced on any board with an I2C device (I have a MicroZed here, but it does not have an I2C slave device on the board, so I can't easily confirm this with another board).  The bug is admittedly quite rare, but with continual I2C reads, and fairly heavy system load, I do see it often enough to be a limitation.  In a board power-cycle test, it also caused boot failures (during I2C device enumeration) about 10 times out of 10000 reboots.

 

Steps to reproduce:

1. Perform many I2C read operations (preferably a single register read, not sequential continuous reads).

2. Load down system with other non-i2c interrupt-heavy tasks.

3. Wait.

 

After a while (usually somewhere between 0 and 24 hours, although it's occasionally taken longer), you'll get an I2C timeout from the cadence driver.  Depending on the source of the read request (slave device driver), you may or may not also get a kernel oops like the one below

cdns-i2c e0004000.i2c: timeout waiting on completion
Unable to handle kernel paging request at virtual address fffffffe
pgd = dd488000
[fffffffe] *pgd=1eff6861, *pte=00000000, *ppte=00000000
Internal error: Oops - BUG: 80000007 [#1] PREEMPT SMP ARM
Modules linked in: array
CPU: 1 PID: 2961 Comm: cat Not tainted 4.9.0-xilinx-g5b20083 #22
Hardware name: Xilinx Zynq Platform
task: de580840 task.stack: dd41c000
PC is at 0xfffffffe
LR is at 0x0
pc : [<fffffffe>] lr : [<00000000>] psr: a00f0033
sp : dd41dbb8 ip : de580840 fp : ffffffff
r10: ffffffff r9 : ffffffff r8 : ffffffff
r7 : ffffffff r6 : ffffffff r5 : ffffffff r4 : ffffffff
r3 : de580840 r2 : de580840 r1 : 00000000 r0 : ffffff92
Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA Thumb Segment none
Control: 18c5387d Table: 1d48804a DAC: 00000051
Process cat (pid: 2961, stack limit = 0xdd41c210)
Stack: (0xdd41dbb8 to 0xdd41e000)
dba0: ffffffff ffffffff
dbc0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
dbe0: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
dc00: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
dc20: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
dc40: ffffffff ffffffff ffffffff dddb5000 ddd2e018 ddff7e00 ddd42180 c046328c
dc60: c0463260 dddb5000 c063955c c045a310 00000000 de012c0c 00000018 00000002
dc80: de012c0c 00000018 00000000 dd41dd18 c092d8c0 ddd258c0 c045a288 c0381df8
dca0: 00001000 ddfdc080 c063955c c022268c ddfdc080 00001000 00000000 00000001
dcc0: de579000 dd41dde0 ddd42180 c01ebdfc ddfdc0b0 00000010 00000000 00000000
dce0: dd41dec0 c031ac30 ddd42180 00000000 ddd42180 00000000 00001000 dd41dde0
dd00: dd41dd1c c0221e88 00000000 c01cd4e0 dd41dd18 dd41dd1c dde5b000 00000002
dd20: 00000000 00010000 dde5b000 00000010 00000000 024200ca 024200ca c0903940
dd40: 00000081 00000000 dddcf700 c0198674 00000001 0000000f ddfdce00 00010000
dd60: 00001000 c092d8c0 024200c0 c060c1a8 ddd42180 c03199a0 01000000 dd41dddc
dd80: dd41ddd8 dd41ddec ddfdce00 00010000 dd41b500 00000000 00000010 dd41de00
dda0: ffffe000 bf000000 00000000 dd41dec0 ddd42180 c01cd570 dd41dde0 00000000
ddc0: 00000010 c01f2540 00000000 c0903940 00000081 00001000 dd41b500 00000000
dde0: 00000000 00000000 def8fa60 00000008 00000000 01000000 ddfdce00 00000000
de00: de579000 00001000 dde88000 00001000 dd476000 00001000 dddb7000 00001000
de20: dde21000 00001000 dd483000 00001000 ddebf000 00001000 ddef1000 00001000
de40: de5bf000 00001000 ddff1000 00001000 dd486000 00001000 ddfcc000 00001000
de60: dddfb000 00001000 ddfca000 00001000 de72f000 00001000 dde37000 00001000
de80: 01000000 dd41def0 ddfdce00 00000000 ddd42180 01000000 00000000 c01f2f00
dea0: 00000000 c01f226c 00000000 dddc3d68 00000000 00000000 00000817 00000001
dec0: 00000000 00000000 00000001 ddd42180 dd41df40 00000000 00000000 00000000
dee0: dd41df48 7fffffff 00000000 c01f23d8 01000000 01000000 00000000 dde039c0
df00: 00000000 00000000 dd41df48 00000000 00000000 00000000 dde039c0 dde039c0
df20: ddd42180 00000000 01000000 c01cd948 01000000 00000000 dde039c0 ddd42180
df40: 00000000 00000000 00001150 00000000 ddd42188 00000000 00000000 00000000
df60: 000000ef c0106fa4 dd41c000 00000000 00000000 c01ce23c 7fffffff 00000000
df80: 00000100 00000001 b6f81570 01000000 00000000 00000000 000000ef c0106fa4
dfa0: dd41c000 c0106de0 01000000 00000000 00000001 00000003 00000000 01000000
dfc0: 01000000 00000000 00000000 000000ef 01000000 00000003 00000001 00000000
dfe0: b6eba371 beb63c34 0001b884 b6eba378 800f0030 00000001 00000000 00000000
Code: bad PC value
---[ end trace 3f3757dc5c49e4da ]---

This oops was generating when reading the magnetometer in an ST LSM303AGR.  The corresponding I2C capture is:

I2C Fail mag X read.png

The bug is in regards to the extra transfer of 0xFF's after the single byte read is completed. 

 

I've also reproduced this on another I2C device (TMP102).  On this failure, there is no kernel oops, but the driver is left in an inaccessible state after the failure.

I2C Fail tmp102 read.png

 

The cause of this seems to be a race condition in a section of the cdns_i2c_mrecv() function in the driver.  In the following snippet from that function:

	/* Set the slave address in address register - triggers operation */
	cdns_i2c_writereg(CDNS_I2C_ENABLED_INTR_MASK, CDNS_I2C_IER_OFFSET);
	cdns_i2c_writereg(id->p_msg->addr & CDNS_I2C_ADDR_MASK,
						CDNS_I2C_ADDR_OFFSET);
	/* Clear the bus hold flag if bytes to receive is less than FIFO size */
	if (!id->bus_hold_flag &&
		((id->p_msg->flags & I2C_M_RECV_LEN) != I2C_M_RECV_LEN) &&
		(id->recv_count <= CDNS_I2C_FIFO_DEPTH))
			cdns_i2c_clear_bus_hold(id);

There is a possibility of interruption between the start of the I2C transfer (write to the address register), and the clearing of the HOLD bit in the config register.  During periods of heavy interrupt generation, there is a small chance that an unrelated interrupt will occur, preempting this function.  If the preemption occurs between these two events (write to address register, and clearing of HOLD bit), and the function is preempted for longer than it takes for the I2C peripheral to complete its transfer, the erroneous transfer shown above will occur. The bug seems to happen more often on single-byte read transfers that complete relatively quickly.  I think what is happening is that the I2C Master Invalid Read Transaction Errata is occurring because the HOLD bit is not clear when the transaction completes.

 

I have tested a fix that involves disabling interrupts (using spin_lock_irqsave/irqrestore) around that code section, preventing the consecutive execution from being preempted/delayed.  This appears to be working, however I have to let the system run for an extended period of time to be sure that this rare occurrence has disappeared.  Without more time examining the driver and an actual datasheet for the peripheral, I'm not sure this is the best long-term fix for the issue.  I'm also not certain if there is a better linux kernel construct for only disabling interrupts (there is no real reason for the spin lock, just to disable the interrupts temporarily).

15 Replies
Adventurer
Adventurer
3,904 Views
Registered: ‎12-02-2014

Re: Cadence I2C driver bug

This seems like a reasonable solution. Nice job!
0 Kudos
Visitor fox
Visitor
3,825 Views
Registered: ‎04-24-2018

Re: Cadence I2C driver bug

Hi,

 

I probably have the exact same issue as you have using the realtime linux kernel patch. Using realtime linux, this the timeouts happens very frequently (~ 3/10 reboots) and rarely also the kernel crashes, similar to your debug outputs, appear. Furthermore, I can force this error very easily if I perform endless apache2 restarts within a script and perform I2C access simultaneously.

 

However, I was not able to fix this using spinlocks (raw_spinlocks) and / or  preempt_disable_rt.

 

Do you have seen this issue after a few another tests? Or is it fixed in your case when using spinlocks?

 

Kind regards,

fox

 

0 Kudos
Visitor fox
Visitor
3,803 Views
Registered: ‎04-24-2018

Re: Cadence I2C driver bug

So, I think I am heavily influenced by this bug: https://www.xilinx.com/support/answers/61665.html after debugging the I2C-Cadence driver when receiving I2C frames (<= 252 bytes). I implemented the workaround (tracking the I2C transfer size register within the interrupt service routine) to prevent the bug.

 

In my case, the IXR_COMP (complete bit) is not set in certain cases (I do not know why yet, but it happens).

Further, the SR_RXDV (data valid bit) was set "too long" and the "read-out" loop within the interrupt service routine read subsequently too much data which not existed.

I.e. the "unsigned" typed "recv_count" and "curr_recv_count" decrased to a value < 0 which lead very high values. Furthermore, the p_recv_buf was accessed out of bounce which caused the kernel to crash many times.

 

The affected lines are

 

 

*(id->p_recv_buf)++ = cdns_i2c_readreg(CDNS_I2C_DATA_OFFSET);
id->recv_count--;
id->curr_recv_count--;

 

 

My fix includes:

1.

Checking if the XFER_SIZE_OFFSET register contains a value < (FIFO - 2) and if yes, forcing the COMP bit to be set within the isr_status variable in the "read-out" loop, i.e.:

 

// Workaround part 1:
if(!(isr_status & CDNS_I2C_IXR_COMP)) {
    // COMPLETE bit is not set currently.
    if(cdns_i2c_readreg(CDNS_I2C_XFER_SIZE_OFFSET) < CDNS_I2C_DATA_INTR_DEPTH) {
        // force the COMPLETE bit to be set
        isr_status |= CDNS_I2C_IXR_COMP;
    }
}

2.

Checking if "id->recv_count" (and curr_recv_count) is > 0 before accessing the buffer and decreasing the values.

 

First quick tests looks good, but I have to test much more yet...

0 Kudos
Visitor cogsy
Visitor
3,624 Views
Registered: ‎09-30-2015

Re: Cadence I2C driver bug

I've just been struggling through the same problem, with the same erroneous transfers observed on the scope (though mine where all 0x00s).

 

I have 15 LEDs that I'm 'twinkling' on PCA953xs which requires a huge amount of I2C traffic in single byte reads/writes.  I could trigger this bug within about 3 minutes with a moderate CPU load.  After it happens the work_queue servicing all the set_brightness calls stalls and the kernel falls over a short time later.

 

I had assumed it was caused by this bug mentioned in i2c-cadence.c --EDIT: which I now see you refer to in your explanation.

 

/*
 * Cadence I2C controller has a bug wherein it generates
 * invalid read transaction after HW timeout in master receiver mode.
 * HW timeout is not used by this driver and the interrupt is disabled.
 * But the feature itself cannot be disabled. Hence maximum value
 * is written to this register to reduce the chances of error.
 */

 

Enabling IXR_TO and adding it to the error mask certainly spat out a few errors before the inevitable

 

cdns-i2c e0005000.i2c: timeout waiting on completion

 

I expect the invalid read transactions mentioned would generate ISRs with SR_RXDV set.

 

My fix has also been to check (curr_)recv_count > 0 in the ISR, which stops the work_queue from stalling, albeit after the 1000ms timeout delay.

 

Feels hacky, but at least my board doesn't hang anymore.

0 Kudos
Adventurer
Adventurer
3,037 Views
Registered: ‎12-02-2014

Re: Cadence I2C driver bug

I'm not sure if this relates to your issues exactly, but this line seems unintentional, or possibly too aggressive with respect to the variable updatetx:

 

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/i2c/busses/i2c-cadence.c?h=v4.20-rc1#n231

 

I'm having an issue with read transactions greater than 252 bytes, and if I remove the updatetx check, it works fine.

 

 

The minimum change to resolve the issue for me, is to remove the consideration of 'updatetx' from the following line:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/i2c/busses/i2c-cadence.c?h=v4.20-rc1#n254

0 Kudos
Adventurer
Adventurer
2,995 Views
Registered: ‎12-02-2014

Re: Cadence I2C driver bug

Actually, that change I mentioned didn't fix anything (it was just killing the transaction after the first part of the message)... without the change, stress testing transfers over 252 bytes causes instability after a few runs:

# Unable to handle kernel paging request at virtual address ffffffff
pgd = d8fc4000
[ffffffff] *pgd=1bff6861, *pte=00000000, *ppte=00000000
Internal error: Oops - BUG: 37 [#1] PREEMPT SMP ARM
Modules linked in: usb_f_acm u_serial g_serial libcomposite m25p80
CPU: 0 PID: 2414 Comm: sh Not tainted 4.14.0 #1.00.0002-b018
Hardware name: Xilinx Zynq Platform
task: da733800 task.stack: d8e2e000
PC is at tty_paranoia_check+0x20/0x5c
LR is at tty_ioctl+0x2c/0xa24
pc : [<c0393804>] lr : [<c03945d0>] psr: a00f0013
sp : d8e2fea0 ip : 0000000a fp : 0009c9d0
r10: 00000000 r9 : d8e2e000 r8 : d64403c0
r7 : be9ced3c r6 : 00005410 r5 : ffffffff r4 : be9ced3c
r3 : c081a13e r2 : c081a13e r1 : da6973b0 r0 : ffffffff
Flags: NzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
Control: 18c5387d Table: 18fc404a DAC: 00000051
Process sh (pid: 2414, stack limit = 0xd8e2e210)
Stack: (0xd8e2fea0 to 0xd8e30000)
fea0: be9ced3c c03945d0 00000010 c011f8c8 00d5810c 00000000 00f453c0 00000000
fec0: d8e2ff20 d8e2ff08 da756cc0 d8e2e000 d8e2ff20 da733800 d8e2ff08 da756cc0
fee0: d8e2e000 c011fdb0 00000000 00000000 be9ced00 00000002 00000000 d8e2e000
ff00: 00000000 c012032c 00000003 be9ced3c da6973b0 d64403c0 00005410 be9ced3c
ff20: d8e2e000 c01e7e48 be9ced3c c01e86d4 00000200 fffffff6 000bd2e8 00000000
ff40: 000bd2e8 00000072 c01070c8 c01203b0 0000096e c011bed4 00000000 00000000
ff60: ffffffff 00000000 00000000 00000000 0000000a d64403c0 d64403c0 00005410
ff80: be9ced3c d8e2e000 00000000 c01e8818 000bd2e8 000b9db0 00000000 00000036
ffa0: c01070c8 c0106ec0 000bd2e8 000b9db0 0000000a 00005410 be9ced3c 000b9694
ffc0: 000bd2e8 000b9db0 00000000 00000036 ffffffff 00000000 0009c9f0 0009c9d0
ffe0: 000b9218 be9ced34 b6efa0cf b6efa756 200f0030 0000000a 00000000 00000000
[<c0393804>] (tty_paranoia_check) from [<c03945d0>] (tty_ioctl+0x2c/0xa24)
[<c03945d0>] (tty_ioctl) from [<c01e7e48>] (vfs_ioctl+0x18/0x3c)
[<c01e7e48>] (vfs_ioctl) from [<c01e86d4>] (do_vfs_ioctl+0x744/0x854)
[<c01e86d4>] (do_vfs_ioctl) from [<c01e8818>] (SyS_ioctl+0x34/0x5c)
[<c01e8818>] (SyS_ioctl) from [<c0106ec0>] (ret_fast_syscall+0x0/0x48)
Code: 059f003c 07f32051 01a01a21 0a000007 (e5900000)
---[ end trace 50138908b2cf98ee ]---

 

 

0 Kudos
Adventurer
Adventurer
2,918 Views
Registered: ‎12-02-2014

Re: Cadence I2C driver bug

I've narrowed down my issues to a similar cause as what was listed above.

Inside cdns_i2c_master_isr, the following code is running when id->recv_count is 0:

*(id->p_recv_buf)++ =
cdns_i2c_readreg(CDNS_I2C_DATA_OFFSET);
id->recv_count--;
id->curr_recv_count--;

This occurs during large I2C transfers (> 252 bytes) and also when other interfaces are being heavily loaded with interrupts.  Can we get an official response from Xilinx please?

0 Kudos
Observer nburkitt
Observer
2,834 Views
Registered: ‎06-16-2016

Re: Cadence I2C driver bug

I've been bitten by this as well. In my case, it started when moving working code from an older version of the kernel.

I'm using a krtkl snickerdoodle board with a Zynq 7010 and a custom baseboard. The manufacturer maintains a kernel repository on github forked from the Xilinx git repo. My company's product has been working for over a year on an older version of the kernel (4.4.0). When I "upgraded" to the newer, 4.14.0 kernel (again from the krtkl repo), every I2C device we have started to fail with the symptoms described by others on this thread.

We make fairly heavy use of I2c, reading an IMU at around 100 Hz, plus other devices at lower rates. The failure generally takes no more than a few minutes to show up.

My solution, admittedly a bit simple-minded, was just to roll back the i2c-cadence.c file to what I had been using before. Et voila, the problems went away.

Here's a link to the git blame page for the commit that seems to be the culprit.

-Nick

0 Kudos
Adventurer
Adventurer
2,803 Views
Registered: ‎12-02-2014

Re: Cadence I2C driver bug

@nburkitt,

Are you sure that was the commit that introduced the issue for you?  That one seems to be related to i2c bus recovery specifically, which is a special bus condition that is handled by i2c-core.  This is a valid, and sometimes necessary change (depending on the types of i2c devices connected).

 

 

0 Kudos
Observer nburkitt
Observer
2,797 Views
Registered: ‎06-16-2016

Re: Cadence I2C driver bug

@jdefields - I think that's the only significant change that's been made to i2c-cadence.c, but I made no further attempt to analyze the failure, nor did I spend any time to verify that that commit was really the point at which the failure was introduced. The comments certainly make it seem as though the changes solve some problem, but it's not one I've ever encountered. With those changes, I2C on the Zynq is effectively unusable. Without them, it works perfectly, at least for me. The cure here is worse than the disease.

We have several I2C devices, including a Bosch BNO055 IMU, a TI BQ25890 battery charger, and a 24LC16B EEPROM. A simple test is to read any of these via sysfs as fast as bash can loop. With the post-change version of i2c-cadence.c, the kernel is hosed within minutes, if not seconds. With the pre-change version, all such tests run indefinitely, without a peep written to syslog.

I'm already weeks behind schedule. Discarding the driver updates is a pretty ham-fisted approach to solving the problem, but with no evidence that the updated driver solves any problem that I've experienced, ham is looking pretty good. :-)

-Nick

0 Kudos
Visitor slocke
Visitor
1,723 Views
Registered: ‎01-26-2018

Re: Cadence I2C driver bug

For those following along, I have not observed the I2C hang since implementing the atomic section.  It appears this fix has finally made it into the Xilinx github repo for the kernel under the following commit:

https://github.com/Xilinx/linux-xlnx/commit/73766581cd48007fd4c45f172fcbd0a9f3d3a2ec

Now I'd just like to know when there will be a 2019.1 release for zynq that includes this (and other) fixes?

0 Kudos
Visitor slocke
Visitor
511 Views
Registered: ‎01-26-2018

Re: Cadence I2C driver bug

So, it appears that even with the atomic section fix, there are still occasions where the cadence I2C timeout is occuring during heavy interrupt loading.  I have (again) managed to trace the error on a scope, and it is a similar symptom to before -- an extra 255 bytes are transfered on an I2C read.  In this particular case, it is during a 128 byte read of an eeprom -- leading to a total of 383 bytes being transfered.

I assume this is (again) related to the 61664 errata.  The documentation for the Cadence I2C master in the Xilinx TRM is only one step removed from useless, so I'm stumbling around blind trying to figure out what the actual issue is and how to fix it.

I dearly hope someone familiar with this IP core can answer some questions:

1. What units of time does the timeout register measure?

2. If the errata occurs, does the core still generate an RX complete interrupt?  Is that interrupt generated at the end of the extra 16 bytes, or is it generated at the end of the original transfer, or both?

3. (I think I know the answer to this one) Assuming the HOLD bit is *not* set, does the IP core hold SCL low during a DATA interrupt if the fifo is not refilled before the last byte(s) are drained?  In other words, does the HOLD bit need to be set if transferring more than 16 bytes (max FIFO size), or should it only be set if transferring more than 255 bytes (max transfer size)?

0 Kudos
Visitor slocke
Visitor
484 Views
Registered: ‎01-26-2018

Re: Cadence I2C driver bug

I think I know what is happening in the driver, but the "correct" way to fix it will depend on the answers (if any) I get to my previous post.

I2C Fault.png

This capture shows the I2C transaction that is causing the issue.  The 128 byte transfer should be completed during the SCL hold (the 0xFF~a shows that the controller properly NACKed the transfer).  A data interrupt is generated, and the driver services the FIFO (red trace is GPIO toggling for every read from the FIFO).  Blue trace is a simple test of the TRANSFER_SIZE register of the controller on each loop, comparing it against zero.  It is quite obvious that the controller (due to the NACK) finished the transfer, and the TRANSFER_SIZE register is properly diminshed to zero when the ISR starts.  Up until this interrupt, 113 of the 128 bytes have already been transfered, leaving 15 bytes remaining in the correct transfer size.  This is where things go bad.

I *believe* what is happening is that since the hold bit is left set from the previous DATA interrupt, the controller does not issue a STOP condition and finish the transaction (which is why no COMP interrupt flag is set).  Instead, it assumes that more data should be transfered.  As soon as the driver reads the first byte from the FIFO, space is opened up, and the controller continues (even though TRANSFER_SIZE is zero???).  What I don't know is if this is expected behaviour from the core, or another errata.  Furthermore, the readout loop reads MORE than the required 15 bytes out of the FIFO...

The reason the HOLD bit is still set is a combination of a few things:

1. A previous data interrupt for the same transaction happened a little late, causing the driver's read-out loop to read 15 bytes instead of the normal 14 (one extra byte becomes "valid" during the read-out loop).  If this didn't happen, the recv_count would have been large enough to trigger the HOLD bit clearing (see part 2).  This is why an interrupt-laden system causes the fault in my case.

I2C extra FIFO read.png

2.  The driver only checks to see if the HOLD bit should be cleared at the top of the readout loop.  During the DATA interrupt just before the fault condition shown above, the recv_count would have been 15 at the bottom of the readout loop, but 16 at the top (and the HOLD clear comparison is a less than comparison):

		while (cdns_i2c_readreg(CDNS_I2C_SR_OFFSET) &
		       CDNS_I2C_SR_RXDV) {
			/*
			 * Clear hold bit that was set for FIFO control if
			 * RX data left is less than FIFO depth, unless
			 * repeated start is selected.
			 */
			if ((id->recv_count < CDNS_I2C_FIFO_DEPTH) &&
			    !id->bus_hold_flag)
				cdns_i2c_clear_bus_hold(id);

			*(id->p_recv_buf)++ =
				cdns_i2c_readreg(CDNS_I2C_DATA_OFFSET);
			id->recv_count--;
			id->curr_recv_count--;

			if (cdns_is_holdquirk(id, hold_quirk))
				break;
		}

Where this whole mess gets really bad is when the extra I2C transfers continue to be put into the p->recv_buf (buffer overrun) which is what @fox in a previous post saw.  Sometimes this causes kernel oops/panic because of corrupt memory:


I believe the best way to fix this is to ensure that the HOLD bit is clear, preventing the continued transaction when the last few bytes are read out.  This is where I need some insight from someone familiar with the I2C core.  The Zynq TRM seems to suggest that the HOLD bit only needs to be set when transactions larger than 255 (max TRANSACITON_SIZE) are needed.  If the controller automatically holds SCL low between FIFO writes, then the HOLD set (in cdns_i2c_mrecv):

	/*
	 * Check for the message size against FIFO depth and set the
	 * 'hold bus' bit if it is greater than FIFO depth.
	 */
	if (id->recv_count > CDNS_I2C_FIFO_DEPTH)
		ctrl_reg |= CDNS_I2C_CR_HOLD;

The HOLD clear check in the ISR loop can be modified to check against CDNS_I2C_TRANSFER_SIZE.  This would mean that the HOLD bit would only ever get set for transfers greater than 252 bytes, and even so, would be cleared as soon as there was less than 252 bytes left to transfer (giving MUCH more headroom for interrupt latency not causing issues).  

If my assumptions about the controller automatically holding SCL low during FIFO refils are incorrect, then the ISR could be made better by checking the recv_count with a <= FIFO_SIZE at the end of the read loop.

In either solution, there should also be some bounds checks added to the ISR read-out loop to prevent bad things from happening in weird operating conditions...

Can someone from Xilinx please check-in on this?

0 Kudos
Explorer
Explorer
448 Views
Registered: ‎02-22-2012

Re: Cadence I2C driver bug

Hi @slocke 

By coincidence I saw this your "year ago problem report" by pure coincidence.

I have basically the same issue with Xilinx AXI IIC 2.0 IP I2C bus controller.

To reproduce the problem I need to:

1. Perform many I2C read operations. (I get this via gpio-pca953x driver, driving 10Gpbs netdev TX/RX LEDs).

2. Load down system with other non-i2c interrupt-heavy tasks. (In my case huge cca 9.9Gbps network transfers)

3. Wait. (couple of minutes in my case).

In my case the problem shows up as I2C bus totally blocked.

I dig this down to that this I2C bus block happens always(!) after timeout completion of wait_event_timeout() in i2c-xiic.c driver (line in code 697, function xiic_xfer, xlnx kernel 4.14.0).

I did not hit kernel segmentation fault, but for the rest my problem symptoms are almost identical.

I don't want to introduce noise in this thread, but the problem is so incredibly similar.

You dig much deeper in Cadence driver to the root of the problem.

What I saw with AXI IIC and i2c-xiic driver is that if I slightly change i2c-xiic.c driver, e.g. add some debug prints or compile optimisations, the problem is not reproducible any more (problem hit time window almost zero). This looks like some coincidence timing issue in sequence of calls the driver is making to I2C controller (looks smilar to what you dig out).

I workaround the problem by soft reset the AXI IIC (SOFTR at 040h) when timeout happens and this workaround works ok for me, but the origin of problem is still there in AXI IIC and/or i2c-xiic driver.

p.s.

I also filed a report of this problem to Xilinx via FAE.

0 Kudos
Visitor slocke
Visitor
402 Views
Registered: ‎01-26-2018

Re: Cadence I2C driver bug

Yes, it does sound familiar.  It seems there might not be a lot of consideration of race conditions when Xilinx is developing drivers.  I had the same issues as you did with debug statements (they throw the timing off too much to track down the problem).  Instead, you could try GPIO writes like I did.  I just hacked in a direct write to the GPIO registers for this purpose (you need at least one GPIO you can toggle that comes out on a pad you can probe, like an LED, an unused pin on a header, etc). I just threw a global in the driver file called:

void __iomem * gpio_remap;

Then in the probe routine I added a line:

gpio_remap = ioremap((resource_size_t)0xE000A044, 4);

Then I could toggle GPIO (in my case MIO48 in the upper GPIO bank) using a couple of macros I put at the top of the file:

#define GPIO_SET() writel(readl(gpio_remap)|(1<<16),gpio_remap)
#define GPIO_CLR() writel(readl(gpio_remap) & ~(1<<16),gpio_remap)

Another method to help you track it down is to reduce the bus speed. If reducing the bus speed makes the problem go away, it's a good indication that the problem is in the interrupt routine (you're essentially giving the ISR a little more time to do its job before the next byte arrives from the bus).  A race condition outside the ISR (like the first bug I tracked down), will probably not be cured this way, just made a little less prevelant (because the number of transactions per second will be reduced).

0 Kudos