04-26-2010 09:15 AM
We're running linux on powerpc. We have devices that are hot-swappable on I2C (power supplies, etc.). We're noticing that I2C is failing after several (20-30) attempts to access a non-existant I2C device that the driver is hanging. Note that we are not even trying to hot-swap the device; the device doesn't exist when I run my test. This is using the new sysfs driver. The affect to the application is that the read() hangs. It is interruptable, so cntl-C gets me out of the application but subsequent runs cause the exact read() to hang again. I hosed down the driver a bit and the last place I've seen it is at the end of XIic_MasterSend or MasterRecv.
Core registers are:
GIE (+1C) 0x8000
IS (+20) 0xD0
IE (+28) 0x27
RST (+40) 0x00
CTL (+100) 0x0D
STS (+104) 0x40
Clearly the interrupt status and enable registers are disjoint.
I've looked at the bus with an I2C analyzer and the last transaction was ST ADDR NAK SP; no errors.
Has anyone seen this before? I haven't seen any bug fixes related to I2C.
04-27-2010 09:59 AM
I haven't seen that specifically, but I've heard of other issues with I2C on other devices.
It's not clear to me if it's h/w IP issue or s/w driver issues.
Sorry, not much help yet.
04-28-2010 10:28 AM - edited 04-28-2010 10:31 AM
I think I've made some progress in fixing this hang. I haven't hammered on it for days, but at least I can say that I haven't seen it hang and it's ran for more than an hour (normally it dies within a minute or two).
First, I'd like to know why it tries a transaction 160 times when it fails... Seems a little too much. Maybe someone was trying to delay in the case of an EEPROM write in progress making the part not ACK. That should probably be done at a higher layer (the app).
In the original code (i2c-algo-xilinx.c) the retry code reads:
Status = XIic_MasterRecv(&dev->Iic, pmsg->buf, pmsg->len);
dev->Iic.Stats.TxErrors = 0;
and
Status = XIic_MasterSend(&dev->Iic, pmsg->buf, pmsg->len);
dev->Iic.Stats.TxErrors = 0;
I think the sometimes the interrupt is coming in prior to the clearing of TxError and is getting lost. I've changed the code to say:
dev->Iic.Stats.TxErrors = 0;
Status = XIic_MasterRecv(&dev->Iic, pmsg->buf, pmsg->len);
and
dev->Iic.Stats.TxErrors = 0;
Status = XIic_MasterSend(&dev->Iic, pmsg->buf, pmsg->len);
You might want to consider rolling this into the distribution. It does look like the problem is gone. Or at least the window has been made much, much smaller.
-D i c k
04-28-2010 10:43 AM
Thanks for the update. Keep us up to date on any other findings.
Yes, we'll try to roll these feedback in, but want to wait long enough for your testing.
Yes the # of retries is large and it was because of EEPROM operations.
Thanks.
05-04-2010 07:53 AM
John,
The above fix has shown to definitely fix the problem. You would also see this in the case of the EEPROM write being busy; any venture into that retry code potentially could cause a hang.
I also put a return -ENODEV after the printfs when the retry count is exhausted. It's not good to indicate success on the read when no data was actually placed in the buffer; someone might try to use it.
-**bleep**
05-04-2010 08:32 AM
Thanks, appreciate the feedback.
I'll try to queue that up for testing and incorporation.