The Ghost(busters) in the Machine

Wednesday March 21, 2007

How do you troubleshoot completely random problems?

My home desktop machine has been suffering from a Linux kernel "Oops" approximately once every two days for the last few weeks. I would really like it to stop doing that. When I get a stack trace in my logs, it's consistently in the "kswapd" process, even though I disabled all swap weeks ago.

I'm running Edgy on this machine, just like I was running it on my laptop and am running it on my work desktop. Those machines were both completely stable (modulo occasional ndiswrapper issues) running the exact same kernel.

It doesn't seem like it's a hardware issue. At least, the same machine has never exhibited any problems under Windows.

It isn't deterministically reproducible. It always seems to be in response to a click or some kind of user-input event during heavy disk I/O, but flogging the disks and mashing the keyboard, even for hours at a time, doesn't cause it to happen.

I am considering a fresh re-install to attempt a fix for this, but besides the inelegance of that solution, it seems likely that it will leave me in the same place.

Does anyone have a suggestion for tracking this down so that I'll actually know that it's fixed?