Kernel Hotspots and Problems
Much of this dicussion came about during OLS 2005 and the PARISC Mini-bof we held on Thursday July 21st, 2005.
Problem the first - More Crash Info -
When the kernel fails we need more information to debug the problem. Possible solutions include:
* Linux kernel crash dump (Dump crash data to the network). * Kexec (Launch second kernel to debug incore data). * Netdump?
Problem the second - Floating Point Registers -
We have re-enabled floating pointer registers around the 2.6.9 era, and this introduced a bug in our code. The bug was rather subtle and fixed by James Bottomley. The issue was that our _switch_to and _switch_to_ret was not saving the callee saves floating point registers. The situation arose that one of the floating point registers was getting trashed in the pa_memcpy code which usually meant something died with an invalid copy.
There are still some stability issues and crashes in pa_memcpy. It seems the Joel Soete has had some limited success using GCC 4.0 to limit the use of floating point registers.
There may still be some call paths that trash floating point registers. How do we find this?
* Create a consistency checking kernel thread. * The kernel thread compares fp state every X seconds. * The kernel thread could do other things?