What is it we are looking for? Well, below you can find some examples of messages that are related to kernel problems. Usually, such messages appear on the system console, but sometimes (eg. when the problem is not sufficiently serious to make the kernel crash immediately) you can also see them in the logs.
=============================================
[ INFO: possible recursive locking detected ]
---------------------------------------------
idle/1 is trying to acquire lock:
(lock_ptr){....}, at: [<c021cbd2>] acpi_os_acquire_lock+0x8/0xa
but task is already holding lock:
(lock_ptr){....}, at: [<c021cbd2>] acpi_os_acquire_lock+0x8/0xa
other info that might help us debug this:
1 lock held by idle/1:
#0: (lock_ptr){....}, at: [<c021cbd2>] acpi_os_acquire_lock+0x8/0xa
stack backtrace:
[<c0103e89>] show_trace+0xd/0x10
[<c0104483>] dump_stack+0x19/0x1b
[<c01395fa>] __lock_acquire+0x7d9/0xa50
[<c0139a98>] lock_acquire+0x71/0x91
[<c02f0beb>] _spin_lock_irqsave+0x2c/0x3c
[<c021cbd2>] acpi_os_acquire_lock+0x8/0xa
[<c0222d95>] acpi_ev_gpe_detect+0x4d/0x10e
[<c02215c3>] acpi_ev_sci_xrupt_handler+0x15/0x1d
[<c021c8b1>] acpi_irq+0xe/0x18
[<c014d36e>] request_irq+0xbe/0x10c
[<c021cf33>] acpi_os_install_interrupt_handler+0x59/0x87
[<c02215e7>] acpi_ev_install_sci_handler+0x1c/0x21
[<c0220d41>] acpi_ev_install_xrupt_handlers+0x9/0x50
[<c0231772>] acpi_enable_subsystem+0x7d/0x9a
[<c0416656>] acpi_init+0x3f/0x170
[<c01003ae>] _stext+0x116/0x26c
[<c0101005>] kernel_thread_helper+0x5/0xb
The above message indicates that the kernel's runtime locking correctness
validator (often referred to as ''lockdep'') has detected a possible locking
error. The errors detected by lockdep need not be critical, so if you have
enabled lockdep in the kernel configuration, which is recommended for testing
(see Subsection 1.6.4), from time to time you can
see them in the system logs or in the output of dmesg. They are always
worth reporting, although sometimes lockdep may think that there is a problem
even if the locking is used in a correct way. Still, in such a case your report
will tell the kernel developers that they should teach lockdep not to trigger
in this particular place any more.
BUG: sleeping function called from invalid context at /usr/src/linux-mm/sound/core/info.c:117 in_atomic():1, irqs_disabled():0 <c1003ef9> show_trace+0xd/0xf <c100440c> dump_stack+0x17/0x19 <c10178ce> __might_sleep+0x93/0x9d <f988eeb5> snd_iprintf+0x1b/0x84 [snd] <f988d808> snd_card_module_info_read+0x34/0x4e [snd] <f988f197> snd_info_entry_open+0x20f/0x2cc [snd] <c1067a17> __dentry_open+0x133/0x260 <c1067bb7> nameidata_to_filp+0x1c/0x2e <c1067bf7> do_filp_open+0x2e/0x35 <c1068bf2> do_sys_open+0x54/0xd7 <c1068ca1> sys_open+0x16/0x18 <c11dab67> sysenter_past_esp+0x54/0x75 BUG: using smp_processor_id() in preemptible [00000001] code: init/1 caller is __handle_mm_fault+0x2b/0x20d [<c0103ba8>] show_trace+0xd/0xf [<c0103c7a>] dump_stack+0x17/0x19 [<c0203bcc>] debug_smp_processor_id+0x8c/0xa0 [<c0160e60>] __handle_mm_fault+0x2b/0x20d [<c0116f7b>] do_page_fault+0x226/0x61f [<c0103959>] error_code+0x39/0x40 [<c019d4c1>] padzero+0x19/0x28 [<c019e716>] load_elf_binary+0x836/0xc02 [<c017db53>] search_binary_handler+0x123/0x35a [<c019d3b9>] load_script+0x221/0x230 [<c017db53>] search_binary_handler+0x123/0x35a [<c017deee>] do_execve+0x164/0x215 [<c0101e7a>] sys_execve+0x3b/0x7e [<c02fabc3>] syscall_call+0x7/0xbThe above message means that one of the kernel's functions has been called from a wrong place. In this particular case the execution of the function
snd_iprintf() might be suspended until certain condition is satisfied
(in such cases the kernel developers say that the function might sleep), so it
should not be called from the portions of code that have to be executed
atomically, such as interrupt handlers (ie. from atomic context).
However, apparently snd_iprintf() has been called from atomic context and
the kernel reports this as a potential problem. Such problems need not cause
the kernel to crash and the related messages, similar to the above one, can
appear in the system logs or in the output of dmesg, but they are
serious and should always be reported.
The next message is a so-called Oops, which means that it represents
a problem causing the kernel to stop working. In other words, it means that
something really bad has happened and the kernel cannot continue running,
since your hardware might be damaged or your data might be corrupted otherwise.
Such messages are often accompanied by so-called kernel panics (the
origin of the term ''kernel panic'' as well as its possible meanings are
explained in the OSWeekly.com article by Puru Govind which is available
at http://www.osweekly.com/index.php?option=com_content&task=view&id=2241&Itemid=449).
BUG: unable to handle kernel paging request at virtual address 6b6b6c07
printing eip:
c0138722
*pde = 00000000
Oops: 0002 [#1]
4K_STACKS PREEMPT SMP
last sysfs file: /devices/pci0000:00/0000:00:1d.7/uevent
Modules linked in: snd_timer snd soundcore snd_page_alloc intel_agp agpgart
ide_cd cdrom ipv6 w83627hf hwmon_vid hwmon i2c_isa i2c_i801 skge af_packet
ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink xt_tcpudp
iptable_filter ip_tables x_tables cpufreq_userspace p4_clockmod speedstep_lib
binfmt_misc thermal processor fan container rtc unix
CPU: 0
EIP: 0060:[<c0138722>] Not tainted VLI
EFLAGS: 00010046 (2.6.18-rc2-mm1 #78)
EIP is at __lock_acquire+0x362/0xaea
eax: 00000000 ebx: 6b6b6b6b ecx: c0360358 edx: 00000000
esi: 00000000 edi: 00000000 ebp: f544ddf4 esp: f544ddc0
ds: 007b es: 007b ss: 0068
Process udevd (pid: 1353, ti=f544d000 task=f6fce8f0 task.ti=f544d000)
Stack: 00000000 00000000 00000000 c7749ea4 f6fce8f0 c0138e74 000001e8 00000000
00000000 f6653fa4 00000246 00000000 00000000 f544de1c c0139214 00000000
00000002 00000000 c014fe3a c7749ea4 c7749e90 f6fce8f0 f5b19b04 f544de34
Call Trace:
[<c0139214>] lock_acquire+0x71/0x91
[<c02f2bfb>] _spin_lock+0x23/0x32
[<c014fe3a>] __delayacct_blkio_ticks+0x16/0x67
[<c01a4f76>] do_task_stat+0x3df/0x6c1
[<c01a5265>] proc_tgid_stat+0xd/0xf
[<c01a29dd>] proc_info_read+0x50/0xb3
[<c0171cbb>] vfs_read+0xcb/0x177
[<c017217c>] sys_read+0x3b/0x71
[<c0103119>] sysenter_past_esp+0x56/0x8d
DWARF2 unwinder stuck at sysenter_past_esp+0x56/0x8d
Leftover inexact backtrace:
[<c0104318>] show_stack_log_lvl+0x8c/0x97
[<c010447f>] show_registers+0x15c/0x1ed
[<c01046c2>] die+0x1b2/0x2b7
[<c0116f5f>] do_page_fault+0x410/0x4f0
[<c0103d1d>] error_code+0x39/0x40
[<c0139214>] lock_acquire+0x71/0x91
[<c02f2bfb>] _spin_lock+0x23/0x32
[<c014fe3a>] __delayacct_blkio_ticks+0x16/0x67
[<c01a4f76>] do_task_stat+0x3df/0x6c1
[<c01a5265>] proc_tgid_stat+0xd/0xf
[<c01a29dd>] proc_info_read+0x50/0xb3
[<c0171cbb>] vfs_read+0xcb/0x177
[<c017217c>] sys_read+0x3b/0x71
[<c0103119>] sysenter_past_esp+0x56/0x8d
Code: 68 4b 75 2f c0 68 d5 04 00 00 68 b9 75 31 c0 68 e3 06 31 c0 e8 ce 7e fe ff
e8 87 c2 fc ff 83 c4 10 eb 08 85 db 0f 84 6b 07 00 00 <f0> ff 83 9c 00 00 00 8b
55 dc 8b 92 5c 05 00 00 89 55 e4 83 fa
EIP: [<c0138722>] __lock_acquire+0x362/0xaea SS:ESP 0068:f544ddc0
The following message represents an error resulting from a situation that, according to the kernel developers, cannot happen:
KERNEL: assertion ((int)tp->lost_out >= 0) failed at net/ipv4/tcp_input.c (2148) KERNEL: assertion ((int)tp->lost_out >= 0) failed at net/ipv4/tcp_input.c (2148) KERNEL: assertion ((int)tp->sacked_out >= 0) failed at net/ipv4/tcp_input.c (2147) KERNEL: assertion ((int)tp->sacked_out >= 0) failed at net/ipv4/tcp_input.c (2147) BUG: warning at /usr/src/linux-mm/kernel/cpu.c:56/unlock_cpu_hotplug() [<c0103e41>] dump_trace+0x70/0x176 [<c0103fc1>] show_trace_log_lvl+0x12/0x22 [<c0103fde>] show_trace+0xd/0xf [<c01040b0>] dump_stack+0x17/0x19 [<c0140e19>] unlock_cpu_hotplug+0x46/0x7c [<fd9560b0>] cpufreq_set+0x81/0x8b [cpufreq_userspace] [<fd956109>] store_speed+0x35/0x40 [cpufreq_userspace] [<c02ac9f2>] store+0x38/0x49 [<c01aec16>] flush_write_buffer+0x23/0x2b [<c01aec69>] sysfs_write_file+0x4b/0x6c [<c01770af>] vfs_write+0xcb/0x173 [<c0177203>] sys_write+0x3b/0x71 [<c010312d>] sysenter_past_esp+0x56/0x8d [<b7fbe410>] 0xb7fbe410 [<c0103fc1>] show_trace_log_lvl+0x12/0x22 [<c0103fde>] show_trace+0xd/0xf [<c01040b0>] dump_stack+0x17/0x19 [<c0140e19>] unlock_cpu_hotplug+0x46/0x7c [<fd9560b0>] cpufreq_set+0x81/0x8b [cpufreq_userspace] [<fd956109>] store_speed+0x35/0x40 [cpufreq_userspace] [<c02ac9f2>] store+0x38/0x49 [<c01aec16>] flush_write_buffer+0x23/0x2b [<c01aec69>] sysfs_write_file+0x4b/0x6c [<c01770af>] vfs_write+0xcb/0x173 [<c0177203>] sys_write+0x3b/0x71 [<c010312d>] sysenter_past_esp+0x56/0x8d
Apart from the messages that can appear on the console, in the output of
dmesg or in the system logs, some problems can be reported in a less
direct way. For example, if the kernel memory leak detector is used (so far, it
has not been included in stable kernels), there is the file
/sys/kernel/debug/memleak, in which possible kernel memory leaks are
registered, eg.
orphan pointer 0xf5a6fd60 (size 39): c0173822: <__kmalloc> c01df500: <context_struct_to_string> c01df679: <security_sid_to_context> c01d7eee: <selinux_socket_getpeersec_dgram> f884f019: <unix_get_peersec_dgram> f8850698: <unix_dgram_sendmsg> c02a88c2: <sock_sendmsg> c02a9c7a: <sys_sendto>This information, supplemented with the kernel configuration file (see Section 1.6), may allow the kernel developers to fix the bug that you have managed to find.
The examples shown above are related to problems that occur when the kernel is running, called run-time errors. Obviously, to get a run-time error you need to build, install and boot the kernel. Surprisingly, however, it is possible that you will not be able to build the kernel due to a compilation error. This is not a frequent problem and it usually indicates that the author of certain piece of kernel code was not careful enough. Still, this happens to many developers, including us, and if you find a compilation error, report it immediately (you can even try to fix it if you are good at programming in C). To find examples of what happens after someone finds a compilation problem in the kernel, you can look at one of the discussions taking place on the LKML after Andrew Morton announces a new -mm kernel (for more information about the -mm tree see Section 1.5).
It should be stressed that some kernel bugs are not immediately visible. Some of them show up only in specific situations and may manifest themselves, for example, in hanging random processes or dropping random data into files that are written to. For instance, there is a whole category of kernel problems that appear only when the system is suspended to RAM or hibernated (ie. suspended to disk), either during the suspend, or while the kernel is resuming normal operations. All in all, you will never know what surprises the kernel has got for you, so you should better be prepared.
Generally, kernel run-time errors can be divided into three categories: