next up previous contents
Next: Binary drivers and distribution Up: Testing Previous: Measuring performance   Contents

Hello world! or What exactly are we looking for?

What is it we are looking for? Well, below you can find some examples of messages that are related to kernel problems. Usually, such messages appear on the system console, but sometimes (eg. when the problem is not sufficiently serious to make the kernel crash immediately) you can also see them in the logs.

=============================================
[ INFO: possible recursive locking detected ]
---------------------------------------------
idle/1 is trying to acquire lock:
 (lock_ptr){....}, at: [<c021cbd2>] acpi_os_acquire_lock+0x8/0xa

but task is already holding lock:
 (lock_ptr){....}, at: [<c021cbd2>] acpi_os_acquire_lock+0x8/0xa

other info that might help us debug this:
1 lock held by idle/1:
 #0:  (lock_ptr){....}, at: [<c021cbd2>] acpi_os_acquire_lock+0x8/0xa

stack backtrace:
 [<c0103e89>] show_trace+0xd/0x10
 [<c0104483>] dump_stack+0x19/0x1b
 [<c01395fa>] __lock_acquire+0x7d9/0xa50
 [<c0139a98>] lock_acquire+0x71/0x91
 [<c02f0beb>] _spin_lock_irqsave+0x2c/0x3c
 [<c021cbd2>] acpi_os_acquire_lock+0x8/0xa
 [<c0222d95>] acpi_ev_gpe_detect+0x4d/0x10e
 [<c02215c3>] acpi_ev_sci_xrupt_handler+0x15/0x1d
 [<c021c8b1>] acpi_irq+0xe/0x18
 [<c014d36e>] request_irq+0xbe/0x10c
 [<c021cf33>] acpi_os_install_interrupt_handler+0x59/0x87
 [<c02215e7>] acpi_ev_install_sci_handler+0x1c/0x21
 [<c0220d41>] acpi_ev_install_xrupt_handlers+0x9/0x50
 [<c0231772>] acpi_enable_subsystem+0x7d/0x9a
 [<c0416656>] acpi_init+0x3f/0x170
 [<c01003ae>] _stext+0x116/0x26c
 [<c0101005>] kernel_thread_helper+0x5/0xb
The above message indicates that the kernel's runtime locking correctness validator (often referred to as ''lockdep'') has detected a possible locking error. The errors detected by lockdep need not be critical, so if you have enabled lockdep in the kernel configuration, which is recommended for testing (see Subsection 1.6.4), from time to time you can see them in the system logs or in the output of dmesg. They are always worth reporting, although sometimes lockdep may think that there is a problem even if the locking is used in a correct way. Still, in such a case your report will tell the kernel developers that they should teach lockdep not to trigger in this particular place any more.

BUG: sleeping function called from invalid context at /usr/src/linux-mm/sound/core/info.c:117
in_atomic():1, irqs_disabled():0
 <c1003ef9> show_trace+0xd/0xf
 <c100440c> dump_stack+0x17/0x19
 <c10178ce> __might_sleep+0x93/0x9d
 <f988eeb5> snd_iprintf+0x1b/0x84 [snd]
 <f988d808> snd_card_module_info_read+0x34/0x4e [snd] 
 <f988f197> snd_info_entry_open+0x20f/0x2cc [snd]
 <c1067a17> __dentry_open+0x133/0x260
 <c1067bb7> nameidata_to_filp+0x1c/0x2e
 <c1067bf7> do_filp_open+0x2e/0x35
 <c1068bf2> do_sys_open+0x54/0xd7
 <c1068ca1> sys_open+0x16/0x18
 <c11dab67> sysenter_past_esp+0x54/0x75
BUG: using smp_processor_id() in preemptible [00000001] code: init/1
caller is __handle_mm_fault+0x2b/0x20d
 [<c0103ba8>] show_trace+0xd/0xf
 [<c0103c7a>] dump_stack+0x17/0x19
 [<c0203bcc>] debug_smp_processor_id+0x8c/0xa0
 [<c0160e60>] __handle_mm_fault+0x2b/0x20d
 [<c0116f7b>] do_page_fault+0x226/0x61f
 [<c0103959>] error_code+0x39/0x40
 [<c019d4c1>] padzero+0x19/0x28
 [<c019e716>] load_elf_binary+0x836/0xc02
 [<c017db53>] search_binary_handler+0x123/0x35a
 [<c019d3b9>] load_script+0x221/0x230
 [<c017db53>] search_binary_handler+0x123/0x35a
 [<c017deee>] do_execve+0x164/0x215
 [<c0101e7a>] sys_execve+0x3b/0x7e
 [<c02fabc3>] syscall_call+0x7/0xb
The above message means that one of the kernel's functions has been called from a wrong place. In this particular case the execution of the function snd_iprintf() might be suspended until certain condition is satisfied (in such cases the kernel developers say that the function might sleep), so it should not be called from the portions of code that have to be executed atomically, such as interrupt handlers (ie. from atomic context). However, apparently snd_iprintf() has been called from atomic context and the kernel reports this as a potential problem. Such problems need not cause the kernel to crash and the related messages, similar to the above one, can appear in the system logs or in the output of dmesg, but they are serious and should always be reported.

The next message is a so-called Oops, which means that it represents a problem causing the kernel to stop working. In other words, it means that something really bad has happened and the kernel cannot continue running, since your hardware might be damaged or your data might be corrupted otherwise. Such messages are often accompanied by so-called kernel panics (the origin of the term ''kernel panic'' as well as its possible meanings are explained in the OSWeekly.com article by Puru Govind which is available at http://www.osweekly.com/index.php?option=com_content&task=view&id=2241&Itemid=449).

BUG: unable to handle kernel paging request at virtual address 6b6b6c07
 printing eip:
c0138722
*pde = 00000000
Oops: 0002 [#1]
4K_STACKS PREEMPT SMP 
last sysfs file: /devices/pci0000:00/0000:00:1d.7/uevent
Modules linked in: snd_timer snd soundcore snd_page_alloc intel_agp agpgart
ide_cd cdrom ipv6 w83627hf hwmon_vid hwmon i2c_isa i2c_i801 skge af_packet
ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink xt_tcpudp
iptable_filter ip_tables x_tables cpufreq_userspace p4_clockmod speedstep_lib
binfmt_misc thermal processor fan container rtc unix
CPU:    0
EIP:    0060:[<c0138722>]    Not tainted VLI
EFLAGS: 00010046   (2.6.18-rc2-mm1 #78) 
EIP is at __lock_acquire+0x362/0xaea
eax: 00000000   ebx: 6b6b6b6b   ecx: c0360358   edx: 00000000
esi: 00000000   edi: 00000000   ebp: f544ddf4   esp: f544ddc0
ds: 007b   es: 007b   ss: 0068
Process udevd (pid: 1353, ti=f544d000 task=f6fce8f0 task.ti=f544d000)
Stack: 00000000 00000000 00000000 c7749ea4 f6fce8f0 c0138e74 000001e8 00000000 
        00000000 f6653fa4 00000246 00000000 00000000 f544de1c c0139214 00000000 
        00000002 00000000 c014fe3a c7749ea4 c7749e90 f6fce8f0 f5b19b04 f544de34 
Call Trace:
[<c0139214>] lock_acquire+0x71/0x91
[<c02f2bfb>] _spin_lock+0x23/0x32
[<c014fe3a>] __delayacct_blkio_ticks+0x16/0x67
[<c01a4f76>] do_task_stat+0x3df/0x6c1
[<c01a5265>] proc_tgid_stat+0xd/0xf
[<c01a29dd>] proc_info_read+0x50/0xb3
[<c0171cbb>] vfs_read+0xcb/0x177
[<c017217c>] sys_read+0x3b/0x71
[<c0103119>] sysenter_past_esp+0x56/0x8d
DWARF2 unwinder stuck at sysenter_past_esp+0x56/0x8d
Leftover inexact backtrace:
[<c0104318>] show_stack_log_lvl+0x8c/0x97
[<c010447f>] show_registers+0x15c/0x1ed
[<c01046c2>] die+0x1b2/0x2b7
[<c0116f5f>] do_page_fault+0x410/0x4f0
[<c0103d1d>] error_code+0x39/0x40
[<c0139214>] lock_acquire+0x71/0x91
[<c02f2bfb>] _spin_lock+0x23/0x32
[<c014fe3a>] __delayacct_blkio_ticks+0x16/0x67
[<c01a4f76>] do_task_stat+0x3df/0x6c1
[<c01a5265>] proc_tgid_stat+0xd/0xf
[<c01a29dd>] proc_info_read+0x50/0xb3
[<c0171cbb>] vfs_read+0xcb/0x177
[<c017217c>] sys_read+0x3b/0x71
[<c0103119>] sysenter_past_esp+0x56/0x8d
Code: 68 4b 75 2f c0 68 d5 04 00 00 68 b9 75 31 c0 68 e3 06 31 c0 e8 ce 7e fe ff
e8 87 c2 fc ff 83 c4 10 eb 08 85 db 0f 84 6b 07 00 00 <f0> ff 83 9c 00 00 00 8b
55 dc 8b 92 5c 05 00 00 89 55 e4 83 fa 
EIP: [<c0138722>] __lock_acquire+0x362/0xaea SS:ESP 0068:f544ddc0

The following message represents an error resulting from a situation that, according to the kernel developers, cannot happen:

KERNEL: assertion ((int)tp->lost_out >= 0) failed at net/ipv4/tcp_input.c (2148)
KERNEL: assertion ((int)tp->lost_out >= 0) failed at net/ipv4/tcp_input.c (2148)
KERNEL: assertion ((int)tp->sacked_out >= 0) failed at net/ipv4/tcp_input.c (2147)
KERNEL: assertion ((int)tp->sacked_out >= 0) failed at net/ipv4/tcp_input.c (2147)



BUG: warning at /usr/src/linux-mm/kernel/cpu.c:56/unlock_cpu_hotplug()
 [<c0103e41>] dump_trace+0x70/0x176
 [<c0103fc1>] show_trace_log_lvl+0x12/0x22
 [<c0103fde>] show_trace+0xd/0xf
 [<c01040b0>] dump_stack+0x17/0x19
 [<c0140e19>] unlock_cpu_hotplug+0x46/0x7c
 [<fd9560b0>] cpufreq_set+0x81/0x8b [cpufreq_userspace]
 [<fd956109>] store_speed+0x35/0x40 [cpufreq_userspace]
 [<c02ac9f2>] store+0x38/0x49
 [<c01aec16>] flush_write_buffer+0x23/0x2b
 [<c01aec69>] sysfs_write_file+0x4b/0x6c
 [<c01770af>] vfs_write+0xcb/0x173
 [<c0177203>] sys_write+0x3b/0x71
 [<c010312d>] sysenter_past_esp+0x56/0x8d
 [<b7fbe410>] 0xb7fbe410
 [<c0103fc1>] show_trace_log_lvl+0x12/0x22
 [<c0103fde>] show_trace+0xd/0xf
 [<c01040b0>] dump_stack+0x17/0x19
 [<c0140e19>] unlock_cpu_hotplug+0x46/0x7c
 [<fd9560b0>] cpufreq_set+0x81/0x8b [cpufreq_userspace]
 [<fd956109>] store_speed+0x35/0x40 [cpufreq_userspace]
 [<c02ac9f2>] store+0x38/0x49
 [<c01aec16>] flush_write_buffer+0x23/0x2b
 [<c01aec69>] sysfs_write_file+0x4b/0x6c
 [<c01770af>] vfs_write+0xcb/0x173
 [<c0177203>] sys_write+0x3b/0x71
 [<c010312d>] sysenter_past_esp+0x56/0x8d

Apart from the messages that can appear on the console, in the output of dmesg or in the system logs, some problems can be reported in a less direct way. For example, if the kernel memory leak detector is used (so far, it has not been included in stable kernels), there is the file /sys/kernel/debug/memleak, in which possible kernel memory leaks are registered, eg.

orphan pointer 0xf5a6fd60 (size 39):
c0173822: <__kmalloc>
c01df500: <context_struct_to_string>
c01df679: <security_sid_to_context>
c01d7eee: <selinux_socket_getpeersec_dgram>
f884f019: <unix_get_peersec_dgram>
f8850698: <unix_dgram_sendmsg>
c02a88c2: <sock_sendmsg>
c02a9c7a: <sys_sendto>
This information, supplemented with the kernel configuration file (see Section 1.6), may allow the kernel developers to fix the bug that you have managed to find.

The examples shown above are related to problems that occur when the kernel is running, called run-time errors. Obviously, to get a run-time error you need to build, install and boot the kernel. Surprisingly, however, it is possible that you will not be able to build the kernel due to a compilation error. This is not a frequent problem and it usually indicates that the author of certain piece of kernel code was not careful enough. Still, this happens to many developers, including us, and if you find a compilation error, report it immediately (you can even try to fix it if you are good at programming in C). To find examples of what happens after someone finds a compilation problem in the kernel, you can look at one of the discussions taking place on the LKML after Andrew Morton announces a new -mm kernel (for more information about the -mm tree see Section 1.5).

It should be stressed that some kernel bugs are not immediately visible. Some of them show up only in specific situations and may manifest themselves, for example, in hanging random processes or dropping random data into files that are written to. For instance, there is a whole category of kernel problems that appear only when the system is suspended to RAM or hibernated (ie. suspended to disk), either during the suspend, or while the kernel is resuming normal operations. All in all, you will never know what surprises the kernel has got for you, so you should better be prepared.

Generally, kernel run-time errors can be divided into three categories:

The easily reproducible bugs are the easiest to fix, since in these cases it is quite easy, albeit often quite time-consuming, to find a patch that has introduced the problem. For this reason, the easily reproducible bugs are usually fixed relatively quickly. In turn, the bugs that are difficult to reproduce usually take a lot of time to get fixed, since in these cases the source of the problem cannot be easily identified. If you encounter such a bug, you will probably need help of the developers knowing the relevant kernel subsystem and you will have to be very patient.


next up previous contents
Next: Binary drivers and distribution Up: Testing Previous: Measuring performance   Contents
MichaƂ Piotrowski 2007-06-21