KernelDebuggingTricks

Kernel Debugging Tricks

Debugging the kernel is not necessarily rocket science; in fact it can be achieved using very simple and straight forward techniques and some time, patience and perseverance. This page describes some tricks and techniques to help debug the kernel.

printk is your friend

The simplest, and probably most effective way to debug the kernel is via printk(). This enables one to print messages to the console, and it very similar to printf(). Note that printk() can slow down the execution of code which can alter the way code runs, for example, changing the way race conditions occur.

Changing the ring buffer size

The internal kernel console message buffer can sometimes be too small to capture all of the printk messages, especially when debug code generates a lot of printk messages. If the buffer fills up, it wraps around and one can lose valueable debug messages.

To increase the internal buffer, use the kernel boot parameter:

log_buf_len=N

where N is the size of the buffer in bytes, and must be a power of 2.

Changing debug levels

One can specify the type of printk() log level by pre-pending the 1st printk() argument with one of the following:

KERN_EMERG    /* system is unusable                   */
KERN_ALERT    /* action must be taken immediately     */
KERN_CRIT     /* critical conditions                  */
KERN_ERR      /* error conditions                     */
KERN_WARNING  /* warning conditions                   */
KERN_NOTICE   /* normal but significant condition     */
KERN_INFO     /* informational                        */
KERN_DEBUG    /* debug-level messages                 */

e.g. printk(KERN_DEBUG "example debug message\n");

If one does not specify the log level then the default log level of KERN_WARNING is used. For example, enable all levels of console message:

echo 7 > /proc/sys/kernel/printk

To view console messages at boot, remove the quite and splash boot parameters from the kernel boot line in grub. This will disable the usplash splash screen and re-enable console messages.

Serial Console

Serial console enables one to dump out console messages over a serial cable. Most modern PCs do not have legacy serial ports, so instead, one can use a USB serial dongle instead. A "null serial cable" or "universal file transfer cable" is needed to connect the target computer with the host. Most commonly this will be a DB9 female to DB9 female null serial cable. In addition, one needs to enable USB serial support as a kernel build configuration:

CONFIG_USB_SERIAL_CONSOLE=y
CONFIG_USB_SERIAL=y

and enable the appropriate driver, e.g.:

CONFIG_USB_SERIAL_PL2303=y

and boot this kernel with

console=ttyUSB0,9600n8

one may need to adjust the baud rate appropriately.

Note: Generally, there is NO hardware or software flow control on serial console drivers, which means one may get dropped characters when running very high speed tty baud rates, such as 115200 baud.

Console Messages

Kernel Oops messages general contain a fair amount of information, ranging from register and process state dump and a stack dump too. Unfortunately the stack dump can be more than 25 lines and can scroll off the top of the 25 line Virtual Console. Hence to capture more of a Oops, try the following:

chvt 1
setfont /usr/share/consolefonts/Uni1-VGA8.psf.gz

Of course, one may still have a stack dump that scrolls the top of the Oops message off the console, so one trick is to rebuild the kernel with the stack dump removed, just to capture the initial Oops information. To do this, modify dump_stack in arch/x86/kernel/dumpstack_*.c and comment out the call to show_trace()

Slowing down kernel messages on boot

One may find a machine hangs during the kernel boot process and one would like to be able to see all the kernel messages but unfortunately they scroll off the console too quickly. One can slow down kernel console messages at boot time using by building the kernel with the following option enabled:

CONFIG_BOOT_PRINTK_DELAY=y

And boot the machine with the following kernel boot parameter:

boot_delay=N

where N = msecs delay between each console message.

Kernel panic during suspend

Debugging suspend/resume issues can be difficult if the kernel panics during suspend, especially late in the suspend because console messages are disabled. One can stop console messages from being suspended by using the kernel parameter no_console_suspend:

no_console_suspend=1

This will force the console not to suspend. Boot with this option, chvt 1 (to console #1), and suspend using pm-suspend

Serial Console in VirtualBox

In some debug scenerios it can be helpful to debug the kernel running inside a virtual machine. This is useful for some classes of non-hardware specific bugs, for example generic kernel core problems or debugging file system drivers.

One can capture Linux console messages running inside VirtualBox by setting it the VirtualBox serial log to /tmp/vbox and running a serial tty communications program such as minicom, and configure it to communicate with a named pipe tty called unix#/tmp/vbox

Boot with virtualised kernel boot line:

console=ttyS0,9600 

and minicom will capture the console messages

Network Console

One can route console messages over a network using netconsole. Note that it's not useful for capturing kernel panics as kernel halts before the messages can be transmitted over the network. However it can be useful to monitor systems without the need of message serial console cabling.

see Documentation/networking/netconsole.txt and also Kernel/Netconsole.

netconsole=[src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr]

        where
             src-port      source for UDP packets (defaults to 6665)
             src-ip        source IP to use (interface address)
             dev           network interface (eth0)
             tgt-port      port for logging agent (6666)
             tgt-ip        IP address for logging agent
             tgt-macaddr   ethernet MAC address for logging agent (broadcast)

Examples:

linux netconsole=4444@10.0.0.1/eth1,9353@10.0.0.2/12:34:56:78:9a:bc

The remote host can run either 'netcat -u -l -p <port>' or syslogd.

gdb on vmlinux

One can disassemble a built kernel using gdb on the vmlinux image. This is useful when one gets a kernel Oops message and a stack dump - one can then disassemble the object code and see where the Oops is occuring. For example:

gdb  debian/build/build-generic/vmlinux
(gdb) disassemble printk
Dump of assembler code for function printk:
0xffffffff8023dce0 <printk+0>:  sub    $0xd8,%rsp
0xffffffff8023dce7 <printk+7>:  lea    0xe0(%rsp),%rax
0xffffffff8023dcef <printk+15>: mov    %rsi,0x28(%rsp)
0xffffffff8023dcf4 <printk+20>: mov    %rsp,%rsi
0xffffffff8023dcf7 <printk+23>: mov    %rdx,0x30(%rsp)
0xffffffff8023dcfc <printk+28>: mov    %rcx,0x38(%rsp)
0xffffffff8023dd01 <printk+33>: mov    %rax,0x8(%rsp)
0xffffffff8023dd06 <printk+38>: lea    0x20(%rsp),%rax
0xffffffff8023dd0b <printk+43>: mov    %r8,0x40(%rsp)
0xffffffff8023dd10 <printk+48>: mov    %r9,0x48(%rsp)
0xffffffff8023dd15 <printk+53>: movl   $0x8,(%rsp)
0xffffffff8023dd1c <printk+60>: movl   $0x30,0x4(%rsp)
0xffffffff8023dd24 <printk+68>: mov    %rax,0x10(%rsp)
0xffffffff8023dd29 <printk+73>: callq  0xffffffff8023d980 <vprintk>
0xffffffff8023dd2e <printk+78>: add    $0xd8,%rsp
0xffffffff8023dd35 <printk+85>: retq   
End of assembler dump.

Objdump

If one has the built object code at hand, one can disassemble the object using objdump as follows:

objdump -SdCg debian/build/build-generic/fs/dcache.o

Using GDB to find the location where your kernel panicked or oopsed.

A quick and easy way to find the line of code where your kernel panicked or oopsed is to use GDB list command. You can do this as follows.

Lets assume your panic/oops message says something like:

[  174.507084] Stack:
[  174.507163]  ce0bd8ac 00000008 00000000 ce4a7e90 c039ce30 ce0bd8ac c0718b04 c07185a0
[  174.507380]  ce4a7ea0 c0398f22 ce0bd8ac c0718b04 ce4a7eb0 c037deee ce0bd8e0 ce0bd8ac
[  174.507597]  ce4a7ec0 c037dfe0 c07185a0 ce0bd8ac ce4a7ed4 c037d353 ce0bd8ac ce0bd8ac
[  174.507888] Call Trace:
[  174.508125]  [<c039ce30>] ? sd_remove+0x20/0x70
[  174.508235]  [<c0398f22>] ? scsi_bus_remove+0x32/0x40
[  174.508326]  [<c037deee>] ? __device_release_driver+0x3e/0x70
[  174.508421]  [<c037dfe0>] ? device_release_driver+0x20/0x40
[  174.508514]  [<c037d353>] ? bus_remove_device+0x73/0x90
[  174.508606]  [<c037bccf>] ? device_del+0xef/0x150
[  174.508693]  [<c0399207>] ? __scsi_remove_device+0x47/0x80
[  174.508786]  [<c0399262>] ? scsi_remove_device+0x22/0x40
[  174.508877]  [<c0399324>] ? __scsi_remove_target+0x94/0xd0
[  174.508969]  [<c03993c0>] ? __remove_child+0x0/0x20
[  174.509060]  [<c03993d7>] ? __remove_child+0x17/0x20
[  174.509148]  [<c037b868>] ? device_for_each_child+0x38/0x60
[  174.509241]  [<c039938f>] ? scsi_remove_target+0x2f/0x60
[  174.509393]  [<d0c38907>] ? __iscsi_unbind_session+0x77/0xa0 [scsi_transport_iscsi]
[  174.509699]  [<c015272e>] ? run_workqueue+0x6e/0x140
[  174.509801]  [<d0c38890>] ? __iscsi_unbind_session+0x0/0xa0 [scsi_transport_iscsi]
[  174.509977]  [<c0152888>] ? worker_thread+0x88/0xe0
[  174.510047]  [<c01566a0>] ? autoremove_wake_function+0x0/0x40

Lets say you want to know what line of code represents sd_remove+0x20/0x70. cd to the ubuntu debian/build/build-generic directory in your kernel tree and run gdb on the ".o" file which has the function sd_remove() in this case in sd.o, and use the gdb "list" command, (gdb) list *(function+0xoffset), in this case function is sd_remove() and offset is 0x20, and gdb should tell you the line number where you hit the panic or oops. This has worked for me very reliably for most cases.

manjo@hungry:~/devel/ubuntu/kernel/ubuntu-karmic-397906/debian/build/build-generic/drivers/scsi$ gdb sd.o
GNU gdb (GDB) 6.8.50.20090628-cvs-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
(gdb) list *(sd_remove+0x20)
0x1650 is in sd_remove (/home/manjo/devel/ubuntu/kernel/ubuntu-karmic-397906/drivers/scsi/sd.c:2125).
2120    static int sd_remove(struct device *dev)
2121    {
2122            struct scsi_disk *sdkp;
2123    
2124            async_synchronize_full();
2125            sdkp = dev_get_drvdata(dev);
2126            blk_queue_prep_rq(sdkp->device->request_queue, scsi_prep_fn);
2127            device_del(&sdkp->dev);
2128            del_gendisk(sdkp->disk);
2129            sd_shutdown(dev);
(gdb)

Kernel/KernelDebuggingTricks (last edited 2015-09-21 12:04:28 by 78)