<> ||<>|| = Introduction = This page will show how to debug kernel bugs inside a QEMU/KVM instance. = Kernel debugging using QEMU/KVM instances = == Step #1: create a memory dump of the VM == === libvirt: virsh dump === If the VM is running in a libvirt environment we can use "virsh dump" to generate a memory dump: {{{ $ virsh dump --memory-only bionic_amd64 /tmp/vmcore.img Domain bionic_amd64 dumped to /tmp/vmcore.img }}} NOTE: the argument --memory-only is required to dump only the VM's memory, so that the resulting vmcore.img can be analized later by any gdb/crash tools. {{{ $ crash /tmp/core.img ~/canonical/bcache/bad-kernel/vmlinux ... cr1ash> }}} === QMP === If the VM is not running in a libvirt environment, it is possible to attach directly to the QEMU instance and generate the memory dump (similarly to the `virsh dump` method) using the QEMU Machine Protocol (QMP). QMP is a JSON-based protocol that allows external applications to control QEMU/KVM instances. To generate a memory dump make sure to start the qemu/kvm instance with the -qmp option: {{{ $ qemu-system-x86_64 -qmp tcp:localhost:4444,server,nowait ... }}} Now qemu will listen on port 4444 accepting QMP commands. Connect to the QMP port and send the following commands to generate a crash dump in /tmp/vmcore.img: {{{ {"execute": "qmp_capabilities"} {"execute":"dump-guest-memory","arguments":{"paging":false,"protocol":"file:/tmp/vmcore.img"}} }}} Example: {{{ $ telnet localhost 4444 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. {"QMP": {"version": {"qemu": {"micro": 0, "minor": 0, "major": 4}, "package": "Debian 1:4.0+dfsg-0ubuntu5"}, "capabilities": ["oob"]}} { "execute": "qmp_capabilities" } {"return": {}} {"execute":"dump-guest-memory","arguments":{"paging":false,"protocol":"file:/tmp/vmcore.img"}} ... {"return": {}} ^] telnet> quit Connection closed. $ file /tmp/vmcore.img /tmp/vmcore.img: ELF 64-bit LSB core file, x86-64, version 1 (SYSV), SVR4-style }}} == Step #2: analize the memory dump using crash == Once the memory dump has been generated you need the vmlinux of the running kernel and then you can attach to it using `crash`. Make sure to install `crash`: {{{ $ apt install crash }}} Examine the memory dump reading the symbols from vmlinux: {{{ $ crash /tmp/vmcore.img vmlinux }}} = Example: track down a bcache deadlock using a memory dump = {{{ $ crash /tmp/vmcore.img vmlinux KERNEL: vmlinux DUMPFILE: vmcore.img CPUS: 2 DATE: Tue Aug 6 12:15:22 2019 UPTIME: 00:06:27 LOAD AVERAGE: 2.00, 1.81, 1.06 TASKS: 152 NODENAME: ubuntu RELEASE: 5.3.0-050300rc3-generic VERSION: #201908042232 SMP Tue Aug 6 09:32:11 UTC 2019 MACHINE: x86_64 (1992 Mhz) MEMORY: 1 GB PANIC: "" PID: 0 COMMAND: "swapper/0" TASK: ffffffffa6413780 (1 of 2) [THREAD_INFO: ffffffffa6413780] CPU: 0 STATE: TASK_RUNNING (ACTIVE) WARNING: panic task not found crash> dmesg ... [ 133.771761] INFO: task python3:4714 blocked for more than 10 seconds. [ 133.773836] Not tainted 5.3.0-050300rc3-generic #201908042232 [ 133.776172] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 133.780275] python3 D 0 4714 1589 0x00004000 [ 133.780283] Call Trace: [ 133.780361] __schedule+0x2a8/0x670 [ 133.780370] ? __switch_to_asm+0x40/0x70 [ 133.780375] schedule+0x2d/0x90 [ 133.780398] bch_bucket_alloc+0xe5/0x370 [bcache] [ 133.780440] ? wait_woken+0x80/0x80 [ 133.780454] __bch_bucket_alloc_set+0x10d/0x160 [bcache] [ 133.780467] bch_bucket_alloc_set+0x4e/0x70 [bcache] [ 133.780486] __uuid_write+0x61/0x180 [bcache] [ 133.780504] ? __write_super+0x154/0x190 [bcache] [ 133.780519] bch_uuid_write+0x16/0x40 [bcache] [ 133.780536] __cached_dev_store+0x668/0x8c0 [bcache] [ 133.780552] bch_cached_dev_store+0x46/0x110 [bcache] [ 133.780590] sysfs_kf_write+0x3c/0x50 [ 133.780596] kernfs_fop_write+0x125/0x1a0 [ 133.780618] __vfs_write+0x1b/0x40 [ 133.780622] vfs_write+0xb1/0x1a0 [ 133.780626] ksys_write+0xa7/0xe0 [ 133.780630] __x64_sys_write+0x1a/0x20 [ 133.780652] do_syscall_64+0x5a/0x130 [ 133.780659] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 133.780715] INFO: task bcache_allocato:7289 blocked for more than 10 seconds. [ 133.784130] Not tainted 5.3.0-050300rc3-generic #201908042232 [ 133.787863] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 133.791743] bcache_allocato D 0 7289 2 0x80004000 [ 133.791751] Call Trace: [ 133.791765] __schedule+0x2a8/0x670 [ 133.791792] ? kvm_sched_clock_read+0x11/0x20 [ 133.791798] schedule+0x2d/0x90 [ 133.791821] bch_bucket_alloc+0xe5/0x370 [bcache] [ 133.791827] ? wait_woken+0x80/0x80 [ 133.791848] bch_prio_write+0x1dc/0x390 [bcache] [ 133.791862] bch_allocator_thread+0x233/0x490 [bcache] [ 133.791881] kthread+0x121/0x140 [ 133.791894] ? invalidate_buckets+0x890/0x890 [bcache] [ 133.791899] ? kthread_park+0xb0/0xb0 [ 133.791906] ret_from_fork+0x35/0x40 }}} The system doesn't show any panic, but looking at dmesg we can see the usual hung task timeout trace. In this particular case a "python3" process and a kernel thread "bcache_allocator" are both stuck in D state, potentially waiting for an event that never happens. Let's take a look at the stack trace of those PIDs: {{{ crash> bt 4714 PID: 4714 TASK: ffff94dafa634140 CPU: 0 COMMAND: "python3" #0 [ffffafd540543b50] __schedule at ffffffffa5886ba8 #1 [ffffafd540543be0] schedule at ffffffffa5886f9d #2 [ffffafd540543bf8] bch_bucket_alloc at ffffffffc06e6105 [bcache] #3 [ffffafd540543c60] __bch_bucket_alloc_set at ffffffffc06e657d [bcache] #4 [ffffafd540543cb0] bch_bucket_alloc_set at ffffffffc06e661e [bcache] #5 [ffffafd540543cf0] __uuid_write at ffffffffc06fb241 [bcache] #6 [ffffafd540543d98] bch_uuid_write at ffffffffc06fb646 [bcache] #7 [ffffafd540543db8] __cached_dev_store at ffffffffc0701eb8 [bcache] #8 [ffffafd540543e10] bch_cached_dev_store at ffffffffc0702156 [bcache] #9 [ffffafd540543e40] sysfs_kf_write at ffffffffa516bf3c #10 [ffffafd540543e50] kernfs_fop_write at ffffffffa516b895 #11 [ffffafd540543e90] __vfs_write at ffffffffa50c7c1b #12 [ffffafd540543ea0] vfs_write at ffffffffa50caaf1 #13 [ffffafd540543ed8] ksys_write at ffffffffa50cadf7 #14 [ffffafd540543f20] __x64_sys_write at ffffffffa50cae4a #15 [ffffafd540543f30] do_syscall_64 at ffffffffa4e0430a #16 [ffffafd540543f50] entry_SYSCALL_64_after_hwframe at ffffffffa5a0008c crash> bt 7289 PID: 7289 TASK: ffff94daf8c9ab80 CPU: 0 COMMAND: "bcache_allocato" #0 [ffffafd542fafd10] __schedule at ffffffffa5886ba8 #1 [ffffafd542fafda0] schedule at ffffffffa5886f9d #2 [ffffafd542fafdb8] bch_bucket_alloc at ffffffffc06e6105 [bcache] #3 [ffffafd542fafe20] bch_prio_write at ffffffffc06fbd5c [bcache] #4 [ffffafd542fafeb0] bch_allocator_thread at ffffffffc06e5dc3 [bcache] #5 [ffffafd542faff08] kthread at ffffffffa4ec04f1 #6 [ffffafd542faff50] ret_from_fork at ffffffffa5a00215 }}} This confirms what we can see from dmesg, both tasks are still stuck in the same spot. If we inspect the source code of bch_bucket_alloc() we can see the following: {{{ long bch_bucket_alloc(struct cache *ca, unsigned int reserve, bool wait) { DEFINE_WAIT(w); struct bucket *b; long r; /* No allocation if CACHE_SET_IO_DISABLE bit is set */ if (unlikely(test_bit(CACHE_SET_IO_DISABLE, &ca->set->flags))) return -1; /* fastpath */ if (fifo_pop(&ca->free[RESERVE_NONE], r) || fifo_pop(&ca->free[reserve], r)) goto out; if (!wait) { trace_bcache_alloc_fail(ca, reserve); return -1; } do { prepare_to_wait(&ca->set->bucket_wait, &w, TASK_UNINTERRUPTIBLE); mutex_unlock(&ca->set->bucket_lock); schedule(); mutex_lock(&ca->set->bucket_lock); } while (!fifo_pop(&ca->free[RESERVE_NONE], r) && !fifo_pop(&ca->free[reserve], r)); finish_wait(&ca->set->bucket_wait, &w); ... }}} The "wait" path is taken when we can't pop any item from ca->free, and the particular wait queue where our tasks are sleeping is called bucket_wait. Now we can search in the code where the wake_up event for this particular wait queue is sent: {{{ $ git grep "wake_up.*bucket_wait" drivers/md/bcache/alloc.c: wake_up(&ca->set->bucket_wait); }}} And with a simple search in that file we can easily identify that the wake up can only be sent by bch_allocator_thread: {{{ static int bch_allocator_thread(void *arg) { struct cache *ca = arg; mutex_lock(&ca->set->bucket_lock); while (1) { /* * First, we pull buckets off of the unused and free_inc lists, * possibly issue discards to them, then we add the bucket to * the free list: */ while (1) { long bucket; if (!fifo_pop(&ca->free_inc, bucket)) break; if (ca->discard) { mutex_unlock(&ca->set->bucket_lock); blkdev_issue_discard(ca->bdev, bucket_to_sector(ca->set, bucket), ca->sb.bucket_size, GFP_KERNEL, 0); mutex_lock(&ca->set->bucket_lock); } allocator_wait(ca, bch_allocator_push(ca, bucket)); wake_up(&ca->set->btree_cache_wait); wake_up(&ca->set->bucket_wait); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ } ... }}} However, looking back at the stack trace of bcache_allocator we can see that bch_bucket_alloc() was called by the bch_allocator_thread() itself, and it's waiting on a wake up event that should be sent by itself, but it's not possible, because it's sleeping, therefore we have a deadlock. = Conclusion = In this tutorial we have seen how to generate a memory dump of a running QEMU/KVM instance and use the debugging tool "crash" to inspect the system and track down a deadlock condition. = See also = https://wiki.qemu.org/Documentation/QMP