系统是 openmediavault 5 是虚拟机,内核是 Debian 5.10.70-1~bpo10 ,宿主机是 pve 7 ,宿主机安装在 nvme 硬盘上。
虚拟机系统启动后,1 、2 天就会出现类似的问题 kswapd0 Not tainted
,但是 stack 不太一样
Jan 25 22:02:16 omv5 kernel: [78356.750086] general protection fault, probably for non-canonical address 0x100000000000000: 0000 [#1] SMP PTI
Jan 25 22:02:16 omv5 kernel: [78356.750165] CPU: 0 PID: 68 Comm: kswapd0 Not tainted 5.10.0-0.bpo.9-amd64 #1 Debian 5.10.70-1~bpo10+1
Jan 25 22:02:16 omv5 kernel: [78356.750224] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Jan 25 22:02:16 omv5 kernel: [78356.750316] RIP: 0010:fsverity_free_info.part.3+0x9/0x30
Jan 25 22:02:16 omv5 kernel: [78356.750365] Code: ff ff c3 48 8b 52 40 48 c7 c6 d8 75 90 ba 48 c7 c7 98 5c 05 bb e8 07 ce 15 00 b8 ff ff ff ff c3 90 0f 1f 44 00 00 53 48 89 fb <48> 8b 7f 08 e8 de d2 f4 ff 48 8b 3d 97 8f c8 01 48 89 de 5b e9 ce
Jan 25 22:02:16 omv5 kernel: [78356.750481] RSP: 0018:ffffacd6c01a7b28 EFLAGS: 00010206
Jan 25 22:02:16 omv5 kernel: [78356.750523] RAX: ffff94d90b2f2038 RBX: 0100000000000000 RCX: 0000000000000000
Jan 25 22:02:16 omv5 kernel: [78356.750576] RDX: 00000000fffffffe RSI: 0000000000000000 RDI: 0100000000000000
Jan 25 22:02:16 omv5 kernel: [78356.750626] RBP: ffff94d90b2f1e78 R08: 0000000000000000 R09: 0000000000000000
Jan 25 22:02:16 omv5 kernel: [78356.750682] R10: ffff94d90b6d50c0 R11: 0000000000000001 R12: ffffffffc0762f80
Jan 25 22:02:16 omv5 kernel: [78356.750730] R13: ffff94d935214000 R14: 0000000000000000 R15: 00000000000002a9
Jan 25 22:02:16 omv5 kernel: [78356.750779] FS: 0000000000000000(0000) GS:ffff94d97dc00000(0000) knlGS:0000000000000000
Jan 25 22:02:16 omv5 kernel: [78356.750827] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 25 22:02:16 omv5 kernel: [78356.750867] CR2: 00007f131a86d000 CR3: 00000000086ae000 CR4: 00000000000006f0
Jan 25 22:02:16 omv5 kernel: [78356.750913] Call Trace:
Jan 25 22:02:16 omv5 kernel: [78356.750953] fsverity_cleanup_inode+0x1a/0x30
Jan 25 22:02:16 omv5 kernel: [78356.751074] ext4_evict_inode+0x7a/0x640 [ext4]
Jan 25 22:02:16 omv5 kernel: [78356.751124] evict+0xd2/0x1a0
Jan 25 22:02:16 omv5 kernel: [78356.751161] dispose_list+0x48/0x60
Jan 25 22:02:16 omv5 kernel: [78356.751198] prune_icache_sb+0x52/0x70
Jan 25 22:02:16 omv5 kernel: [78356.751236] super_cache_scan+0x123/0x1a0
Jan 25 22:02:16 omv5 kernel: [78356.751276] do_shrink_slab+0x11f/0x250
Jan 25 22:02:16 omv5 kernel: [78356.751313] shrink_slab+0x20f/0x2c0
Jan 25 22:02:16 omv5 kernel: [78356.751352] shrink_node+0x24b/0x6d0
Jan 25 22:02:16 omv5 kernel: [78356.751382] balance_pgdat+0x2d1/0x550
Jan 25 22:02:16 omv5 kernel: [78356.752424] kswapd+0x201/0x390
Jan 25 22:02:16 omv5 kernel: [78356.753231] ? finish_wait+0x80/0x80
Jan 25 22:02:16 omv5 kernel: [78356.753946] ? balance_pgdat+0x550/0x550
Jan 25 22:02:16 omv5 kernel: [78356.754706] kthread+0x116/0x130
Jan 25 22:02:16 omv5 kernel: [78356.755452] ? __kthread_cancel_work+0x40/0x40
Jan 25 22:02:16 omv5 kernel: [78356.756248] ret_from_fork+0x22/0x30
Jan 25 22:02:16 omv5 kernel: [78356.757030] Modules linked in: xt_nat xt_tcpudp veth xt_conntrack xt_MASQUERADE nf_conntrack_netlink xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink br_netfilter bridge stp llc overlay softdog watchdog cpufreq_conservative cpufreq_userspace cpufreq_ondemand cpufreq_powersave bochs_drm drm_vram_helper drm_ttm_helper ttm drm_kms_helper cec pcspkr evdev serio_raw drm virtio_console joydev sg virtio_balloon qemu_fw_cfg button sunrpc ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod t10_pi crc_t10dif crct10dif_generic sr_mod crct10dif_common cdrom ata_generic hid_generic usbhid hid virtio_net net_failover failover virtio_scsi psmouse ahci libahci ata_piix uhci_hcd libata ehci_hcd usbcore virtio_pci scsi_mod virtio_ring virtio
Jan 25 22:02:16 omv5 kernel: [78356.757137] i2c_piix4 usb_common
Jan 25 22:02:16 omv5 kernel: [78356.765162] ---[ end trace 6b6382944e461dee ]---
Jan 25 22:02:16 omv5 kernel: [78356.766097] RIP: 0010:fsverity_free_info.part.3+0x9/0x30
Jan 25 22:02:16 omv5 kernel: [78356.767038] Code: ff ff c3 48 8b 52 40 48 c7 c6 d8 75 90 ba 48 c7 c7 98 5c 05 bb e8 07 ce 15 00 b8 ff ff ff ff c3 90 0f 1f 44 00 00 53 48 89 fb <48> 8b 7f 08 e8 de d2 f4 ff 48 8b 3d 97 8f c8 01 48 89 de 5b e9 ce
Jan 25 22:02:16 omv5 kernel: [78356.769021] RSP: 0018:ffffacd6c01a7b28 EFLAGS: 00010206
Jan 25 22:02:16 omv5 kernel: [78356.770001] RAX: ffff94d90b2f2038 RBX: 0100000000000000 RCX: 0000000000000000
Jan 25 22:02:16 omv5 kernel: [78356.771038] RDX: 00000000fffffffe RSI: 0000000000000000 RDI: 0100000000000000
Jan 25 22:02:16 omv5 kernel: [78356.771959] RBP: ffff94d90b2f1e78 R08: 0000000000000000 R09: 0000000000000000
Jan 25 22:02:16 omv5 kernel: [78356.772897] R10: ffff94d90b6d50c0 R11: 0000000000000001 R12: ffffffffc0762f80
Jan 25 22:02:16 omv5 kernel: [78356.773755] R13: ffff94d935214000 R14: 0000000000000000 R15: 00000000000002a9
Jan 25 22:02:16 omv5 kernel: [78356.774678] FS: 0000000000000000(0000) GS:ffff94d97dc00000(0000) knlGS:0000000000000000
Jan 25 22:02:16 omv5 kernel: [78356.775346] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 25 22:02:16 omv5 kernel: [78356.775879] CR2: 00007f131a86d000 CR3: 00000000086ae000 CR4: 00000000000006f0
尝试安装 omv6, 也出现了同样的问题。目前猜测是宿主机的问题,但是不确定是系统问题还是硬件(内存、硬盘)的问题。 折腾好几天了,网上的信息也看不少,现在也没什么思路了。 拜托各位大佬帮忙看看🙏
1
liuxu 2022-01-26 21:56:16 +08:00 1
看了下,内核 bug 你需要贴给 debian bug report ,https://www.debian.org/Bugs/Reporting
我查了下和你差不多的 bug ,都是让升级内核到 5.11-5.13 以上,建议更高试试 |
2
jiayouniu OP @liuxu 装了 omv6 ,内核是 5.15 ,同样还是出现类似的问题。我现在怀疑是内存的问题,准备跑下 memory test 。如果还是不行,再提交 bug
|
3
liuxu 2022-01-27 12:17:45 +08:00 1
@jiayouniu
从你贴的内核调用栈看,kswapd 是 swap 交换进程,作用是把内存的的 cache 缓存到 swap 。 它在执行了 shrink_slab ,slab 是内核内存管理层,linux 内核在获取物理内存后,使用结束不会释放,而是自己管理,放在 xxx_slab 链中 然后内部调用了 prune_icache_sb ,icache_sb 应该是磁盘的 superblock ,superblock 存放着文件系统统计信息,而它是缓存在内存中的,这里执行 prune_icache_sb ,也就是刷写 superbolck 到磁盘上 之后执行了 fsverity_cleanup_inode ,也就是把内存中的 inode 缓存也全部写回磁盘 https://elixir.bootlin.com/linux/v5.10.70/source/fs/verity/open.c#L346 void fsverity_cleanup_inode(struct inode *inode) { fsverity_free_info(inode->i_verity_info); inode->i_verity_info = NULL; } EXPORT_SYMBOL_GPL(fsverity_cleanup_inode) 最后这个函数最终调用 fsverity_free_info ,抛出了异常 https://elixir.bootlin.com/linux/v5.10.70/source/fs/verity/open.c#L240 void fsverity_free_info(struct fsverity_info *vi) { if (!vi) return; kfree(vi->tree_params.hashstate); kmem_cache_free(fsverity_info_cachep, vi); } 结果这个时候内核抛了异常,general protection fault, probably for non-canonical address 大致可以猜到,此时的 inode->i_verity_info 地址已经被污染了,不再是有效内存地址了 我猜测可能是所谓的 SMP 多核 cpu 执行清除的时候没有对此数据结构锁上,导致其他 cpu 核心已经清除了它,地址已经置 NULL ,然后此时 cpu0 执行清除此数据导致 free 了 0x100000000000000: 0000 非法地址(此地址在此内存架构上可能是 c 语言的 NULL ?) 以上只是我不专业的推测,具体还是看你操作是否有效,希望解决问题后能 at 我,让我看看具体是啥问题,咋解决的 |