一台服务器在多次“Out of memory”消息后崩溃,我正试图找出罪魁祸首.如果它在userland中 – 哪个进程.如果它在内核中 – 哪个内核模块.
细节
我试图找出如何使用crash实用程序来调查在服务器上触发OOM的原因.
作为安装一对新服务器的一部分,我开始初始化14TB DRBD设备.大约在那个时候,在玩DRBD同步器速率配置,并且一些绑定的网络接口上下移动时,其中一个服务器崩溃了.在30秒的时间内,它产生了39个内存不足:杀死进程####消息.然后它崩溃了:
Kernel panic - not syncing: Out of memory and no killable processes...
系统崩溃触发了kdump.现在我有一个很好的vmcore.flat文件应该可以直接用于调查问题,但我很难找到所有内存的去向.
我所知道的唯一资源是Dedoimedo’s site,它有nice instructions和Kernel Crash Book.这些也恰好是答案中建议的唯一资源,所以我认为崩溃是唯一的调查方法.
如果有另一种方法对事件进行验尸,我愿意接受.只是崩溃才是我所知道的唯一实用工具.我现在所拥有的只是vmcore.flat文件,我需要知道的是哪个组件占用了所有内存.我怀疑是内核模块问题,更具体地说是一个绑定模块(因为我带来了一个接口时触发),DRBD模块(版本8.3.15由CentOS 6.3上的树构建),或者其中一个10G以太网模块(mlnx_en由树构建,这是我关闭的接口,或树内bnx2x,它是保持活动的接口).我需要知道的是,是否有办法证实我的怀疑.
检查使用了多少内存
$crash /usr/lib/debug/lib/modules/2.6.32-279.5.2.el6.x86_64/vmlinux vmcore.flat .... crash> kmem -i PAGES TOTAL PERCENTAGE TOTAL MEM 16482587 62.9 GB ---- FREE 54610 213.3 MB 0% of TOTAL MEM USED 16427977 62.7 GB 99% of TOTAL MEM SHARED 4683 18.3 MB 0% of TOTAL MEM BUFFERS 118 472 KB 0% of TOTAL MEM CACHED 82 328 KB 0% of TOTAL MEM SLAB 46635 182.2 MB 0% of TOTAL MEM TOTAL SWAP 0 0 ---- SWAP USED 0 0 100% of TOTAL SWAP SWAP FREE 0 0 0% of TOTAL SWAP
显然,它耗尽了内存.所有的64G都不见了……但是在哪里?
试图查看是否有任何进程泄漏内存
唯一似乎相关的命令是ps(这是崩溃的ps子命令).它没有显示任何异常,但它也没有显示内核线程.
crash> ps PID PPID cpu TASK ST %MEM VSZ RSS COMM 0 0 0 ffffffff81a8d020 RU 0.0 0 0 [swapper] > 0 0 1 ffff88102c456040 RU 0.0 0 0 [swapper] > 0 0 2 ffff88082c772aa0 RU 0.0 0 0 [swapper] > 0 0 3 ffff88102c456aa0 RU 0.0 0 0 [swapper] 0 0 4 ffff88082c7b8ae0 RU 0.0 0 0 [swapper] > 0 0 5 ffff88102c457500 RU 0.0 0 0 [swapper] > 0 0 6 ffff88082c7d6aa0 RU 0.0 0 0 [swapper] > 0 0 7 ffff88102c506080 RU 0.0 0 0 [swapper] > 0 0 8 ffff88082c016ae0 RU 0.0 0 0 [swapper] > 0 0 9 ffff88102c506ae0 RU 0.0 0 0 [swapper] > 0 0 10 ffff88082c05caa0 RU 0.0 0 0 [swapper] > 0 0 11 ffff88102c507540 RU 0.0 0 0 [swapper] > 0 0 12 ffff88082c09cae0 RU 0.0 0 0 [swapper] ..... 4926 1 5 ffff880828a38ae0 ?? 0.0 0 0 mingetty 4928 1 1 ffff88102a4e8040 ?? 0.0 0 0 mingetty 4930 1 19 ffff880827af4080 ?? 0.0 0 0 mingetty 4932 1 2 ffff88100f122040 ?? 0.0 0 0 mingetty 4934 1 18 ffff8810296ea080 ?? 0.0 0 0 mingetty 4936 1047 4 ffff880ff342d540 IN 0.0 11184 948 udevd 4937 1047 5 ffff88082a240080 IN 0.0 11184 948 udevd 5060 3772 2 ffff88082881d540 ?? 0.0 0 0 sshd 5078 1 1 ffff88100f060ae0 ?? 0.0 0 0 sshd 5079 1 1 ffff88082b882ae0 ?? 0.0 0 0 bash
如果我取出内核线程(无论如何都显示%MEM的零),我们可以看到我最后几乎没有运行:
crash> ps -u PID PPID cpu TASK ST %MEM VSZ RSS COMM 1 0 1 ffff88082c41b500 ?? 0.0 19348 348 init 1047 1 2 ffff881029524040 IN 0.0 11188 948 udevd 3171 1 3 ffff880826ccaaa0 IN 0.0 27636 240 auditd 3172 1 17 ffff881029d1b500 IN 0.0 27636 240 auditd > 3772 1 0 ffff88102b257500 RU 0.0 64072 668 sshd 4800 1 0 ffff88100f061540 ?? 0.0 0 0 dsm_om_shrsvcd 4842 1 16 ffff88100f012ae0 ?? 0.0 0 0 cmcld 4854 1 17 ffff88082a241540 ?? 0.0 0 0 cmlogd 4855 1 3 ffff88082796cae0 ?? 0.0 0 0 cmfileassistd 4856 1 18 ffff88082809d500 ?? 0.0 0 0 cmnetd 4860 1 0 ffff88082705aae0 ?? 0.0 0 0 cmresourced 4924 1 9 ffff88102a4e8aa0 ?? 0.0 0 0 mingetty 4926 1 5 ffff880828a38ae0 ?? 0.0 0 0 mingetty 4928 1 1 ffff88102a4e8040 ?? 0.0 0 0 mingetty 4930 1 19 ffff880827af4080 ?? 0.0 0 0 mingetty 4932 1 2 ffff88100f122040 ?? 0.0 0 0 mingetty 4934 1 18 ffff8810296ea080 ?? 0.0 0 0 mingetty 4936 1047 4 ffff880ff342d540 IN 0.0 11184 948 udevd 4937 1047 5 ffff88082a240080 IN 0.0 11184 948 udevd 5060 3772 2 ffff88082881d540 ?? 0.0 0 0 sshd 5078 1 1 ffff88100f060ae0 ?? 0.0 0 0 sshd 5079 1 1 ffff88082b882ae0 ?? 0.0 0 0 bash 5257 1 1 ffff8808279e6aa0 ?? 0.0 0 0 jnx_mlnxsnmp_da
更新:
包括Soham建议的更多输出.不幸的是,我无法从中得出任何进一步的结论.我能做的最好的事情就是怀疑内核漏掉了内存,因为用户进程几乎全都死了.
The (almost full) output of log -m
is here
crash> ps -G | tail -n +2 | cut -b2- | gawk '{mem += $8} END {print "total " mem/1048576 "GB"}' total 0.00391006GB
请注意,此时几乎所有用户空间进程都已停止,因此预计使用率较低.
内存不足消息:
正如我上面提到的,有39个“Out of memory:”消息,这里它们是:
crash> log -m | grep Out <3>[ 223.556616] Out of memory: Kill process 3189 (portreserve) score 1 or sacrifice child <3>[ 223.787234] Out of memory: Kill process 3196 (rsyslogd) score 1 or sacrifice child <3>[ 224.237119] Out of memory: Kill process 3728 (dbus-daemon) score 1 or sacrifice child <3>[ 228.771770] Out of memory: Kill process 3758 (snmpd) score 1 or sacrifice child <3>[ 229.033466] Out of memory: Kill process 3782 (xinetd) score 1 or sacrifice child <3>[ 229.257710] Out of memory: Kill process 3782 (xinetd) score 1 or sacrifice child <3>[ 229.484321] Out of memory: Kill process 3782 (xinetd) score 1 or sacrifice child <3>[ 229.711169] Out of memory: Kill process 3782 (xinetd) score 1 or sacrifice child <3>[ 229.934955] Out of memory: Kill process 3801 (cmproxyd) score 1 or sacrifice child <3>[ 230.159542] Out of memory: Kill process 3812 (ntpd) score 1 or sacrifice child <3>[ 230.382083] Out of memory: Kill process 3953 (master) score 1 or sacrifice child <3>[ 230.606613] Out of memory: Kill process 3953 (master) score 1 or sacrifice child <3>[ 230.829515] Out of memory: Kill process 3953 (master) score 1 or sacrifice child <3>[ 230.832105] Out of memory: Kill process 3961 (crond) score 1 or sacrifice child <3>[ 236.749746] Out of memory: Kill process 3974 (atd) score 1 or sacrifice child <3>[ 236.969421] Out of memory: Kill process 4272 (dsm_sa_datamgrd) score 1 or sacrifice child <3>[ 237.192102] Out of memory: Kill process 4492 (dsm_sa_datamgrd) score 1 or sacrifice child <3>[ 237.746301] Out of memory: Kill process 4552 (dsm_sa_eventmgr) score 1 or sacrifice child <3>[ 237.968308] Out of memory: Kill process 4613 (dsm_sa_snmpd) score 1 or sacrifice child <3>[ 238.190550] Out of memory: Kill process 4614 (dsm_sa_snmpd) score 1 or sacrifice child <3>[ 238.644020] Out of memory: Kill process 4643 (dsm_om_connsvcd) score 1 or sacrifice child <3>[ 238.865658] Out of memory: Kill process 4643 (dsm_om_connsvcd) score 1 or sacrifice child <3>[ 251.285450] Out of memory: Kill process 4643 (dsm_om_connsvcd) score 1 or sacrifice child <3>[ 251.506601] Out of memory: Kill process 4800 (dsm_om_shrsvcd) score 1 or sacrifice child <3>[ 251.727570] Out of memory: Kill process 4842 (cmcld) score 1 or sacrifice child <3>[ 251.947085] Out of memory: Kill process 4842 (cmcld) score 1 or sacrifice child <3>[ 252.167096] Out of memory: Kill process 4854 (cmlogd) score 1 or sacrifice child <3>[ 252.384090] Out of memory: Kill process 4855 (cmfileassistd) score 1 or sacrifice child <3>[ 252.603324] Out of memory: Kill process 4924 (mingetty) score 1 or sacrifice child <3>[ 252.820757] Out of memory: Kill process 4926 (mingetty) score 1 or sacrifice child <3>[ 253.037558] Out of memory: Kill process 4928 (mingetty) score 1 or sacrifice child <3>[ 253.254908] Out of memory: Kill process 4930 (mingetty) score 1 or sacrifice child <3>[ 253.257391] Out of memory: Kill process 4932 (mingetty) score 1 or sacrifice child <3>[ 253.259357] Out of memory: Kill process 4934 (mingetty) score 1 or sacrifice child <3>[ 253.261353] Out of memory: Kill process 5060 (sshd) score 1 or sacrifice child <3>[ 253.263365] Out of memory: Kill process 5060 (sshd) score 1 or sacrifice child <3>[ 253.264392] Out of memory: Kill process 5079 (bash) score 1 or sacrifice child <3>[ 253.266352] Out of memory: Kill process 5257 (jnx_mlnxsnmp_da) score 1 or sacrifice child <0>[ 253.529344] Kernel panic - not syncing: Out of memory and no killable processes...
sys输出:
crash> sys KERNEL: /usr/lib/debug/lib/modules/2.6.32-279.5.2.el6.x86_64/vmlinux DUMPFILE: pcdata03.vmcore.flat [PARTIAL DUMP] cpuS: 32 DATE: Wed Feb 6 02:11:52 2013 UPTIME: 00:04:12 LOAD AVERAGE: 3.03,0.95,0.34 TASKS: 578 NODENAME: .... RELEASE: 2.6.32-279.5.2.el6.x86_64 VERSION: #1 SMP Fri Aug 24 01:07:11 UTC 2012 MACHINE: x86_64 (2700 Mhz) MEMORY: 64 GB PANIC: "[ 253.529344] Kernel panic - not syncing: Out of memory and no killable processes..."
kmem -z
crash> kmem -z NODE: 0 ZONE: 0 ADDR: ffff88000000a0c0 NAME: "DMA" SIZE: 4095 PRESENT: 3839 MIN/LOW/HIGH: 5/6/7 VM_STAT: NR_FREE_PAGES: 3936 NR_INACTIVE_ANON: 0 NR_ACTIVE_ANON: 0 NR_INACTIVE_FILE: 0 NR_ACTIVE_FILE: 0 NR_UNEVICTABLE: 0 NR_MLOCK: 0 NR_ANON_PAGES: 0 NR_FILE_MAPPED: 0 NR_FILE_PAGES: 0 NR_FILE_DIRTY: 0 NR_WRITEBACK: 0 NR_SLAB_RECLAIMABLE: 0 NR_SLAB_UNRECLAIMABLE: 0 NR_PAGETABLE: 0 NR_KERNEL_STACK: 0 NR_UNSTABLE_NFS: 0 NR_BOUNCE: 0 NR_VMSCAN_WRITE: 0 NR_VMSCAN_IMMEDIATE: 0 NR_WRITEBACK_TEMP: 0 NR_ISOLATED_ANON: 0 NR_ISOLATED_FILE: 0 NR_SHMEM: 0 NUMA_HIT: 0 NUMA_MISS: 0 NUMA_FOREIGN: 0 NUMA_INTERLEAVE_HIT: 0 NUMA_LOCAL: 0 NUMA_OTHER: 0 NR_ANON_TRANSPARENT_HUGEPAGES: 0 NODE: 0 ZONE: 1 ADDR: ffff880000012780 NAME: "DMA32" SIZE: 1044480 PRESENT: 756520 MIN/LOW/HIGH: 1030/1287/1545 VM_STAT: NR_FREE_PAGES: 30117 NR_INACTIVE_ANON: 0 NR_ACTIVE_ANON: 0 NR_INACTIVE_FILE: 1 NR_ACTIVE_FILE: 0 NR_UNEVICTABLE: 0 NR_MLOCK: 0 NR_ANON_PAGES: 0 NR_FILE_MAPPED: 0 NR_FILE_PAGES: 1 NR_FILE_DIRTY: 0 NR_WRITEBACK: 0 NR_SLAB_RECLAIMABLE: 4 NR_SLAB_UNRECLAIMABLE: 4150 NR_PAGETABLE: 0 NR_KERNEL_STACK: 0 NR_UNSTABLE_NFS: 0 NR_BOUNCE: 0 NR_VMSCAN_WRITE: 0 NR_VMSCAN_IMMEDIATE: 0 NR_WRITEBACK_TEMP: 0 NR_ISOLATED_ANON: 0 NR_ISOLATED_FILE: 0 NR_SHMEM: 0 NUMA_HIT: 575606 NUMA_MISS: 3 NUMA_FOREIGN: 0 NUMA_INTERLEAVE_HIT: 0 NUMA_LOCAL: 575598 NUMA_OTHER: 11 NR_ANON_TRANSPARENT_HUGEPAGES: 0 NODE: 0 ZONE: 2 ADDR: ffff88000001ae40 NAME: "Normal" SIZE: 7602176 PRESENT: 7498240 MIN/LOW/HIGH: 10217/12771/15325 VM_STAT: NR_FREE_PAGES: 10443 NR_INACTIVE_ANON: 134 NR_ACTIVE_ANON: 197 NR_INACTIVE_FILE: -47 NR_ACTIVE_FILE: 42 NR_UNEVICTABLE: 0 NR_MLOCK: 0 NR_ANON_PAGES: 219 NR_FILE_MAPPED: 115 NR_FILE_PAGES: 45 NR_FILE_DIRTY: 0 NR_WRITEBACK: 0 NR_SLAB_RECLAIMABLE: 908 NR_SLAB_UNRECLAIMABLE: 18771 NR_PAGETABLE: 91 NR_KERNEL_STACK: 556 NR_UNSTABLE_NFS: 0 NR_BOUNCE: 0 NR_VMSCAN_WRITE: 0 NR_VMSCAN_IMMEDIATE: 0 NR_WRITEBACK_TEMP: 0 NR_ISOLATED_ANON: 0 NR_ISOLATED_FILE: 0 NR_SHMEM: 34 NUMA_HIT: 8243991 NUMA_MISS: 648 NUMA_FOREIGN: 4593726 NUMA_INTERLEAVE_HIT: 20066 NUMA_LOCAL: 8243829 NUMA_OTHER: 810 NR_ANON_TRANSPARENT_HUGEPAGES: 0 NODE: 0 ZONE: 3 ADDR: ffff880000023500 NAME: "Movable" [unpopulated] NODE: 1 ZONE: 0 ADDR: ffff880840000040 NAME: "DMA" [unpopulated] NODE: 1 ZONE: 1 ADDR: ffff880840008700 NAME: "DMA32" [unpopulated] NODE: 1 ZONE: 2 ADDR: ffff880840010dc0 NAME: "Normal" SIZE: 8388608 PRESENT: 8273920 MIN/LOW/HIGH: 11274/14092/16911 VM_STAT: NR_FREE_PAGES: 10114 NR_INACTIVE_ANON: 417 NR_ACTIVE_ANON: 83 NR_INACTIVE_FILE: 47 NR_ACTIVE_FILE: 32 NR_UNEVICTABLE: 0 NR_MLOCK: 0 NR_ANON_PAGES: 436 NR_FILE_MAPPED: 22 NR_FILE_PAGES: 154 NR_FILE_DIRTY: 0 NR_WRITEBACK: 0 NR_SLAB_RECLAIMABLE: 863 NR_SLAB_UNRECLAIMABLE: 21939 NR_PAGETABLE: 134 NR_KERNEL_STACK: 27 NR_UNSTABLE_NFS: 0 NR_BOUNCE: 0 NR_VMSCAN_WRITE: 3 NR_VMSCAN_IMMEDIATE: 5 NR_WRITEBACK_TEMP: 0 NR_ISOLATED_ANON: 0 NR_ISOLATED_FILE: 23 NR_SHMEM: 20 NUMA_HIT: 4332488 NUMA_MISS: 4593726 NUMA_FOREIGN: 665 NUMA_INTERLEAVE_HIT: 20007 NUMA_LOCAL: 4309300 NUMA_OTHER: 4616914 NR_ANON_TRANSPARENT_HUGEPAGES: 0 NODE: 1 ZONE: 3 ADDR: ffff880840019480 NAME: "Movable" [unpopulated]
kmem -f
crash> kmem -f NODE 0 ZONE NAME SIZE FREE MEM_MAP START_PADDR START_MAPNR 0 DMA 4095 3936 ffffea0000000038 1000 0 AREA SIZE FREE_AREA_STRUCT BLOCKS PAGES 0 4k ffff880000012128 2 2 0 4k ffff880000012138 0 0 0 4k ffff880000012148 0 0 0 4k ffff880000012158 0 0 0 4k ffff880000012168 0 0 1 8k ffff880000012180 1 2 1 8k ffff880000012190 0 0 1 8k ffff8800000121a0 0 0 1 8k ffff8800000121b0 0 0 1 8k ffff8800000121c0 0 0 2 16k ffff8800000121d8 1 4 2 16k ffff8800000121e8 0 0 2 16k ffff8800000121f8 0 0 2 16k ffff880000012208 0 0 2 16k ffff880000012218 0 0 3 32k ffff880000012230 1 8 3 32k ffff880000012240 0 0 3 32k ffff880000012250 0 0 3 32k ffff880000012260 0 0 3 32k ffff880000012270 0 0 4 64k ffff880000012288 1 16 4 64k ffff880000012298 0 0 4 64k ffff8800000122a8 0 0 4 64k ffff8800000122b8 0 0 4 64k ffff8800000122c8 0 0 5 128k ffff8800000122e0 0 0 5 128k ffff8800000122f0 0 0 5 128k ffff880000012300 0 0 5 128k ffff880000012310 0 0 5 128k ffff880000012320 0 0 6 256k ffff880000012338 1 64 6 256k ffff880000012348 0 0 6 256k ffff880000012358 0 0 6 256k ffff880000012368 0 0 6 256k ffff880000012378 0 0 7 512k ffff880000012390 0 0 7 512k ffff8800000123a0 0 0 7 512k ffff8800000123b0 0 0 7 512k ffff8800000123c0 0 0 7 512k ffff8800000123d0 0 0 8 1024k ffff8800000123e8 1 256 8 1024k ffff8800000123f8 0 0 8 1024k ffff880000012408 0 0 8 1024k ffff880000012418 0 0 8 1024k ffff880000012428 0 0 9 2048k ffff880000012440 0 0 9 2048k ffff880000012450 0 0 9 2048k ffff880000012460 0 0 9 2048k ffff880000012470 1 512 9 2048k ffff880000012480 0 0 10 4096k ffff880000012498 0 0 10 4096k ffff8800000124a8 0 0 10 4096k ffff8800000124b8 3 3072 10 4096k ffff8800000124c8 0 0 10 4096k ffff8800000124d8 0 0 ZONE NAME SIZE FREE MEM_MAP START_PADDR START_MAPNR 1 DMA32 1044480 30117 ffffea0000038000 1000000 4095 AREA SIZE FREE_AREA_STRUCT BLOCKS PAGES 0 4k ffff88000001a7e8 24 24 0 4k ffff88000001a7f8 4 4 0 4k ffff88000001a808 13 13 0 4k ffff88000001a818 0 0 0 4k ffff88000001a828 0 0 1 8k ffff88000001a840 2 4 1 8k ffff88000001a850 2 4 1 8k ffff88000001a860 4 8 1 8k ffff88000001a870 0 0 1 8k ffff88000001a880 0 0 2 16k ffff88000001a898 0 0 2 16k ffff88000001a8a8 3 12 2 16k ffff88000001a8b8 4 16 2 16k ffff88000001a8c8 0 0 2 16k ffff88000001a8d8 0 0 3 32k ffff88000001a8f0 0 0 3 32k ffff88000001a900 3 24 3 32k ffff88000001a910 3 24 3 32k ffff88000001a920 0 0 3 32k ffff88000001a930 0 0 4 64k ffff88000001a948 1 16 4 64k ffff88000001a958 3 48 4 64k ffff88000001a968 6 96 4 64k ffff88000001a978 0 0 4 64k ffff88000001a988 0 0 5 128k ffff88000001a9a0 0 0 5 128k ffff88000001a9b0 3 96 5 128k ffff88000001a9c0 7 224 5 128k ffff88000001a9d0 0 0 5 128k ffff88000001a9e0 0 0 6 256k ffff88000001a9f8 0 0 6 256k ffff88000001aa08 1 64 6 256k ffff88000001aa18 6 384 6 256k ffff88000001aa28 0 0 6 256k ffff88000001aa38 0 0 7 512k ffff88000001aa50 1 128 7 512k ffff88000001aa60 0 0 7 512k ffff88000001aa70 8 1024 7 512k ffff88000001aa80 0 0 7 512k ffff88000001aa90 0 0 8 1024k ffff88000001aaa8 1 256 8 1024k ffff88000001aab8 1 256 8 1024k ffff88000001aac8 5 1280 8 1024k ffff88000001aad8 0 0 8 1024k ffff88000001aae8 0 0 9 2048k ffff88000001ab00 0 0 9 2048k ffff88000001ab10 1 512 9 2048k ffff88000001ab20 3 1536 9 2048k ffff88000001ab30 1 512 9 2048k ffff88000001ab40 0 0 10 4096k ffff88000001ab58 0 0 10 4096k ffff88000001ab68 0 0 10 4096k ffff88000001ab78 22 22528 10 4096k ffff88000001ab88 1 1024 10 4096k ffff88000001ab98 0 0 ZONE NAME SIZE FREE MEM_MAP START_PADDR START_MAPNR 2 Normal 7602176 10443 ffffea0003800000 100000000 1048575 AREA SIZE FREE_AREA_STRUCT BLOCKS PAGES 0 4k ffff880000022ea8 365 365 0 4k ffff880000022eb8 274 274 0 4k ffff880000022ec8 274 274 0 4k ffff880000022ed8 0 0 0 4k ffff880000022ee8 0 0 1 8k ffff880000022f00 99 198 1 8k ffff880000022f10 94 188 1 8k ffff880000022f20 360 720 1 8k ffff880000022f30 0 0 1 8k ffff880000022f40 0 0 2 16k ffff880000022f58 30 120 2 16k ffff880000022f68 41 164 2 16k ffff880000022f78 204 816 2 16k ffff880000022f88 0 0 2 16k ffff880000022f98 0 0 3 32k ffff880000022fb0 9 72 3 32k ffff880000022fc0 19 152 3 32k ffff880000022fd0 138 1104 3 32k ffff880000022fe0 0 0 3 32k ffff880000022ff0 0 0 4 64k ffff880000023008 7 112 4 64k ffff880000023018 4 64 4 64k ffff880000023028 77 1232 4 64k ffff880000023038 0 0 4 64k ffff880000023048 0 0 5 128k ffff880000023060 3 96 5 128k ffff880000023070 3 96 5 128k ffff880000023080 43 1376 5 128k ffff880000023090 0 0 5 128k ffff8800000230a0 0 0 6 256k ffff8800000230b8 0 0 6 256k ffff8800000230c8 0 0 6 256k ffff8800000230d8 13 832 6 256k ffff8800000230e8 0 0 6 256k ffff8800000230f8 0 0 7 512k ffff880000023110 0 0 7 512k ffff880000023120 0 0 7 512k ffff880000023130 5 640 7 512k ffff880000023140 0 0 7 512k ffff880000023150 0 0 8 1024k ffff880000023168 0 0 8 1024k ffff880000023178 0 0 8 1024k ffff880000023188 0 0 8 1024k ffff880000023198 0 0 8 1024k ffff8800000231a8 0 0 9 2048k ffff8800000231c0 0 0 9 2048k ffff8800000231d0 0 0 9 2048k ffff8800000231e0 1 512 9 2048k ffff8800000231f0 0 0 9 2048k ffff880000023200 0 0 10 4096k ffff880000023218 0 0 10 4096k ffff880000023228 0 0 10 4096k ffff880000023238 0 0 10 4096k ffff880000023248 1 1024 10 4096k ffff880000023258 0 0 ZONE NAME SIZE FREE MEM_MAP START_PADDR START_MAPNR 3 Movable 0 0 0 0 0 -------------------------------------------------------------------------- NODE 1 ZONE NAME SIZE FREE MEM_MAP START_PADDR START_MAPNR 0 DMA 0 0 0 0 0 ZONE NAME SIZE FREE MEM_MAP START_PADDR START_MAPNR 1 DMA32 0 0 0 0 0 ZONE NAME SIZE FREE MEM_MAP START_PADDR START_MAPNR 2 Normal 8388608 10114 ffffea001ce00000 840000000 0 AREA SIZE FREE_AREA_STRUCT BLOCKS PAGES 0 4k ffff880840018e28 405 405 0 4k ffff880840018e38 162 162 0 4k ffff880840018e48 317 317 0 4k ffff880840018e58 0 0 0 4k ffff880840018e68 0 0 1 8k ffff880840018e80 106 212 1 8k ffff880840018e90 70 140 1 8k ffff880840018ea0 269 538 1 8k ffff880840018eb0 0 0 1 8k ffff880840018ec0 0 0 2 16k ffff880840018ed8 24 96 2 16k ffff880840018ee8 18 72 2 16k ffff880840018ef8 207 828 2 16k ffff880840018f08 0 0 2 16k ffff880840018f18 0 0 3 32k ffff880840018f30 20 160 3 32k ffff880840018f40 4 32 3 32k ffff880840018f50 148 1184 3 32k ffff880840018f60 0 0 3 32k ffff880840018f70 0 0 4 64k ffff880840018f88 17 272 4 64k ffff880840018f98 2 32 4 64k ffff880840018fa8 95 1520 4 64k ffff880840018fb8 0 0 4 64k ffff880840018fc8 0 0 5 128k ffff880840018fe0 4 128 5 128k ffff880840018ff0 1 32 5 128k ffff880840019000 37 1184 5 128k ffff880840019010 0 0 5 128k ffff880840019020 0 0 6 256k ffff880840019038 0 0 6 256k ffff880840019048 0 0 6 256k ffff880840019058 8 512 6 256k ffff880840019068 0 0 6 256k ffff880840019078 0 0 7 512k ffff880840019090 0 0 7 512k ffff8808400190a0 0 0 7 512k ffff8808400190b0 1 128 7 512k ffff8808400190c0 0 0 7 512k ffff8808400190d0 0 0 8 1024k ffff8808400190e8 0 0 8 1024k ffff8808400190f8 0 0 8 1024k ffff880840019108 1 256 8 1024k ffff880840019118 0 0 8 1024k ffff880840019128 0 0 9 2048k ffff880840019140 0 0 9 2048k ffff880840019150 0 0 9 2048k ffff880840019160 1 512 9 2048k ffff880840019170 1 512 9 2048k ffff880840019180 0 0 10 4096k ffff880840019198 0 0 10 4096k ffff8808400191a8 0 0 10 4096k ffff8808400191b8 0 0 10 4096k ffff8808400191c8 1 1024 10 4096k ffff8808400191d8 0 0 ZONE NAME SIZE FREE MEM_MAP START_PADDR START_MAPNR 3 Movable 0 0 0 0 0 nr_free_pages: 54610 (found 54742)
与foreach bt的一点运动
crash> foreach bt | awk '$1 == "#0" { $2 = ""; print }' | sort | uniq -c 31 #0 crash_nmi_callback at ffffffff81029df6 1 #0 machine_kexec at ffffffff8103281b 546 #0 schedule at ffffffff814fda62
实际上,它们要么已经崩溃,要么正在等待记忆(或者我没有正确阅读它).
crash> ps -G | sed 's/>//g' | sort -k 8,8 -n | awk '$8 ~ /[0-9]/{ $8 = $8/1024" MB"; print }' | tail -20
检查largepages的数量.
crash> p -d nr_huge_pages
更新:
A)从以下内核版本捕获崩溃转储.
$crash --osrelease vmcore.flat 2.6.32-279.5.2.el6.x86_64
B)让我们从kernel-debug-debuginfo包中提取vmlinux文件.
$rpm2cpio kernel-debug-debuginfo-2.6.32-279.5.2.el6.x86_64.rpm | \ cpio -idv ./usr/lib/debug/lib/modules/*/vmlinux
C)使用崩溃实用程序打开vmcore文件.
$bunzip2 vmcore.flat.bz2 $crash vmcore.flat ./usr/lib/debug/lib/modules/2.6.32-279.5.2.el6.x86_64/vmlinux
D)系统信息.
crash> sys KERNEL: ./usr/lib/debug/lib/modules/2.6.32-279.5.2.el6.x86_64/vmlinux DUMPFILE: vmcore.flat [PARTIAL DUMP] cpuS: 32 DATE: Tue Feb 5 12:11:52 2013 UPTIME: 00:04:12 LOAD AVERAGE: 3.03,0.34 TASKS: 578 NODENAME: ... RELEASE: 2.6.32-279.5.2.el6.x86_64 VERSION: #1 SMP Fri Aug 24 01:07:11 UTC 2012 MACHINE: x86_64 (2700 Mhz) MEMORY: 64 GB PANIC: "[ 253.529344] Kernel panic - not syncing: Out of memory and no killable processes..."
a)由于内存不足而发生恐慌,但系统上禁用了“panic_on_oom”参数.
crash> p -d sysctl_panic_on_oom sysctl_panic_on_oom = $6 = 0
此参数启用或禁用内存不足功能的混乱.如果将其设置为0,内核将杀死一些名为oom_killer的流氓进程.通常,oom_killer可以杀死恶意进程并且系统将继续存在.如果将其设置为1,则内存发生时内核会发生混乱.
b)那么,我们是如何在oom事件发生时捕获vmcore的?
好吧,让我们检查mm / oom_kill.c源代码.它说如果系统上没有任何东西可以杀死那么只是挂起或恐慌.
++++++ 499 /* Found nothing?!?! Either we hang forever,or we panic. */ 500 if (!p) { 501 read_unlock(&tasklist_lock); 502 cpuset_unlock(); 503 panic("Out of memory and no killable processes...\n"); <<<------ 504 } 505 ++++++
因此我们达到了恐慌状态,并且在此系统上启用了kdump服务时,捕获了vmcore.
E)让我们检查内核环缓冲区,
crash> log [..] [ 253.351427] Node 0 DMA free:15744kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15356kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes [ 253.352234] lowmem_reserve[]: 0 2955 32245 32245 [ 253.352812] Node 0 DMA32 free:120436kB min:4120kB low:5148kB high:6180kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:32kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3026080kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:20kB slab_unreclaimable:16600kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1 all_unreclaimable? no [ 253.353637] lowmem_reserve[]: 0 0 29290 29290 [ 253.354216] Node 0 Normal free:40580kB min:40868kB low:51084kB high:61300kB active_anon:956kB inactive_anon:536kB active_file:260kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:29992960kB mlocked:0kB dirty:0kB writeback:0kB mapped:460kB shmem:136kB slab_reclaimable:3640kB slab_unreclaimable:75128kB kernel_stack:4448kB pagetables:428kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no [ 253.355047] lowmem_reserve[]: 0 0 0 0 [ 253.355624] Node 1 Normal free:39896kB min:45096kB low:56368kB high:67644kB active_anon:412kB inactive_anon:1668kB active_file:288kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):220kB present:33095680kB mlocked:0kB dirty:0kB writeback:0kB mapped:92kB shmem:80kB slab_reclaimable:3496kB slab_unreclaimable:87864kB kernel_stack:216kB pagetables:564kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no [ 253.356457] lowmem_reserve[]: 0 0 0 0 [ 253.357034] Node 0 DMA: 2*4kB 1*8kB 1*16kB 1*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15744kB [ 253.358351] Node 0 DMA32: 41*4kB 8*8kB 7*16kB 6*32kB 10*64kB 10*128kB 7*256kB 9*512kB 7*1024kB 5*2048kB 23*4096kB = 120468kB [ 253.359674] Node 0 Normal: 718*4kB 558*8kB 278*16kB 169*32kB 88*64kB 47*128kB 13*256kB 5*512kB 0*1024kB 1*2048kB 1*4096kB = 40872kB [ 253.360995] Node 1 Normal: 876*4kB 447*8kB 249*16kB 174*32kB 116*64kB 40*128kB 8*256kB 1*512kB 1*1024kB 2*2048kB 1*4096kB = 40952kB [ 253.362319] 154 total pagecache pages [ 253.362502] 0 pages in swap cache [ 253.362684] Swap cache stats: add 0,delete 0,find 0/0 [ 253.362869] Free swap = 0kB [ 253.363050] Total swap = 0kB [ 253.526814] 16777215 pages RAM [ 253.526999] 294628 pages reserved [ 253.527190] 114911 pages shared [ 253.527372] 16392561 pages non-shared [..]
F)让我们在崩溃时检查系统上的内存状态.
crash> kmem -i PAGES TOTAL PERCENTAGE TOTAL MEM 16482587 62.9 GB ---- -------------------------------+ FREE 54610 213.3 MB 0% of TOTAL MEM | USED 16427977 62.7 GB 99% of TOTAL MEM | SHARED 4683 18.3 MB 0% of TOTAL MEM | BUFFERS 118 472 KB 0% of TOTAL MEM | CACHED 82 328 KB 0% of TOTAL MEM | SLAB 46635 182.2 MB 0% of TOTAL MEM | | TOTAL SWAP 0 0 ---- ----------------------+ | SWAP USED 0 0 100% of TOTAL SWAP | | SWAP FREE 0 0 0% of TOTAL SWAP | | | | | | crash> p -d totalram_pages | | totalram_pages = $5 = 16482587 | | | | crash> !echo "scale=5;(16482587*4096)/2^30"|bc -q | | 62.87607 <<<-----[ Total physical memory is 62.9 GB ] <<<--|--------+ | crash> p -d total_swap_pages | total_swap_pages = $6 = 0 <<<------[ No Swap on the system ] <<<-----------+
>我们总共有~63GiB的物理内存.
>交换分区或文件未在系统上创建,因此我们没有在此服务器上交换.
>用于缓存的内存非常少,328KB,缓冲区的内存为472KB.
>平板中使用的内存也非常少,只有182.2 MB.
G)分配给进程的总内存为0.00391006GiB.
crash> ps -G | tail -n +2 | cut -b2- | gawk '{mem += $8} END {print "total " mem/1048576 "GB"}' total 0.00391006GB
H)应用程序进程没有在系统上使用内存.
crash> ps -G | sed 's/>//g' | sort -k 8,8 -n | awk '$8 ~ /[0-9]/{ $8 = $8/1024" MB"; print }' | tail -20 965 2 21 ffff8808292f1500 IN 0.0 0 0 MB [ext4-dio-unwrit] 966 2 22 ffff8808292d4080 IN 0.0 0 0 MB [ext4-dio-unwrit] 967 2 23 ffff8808292ce040 IN 0.0 0 0 MB [ext4-dio-unwrit] 968 2 24 ffff8808299b5540 IN 0.0 0 0 MB [ext4-dio-unwrit] 969 2 25 ffff880829aa6040 IN 0.0 0 0 MB [ext4-dio-unwrit] 970 2 26 ffff880827367500 IN 0.0 0 0 MB [ext4-dio-unwrit] 971 2 27 ffff880827366aa0 IN 0.0 0 0 MB [ext4-dio-unwrit] 972 2 28 ffff880827366040 IN 0.0 0 0 MB [ext4-dio-unwrit] 97 2 23 ffff88082c1ac080 IN 0.0 0 0 MB [ksoftirqd/23] 973 2 29 ffff880827371540 IN 0.0 0 0 MB [ext4-dio-unwrit] 974 2 30 ffff880827370ae0 IN 0.0 0 0 MB [ext4-dio-unwrit] 975 2 31 ffff880827370080 IN 0.0 0 0 MB [ext4-dio-unwrit] 98 2 23 ffff88082c1bb500 IN 0.0 0 0 MB [watchdog/23] 99 2 24 ffff88082c1baaa0 IN 0.0 0 0 MB [migration/24] 3171 1 3 ffff880826ccaaa0 IN 0.0 27636 0.234375 MB auditd 1 0 1 ffff88082c41b500 UN 0.0 19348 0.339844 MB init 3772 1 0 ffff88102b257500 RU 0.0 64072 0.652344 MB sshd 1047 1 2 ffff881029524040 IN 0.0 11188 0.925781 MB udevd 4936 1047 4 ffff880ff342d540 IN 0.0 11184 0.925781 MB udevd 4937 1047 5 ffff88082a240080 IN 0.0 11184 0.925781 MB udevd
I)让我们验证系统上的内存调整参数.
crash> p -d sysctl_overcommit_memory sysctl_overcommit_memory = $7 = 0
此值包含一个启用内存过量使用的标志.当此标志为0时,内核尝试估计当用户空间请求更多内存时剩余的可用内存量.
crash> p -d sysctl_overcommit_ratio sysctl_overcommit_ratio = $8 = 50
当overcommit_memory设置为2时,不允许提交的地址空间超过交换加上物理RAM的这个百分比.
crash> p -d zone_reclaim_mode zone_reclaim_mode = $4 = 0
Zone_reclaim_mode允许某人在区域内存不足时设置或多或少的积极方法来回收内存.如果设置为零,则不会发生区域回收.
crash> p -d min_free_kbytes min_free_kbytes = $3 = 90112 <<<--------[ 88 MB ]
在系统中保持空闲的最小千字节数.该值用于计算每个低存储区的水印值,然后为其分配与其大小成比例的多个保留空闲页.设置此参数时,太低和太高的值都可能会造成损坏.
换句话说,将min_free_kbytes设置得太低会阻止系统回收内存.这可能导致系统挂起和OOM杀死多个进程.但是,将此参数设置为过高的值(占系统内存总量的5-10%)将导致系统立即变为内存不足. Linux旨在使用所有可用的RAM来缓存文件系统数据.设置高min_free_kbytes值会导致系统花费太多时间回收内存.
上面参数的值看起来还可以,那么我的记忆在哪里?
假设:
>主要罪犯不在用户空间.根据我的经验,不负责任的记忆是由Mellanox和DRBD模块引起的,但我不确定你的情况.>由于大多数页面都是从vmcore文件中丢弃的,以减小vmcore文件的大小(core_collector makedumpfile -d 31 -c).我无法检查巨页大小.