启动后,我注意到其中一个cpu总是达到100%.它通常是cpu 1(第2个cpu)但是大约10次启动它是cpu 3(第4个cpu)一次.
导致高负载的过程是事件/ 1(或事件/ 3,它发生在核心3上).我看过dmesg并没有发现任何异常.有没有人有任何建议,我怎么能找到实际导致cpu使用的是什么?
我还注意到,当我在启动时插入显示器时,在CentOS加载屏幕上,加载条到达大约一半,然后屏幕变黑(没有显示登录屏幕).否则一切都会启动并正常运行.
服务器信息:
CentOS release 6.9 (Final)
cpu信息:
processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) cpu 3.00GHz stepping : 3 microcode : 5 cpu MHz : 3000.000 cache size : 2048 KB physical id : 3 siblings : 2 core id : 0 cpu cores : 1 apicid : 6 initial apicid : 6 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc pebs bts pni dtes64 monitor ds_cpl cid cx16 xtpr bogomips : 5985.27 clflush size : 64 cache_alignment : 128 address sizes : 36 bits physical,48 bits virtual power management:
更新1:
cat / proc / interrupts
cpu0 cpu1 cpu2 cpu3 0: 133 0 0 1 IO-APIC-edge timer 1: 0 0 0 2 IO-APIC-edge i8042 4: 0 0 0 2 IO-APIC-edge 8: 0 0 0 1 IO-APIC-edge rtc0 9: 0 0 0 0 IO-APIC-fasteoi acpi 12: 0 0 0 4 IO-APIC-edge i8042 14: 0 0 0 147 IO-APIC-edge ata_piix 15: 0 0 0 0 IO-APIC-edge ata_piix 16: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb2 18: 0 0 0 301 IO-APIC-fasteoi uhci_hcd:usb4,radeon 19: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb3 23: 0 0 0 49 IO-APIC-fasteoi ehci_hcd:usb1 46: 0 0 3804 4767 IO-APIC-fasteoi megaraid 64: 0 288 0 104 IO-APIC-fasteoi eth0 NMI: 0 1 0 0 Non-maskable interrupts LOC: 24325 76909 25269 31039 Local timer interrupts SPU: 0 0 0 0 SpurIoUs interrupts PMI: 0 1 0 0 Performance monitoring interrupts IWI: 0 0 0 0 IRQ work interrupts RES: 2295 703 1357 886 Rescheduling interrupts CAL: 3986 421 156 175 Function call interrupts TLB: 526 95 803 3519 TLB shootdowns TRM: 0 0 0 0 Thermal event interrupts THR: 0 0 0 0 Threshold APIC interrupts MCE: 0 0 0 0 Machine check exceptions MCP: 1 1 1 1 Machine check polls ERR: 0 MIS: 0
SAR
Linux 2.6.32-696.16.1.el6.x86_64 (HOSTNAME) 12/30/2017 _x86_64_ (4 cpu) 09:57:37 AM LINUX RESTART 10:00:01 AM cpu %user %nice %system %iowait %steal %idle 10:10:01 AM all 0.10 0.07 21.09 1.49 0.00 77.25 10:20:01 AM all 0.15 0.00 21.00 0.00 0.00 78.85 10:30:01 AM all 0.11 0.00 20.92 0.00 0.00 78.97 10:40:01 AM all 0.09 0.00 20.81 0.01 0.00 79.09 Average: all 0.11 0.02 20.96 0.37 0.00 78.54 12:35:32 PM LINUX RESTART
最佳
Tasks: 164 total,2 running,162 sleeping,0 stopped,0 zombie cpu(s): 0.2%us,20.8%sy,0.0%ni,78.9%id,0.0%wa,0.0%hi,0.1%si,0.0%st Mem: 8058904k total,453272k used,7605632k free,22240k buffers Swap: 8191996k total,0k used,8191996k free,174064k cached PID USER PR NI VIRT RES SHR S %cpu %MEM TIME+ COMMAND 20 root 20 0 0 0 0 R 99.9 0.0 5:50.67 events/1
更新2:
一旦我重新获得对盒子的物理访问,我就完全将PERC控制器与部件服务器中的一个交换掉了.我重新安装了存储卡和电池.由于新硬件导致RAID配置不匹配,我将其从磁盘恢复.启动后,我获得了相同的100%cpu使用率.
我通过拉动CMOS电池并按住电源按钮10秒钟来重置BIOS / CMOS.重新启动并设置RAID以再次从硬盘驱动器读取. cpu仍然是100%.
我跑yum更新并重新启动.仍然是100%.下面是顶部显示各个cpu.
最佳
top - 11:59:19 up 21 min,1 user,load average: 1.00,0.97,0.72 Tasks: 164 total,0 zombie cpu0 : 0.0%us,0.0%sy,100.0%id,0.0%si,0.0%st cpu1 : 0.3%us,0.3%sy,99.3%id,0.0%st cpu2 : 0.3%us,99.7%id,0.0%st cpu3 : 0.0%us,100.0%sy,0.0%id,456996k used,7601908k free,22480k buffers Swap: 8191996k total,173792k cached
SAR
Linux 2.6.32-696.16.1.el6.x86_64 (HOSTNAME) 01/04/2018 _x86_64_ (4 cpu) 10:40:45 AM LINUX RESTART 10:50:01 AM cpu %user %nice %system %iowait %steal %idle 11:00:01 AM all 0.08 0.00 20.86 0.00 0.00 79.06 11:40:01 AM all 0.00 0.00 0.00 0.00 0.00 0.00 11:50:01 AM all 0.08 0.00 20.87 0.02 0.00 79.03 12:00:01 PM all 0.08 0.00 20.89 0.00 0.00 79.02 Average: all 0.00 0.00 20.83 0.00 0.00 79.78
cat / proc / interrupts
cpu0 cpu1 cpu2 cpu3 0: 133 0 0 6 IO-APIC-edge timer 1: 0 0 0 2 IO-APIC-edge i8042 4: 0 0 0 2 IO-APIC-edge 8: 0 0 0 1 IO-APIC-edge rtc0 9: 0 0 0 0 IO-APIC-fasteoi acpi 12: 0 0 0 4 IO-APIC-edge i8042 14: 0 0 0 147 IO-APIC-edge ata_piix 15: 0 0 0 0 IO-APIC-edge ata_piix 16: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb2 18: 0 0 302 302 IO-APIC-fasteoi uhci_hcd:usb4,radeon 19: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb3 23: 0 0 0 53 IO-APIC-fasteoi ehci_hcd:usb1 46: 0 0 4074 4912 IO-APIC-fasteoi megaraid 64: 0 4917 0 108 IO-APIC-fasteoi eth0 NMI: 0 0 0 28 Non-maskable interrupts LOC: 197497 401002 148354 1361329 Local timer interrupts SPU: 0 0 0 0 SpurIoUs interrupts PMI: 0 0 0 28 Performance monitoring interrupts IWI: 0 0 0 0 IRQ work interrupts RES: 5891 1183 2828 8249 Rescheduling interrupts CAL: 3641 1441 156 184 Function call interrupts TLB: 837 3324 833 202 TLB shootdowns TRM: 0 0 0 0 Thermal event interrupts THR: 0 0 0 0 Threshold APIC interrupts MCE: 0 0 0 0 Machine check exceptions MCP: 6 6 6 6 Machine check polls ERR: 0 MIS: 0
更新3:
我在GRUB中的Kernel命令中添加了noapic和nolapic参数.以下是top和cat / proc / interrupts的结果
最佳
top - 14:55:01 up 5 min,load average: 1.76,1.27,0.58 Tasks: 111 total,109 sleeping,0 zombie cpu(s): 0.4%us,99.6%sy,0.0%st Mem: 8059152k total,442016k used,7617136k free,22252k buffers Swap: 8191996k total,173556k cached
cat / proc / interrupts
cpu0 0: 447518 XT-PIC-XT-PIC timer 1: 2 XT-PIC-XT-PIC i8042 2: 0 XT-PIC-XT-PIC cascade 3: 1 XT-PIC-XT-PIC 4: 4 XT-PIC-XT-PIC 5: 50 XT-PIC-XT-PIC ehci_hcd:usb1 7: 8825 XT-PIC-XT-PIC uhci_hcd:usb4,radeon,megaraid 8: 1 XT-PIC-XT-PIC rtc0 9: 0 XT-PIC-XT-PIC acpi 10: 0 XT-PIC-XT-PIC uhci_hcd:usb3 11: 1586 XT-PIC-XT-PIC uhci_hcd:usb2,eth0 12: 4 XT-PIC-XT-PIC i8042 14: 148 XT-PIC-XT-PIC ata_piix 15: 0 XT-PIC-XT-PIC ata_piix NMI: 0 Non-maskable interrupts LOC: 0 Local timer interrupts SPU: 0 SpurIoUs interrupts PMI: 0 Performance monitoring interrupts IWI: 0 IRQ work interrupts RES: 0 Rescheduling interrupts CAL: 0 Function call interrupts TLB: 0 TLB shootdowns TRM: 0 Thermal event interrupts THR: 0 Threshold APIC interrupts MCE: 0 Machine check exceptions MCP: 2 Machine check polls ERR: 0 MIS: 0
我还尝试启动到另一个更老版本的内核(Centos 6.7),它产生了与以前相同的结果:随机内核100%的cpu使用率.
更新4:
我被另一个项目分心了,让服务器开了几个小时.我在关闭之前检查了顶部并注意到cpu使用率已降至正常水平(每个核心不到1%).我重新启动,看看问题是否会重新出现,但事实并非如此.我想知道是什么造成了这种情况,并且我愿意继续尝试不同的事情,如果有人有任何建议的话.我注意到的唯一不寻常的是/ var / spool / mail / root中的消息:
Invalid system activity file: /var/log/sa//sa04
这是在我检查顶部之前生成的.
更新5:
我找到了问题的根源!当我休息一下我的另一个项目时,我拔掉了显示器并随身携带.当我重新登录(通过SSH)时,cpu使用率正常.当我回想起可能发生的变化时,我唯一能想到的就是显示器.为了测试理论,我重新启动了显示器插入.Voila! 100%的cpu使用率.我拔掉了显示器,cpu使用率立即下降.
所以现在我想知道在插入显示器时导致cpu使用率的原因是什么?
更新6:
lspci的
00:00.0 Host bridge: Intel Corporation E7520 Memory Controller Hub (rev 09) 00:02.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A (rev 09) 00:04.0 PCI bridge: Intel Corporation E7525/E7520 PCI Express Port B (rev 09) 00:05.0 PCI bridge: Intel Corporation E7520 PCI Express Port B1 (rev 09) 00:06.0 PCI bridge: Intel Corporation E7520 PCI Express Port C (rev 09) 00:1d.0 USB controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #1 (rev 02) 00:1d.1 USB controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #2 (rev 02) 00:1d.2 USB controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #3 (rev 02) 00:1d.7 USB controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB2 EHCI Controller (rev 02) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c2) 00:1f.0 ISA bridge: Intel Corporation 82801EB/ER (ICH5/ICH5R) LPC Interface Bridge (rev 02) 00:1f.1 IDE interface: Intel Corporation 82801EB/ER (ICH5/ICH5R) IDE Controller (rev 02) 01:00.0 PCI bridge: Intel Corporation 80332 [Dobson] I/O processor (A-Segment Bridge) (rev 06) 01:00.2 PCI bridge: Intel Corporation 80332 [Dobson] I/O processor (B-Segment Bridge) (rev 06) 02:0e.0 RAID bus controller: Dell PowerEdge Expandable RAID controller 4 (rev 06) 05:00.0 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge A (rev 09) 05:00.2 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge B (rev 09) 06:07.0 Ethernet controller: Intel Corporation 82541GI Gigabit Ethernet Controller (rev 05) 07:08.0 Ethernet controller: Intel Corporation 82541GI Gigabit Ethernet Controller (rev 05) 09:0d.0 VGA compatible controller: Advanced Micro Devices,Inc. [AMD/ATI] RV100 [Radeon 7000 / Radeon VE]
更新7:
将noacpi和nomodeset添加到引导选项会导致cpu使用问题消失. CentOS也启动到登录屏幕,而不是在监视器中间加载屏幕时将其遮住.这表明了什么?
您可以尝试:
>更新Dell PERC驱动程序和固件>使用不同的(较旧/较新)内核版本>重置服务器CMOS / BIOS和/或更新其固件>更改受影响的硬件.