某日发现某台Linux主机(Dell R720, Redhat 6.2)反应异常缓慢,实际基本没有运行什么应用程序,什么原因呢?
查看下系统日志:
shell> vi /var/log/messages
…
Sep 24 06:17:20 ch13 kernel: CPU6: Package power limit notification (total events = 368993)
Sep 24 06:17:20 ch13 kernel: CPU3: Package power limit notification (total events = 370933)
Sep 24 06:17:20 ch13 kernel: CPU9: Package power limit notification (total events = 370893)
Sep 24 06:17:20 ch13 kernel: CPU10: Package power limit notification (total events = 370938)
Sep 24 06:17:20 ch13 kernel: CPU6: Package power limit normal
Sep 24 06:17:20 ch13 kernel: CPU0: Package power limit normal
Sep 24 06:17:20 ch13 kernel: CPU1: Package power limit normal
Sep 24 06:17:20 ch13 kernel: CPU7: Package power limit normal
Sep 24 06:17:20 ch13 kernel: CPU8: Package power limit normal
Sep 24 06:17:20 ch13 kernel: CPU2: Package power limit normal
Sep 24 06:17:20 ch13 kernel: CPU3: Package power limit normal
…
Sep 24 06:17:20 ch13 kernel: CPU3: Core power limit normal
Sep 24 06:17:20 ch13 kernel: CPU9: Core power limit normal
Sep 24 06:17:20 ch13 kernel: CPU4: Core power limit normal
Sep 24 06:17:20 ch13 kernel: CPU10: Core power limit normal
Sep 24 06:17:20 ch13 kernel: CPU5: Package power limit notification (total events = 371505)
Sep 24 06:17:20 ch13 kernel: CPU11: Package power limit notification (total events = 371497)
Sep 24 06:17:51 ch13 kernel: [Hardware Error]: Machine check events logged
Sep 24 06:17:51 ch13 mcelog: Processor 7 below trip temperature. Throttling disabled
Sep 24 06:17:51 ch13 mcelog: Processor 1 below trip temperature. Throttling disabled
Sep 24 06:17:51 ch13 mcelog: Processor 2 below trip temperature. Throttling disabled
Sep 24 06:17:51 ch13 mcelog: Processor 8 below trip temperature. Throttling disabled
Sep 24 06:17:51 ch13 mcelog: Processor 0 below trip temperature. Throttling disabled
Sep 24 06:17:51 ch13 mcelog: Processor 6 below trip temperature. Throttling disabled
Sep 24 06:17:51 ch13 mcelog: Processor 3 below trip temperature. Throttling disabled
Sep 24 06:17:51 ch13 mcelog: Processor 9 below trip temperature. Throttling disabled
Sep 24 06:17:51 ch13 mcelog: Processor 10 below trip temperature. Throttling disabled
Sep 24 06:17:51 ch13 mcelog: Processor 4 below trip temperature. Throttling disabled
…
查看下系统的大致状况:
shell> vmstat 1 10
procs ———–memory———- —swap– —–io—- –system– —–cpu—–
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 12171416 251176 2000188 0 0 0 1 1 1 1 0 99 0 0
0 0 0 12171400 251176 2000188 0 0 0 0 15936 236 1 0 99 0 0
0 0 0 12171408 251176 2000188 0 0 0 4 16208 259 0 0 100 0 0
0 0 0 12171408 251176 2000188 0 0 0 12 15119 193 0 0 100 0 0
0 0 0 12171408 251176 2000188 0 0 0 0 16047 237 0 0 100 0 0
0 0 0 12171408 251176 2000188 0 0 0 0 14348 187 0 0 100 0 0
0 0 0 12171408 251176 2000188 0 0 0 0 14977 239 0 0 99 0 0
1 0 0 12171400 251176 2000188 0 0 0 0 16226 216 0 0 100 0 0
0 0 0 12171268 251176 2000188 0 0 0 0 8233 277 1 3 96 0 0
0 0 0 12171392 251176 2000188 0 0 0 0 16465 193 0 0 100 0 0
发现上下文切换异常的大,查看中断信息,发现有部分信息异常的大,表明系统中断量极大:
shell> watch -n1 “cat /proc/interrupts”
CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11
TRM: 613877 596324 616331 613775 614013 614117 618081 617479 618912 616294 616631 616726 Thermal event interrupts
基于以上信息,可以大致推测在硬件层面有限制或者有故障,导致操作系统处理不过来,大量的上下文切换来应对。
Google了一番也没有找到具体的方案,有网友有类似的问题,给出原因说是BIOS中设置了处理器的性能优化以及电源的限制,需要重新设置BIOS中相关内容,重启即可。决定尝试一下,但是遇到了一个新问题:主机托管在远程的IDC机房,没有相应的远程技术支持服务,咋整呢?
Dell提供的完整的系统管理系统Dell OpenManage Server Administrator(OMSA),提供了远程Web界面设置BIOS的功能。我们可以通过它来设置BIOS的相关项,然后远程重启即可。
OMSA的安装”Linux下安装Dell OpenManage Server Administrator(OMSA)“。
推荐的BIOS设置方案见: Configuring Low-Latency Environments on Dell PowerEdge 12th Generation Servers
文章中的推荐方案如下:
这里我们更新System Profile Settings(系统配置文件设置)部分,参考最后一列的推荐设置。这里主要调整:
- C1E 已禁用
- C状态 已禁用
- 监控器/MWait 已禁用
配置完成的结果如下,事实证明确实不再显示同类警告或错误信息,OK!
参考:
http://www.sulabs.net/?p=405
The post 使用OMSA解决Linux下”Package power limit notification”问题 appeared first on SQLParty.