硬盘smart健康检测使用说明
如果我们想监测硬盘SMART特性(包括硬盘健康、通电次数、通电时间、硬盘温度等),需要在被监控主机安装smartmontools工具
如果发现硬盘检测失败,会发送告警通知
硬盘SMART特性目前只支持物理机,raid也可以,还不能支持虚拟机
1、在agent/config/application.properties配置开启smart
2、Linux安装smartmontools
(1) 可以访问网络,yum方式在线安装
(2) 不能访问网络,离线安装
如果不能访问网络,也可以下载到本地,然后安装: smartmontools-7.2.tar.gz
安装步骤如下
3、windows安装smartmontools
下载: smartmontools-7.2-1.win32-setup.zip
Windows安装会自动设置环境变量C:\Program Files\smartmontools\bin,有时候不生效,是因为设置为用户变量了,需要手动把添加到【系统变量】
4、smartmontools检测磁盘健康命令,比如检测Windows的C盘,Linux也一样,返回PASSED说明硬盘是健康的
C:\Users\ethan>smartctl -H C:
smartctl 7.2 2020-12-30 r5155 [x86_64-w64-mingw32-w10-b19044] (sf-7.2-1)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

Warning: Limited functionality due to missing admin rights
Read SMART Thresholds failed: Function not implemented

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
5、安装完成后,重启agent,我们就可以在server主机列表页面,点击【系统】按钮看到硬盘的SMART信息了
提示:如果是监控windows主机,建议用管理员身份运行agent,否则有时候采集不到数据
磁盘SMART状态有三种,分别是:PASSED健康、FAILED失败、Disabled已禁用,如果显示失败就需要注意检查硬盘是否有问题
6、如果windows部署的agent获取不到硬盘SMART信息,agent日志出现如下错误
处理办法:我们只要把smartmontools的环境变量添加到【系统环境变量】里就好了,如下图所示
7、其他信息说明
smartctl --scan,说明:获取所有硬盘
exec smartctl -H /dev/sda exit status 4,说明:windows正常打印信息,不是错误,请忽略
exec error exec: "smartctl": executable file not found in %PATH%,说明:因为主机没有安装smartmontools工具,请忽略,不影响运行
SMART support is: Unavailable,说明:bios里禁用了smart特性,开启即可。也可能是虚拟机下不支持smart
SMART support is: Disabled,说明:表示SMART未启用,执行如下命令,启动SMART,用命令:smartctl -s on /dev/sda
查看磁盘是否支持smart:smartctl -i /dev/sda
C:\Users\ethan>smartctl -i /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-w64-mingw32-w10-b19044] (sf-7.2-1)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

Warning: Limited functionality due to missing admin rights
=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Black Mobile
Device Model:     WDC WD5000LPLX-08ZNTT0
Serial Number:    WD-WX41A479VEH4
Firmware Version: 04.01A04
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   [No Information Found]
Local Time is:    Wed Aug 24 15:48:37 2022
SMART support is: Available - device has SMART capability.
                  Enabled status cached by OS, trying SMART RETURN STATUS cmd.
SMART support is: Enabled
#Enabled表示启用了SMART
#Available表示硬盘支持SMART
启用SMART:smartctl --smart=on --offlineauto=on --saveauto=on /dev/sda
查看磁盘其他指标信息:smartctl -A /dev/sda
C:\Users\ethan>smartctl -A /dev/sda
smartctl 7.2 2020-12-30 r5155 [x86_64-w64-mingw32-w10-b19044] (sf-7.2-1)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

Warning: Limited functionality due to missing admin rights
Read SMART Thresholds failed: Function not implemented

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   195   ---    Pre-fail  Always       -       19
  3 Spin_Up_Time            0x0027   152   142   ---    Pre-fail  Always       -       1366
  4 Start_Stop_Count        0x0032   001   001   ---    Old_age   Always       -       133038
  5 Reallocated_Sector_Ct   0x0033   200   200   ---    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002f   200   200   ---    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   088   088   ---    Old_age   Always       -       8828
 10 Spin_Retry_Count        0x0032   100   100   ---    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   ---    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   095   095   ---    Old_age   Always       -       5573
192 Power-Off_Retract_Count 0x0032   200   200   ---    Old_age   Always       -       20
193 Load_Cycle_Count        0x0032   150   150   ---    Old_age   Always       -       150173
194 Temperature_Celsius     0x0022   101   090   ---    Old_age   Always       -       42
196 Reallocated_Event_Count 0x0032   200   200   ---    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   ---    Old_age   Always       -       8
198 Offline_Uncorrectable   0x0030   100   253   ---    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   ---    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   ---    Old_age   Offline      -       0
240 Head_Flying_Hours       0x0032   089   089   ---    Old_age   Always       -       8624