硬盘smart健康检测使用说明
如果我们想监测硬盘SMART特性(包括硬盘健康、通电次数、通电时间、硬盘温度等),需要在被监控主机安装smartmontools工具
如果发现硬盘检测失败,会发送告警通知
1、在agent/config/application.properties配置开启smart
2、Linux通过yum方式安装smartmontools
3、windows安装smartmontools

下载: smartmontools-7.2-1.win32-setup.zip

Windows安装会自动设置环境变量C:\Program Files\smartmontools\bin,有时候不生效,是因为设置为用户变量了,需要手动把添加到【系统变量】
4、smartmontools检测磁盘健康命令,比如检测Windows的C盘,Linux也一样,返回PASSED说明硬盘是健康的
C:\Users\ethan>smartctl -H C:
smartctl 7.2 2020-12-30 r5155 [x86_64-w64-mingw32-w10-b19044] (sf-7.2-1)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

Warning: Limited functionality due to missing admin rights
Read SMART Thresholds failed: Function not implemented

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
5、安装完成后,重启agent,我们就可以在server主机列表页面,点击【系统】按钮看到硬盘的SMART信息了
提示:如果是监控windows主机,建议用管理员身份运行agent,否则有时候采集不到数据
磁盘SMART状态有三种,分别是:PASSED健康、FAILED失败、Disabled已禁用,如果显示失败就需要注意检查硬盘是否有问题
6、其他信息说明
  • smartctl --scan,说明:获取所有硬盘
  • exec smartctl -H /dev/sda exit status 4,说明:windows正常打印信息,不是错误,请忽略
  • exec error exec: "smartctl": executable file not found in %PATH%,说明:因为主机没有安装smartmontools工具,请忽略,不影响运行
  • SMART support is: Unavailable,说明:bios里禁用了smart特性,开启即可。也可能是虚拟机下不支持smart
  • SMART support is: Disabled,说明:表示SMART未启用,执行如下命令,启动SMART,用命令:smartctl -s on /dev/sda
  • 查看磁盘是否支持smart:smartctl -i /dev/sda
  • C:\Users\ethan>smartctl -i /dev/sda
    smartctl 7.2 2020-12-30 r5155 [x86_64-w64-mingw32-w10-b19044] (sf-7.2-1)
    Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
    
    Warning: Limited functionality due to missing admin rights
    === START OF INFORMATION SECTION ===
    Model Family:     Western Digital Black Mobile
    Device Model:     WDC WD5000LPLX-08ZNTT0
    Serial Number:    WD-WX41A479VEH4
    Firmware Version: 04.01A04
    Device is:        In smartctl database [for details use: -P show]
    ATA Version is:   [No Information Found]
    Local Time is:    Wed Aug 24 15:48:37 2022
    SMART support is: Available - device has SMART capability.
                      Enabled status cached by OS, trying SMART RETURN STATUS cmd.
    SMART support is: Enabled
    #Enabled表示启用了SMART
    #Available表示硬盘支持SMART
  • 启用SMART:smartctl --smart=on --offlineauto=on --saveauto=on /dev/sda
  • 查看磁盘其他指标信息:smartctl -A /dev/sda
  • C:\Users\ethan>smartctl -A /dev/sda
    smartctl 7.2 2020-12-30 r5155 [x86_64-w64-mingw32-w10-b19044] (sf-7.2-1)
    Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
    
    Warning: Limited functionality due to missing admin rights
    Read SMART Thresholds failed: Function not implemented
    
    === START OF READ SMART DATA SECTION ===
    SMART Attributes Data Structure revision number: 16
    Vendor Specific SMART Attributes with Thresholds:
    ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
      1 Raw_Read_Error_Rate     0x002f   200   195   ---    Pre-fail  Always       -       19
      3 Spin_Up_Time            0x0027   152   142   ---    Pre-fail  Always       -       1366
      4 Start_Stop_Count        0x0032   001   001   ---    Old_age   Always       -       133038
      5 Reallocated_Sector_Ct   0x0033   200   200   ---    Pre-fail  Always       -       0
      7 Seek_Error_Rate         0x002f   200   200   ---    Pre-fail  Always       -       0
      9 Power_On_Hours          0x0032   088   088   ---    Old_age   Always       -       8828
     10 Spin_Retry_Count        0x0032   100   100   ---    Old_age   Always       -       0
     11 Calibration_Retry_Count 0x0032   100   100   ---    Old_age   Always       -       0
     12 Power_Cycle_Count       0x0032   095   095   ---    Old_age   Always       -       5573
    192 Power-Off_Retract_Count 0x0032   200   200   ---    Old_age   Always       -       20
    193 Load_Cycle_Count        0x0032   150   150   ---    Old_age   Always       -       150173
    194 Temperature_Celsius     0x0022   101   090   ---    Old_age   Always       -       42
    196 Reallocated_Event_Count 0x0032   200   200   ---    Old_age   Always       -       0
    197 Current_Pending_Sector  0x0032   200   200   ---    Old_age   Always       -       8
    198 Offline_Uncorrectable   0x0030   100   253   ---    Old_age   Offline      -       0
    199 UDMA_CRC_Error_Count    0x0032   200   200   ---    Old_age   Always       -       0
    200 Multi_Zone_Error_Rate   0x0008   100   253   ---    Old_age   Offline      -       0
    240 Head_Flying_Hours       0x0032   089   089   ---    Old_age   Always       -       8624