ASUS P9D-M, ASMB7-IKVM and false vCenter alarms

ESXi host with fan alert

I was puzzled when I saw that the ESXi hosts of my new lab are generating critical CPU fan errors. Strange, I'm sure I mounted the fan correctly - otherwise the system would have been shutdown automatically to prevent serious damage. 🙂

I had a look at the IPMI web interface and saw that a fan is detected - but why is it called "FRNT_FAN1"?

IPMI fan and temperature sensors

The appropriate vCenter Server view showed up the same information:

vCenter fan and temperature sensors

All disconnected fans and temperature sensors are listed as alerts. In the first place, I thought that this behavior generates the alert and analyzed thresholds using ipmitool on a Linux host:

 1# ipmitool -H myesxi.domain.loc -U admin sdr type Fan
 2Password:
 3CPU_FAN1         | A0h | lnr  |  0.0 | 0 RPM
 4FRNT_FAN1        | A2h | lnr |  0.0 | 0 RPM
 5FRNT_FAN2        | A3h | lnr |  0.0 | 0 RPM
 6FRNT_FAN3        | A4h | ok  |  0.0 | 800 RPM
 7FRNT_FAN4        | A5h | lnr |  0.0 | 0 RPM
 8
 9# ipmitool -H myesxi.domain.loc -U admin sensor get "FRNT_FAN1"
10Password:
11Locating sensor record...
12Sensor ID              : FRNT_FAN1 (0xa2)
13 Entity ID             : 0.0 (Unspecified)
14 Sensor Type (Threshold)  : Fan (0x04)
15 Sensor Reading        : 0 (+/- 0) RPM
16 Status                : Lower Non-Recoverable
17 Nominal Reading       : 4480.000
18 Normal Minimum        : 1040.000
19 Normal Maximum        : 17920.000
20 Upper non-recoverable : 20000.000
21 Upper critical        : 18960.000
22 Upper non-critical    : 18000.000
23 Lower non-recoverable : 0.000
24 Lower critical        : 0.000
25 Lower non-critical    : 0.000
26 Positive Hysteresis   : 80.000
27 Negative Hysteresis   : 80.000
28 Minimum sensor range  : Unspecified
29 Maximum sensor range  : Unspecified
30 Event Message Control : Per-threshold
31 Readable Thresholds   : lnr lcr lnc unc ucr unr
32 Settable Thresholds   : lnr lcr lnc unc ucr unr
33 Threshold Read Mask   : lnr lcr lnc unc ucr unr
34 Assertion Events      : lnc- lcr-
35 Assertions Enabled    : lnc- lcr-
36 Deassertions Enabled  : lnc- lcr-

Disconnected fans are generating an alarm with lnr (lower non-recoverable) severity. Using ipmitool it is possible to read and set thresholds - but no negative values are possible. I had the idea to set negative thresholds for the fan sensors to stop the alert:

1# ipmitool -U admin -H myesxi.domain.loc sensor thres "FRNT_FAN1" -- "-1" "-1" "-1"
2Password:
3Valid threshold '-1' for sensor 'FRNT_FAN1' not specified!
4...

Amongst others, it is not possible to disable unneeded sensors in my setup (ASUS P9D-M and ASMB7-IKVM).

Let's focus on the other abnormality - the wrong fan naming. I had a look at the mainboard and verified that the CPU fan was plugged into the "CPU_FAN1" port. After some tries I figured out that the "FRNT_FAN1" port is detected as "CPU_FAN1" by IPMI:

IPMI fan and temperature sensors

This also stopped the host hardware alert in ESXi / vCenter.

So finally, this was sufficient to fix the issue. The disconnected fans and sensors are still listed as errors - but no alarm is generated. It seems that the fan port names on the mainboard are faulty - but this might also be the result of a firmware update of the ASMB7-IKVM card. I think that this issue occured right after upgrading the firmware - but I'm not 100% sure.

Translations: