Pythian Blog: Technical Track

Exadata: 7 Useful Commands to check Port/Sensor Alarms

Some days ago I was checking for the alarm below, with a very generic message:
Message=The aggregate sensor /SYS/CABLE_CONN_STAT has a fault.
So, I decided to document some useful commands I used to verify all ports/sensors in my Exadata cluster. The actions used: 1) Use Intelligent Platform Management Interface (IPMI) to read the Sensor Data Record (SDR) repository. 2) Use Intelligent Platform Management Interface (IPMI) to view the ILOM SP System Event Log (SEL). 3) Display all host nodes with ibhosts. 4) Use ibcheckstate to scan InfiniBand fabric and validate the port logical and physical state. 5) Use ibcheckerrors to scan InfiniBand fabric and validate the connectivity as described in the topology file. 6) Checking for sensor healthy from switch. 7) Check the overall health of the InfiniBand switch, on the Exadata switch itself. The commands used:
1) ipmitool sdr
 2) ipmitool sel list | tail -60
 3) ibhosts
 4) ibcheckstate -v
 5) ibcheckerrors
 6) showunhealthy
 7) env_test
Below the complete list with command executions and example outputs. 1) Use Intelligent Platform Management Interface (IPMI) to read the Sensor Data Record (SDR) repository:
[root@exa01db01 ~]# ipmitool sdr
 ACPI | 0x01 | ok
 HDD0/PRSNT | 0x02 | ok
 HDD0/STATE | 0x04 | ok
 HDD1/PRSNT | 0x02 | ok
 HDD1/STATE | 0x04 | ok
 HDD2/PRSNT | 0x02 | ok
 HDD2/STATE | 0x04 | ok
 HDD3/PRSNT | 0x02 | ok
 HDD3/STATE | 0x04 | ok
 HDD4/PRSNT | 0x01 | ok
 HDD4/STATE | 0x00 | ok
 HDD5/PRSNT | 0x01 | ok
 HDD5/STATE | 0x00 | ok
 HDD6/PRSNT | 0x01 | ok
 HDD6/STATE | 0x00 | ok
 HDD7/PRSNT | 0x01 | ok
 HDD7/STATE | 0x00 | ok
 NVME0/PRSNT | 0x01 | ok
 NVME1/PRSNT | 0x01 | ok
 NVME2/PRSNT | 0x01 | ok
 NVME3/PRSNT | 0x01 | ok
 HOST_ERR | 0x01 | ok
 INTSW | 0x01 | ok
 FM0/F0/TACH | 4800 RPM | ok
 FM0/F1/TACH | 4300 RPM | ok
 FM0/F2/TACH | 8300 RPM | ok
 FM0/F3/TACH | 6900 RPM | ok
 FM0/PRSNT | 0x02 | ok
 FM1/F0/TACH | 8100 RPM | ok
 FM1/F1/TACH | 7300 RPM | ok
 FM1/F2/TACH | 3900 RPM | ok
 FM1/F3/TACH | 3600 RPM | ok
 FM1/PRSNT | 0x02 | ok
 FM2/F0/TACH | 4200 RPM | ok
 FM2/F1/TACH | 3700 RPM | ok
 FM2/F2/TACH | 6900 RPM | ok
 FM2/F3/TACH | 6100 RPM | ok
 FM2/PRSNT | 0x02 | ok
 FM3/F0/TACH | 6900 RPM | ok
 FM3/F1/TACH | 5700 RPM | ok
 FM3/F2/TACH | 4600 RPM | ok
 FM3/F3/TACH | 4000 RPM | ok
 FM3/PRSNT | 0x02 | ok
 P0/D0/PRSNT | 0x02 | ok
 P0/D1/PRSNT | 0x02 | ok
 P0/D2/PRSNT | 0x01 | ok
 P0/D3/PRSNT | 0x02 | ok
 P0/D4/PRSNT | 0x02 | ok
 P0/D5/PRSNT | 0x01 | ok
 P0/D6/PRSNT | 0x01 | ok
 P0/D7/PRSNT | 0x02 | ok
 P0/D8/PRSNT | 0x02 | ok
 P0/D9/PRSNT | 0x01 | ok
 P0/D10/PRSNT | 0x02 | ok
 P0/D11/PRSNT | 0x02 | ok
 P0/PRSNT | 0x01 | ok
 P0/V_DIMM | 1.23 Volts | ok
 P1/D0/PRSNT | 0x02 | ok
 P1/D1/PRSNT | 0x02 | ok
 P1/D2/PRSNT | 0x01 | ok
 P1/D3/PRSNT | 0x02 | ok
 P1/D4/PRSNT | 0x02 | ok
 P1/D5/PRSNT | 0x01 | ok
 P1/D6/PRSNT | 0x01 | ok
 P1/D7/PRSNT | 0x02 | ok
 P1/D8/PRSNT | 0x02 | ok
 P1/D9/PRSNT | 0x01 | ok
 P1/D10/PRSNT | 0x02 | ok
 P1/D11/PRSNT | 0x02 | ok
 P1/PRSNT | 0x01 | ok
 P1/V_DIMM | 1.24 Volts | ok
 R1/PCIE1/PRSNT | 0x01 | ok
 MB/RISER1/PRSNT | 0x02 | ok
 R2/PCIE2/PRSNT | 0x02 | ok
 MB/RISER2/PRSNT | 0x02 | ok
 R3/PCIE3/PRSNT | 0x02 | ok
 R3/PCIE4/PRSNT | 0x02 | ok
 MB/RISER3/PRSNT | 0x02 | ok
 T_CORE_NET01 | 70 degrees C | ok
 T_CORE_NET23 | 71 degrees C | ok
 T_IN_PS | 31 degrees C | ok
 T_IN_SLOT1 | 43 degrees C | ok
 T_IN_SLOT2 | 50 degrees C | ok
 T_IN_SLOT3 | 38 degrees C | ok
 T_OUT_SLOT1 | 49 degrees C | ok
 T_OUT_SLOT2 | 50 degrees C | ok
 T_OUT_SLOT3 | 48 degrees C | ok
 PS0/PRSNT | 0x02 | ok
 PS0/P_IN | 180 Watts | ok
 PS0/P_OUT | 170 Watts | ok
 PS0/STATE | 0x01 | ok
 PS0/T_OUT | 36 degrees C | ok
 PS0/V_12V | 12 Volts | ok
 PS0/V_12V_STBY | 12 Volts | ok
 PS0/V_IN | 206 Volts | ok
 PS1/PRSNT | 0x02 | ok
 PS1/P_IN | 180 Watts | ok
 PS1/P_OUT | 160 Watts | ok
 PS1/STATE | 0x01 | ok
 PS1/T_OUT | 38 degrees C | ok
 PS1/V_12V | 12 Volts | ok
 PS1/V_12V_STBY | 12 Volts | ok
 PS1/V_IN | 204 Volts | ok
 PWRBS | no reading | ns
 T_AMB | 16 degrees C | ok
 /SYS/VPS | 360 Watts | ok
 VPS_CPUS | 180 Watts | ok
 VPS_FANS | 10 Watts | ok
 VPS_MEMORY | 45 Watts | ok
  2) Use Intelligent Platform Management Interface (IPMI) to view the ILOM SP System Event Log (SEL):
[root@exa01db01 ~]# ipmitool sel list | tail -60
  20d | 08/14/2019 | 20:51:01 | System ACPI Power State #0x26 | S0/G0: working | Deasserted
  20e | 08/14/2019 | 20:51:01 | System ACPI Power State #0x26 | S5/G2: soft-off | Asserted
  20f | 08/14/2019 | 20:53:57 | System Boot Initiated | System Restart | Asserted
  210 | 08/14/2019 | 20:53:57 | System Firmware Progress | Management controller initialization | Asserted
  211 | 08/14/2019 | 20:53:58 | System Firmware Progress | SMBus initialization | Asserted
  212 | 08/14/2019 | 20:53:59 | System Firmware Progress | Primary CPU initialization | Asserted
  213 | 08/14/2019 | 20:54:00 | System Firmware Progress | Memory initialization | Asserted
  214 | 08/14/2019 | 20:54:00 | System Boot Initiated | Initiated by warm reset | Asserted
  215 | 08/14/2019 | 20:54:00 | System Firmware Progress | Management controller initialization | Asserted
  216 | 08/14/2019 | 20:54:01 | System Firmware Progress | SMBus initialization | Asserted
  217 | 08/14/2019 | 20:54:02 | System Firmware Progress | Primary CPU initialization | Asserted
  218 | 08/14/2019 | 20:54:02 | System Firmware Progress | Memory initialization | Asserted
  219 | 08/14/2019 | 20:54:04 | System ACPI Power State #0x26 | S0/G0: working | Asserted
  21a | 08/14/2019 | 20:54:04 | System ACPI Power State #0x26 | S5/G2: soft-off | Deasserted
  21b | 08/14/2019 | 20:54:43 | System Firmware Progress | Cache initialization | Asserted
  21c | 08/14/2019 | 20:54:46 | System Firmware Progress | Secondary CPU Initialization | Asserted
  21d | 08/14/2019 | 20:55:01 | System Firmware Progress | PCI resource configuration | Asserted
  21e | 08/14/2019 | 20:55:06 | System Firmware Progress | PCI resource configuration | Asserted
  21f | 08/14/2019 | 20:55:15 | System ACPI Power State #0x26 | S0/G0: working | Deasserted
  220 | 08/14/2019 | 20:55:15 | System ACPI Power State #0x26 | S5/G2: soft-off | Asserted
  221 | 08/14/2019 | 20:55:19 | System Boot Initiated | System Restart | Asserted
  222 | 08/14/2019 | 20:55:19 | System Firmware Progress | Management controller initialization | Asserted
  223 | 08/14/2019 | 20:55:20 | System Firmware Progress | SMBus initialization | Asserted
  224 | 08/14/2019 | 20:55:21 | System Firmware Progress | Primary CPU initialization | Asserted
  225 | 08/14/2019 | 20:55:21 | System Firmware Progress | Memory initialization | Asserted
  226 | 08/14/2019 | 20:55:22 | System Boot Initiated | Initiated by warm reset | Asserted
  227 | 08/14/2019 | 20:55:22 | System Firmware Progress | Management controller initialization | Asserted
  228 | 08/14/2019 | 20:55:22 | System Firmware Progress | SMBus initialization | Asserted
  229 | 08/14/2019 | 20:55:24 | System Firmware Progress | Primary CPU initialization | Asserted
  22a | 08/14/2019 | 20:55:24 | System Firmware Progress | Memory initialization | Asserted
  22b | 08/14/2019 | 20:55:30 | System ACPI Power State #0x26 | S0/G0: working | Asserted
  22c | 08/14/2019 | 20:55:30 | System ACPI Power State #0x26 | S5/G2: soft-off | Deasserted
  22d | 08/14/2019 | 20:56:04 | System Firmware Progress | Cache initialization | Asserted
  22e | 08/14/2019 | 20:56:08 | System Firmware Progress | Secondary CPU Initialization | Asserted
  22f | 08/14/2019 | 20:56:23 | System Firmware Progress | PCI resource configuration | Asserted
  230 | 08/14/2019 | 20:56:28 | System Firmware Progress | PCI resource configuration | Asserted
  231 | 08/14/2019 | 20:56:35 | System Firmware Progress | Video initialization | Asserted
  232 | 08/14/2019 | 20:56:35 | System Firmware Progress | Option ROM initialization | Asserted
  233 | 08/14/2019 | 20:56:38 | System Firmware Progress | Keyboard controller initialization | Asserted
  234 | 08/14/2019 | 20:56:41 | System Firmware Progress | Option ROM initialization | Asserted
  235 | 08/14/2019 | 20:57:20 | System Firmware Progress | Hard-disk initialization | Asserted
  236 | 08/14/2019 | 20:57:20 | System Firmware Progress | Option ROM initialization | Asserted
  237 | 08/14/2019 | 20:57:29 | System Firmware Progress | System boot initiated | Asserted
  238 | 08/14/2019 | 20:57:29 | System Firmware Progress | System boot initiated | Asserted
  239 | 08/14/2019 | 20:59:13 | System Firmware Progress | Management controller initialization | Asserted
  23a | 08/14/2019 | 20:59:13 | System Firmware Progress | SMBus initialization | Asserted
  23b | 08/14/2019 | 20:59:14 | System Firmware Progress | Primary CPU initialization | Asserted
  23c | 08/14/2019 | 20:59:14 | System Firmware Progress | Memory initialization | Asserted
  23d | 08/14/2019 | 20:59:55 | System Firmware Progress | Cache initialization | Asserted
  23e | 08/14/2019 | 20:59:58 | System Firmware Progress | Secondary CPU Initialization | Asserted
  23f | 08/14/2019 | 21:00:13 | System Firmware Progress | PCI resource configuration | Asserted
  240 | 08/14/2019 | 21:00:18 | System Firmware Progress | PCI resource configuration | Asserted
  241 | 08/14/2019 | 21:00:25 | System Firmware Progress | Video initialization | Asserted
  242 | 08/14/2019 | 21:00:25 | System Firmware Progress | Option ROM initialization | Asserted
  243 | 08/14/2019 | 21:00:28 | System Firmware Progress | Keyboard controller initialization | Asserted
  244 | 08/14/2019 | 21:00:32 | System Firmware Progress | Option ROM initialization | Asserted
  245 | 08/14/2019 | 21:01:10 | System Firmware Progress | Hard-disk initialization | Asserted
  246 | 08/14/2019 | 21:01:10 | System Firmware Progress | Option ROM initialization | Asserted
  247 | 08/14/2019 | 21:01:18 | System Firmware Progress | System boot initiated | Asserted
  248 | 08/14/2019 | 21:01:18 | System Firmware Progress | System boot initiated | Asserted
  3) Display all host nodes with ibhosts:
[root@exa01db01 ~]# ibhosts
 Ca : 0x0021280001cf2dea ports 2 "exa01nas01 PCIe 1"
 Ca : 0x0021280001cf7d6a ports 2 "exa01db04 S 10.10.10.7 HCA-1"
 Ca : 0x0021280001a13cbc ports 2 "exa01db02 S 10.10.10.2 HCA-1"
 Ca : 0x0021280001cf798e ports 2 "exa01db03 S 10.10.10.6 HCA-1"
 Ca : 0x0021280001a0b038 ports 2 "exa01cel07 C 10.10.10.11 HCA-1"
 Ca : 0x0021280001a135d0 ports 2 "exa01db01 S 10.10.10.1 HCA-1"
 Ca : 0x0021280001cedbca ports 2 "exa01cel05 C 10.10.10.9 HCA-1"
 Ca : 0x0021280001cf6006 ports 2 "exa01cel06 C 10.10.10.10 HCA-1"
 Ca : 0x0021280001a151c0 ports 2 "exa01cel03 C 10.10.10.5 HCA-1"
 Ca : 0x0021280001a16a12 ports 2 "exa01cel04 C 10.10.10.8 HCA-1"
 Ca : 0x0021280001a15364 ports 2 "exa01cel01 C 10.10.10.3 HCA-1"
 Ca : 0x0021280001a1590a ports 2 "exa01cel02 C 10.10.10.4 HCA-1"
 Ca : 0x0010e00001333318 ports 2 "exa01nas03 PCIe 6"
 Ca : 0x0010e00001333704 ports 2 "exa01nas03 PCIe 5"
 Ca : 0x0010e00001330f3c ports 2 "exa01nas04 PCIe 6"
 Ca : 0x0010e00001332f30 ports 2 "exa01nas04 PCIe 5"
 Ca : 0x0010e0000128e3e4 ports 2 "exa02db02 S 10.10.10.24 HCA-1"
 Ca : 0x0010e0000128e18c ports 2 "exa02db01 S 10.10.10.28 HCA-1"
 Ca : 0x0010e00001289c6c ports 2 "exa02cel03 C 10.10.10.27 HCA-1"
 Ca : 0x0010e0000128bea4 ports 2 "exa02cel01 C 10.10.10.25 HCA-1"
 Ca : 0x0010e00001289d90 ports 2 "exa02cel02 C 10.10.10.26 HCA-1"
 Ca : 0x0010e00001868f40 ports 2 "exa03db04 S 10.10.10.35,10.10.10.36 HCA-1"
 Ca : 0x0010e00001859cd0 ports 2 "exa03db03 S 10.10.10.33,10.10.10.34 HCA-1"
 Ca : 0x0010e000018640a0 ports 2 "exa03db01 S 10.10.10.29,10.10.10.30 HCA-1"
 Ca : 0x0010e00001887928 ports 2 "exa03cel07 C 10.10.10.49,10.10.10.50 HCA-1"
 Ca : 0x0010e0000185cd00 ports 2 "exa03cel05 C 10.10.10.45,10.10.10.46 HCA-1"
 Ca : 0x0010e00001868e80 ports 2 "exa03cel06 C 10.10.10.47,10.10.10.48 HCA-1"
 Ca : 0x0010e0000185e5a0 ports 2 "exa03cel04 C 10.10.10.43,10.10.10.44 HCA-1"
 Ca : 0x0010e0000185e5f0 ports 2 "exa03cel03 C 10.10.10.41,10.10.10.42 HCA-1"
 Ca : 0x0010e00001638be4 ports 2 "exa03cel01 C 10.10.10.37,10.10.10.38 HCA-1"
 Ca : 0x0010e0000187c658 ports 2 "exa03cel02 C 10.10.10.39,10.10.10.40 HCA-1"
 Ca : 0x0010e0000185c130 ports 2 "exa03db02 S 10.10.10.31,10.10.10.32 HCA-1"
  4) Use ibcheckstate to scan InfiniBand fabric and validate the port logical and physical state: (It reports if any logical port is on any state different than active. It also reports if any physical port is on any state different than LinkUp.)
[root@exa01db01 ~]# ibcheckstate -v
 
 # Checking Switch: nodeguid 0x0021284692dea0a0
 Node check lid 3: OK 
 Port check lid 3 port 1: OK 
 Port check lid 3 port 2: OK 
 Port check lid 3 port 3: OK 
 Port check lid 3 port 4: OK 
 Port check lid 3 port 5: OK 
 Port check lid 3 port 6: OK 
 
 Port check lid 3 port 7: OK 
 Port check lid 3 port 8: OK 
 Port check lid 3 port 9: OK 
 Port check lid 3 port 10: OK 
 Port check lid 3 port 12: OK 
 Port check lid 3 port 13: OK 
 Port check lid 3 port 14: OK 
 Port check lid 3 port 15: OK 
 Port check lid 3 port 16: OK 
 Port check lid 3 port 17: OK 
 Port check lid 3 port 18: OK 
 Port check lid 3 port 29: OK 
 Port check lid 3 port 31: OK 
 Port check lid 3 port 32: OK 
 
 # Checking Switch: nodeguid 0x0021284692d1a0a0
 Node check lid 1: OK 
 Port check lid 1 port 1: OK 
 Port check lid 1 port 2: OK 
 Port check lid 1 port 3: OK 
 Port check lid 1 port 4: OK 
 Port check lid 1 port 5: OK 
 Port check lid 1 port 6: OK 
 Port check lid 1 port 7: OK 
 Port check lid 1 port 8: OK 
 Port check lid 1 port 9: OK 
 Port check lid 1 port 10: OK 
 Port check lid 1 port 12: OK 
 Port check lid 1 port 13: OK 
 Port check lid 1 port 14: OK 
 Port check lid 1 port 15: OK 
 Port check lid 1 port 16: OK 
 Port check lid 1 port 17: OK 
 Port check lid 1 port 18: OK 
 Port check lid 1 port 29: OK 
 Port check lid 1 port 31: OK 
 Port check lid 1 port 32: OK 
 
 # Checking Switch: nodeguid 0x0010e035c814a0a0
 Node check lid 7: OK 
 Port check lid 7 port 1: OK 
 Port check lid 7 port 2: OK 
 Port check lid 7 port 4: OK 
 Port check lid 7 port 7: OK 
 Port check lid 7 port 10: OK 
 Port check lid 7 port 11: OK 
 Port check lid 7 port 13: OK 
 Port check lid 7 port 14: OK 
 Port check lid 7 port 15: OK 
 Port check lid 7 port 16: OK 
 Port check lid 7 port 17: OK 
 Port check lid 7 port 18: OK 
 Port check lid 7 port 27: OK 
 Port check lid 7 port 29: OK 
 Port check lid 7 port 30: OK 
 Port check lid 7 port 31: OK 
 Port check lid 7 port 33: OK 
 Port check lid 7 port 34: OK 
 Port check lid 7 port 35: OK 
 Port check lid 7 port 36: OK 
 
 # Checking Switch: nodeguid 0x0010e035bed4a0a0
 Node check lid 27: OK 
 Port check lid 27 port 1: OK 
 Port check lid 27 port 2: OK 
 Port check lid 27 port 4: OK 
 Port check lid 27 port 7: OK 
 Port check lid 27 port 10: OK 
 Port check lid 27 port 11: OK 
 Port check lid 27 port 13: OK 
 Port check lid 27 port 14: OK 
 Port check lid 27 port 15: OK 
 Port check lid 27 port 16: OK 
 Port check lid 27 port 17: OK 
 Port check lid 27 port 18: OK 
 Port check lid 27 port 27: OK 
 Port check lid 27 port 29: OK 
 Port check lid 27 port 30: OK 
 Port check lid 27 port 31: OK 
 Port check lid 27 port 33: OK 
 Port check lid 27 port 34: OK 
 Port check lid 27 port 35: OK 
 Port check lid 27 port 36: OK 
 
 # Checking Switch: nodeguid 0x0010e0801a0aa0a0
 Node check lid 62: OK 
 Port check lid 62 port 1: OK 
 Port check lid 62 port 2: OK 
 Port check lid 62 port 3: OK 
 Port check lid 62 port 4: OK 
 Port check lid 62 port 5: OK 
 Port check lid 62 port 6: OK 
 Port check lid 62 port 7: OK 
 Port check lid 62 port 8: OK 
 Port check lid 62 port 9: OK 
 Port check lid 62 port 10: OK 
 Port check lid 62 port 12: OK 
 Port check lid 62 port 13: OK 
 Port check lid 62 port 14: OK 
 Port check lid 62 port 15: OK 
 Port check lid 62 port 16: OK 
 Port check lid 62 port 17: OK 
 Port check lid 62 port 18: OK 
 Port check lid 62 port 31: OK 
 Port check lid 62 port 32: OK 
 
 # Checking Switch: nodeguid 0x00212846965fa0a0
 Node check lid 80: OK 
 Port check lid 80 port 13: OK 
 Port check lid 80 port 14: OK 
 Port check lid 80 port 15: OK 
 Port check lid 80 port 16: OK 
 Port check lid 80 port 19: OK 
 Port check lid 80 port 20: OK 
 Port check lid 80 port 21: OK 
 Port check lid 80 port 22: OK 
 Port check lid 80 port 23: OK 
 Port check lid 80 port 24: OK 
 Port check lid 80 port 25: OK 
 Port check lid 80 port 26: OK 
 Port check lid 80 port 27: OK 
 Port check lid 80 port 28: OK 
 Port check lid 80 port 29: OK 
 Port check lid 80 port 30: OK 
 Port check lid 80 port 31: OK 
 Port check lid 80 port 32: OK 
 Port check lid 80 port 33: OK 
 Port check lid 80 port 34: OK 
 
 # Checking Switch: nodeguid 0x0010e0801ff2a0a0
 Node check lid 54: OK 
 Port check lid 54 port 7: OK 
 Port check lid 54 port 8: OK 
 Port check lid 54 port 9: OK 
 Port check lid 54 port 10: OK 
 Port check lid 54 port 11: OK 
 Port check lid 54 port 12: OK 
 Port check lid 54 port 17: OK 
 Port check lid 54 port 18: OK 
 Port check lid 54 port 19: OK 
 Port check lid 54 port 20: OK 
 Port check lid 54 port 21: OK 
 Port check lid 54 port 22: OK 
 Port check lid 54 port 25: OK 
 Port check lid 54 port 26: OK 
 Port check lid 54 port 27: OK 
 Port check lid 54 port 28: OK 
 Port check lid 54 port 29: OK 
 Port check lid 54 port 30: OK 
 Port check lid 54 port 35: OK 
 Port check lid 54 port 36: OK 
 
 # Checking Switch: nodeguid 0x0010e0801a9da0a0
 Node check lid 53: OK 
 Port check lid 53 port 1: OK 
 Port check lid 53 port 2: OK 
 Port check lid 53 port 3: OK 
 Port check lid 53 port 4: OK 
 Port check lid 53 port 5: OK 
 Port check lid 53 port 6: OK 
 Port check lid 53 port 7: OK 
 Port check lid 53 port 8: OK 
 Port check lid 53 port 9: OK 
 Port check lid 53 port 10: OK 
 Port check lid 53 port 12: OK 
 Port check lid 53 port 13: OK 
 Port check lid 53 port 14: OK 
 Port check lid 53 port 15: OK 
 Port check lid 53 port 16: OK 
 Port check lid 53 port 17: OK 
 Port check lid 53 port 18: OK 
 Port check lid 53 port 31: OK 
 Port check lid 53 port 32: OK 
 
 # Checking Ca: nodeguid 0x0021280001cf2dea
 Node check lid 13: OK 
 Port check lid 13 port 1: OK 
 Port check lid 13 port 2: OK 
 
 # Checking Ca: nodeguid 0x0021280001cf7d6a
 Node check lid 25: OK 
 Port check lid 25 port 1: OK 
 Port check lid 25 port 2: OK 
 
 # Checking Ca: nodeguid 0x0021280001a13cbc
 Node check lid 5: OK 
 Port check lid 5 port 1: OK 
 Port check lid 5 port 2: OK 
 
 # Checking Ca: nodeguid 0x0021280001cf798e
 Node check lid 23: OK 
 Port check lid 23 port 1: OK 
 Port check lid 23 port 2: OK 
 
 # Checking Ca: nodeguid 0x0021280001a0b038
 Node check lid 21: OK 
 Port check lid 21 port 1: OK 
 Port check lid 21 port 2: OK 
 
 # Checking Ca: nodeguid 0x0021280001a135d0
 Node check lid 2: OK 
 Port check lid 2 port 1: OK 
 Port check lid 2 port 2: OK 
 
 # Checking Ca: nodeguid 0x0021280001cedbca
 Node check lid 17: OK 
 Port check lid 17 port 1: OK 
 Port check lid 17 port 2: OK 
 
 # Checking Ca: nodeguid 0x0021280001cf6006
 Node check lid 19: OK 
 Port check lid 19 port 1: OK 
 Port check lid 19 port 2: OK 
 
 # Checking Ca: nodeguid 0x0021280001a151c0
 Node check lid 11: OK 
 Port check lid 11 port 1: OK 
 Port check lid 11 port 2: OK 
 
 # Checking Ca: nodeguid 0x0021280001a16a12
 Node check lid 15: OK 
 Port check lid 15 port 1: OK 
 Port check lid 15 port 2: OK 
 
 # Checking Ca: nodeguid 0x0021280001a15364
 Node check lid 38: OK 
 Port check lid 38 port 1: OK 
 Port check lid 38 port 2: OK 
 
 # Checking Ca: nodeguid 0x0021280001a1590a
 Node check lid 9: OK 
 Port check lid 9 port 1: OK 
 Port check lid 9 port 2: OK 
 
 # Checking Ca: nodeguid 0x0010e00001333704
 Node check lid 42: OK 
 Port check lid 42 port 1: OK 
 Port check lid 42 port 2: OK 
 
 # Checking Ca: nodeguid 0x0010e00001333318
 Node check lid 41: OK 
 Port check lid 41 port 1: OK 
 Port check lid 41 port 2: OK 
 
 # Checking Ca: nodeguid 0x0010e00001330f3c
 Node check lid 40: OK 
 Port check lid 40 port 1: OK 
 Port check lid 40 port 2: OK 
 
 # Checking Ca: nodeguid 0x0010e00001332f30
 Node check lid 39: OK 
 Port check lid 39 port 1: OK 
 Port check lid 39 port 2: OK 
 
 # Checking Ca: nodeguid 0x0010e0000128e3e4
 Node check lid 36: OK 
 Port check lid 36 port 1: OK 
 Port check lid 36 port 2: OK 
 
 # Checking Ca: nodeguid 0x0010e0000128e18c
 Node check lid 34: OK 
 Port check lid 34 port 1: OK 
 Port check lid 34 port 2: OK 
 
 # Checking Ca: nodeguid 0x0010e00001289c6c
 Node check lid 32: OK 
 Port check lid 32 port 1: OK 
 Port check lid 32 port 2: OK 
 
 # Checking Ca: nodeguid 0x0010e0000128bea4
 Node check lid 28: OK 
 Port check lid 28 port 1: OK 
 Port check lid 28 port 2: OK 
 
 # Checking Ca: nodeguid 0x0010e00001289d90
 Node check lid 30: OK 
 Port check lid 30 port 1: OK 
 Port check lid 30 port 2: OK 
 
 # Checking Ca: nodeguid 0x0010e00001868f40
 Node check lid 50: OK 
 Port check lid 50 port 1: OK 
 Port check lid 50 port 2: OK 
 
 # Checking Ca: nodeguid 0x0010e00001859cd0
 Node check lid 57: OK 
 Port check lid 57 port 1: OK 
 Port check lid 57 port 2: OK 
 
 # Checking Ca: nodeguid 0x0010e00001887928
 Node check lid 48: OK 
 Port check lid 48 port 1: OK 
 Port check lid 48 port 2: OK 
 
 # Checking Ca: nodeguid 0x0010e000018640a0
 Node check lid 55: OK 
 Port check lid 55 port 1: OK 
 Port check lid 55 port 2: OK 
 
 # Checking Ca: nodeguid 0x0010e0000185cd00
 Node check lid 47: OK 
 Port check lid 47 port 1: OK 
 Port check lid 47 port 2: OK 
 
 # Checking Ca: nodeguid 0x0010e00001868e80
 Node check lid 52: OK 
 Port check lid 52 port 1: OK 
 Port check lid 52 port 2: OK 
 
 # Checking Ca: nodeguid 0x0010e0000185e5a0
 Node check lid 56: OK 
 Port check lid 56 port 1: OK 
 Port check lid 56 port 2: OK 
 
 # Checking Ca: nodeguid 0x0010e0000185e5f0
 Node check lid 59: OK 
 Port check lid 59 port 1: OK 
 Port check lid 59 port 2: OK 
 
 # Checking Ca: nodeguid 0x0010e00001638be4
 Node check lid 58: OK 
 Port check lid 58 port 1: OK 
 Port check lid 58 port 2: OK 
 
 # Checking Ca: nodeguid 0x0010e0000187c658
 Node check lid 51: OK 
 Port check lid 51 port 1: OK 
 Port check lid 51 port 2: OK 
 
 # Checking Ca: nodeguid 0x0010e0000185c130
 Node check lid 49: OK 
 Port check lid 49 port 1: OK 
 Port check lid 49 port 2: OK 
 
 ## Summary: 40 nodes checked, 0 bad nodes found
 ## 222 ports checked, 0 ports with bad state found
  5) Use ibcheckerrors to scan InfiniBand fabric and validate the connectivity as described in the topology file: The ibcheckerrors command uses the topology file to scan the InfiniBand fabric and validate the connectivity as described in the topology file, and to report errors as indicated by the port counters.
[root@exa01db01 ~]# ibcheckerrors
 #warn: counter PortRcvRemotePhysicalErrors = 100 (threshold 100) lid 1 port 255
 Error check on lid 1 (SUN DCS 36P QDR exa01sw-ib2 10.20.10.24) port all: FAILED 
 #warn: counter PortRcvRemotePhysicalErrors = 100 (threshold 100) lid 1 port 15
 Error check on lid 1 (SUN DCS 36P QDR exa01sw-ib2 10.20.10.24) port 15: FAILED 
 #warn: counter LinkErrorRecoveryCounter = 12 (threshold 10) lid 7 port 255
 Error check on lid 7 (SUN DCS 36P QDR exa02sw-ibb0 10.20.10.50) port all: FAILED 
 #warn: counter LinkErrorRecoveryCounter = 12 (threshold 10) lid 7 port 34
 Error check on lid 7 (SUN DCS 36P QDR exa02sw-ibb0 10.20.10.50) port 34: FAILED 
 #warn: counter SymbolErrorCounter = 156 (threshold 10) lid 27 port 255
 Error check on lid 27 (SUN DCS 36P QDR exa02sw-iba0 10.20.10.49) port all: FAILED 
 #warn: counter SymbolErrorCounter = 156 (threshold 10) lid 27 port 35
 Error check on lid 27 (SUN DCS 36P QDR exa02sw-iba0 10.20.10.49) port 35: FAILED 
 #warn: counter SymbolErrorCounter = 13896 (threshold 10) lid 80 port 255
 #warn: counter LinkErrorRecoveryCounter = 12 (threshold 10) lid 80 port 255
 #warn: counter PortRcvErrors = 145 (threshold 10) lid 80 port 255
 Error check on lid 80 (SUN DCS 36P QDR exa01sw-ib1 10.20.10.23) port all: FAILED 
 #warn: counter SymbolErrorCounter = 118 (threshold 10) lid 80 port 19
 #warn: counter PortRcvErrors = 115 (threshold 10) lid 80 port 19
 Error check on lid 80 (SUN DCS 36P QDR exa01sw-ib1 10.20.10.23) port 19: FAILED 
 #warn: counter SymbolErrorCounter = 13777 (threshold 10) lid 80 port 22
 #warn: counter LinkErrorRecoveryCounter = 12 (threshold 10) lid 80 port 22
 #warn: counter PortRcvErrors = 29 (threshold 10) lid 80 port 22
 Error check on lid 80 (SUN DCS 36P QDR exa01sw-ib1 10.20.10.23) port 22: FAILED 
 #warn: counter PortRcvErrors = 40 (threshold 10) lid 53 port 255
 Error check on lid 53 (SUN DCS 36P QDR exa03sw-iba01 10.20.10.76) port all: FAILED 
 #warn: counter PortRcvErrors = 38 (threshold 10) lid 53 port 8
 Error check on lid 53 (SUN DCS 36P QDR exa03sw-iba01 10.20.10.76) port 8: FAILED 
 
 ## Summary: 40 nodes checked, 0 bad nodes found
 ## 222 ports checked, 6 ports have errors beyond threshold
  6) Checking for sensor healthy from switch: * Running from a leaf switch.
[root@exa01sw-ibb01 ~]# showunhealthy
 OK - No unhealthy sensors
  7) Check the overall health of the InfiniBand switch, on the Exadata switch itself:
[root@exa01sw-ibb01 ~]# env_test
 Environment test started:
 Starting Environment Daemon test:
 Environment daemon running
 Environment Daemon test returned OK
 Starting Voltage test:
 Voltage ECB OK
 Measured 3.3V Main = 3.25 V
 Measured 3.3V Standby = 3.35 V
 Measured 12V = 12.03 V
 Measured 5V = 4.99 V
 Measured VBAT = 3.09 V
 Measured 2.5V = 2.50 V
 Measured 1.8V = 1.78 V
 Measured I4 1.2V = 1.21 V
 Voltage test returned OK
 Starting PSU test:
 PSU 0 present OK
 PSU 1 present OK
 PSU test returned OK
 Starting Temperature test:
 Back temperature 29
 Front temperature 31
 SP temperature 48
 Switch temperature 49, maxtemperature 57
 Temperature test returned OK
 Starting FAN test:
 Fan 0 not present
 Fan 1 running at rpm 11445
 Fan 2 running at rpm 11445
 Fan 3 running at rpm 11445
 Fan 4 not present
 FAN test returned OK
 Starting Connector test:
 Connector test returned OK
 Starting Onboard ibdevice test:
 Switch OK
 All Internal ibdevices OK
 Onboard ibdevice test returned OK
 Starting SSD test:
 SSD test returned OK
 Environment test PASSED
 [root@exa01sw-ibb01 ~]#
Some references: https://docs.oracle.com/cd/E19464-01/820-6850-11/IPMItool.html https://docs.oracle.com/cd/E24707_01/html/E24528/z400000c1016683.html https://docs.oracle.com/cd/E19654-01/820-7752-12/z400014c1567639.html https://docs.oracle.com/cd/E19654-01/820-7751-12/z400014e1393674.html Infiniband Switch in Exadata incorrectly reporting WARNING/FAILURE when Performing ‘showunhealthy’ (Doc ID 1578284.1)

No Comments Yet

Let us know what you think

Subscribe by email