Pythian Blog: Technical Track

Exadata's InfiniBand Switch: Incorrect NTP settings Leading to Evictions, and Patching Recommendations

Overview

This post describes a recent outage during a routine Exadata (X7-2) patching to the 19.2.12.0.0.200317 release using "patchmgr." An incorrect configuration of NTP (Network Time Protocol) on the InfiniBand (IB) switches placed all switches in the "discover" state after the patching. Even though we were using the "-rolling" flag, it continued to patch all three IB switches present in the cluster, which led to the outage. Please read on for best practices and recommendations to avoid outages while patching InfiniBand switches. This NTP issue and some of the steps are also listed in a document recently published on My Oracle Support (MOS): Infiniband Switch rebooted as a part of Patching can caused multiple node evictions in the Cluster (Doc ID 2703311.1) First, let's look at the whole Exadata. Below are all the nodes with its respective versions: Note: The details in this post are from a live production system, with the names, IPs, domains, etc., changed.
[root@exa01db01 pythian]# ./exa-versions.sh
 
  Cluster is a X7-2 Elastic Rack HC 10TB
 
  -- Database Servers
 
  exa01db01 exa01db02 exa01db03
 ------------------------------------------------------------
  19.2.12.0.0.200317 19.2.12.0.0.200317 19.2.12.0.0.200317
 ------------------------------------------------------------
 
 
  -- Cells
 
  exa01celadm01 exa01celadm02 exa01celadm03 exa01celadm04 exa01celadm05 exa01celadm06
 ------------------------------------------------------------------------------------------------------------------------
  19.2.12.0.0.200317 19.2.12.0.0.200317 19.2.12.0.0.200317 19.2.12.0.0.200317 19.2.12.0.0.200317 19.2.12.0.0.200317
 ------------------------------------------------------------------------------------------------------------------------
 
 
  -- Infiniband Switches
 
  exa01sw-iba01 exa01sw-ibb01 exa01sw-ibs01
 ------------------------------------------------------------
  2.2.12-2 2.2.12-2 2.2.12-2
 ------------------------------------------------------------
 

Patching prerequisites

As usual, before patching any Exadata component, it's important to review the health of the entire cluster, thus we've executed the "exachk" utility which reported no issues on the IB switches: Exachk IB switch checks The "exachk" utility checks for the NTP configurations, as highlighted above, to ensure the NTP configuration is not the default. It doesn't check either current timestamps, or if NTP is running properly: IB switch check details Another mandatory step before patching any Exadata component is to execute patchmgr's pre-upgrade checks ("precheck"). This precheck runs a series of tests to make sure the device is ready for the upgrade. Below are the details of these checks which reported no issues on the IB switches:
[root@exa01db01 patch_19.2.12.0.0.200317]# ./patchmgr -ibswitches ~/ib_group -ibswitch_precheck -upgrade
 
 2020-08-15 15:45:31 -0500 1 of 1 :Working: DO: Initiate pre-upgrade validation check on InfiniBand switch(es).
  ----- InfiniBand switch update process started 2020-08-15 15:45:32 -0500 -----
 [NOTE ] Log file at /u01/patches/exadata_patches/IB_PATCHING/patch_19.2.12.0.0.200317/upgradeIBSwitch.log
 
 [INFO ] List of InfiniBand switches for upgrade: ( exa01sw-iba01 exa01sw-ibb01 exa01sw-ibs01 )
 [SUCCESS ] Verifying Network connectivity to exa01sw-iba01
 [SUCCESS ] Verifying Network connectivity to exa01sw-ibb01
 [SUCCESS ] Verifying Network connectivity to exa01sw-ibs01
 [SUCCESS ] Validating verify-topology output
 [INFO ] Master Subnet Manager is set to "exa01sw-ibs01" in all Switches
 
 [INFO ] ---------- Starting with InfiniBand Switch exa01sw-ibs01
 [WARNING ] Infiniband switch meets minimal version requirements, but downgrade is only available to 2.2.13-2 with the current package.
  To downgrade to other versions:
  - Manually download the InfiniBand switch firmware package to the patch directory
  - Set export variable "EXADATA_IMAGE_IBSWITCH_DOWNGRADE_VERSION" to the appropriate version
  - Run patchmgr command to initiate downgrade.
 [SUCCESS ] Verify SSH access to the patchmgr host exa01db01.example.com from the InfiniBand Switch exa01sw-ibs01.
 [INFO ] Starting pre-update validation on exa01sw-ibs01
 [SUCCESS ] Verifying that /tmp has 150M in exa01sw-ibs01, found 492M
 [SUCCESS ] Verifying that / has 20M in exa01sw-ibs01, found 26M
 [SUCCESS ] NTP daemon is running on exa01sw-ibs01.
 [INFO ] Manually validate the following entries Date:(YYYY-MM-DD) 2020-08-15 Time:(HH:MM:SS) 15:55:04
 [INFO ] Validating the current firmware on the InfiniBand Switch
 [SUCCESS ] Firmware verification on InfiniBand switch exa01sw-ibs01
 [SUCCESS ] Verifying that the patchmgr host exa01db01.example.com is recognized on the InfiniBand Switch exa01sw-ibs01 through getHostByName
 [SUCCESS ] Execute plugin check for Patch Check Prereq on exa01sw-ibs01
 [INFO ] Finished pre-update validation on exa01sw-ibs01
 [SUCCESS ] Pre-update validation on exa01sw-ibs01
 [SUCCESS ] Prereq check on exa01sw-ibs01
 
 [INFO ] ---------- Starting with InfiniBand Switch exa01sw-iba01
 [WARNING ] Infiniband switch meets minimal version requirements, but downgrade is only available to 2.2.13-2 with the current package.
  To downgrade to other versions:
  - Manually download the InfiniBand switch firmware package to the patch directory
  - Set export variable "EXADATA_IMAGE_IBSWITCH_DOWNGRADE_VERSION" to the appropriate version
  - Run patchmgr command to initiate downgrade.
 [SUCCESS ] Verify SSH access to the patchmgr host exa01db01.example.com from the InfiniBand Switch exa01sw-iba01.
 [INFO ] Starting pre-update validation on exa01sw-iba01
 [SUCCESS ] Verifying that /tmp has 150M in exa01sw-iba01, found 492M
 [SUCCESS ] Verifying that / has 20M in exa01sw-iba01, found 26M
 [SUCCESS ] NTP daemon is running on exa01sw-iba01.
 [INFO ] Manually validate the following entries Date:(YYYY-MM-DD) 2020-08-15 Time:(HH:MM:SS) 17:34:39
 [INFO ] Validating the current firmware on the InfiniBand Switch
 [SUCCESS ] Firmware verification on InfiniBand switch exa01sw-iba01
 [SUCCESS ] Verifying that the patchmgr host exa01db01.example.com is recognized on the InfiniBand Switch exa01sw-iba01 through getHostByName
 [SUCCESS ] Execute plugin check for Patch Check Prereq on exa01sw-iba01
 [INFO ] Finished pre-update validation on exa01sw-iba01
 [SUCCESS ] Pre-update validation on exa01sw-iba01
 [SUCCESS ] Prereq check on exa01sw-iba01
 
 [INFO ] ---------- Starting with InfiniBand Switch exa01sw-ibb01
 [WARNING ] Infiniband switch meets minimal version requirements, but downgrade is only available to 2.2.13-2 with the current package.
  To downgrade to other versions:
  - Manually download the InfiniBand switch firmware package to the patch directory
  - Set export variable "EXADATA_IMAGE_IBSWITCH_DOWNGRADE_VERSION" to the appropriate version
  - Run patchmgr command to initiate downgrade.
 [SUCCESS ] Verify SSH access to the patchmgr host exa01db01.example.com from the InfiniBand Switch exa01sw-ibb01.
 [INFO ] Starting pre-update validation on exa01sw-ibb01
 [SUCCESS ] Verifying that /tmp has 150M in exa01sw-ibb01, found 492M
 [SUCCESS ] Verifying that / has 20M in exa01sw-ibb01, found 26M
 [SUCCESS ] NTP daemon is running on exa01sw-ibb01.
 [INFO ] Manually validate the following entries Date:(YYYY-MM-DD) 2020-08-15 Time:(HH:MM:SS) 16:24:46
 [INFO ] Validating the current firmware on the InfiniBand Switch
 [SUCCESS ] Firmware verification on InfiniBand switch exa01sw-ibb01
 [SUCCESS ] Verifying that the patchmgr host exa01db01.example.com is recognized on the InfiniBand Switch exa01sw-ibb01 through getHostByName
 [SUCCESS ] Execute plugin check for Patch Check Prereq on exa01sw-ibb01
 [INFO ] Finished pre-update validation on exa01sw-ibb01
 [SUCCESS ] Pre-update validation on exa01sw-ibb01
 [SUCCESS ] Prereq check on exa01sw-ibb01
 [SUCCESS ] Overall status
 
  ----- InfiniBand switch update process ended 2020-08-15 15:47:41 -0500 -----
 2020-08-15 15:47:41 -0500 1 of 1 :SUCCESS: DONE: Initiate pre-upgrade validation check on InfiniBand switch(es).
 2020-08-15 15:47:41 -0500 :SUCCESS: Completed run of command: ./patchmgr -ibswitches /root/ib_group -ibswitch_precheck -upgrade
 2020-08-15 15:47:41 -0500 :INFO : upgrade attempted on nodes in file /root/ib_group: [exa01sw-iba01 exa01sw-ibb01 exa01sw-ibs01]
 2020-08-15 15:47:41 -0500 :INFO : For details, check the following files in /u01/patches/exadata_patches/IB_PATCHING/patch_19.2.12.0.0.200317:
 2020-08-15 15:47:41 -0500 :INFO : - upgradeIBSwitch.log
 2020-08-15 15:47:41 -0500 :INFO : - upgradeIBSwitch.trc
 2020-08-15 15:47:41 -0500 :INFO : - patchmgr.stdout
 2020-08-15 15:47:41 -0500 :INFO : - patchmgr.stderr
 2020-08-15 15:47:41 -0500 :INFO : - patchmgr.log
 2020-08-15 15:47:41 -0500 :INFO : - patchmgr.trc
 2020-08-15 15:47:41 -0500 :INFO : Exit status:0
 2020-08-15 15:47:41 -0500 :INFO : Exiting.

Patching the switches / full outage

After validating all prerequisites, "patchmgr" successfully applied the patch to the switches in "standby" state:
[root@exa01db01 patch_19.2.12.0.0.200317]# ./patchmgr -ibswitches ~/ib_group -upgrade 
 
  ----- InfiniBand switch update process started 2020-08-15 15:49:39 -0500 -----
 [NOTE ] Log file at /u01/patches/exadata_patches/IB_PATCHING/patch_19.2.12.0.0.200317/upgradeIBSwitch.log
 
 [INFO ] List of InfiniBand switches for upgrade: ( exa01sw-iba01 exa01sw-ibb01 exa01sw-ibs01 )
 [SUCCESS ] Verifying Network connectivity to exa01sw-iba01
 [SUCCESS ] Verifying Network connectivity to exa01sw-ibb01
 [SUCCESS ] Verifying Network connectivity to exa01sw-ibs01
 [SUCCESS ] Validating verify-topology output
 [INFO ] Proceeding with upgrade of InfiniBand switches to version 2.2.14_1
 [INFO ] Master Subnet Manager is set to "exa01sw-ibs01" in all Switches
 
 [INFO ] ---------- Starting with InfiniBand Switch exa01sw-ibs01
 [WARNING ] Infiniband switch meets minimal version requirements, but downgrade is only available to 2.2.13-2 with the current package.
  To downgrade to other versions:
  - Manually download the InfiniBand switch firmware package to the patch directory
  - Set export variable "EXADATA_IMAGE_IBSWITCH_DOWNGRADE_VERSION" to the appropriate version
  - Run patchmgr command to initiate downgrade.
 [SUCCESS ] Verify SSH access to the patchmgr host exa01db01.example.com from the InfiniBand Switch exa01sw-ibs01.
 [INFO ] Starting pre-update validation on exa01sw-ibs01
 [SUCCESS ] Verifying that /tmp has 150M in exa01sw-ibs01, found 492M
 [SUCCESS ] Verifying that / has 20M in exa01sw-ibs01, found 26M
 [SUCCESS ] Service opensmd is running on InfiniBand Switch exa01sw-ibs01
 [SUCCESS ] NTP daemon is running on exa01sw-ibs01.
 [INFO ] Manually validate the following entries Date:(YYYY-MM-DD) 2020-08-15 Time:(HH:MM:SS) 15:59:12
 [INFO ] Validating the current firmware on the InfiniBand Switch
 [SUCCESS ] Firmware verification on InfiniBand switch exa01sw-ibs01
 [SUCCESS ] Verifying that the patchmgr host exa01db01.example.com is recognized on the InfiniBand Switch exa01sw-ibs01 through getHostByName
 [SUCCESS ] Execute plugin check for Patch Check Prereq on exa01sw-ibs01
 [INFO ] Finished pre-update validation on exa01sw-ibs01
 [SUCCESS ] Pre-update validation on exa01sw-ibs01
 [INFO ] Package will be downloaded at firmware update time via scp
 [SUCCESS ] Execute plugin check for Patching on exa01sw-ibs01
 [INFO ] Starting upgrade on exa01sw-ibs01 to 2.2.14_1. Please give upto 15 mins for the process to complete. DO NOT INTERRUPT or HIT CTRL+C during the upgrade
 [INFO ] Rebooting exa01sw-ibs01 to complete the firmware update. Wait for 15 minutes before continuing. DO NOT MANUALLY REBOOT THE INFINIBAND SWITCH
 [SUCCESS ] Load firmware 2.2.14_1 onto exa01sw-ibs01
 [SUCCESS ] Verify that /conf/configvalid is set to 1 on exa01sw-ibs01
 [INFO ] Set SMPriority to 8 on exa01sw-ibs01
 [INFO ] Starting post-update validation on exa01sw-ibs01
 [SUCCESS ] Service opensmd is running on InfiniBand Switch exa01sw-ibs01
 [SUCCESS ] NTP daemon is running on exa01sw-ibs01.
 [INFO ] Manually validate the following entries Date:(YYYY-MM-DD) 2020-08-15 Time:(HH:MM:SS) 16:07:02
 [INFO ] /conf/configvalid is 1
 [INFO ] Validating the current firmware on the InfiniBand Switch
 [SUCCESS ] Firmware verification on InfiniBand switch exa01sw-ibs01
 [SUCCESS ] Execute plugin check for Post Patch on exa01sw-ibs01
 [INFO ] Finished post-update validation on exa01sw-ibs01
 [SUCCESS ] Post-update validation on exa01sw-ibs01
 [SUCCESS ] Update InfiniBand switch exa01sw-ibs01 to 2.2.14_1
 
 [INFO ] ---------- Starting with InfiniBand Switch exa01sw-ibb01
 [WARNING ] Infiniband switch meets minimal version requirements, but downgrade is only available to 2.2.13-2 with the current package.
  To downgrade to other versions:
  - Manually download the InfiniBand switch firmware package to the patch directory
  - Set export variable "EXADATA_IMAGE_IBSWITCH_DOWNGRADE_VERSION" to the appropriate version
  - Run patchmgr command to initiate downgrade.
 [SUCCESS ] Verify SSH access to the patchmgr host exa01db01.example.com from the InfiniBand Switch exa01sw-ibb01.
 [INFO ] Starting pre-update validation on exa01sw-ibb01
 [SUCCESS ] Verifying that /tmp has 150M in exa01sw-ibb01, found 492M
 [SUCCESS ] Verifying that / has 20M in exa01sw-ibb01, found 26M
 [SUCCESS ] Service opensmd is running on InfiniBand Switch exa01sw-ibb01
 [SUCCESS ] NTP daemon is running on exa01sw-ibb01.
 [INFO ] Manually validate the following entries Date:(YYYY-MM-DD) 2020-08-15 Time:(HH:MM:SS) 16:45:01
 [INFO ] Validating the current firmware on the InfiniBand Switch
 [SUCCESS ] Firmware verification on InfiniBand switch exa01sw-ibb01
 [SUCCESS ] Verifying that the patchmgr host exa01db01.example.com is recognized on the InfiniBand Switch exa01sw-ibb01 through getHostByName
 [SUCCESS ] Execute plugin check for Patch Check Prereq on exa01sw-ibb01
 [INFO ] Finished pre-update validation on exa01sw-ibb01
 [SUCCESS ] Pre-update validation on exa01sw-ibb01
 [INFO ] Package will be downloaded at firmware update time via scp
 [SUCCESS ] Execute plugin check for Patching on exa01sw-ibb01
 [INFO ] Starting upgrade on exa01sw-ibb01 to 2.2.14_1. Please give upto 15 mins for the process to complete. DO NOT INTERRUPT or HIT CTRL+C during the upgrade
 [INFO ] Rebooting exa01sw-ibb01 to complete the firmware update. Wait for 15 minutes before continuing. DO NOT MANUALLY REBOOT THE INFINIBAND SWITCH
 [SUCCESS ] Load firmware 2.2.14_1 onto exa01sw-ibb01
 [SUCCESS ] Verify that /conf/configvalid is set to 1 on exa01sw-ibb01
 [INFO ] Set SMPriority to 5 on exa01sw-ibb01
 [INFO ] Starting post-update validation on exa01sw-ibb01
 [SUCCESS ] Service opensmd is running on InfiniBand Switch exa01sw-ibb01
 [SUCCESS ] NTP daemon is running on exa01sw-ibb01.
 [INFO ] Manually validate the following entries Date:(YYYY-MM-DD) 2020-08-15 Time:(HH:MM:SS) 16:24:14
 [INFO ] /conf/configvalid is 1
 [INFO ] Validating the current firmware on the InfiniBand Switch
 [SUCCESS ] Firmware verification on InfiniBand switch exa01sw-ibb01
 [SUCCESS ] Execute plugin check for Post Patch on exa01sw-ibb01
 [INFO ] Finished post-update validation on exa01sw-ibb01
 [SUCCESS ] Post-update validation on exa01sw-ibb01
 [SUCCESS ] Update InfiniBand switch exa01sw-ibb01 to 2.2.14_1
 
After successfully patching the two "standby" switches, "patchmgr" proceeded to patch the "master" switch (note that "patchmgr" doesn't check for the actual statuses of the IB switches after the patching). When it rebooted, all nodes lost connectivity, as there were no surviving switches in the "standby" state to manage the connections. This abruptly interrupted "patchmgr":
[INFO ] ---------- Starting with InfiniBand Switch exa01sw-iba01
 [WARNING ] Infiniband switch meets minimal version requirements, but downgrade is only available to 2.2.13-2 with the current package.
  To downgrade to other versions:
  - Manually download the InfiniBand switch firmware package to the patch directory
  - Set export variable "EXADATA_IMAGE_IBSWITCH_DOWNGRADE_VERSION" to the appropriate version
  - Run patchmgr command to initiate downgrade.
 [SUCCESS ] Verify SSH access to the patchmgr host exa01db01.example.com from the InfiniBand Switch exa01sw-iba01.
 [INFO ] Starting pre-update validation on exa01sw-iba01
 [SUCCESS ] Verifying that /tmp has 150M in exa01sw-iba01, found 492M
 [SUCCESS ] Verifying that / has 20M in exa01sw-iba01, found 26M
 [SUCCESS ] Service opensmd is running on InfiniBand Switch exa01sw-iba01
 [SUCCESS ] NTP daemon is running on exa01sw-iba01.
 [INFO ] Manually validate the following entries Date:(YYYY-MM-DD) 2020-08-15 Time:(HH:MM:SS) 18:12:35
 [INFO ] Validating the current firmware on the InfiniBand Switch
 [SUCCESS ] Firmware verification on InfiniBand switch exa01sw-iba01
 [SUCCESS ] Verifying that the patchmgr host exa01db01.example.com is recognized on the InfiniBand Switch exa01sw-iba01 through getHostByName
 [SUCCESS ] Execute plugin check for Patch Check Prereq on exa01sw-iba01
 [INFO ] Finished pre-update validation on exa01sw-iba01
 [SUCCESS ] Pre-update validation on exa01sw-iba01
 [INFO ] Package will be downloaded at firmware update time via scp
 [SUCCESS ] Execute plugin check for Patching on exa01sw-iba01
 [INFO ] Starting upgrade on exa01sw-iba01 to 2.2.14_1. Please give upto 15 mins for the process to complete. DO NOT INTERRUPT or HIT CTRL+C during the upgrade
 [INFO ] Rebooting exa01sw-iba01 to complete the firmware update. Wait for 15 minutes before continuing. DO NOT MANUALLY REBOOT THE INFINIBAND SWITCH
 
As seen above, the last operation executed by "patchmgr" was the reboot of exa01sw-iba01. As previously explained, since there were no nodes in the "standby" state, we lost connectivity to all the nodes and had to log into the DB nodes using the ILOM:
[root@test01 ~]$ ssh root@exa01dbadm01-ilom.example.com
 Password:
 
 Oracle(R) Integrated Lights Out Manager
 
 Version 4.0.4.52 r133103
 
 Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
 
 Warning: HTTPS certificate is set to factory default.
 
 Hostname: exa01dbadm01-ilom
 
 -> start /SP/Console
 Are you sure you want to start /SP/console (y/n)? y
 
 Serial console started. To stop, type ESC (
 
 
 exa01db01 login: root
 Password:
 Last login: Sat Aug 15 16:26:07 CDT 2020 from exa01db01.example.com on ssh
 Last login: Sat Aug 15 16:31:59 on ttyS0
 [root@exa01db01 ~]#

Troubleshooting

All IB switches returned to "standby" state when the clocks reached the pre-reboot timepoint. Given the complexity of this issue, we engaged Oracle to assist in the analysis. We lost all connectivity, as the three switches were in "discover" state. By the time we logged back into the system, ibs01 had already become the "master." The other two switches remained in the "boot in progress" state for some time:
[root@exa01sw-ibs01 ~]# ibswitches
 Switch : 0x0010e0dc11fba0a0 ports 36 "SUN DCS 36P QDR exa01sw-ibs01 10.XX.YY.ZZ" enhanced port 0 lid 3 lmc 0
 Switch : 0x0010e0dc1ceda0a0 ports 36 "BOOT IN PROGRESS" enhanced port 0 lid 0 lmc 0
 Switch : 0x0010e0dc0f6ea0a0 ports 36 "BOOT IN PROGRESS" enhanced port 0 lid 0 lmc 0
 
 [root@exa01sw-ibs01 ~]# getmaster
 Local SM enabled and running, state MASTER
 Last change in Master SubnetManager status detected at: Sat Aug 15 16:25:50 CDT 2020
 Master SubnetManager on sm lid 3 sm guid 0x10e0dc11fba0a0 : SUN DCS 36P QDR exa01sw-ibs01 10.XX.YY.ZZ
 Master SubnetManager Activity Count: 11637 Priority: 14
The Oracle engineer identified the root cause of this issue as the incorrect configuration of NTP. This configuration error led to a clock difference on the switches, fixed by Linux after the reboot, thus moving the clock backward. This caused the switch to enter "discover" state and become unusable. He pinpointed the issue by detecting a time drift on the /var/log/messages file of the IB switch:
Aug 15 18:16:50 exa01sw-iba01 envd[2532]: Connector 13B present
 Aug 15 16:29:05 exa01sw-iba01 envd[2532]: Connector 14B present
 
The switches reverted to "standby" state when the system clock reached the same time it had before the reboot.

Solution

To avoid running into an outage, it's preferable to patch switches one at a time and execute a series of checks on each switch after patching, before proceeding. NOTE: Only perform changes using the "Ilom-admin" (ILOM shell) account. Do not perform any changes on the IB switches with the "root" (regular shell) account.

Configure and enable NTP

  1. Carefully review the NTP settings on each switch:
    Hostname: exa01sw-iba01.example.com
     
     -> show /SP/clock
     
      /SP/clock
      Targets:
     
      Properties:
      datetime = Mon Aug 17 08:58:58 2020
      timezone = CDT (America/Chicago)
      uptime = 0 days, 23:06:02
      usentpserver = disabled
     
      Commands:
      cd
      set
      show
     
     -> show /SP/clients/ntp/server/1
     
      /SP/clients/ntp/server/1
      Targets:
     
      Properties:
      address = 0.0.0.0
     
      Commands:
      cd
      set
      show
     
     -> show /SP/clients/ntp/server/2
     
      /SP/clients/ntp/server/2
      Targets:
     
      Properties:
      address = 0.0.0.0
     
      Commands:
      cd
      set
      show
     
     -> exit
     [root@exa01sw-iba01 ~]# ls -l /etc/ntp.conf
     lrwxrwxrwx 1 root root 21 2015-09-10 11:37 /etc/ntp.conf -> /config/conf/ntp.conf
     [root@exa01sw-iba01 ~]# ls -l /etc/default/ntpdate
     lrwxrwxrwx 1 root root 20 2016-03-31 07:39 /etc/default/ntpdate -> /config/conf/ntpdate
     [root@exa01sw-iba01 ~]# cat /config/conf/ntp.conf
     ## DO NOT EDIT THIS FILE ##
     #server none
     #server none
     driftfile /conf/ntp-drift
     pidfile /var/run/ntpd.pid
     [root@exa01sw-iba01 ~]# cat /config/conf/ntpdate
     ## DO NOT EDIT THIS FILE ##
     # servers to check. (Separate multiple servers with spaces.)
     NTPSERVERS="XXX.YYY.ZZZ.AAA XXX.BBB.CCC.AAA"
     #
     # additional options for ntpdate
     NTPOPTIONS="-u"
    The "root" account (regular shell) changed the NTP's settings only at the OS level. During the reboot Linux invoked "ntpdate" and adjusted the system's clock (backward) using the OS configurations, causing a significant drift that placed the switches in "discover" state.
  2. Change to the "ilom-admin" (ILOM shell) account and adjust the configurations, before enabling NTP.
    Hostname: exa01sw-ibb01.example.com
     
     -> show /SP/clients/ntp/server/1
     
      /SP/clients/ntp/server/1
      Targets:
     
      Properties:
      address = 0.0.0.0
     
      Commands:
      cd
      set
      show
     
     -> show /SP/clients/ntp/server/2
     
      /SP/clients/ntp/server/2
      Targets:
     
      Properties:
      address = 0.0.0.0
     
      Commands:
      cd
      set
      show
     
     -> show /SP/clock
     
      /SP/clock
      Targets:
     
      Properties:
      datetime = Sat Aug 22 08:25:29 2020
      timezone = CDT (US/Central)
      uptime = 89 days, 13:42:39
      usentpserver = disabled
     
      Commands:
      cd
      set
      show
     
     -> set /sp/clock/ usentpserver=disabled
     Set 'usentpserver' to 'disabled'
     
     -> set /sp/clients/ntp/server/1/ address=XXX.YYY.ZZZ.AAA
     Set 'address' to 'XXX.YYY.ZZZ.AAA'
     
     -> set /sp/clients/ntp/server/2/ address=XXX.BBB.CCC.AAA
     Set 'address' to 'XXX.BBB.CCC.AAA'
     
     -> set /sp/clock/ usentpserver=enabled
     Set 'usentpserver' to 'enabled'
     
     -> show /SP/clock
     
      /SP/clock
      Targets:
     
      Properties:
      datetime = Sat Aug 22 08:31:05 2020
      timezone = CDT (US/Central)
      uptime = 89 days, 13:45:48
      usentpserver = enabled
     
      Commands:
      cd
      set
      show
     
     -> show /SP/clients/ntp/server/1
     
      /SP/clients/ntp/server/1
      Targets:
     
      Properties:
      address = XXX.YYY.ZZZ.AAA
     
      Commands:
      cd
      set
      show
     
     -> show /SP/clients/ntp/server/2
     
      /SP/clients/ntp/server/2
      Targets:
     
      Properties:
      address = XXX.BBB.CCC.AAA
     
      Commands:
      cd
      set
      show
     
     -> exit
     [root@exa01sw-ibb01 ~]# date
     Sat Aug 22 08:31:13 CDT 2020
  3. After enabling NTP make sure the switch is in the "standby" state before proceeding to the next switch. Do not reboot the "master" switch If there are no remaining switches in "standby" state. Check the current state with the "getmaster" command in the regular shell of the IB switch:
    [root@exa01sw-ibb01 ~]# getmaster
     Local SM enabled and running, state STANDBY
     Last change in Master SubnetManager status detected at: Sun May 24 08:20:16 CDT 2020
     Master SubnetManager on sm lid 1 sm guid 0x10e0dc218da0a0 : SUN DCS 36P QDR exa01sw-iba01 
     Master SubnetManager Activity Count: 57235558 Priority: 14

Adjusting the Subnet Manager (SM) if the clock moved forward

In our scenario, the clock moved forward causing multiple master IB switches on the fabric. This led to connectivity issues on one node. Rotating the SM to a switch that has the correct timestamp will fix this:
  1. Check the state of all switches in the cluster:
    [root@exa01db01 pythian]# dcli -l root -g ~/ibs_group: getmaster | grep state
     exa01sw-iba01: Local SM enabled and running, state STANDBY
     exa01sw-ibb01: Local SM enabled and running, state STANDBY
     exa01sw-ibs01: Local SM enabled and running, state MASTER
  2. Disable SM on the current "master" switch:
    [root@exa01db01 pythian]# ssh exa01sw-ibs01
     You are now logged in to the root shell.
     It is recommended to use ILOM shell instead of root shell.
     All usage should be restricted to documented commands and documented
     config files.
     To view the list of documented commands, use "help" at linux prompt.
     [root@exa01sw-ibs01 ~]# disablesm
     
     Stopping partcfgchk [ OK ]
     Stopping partitiond-daemon [ OK ]
     Stopping IB Subnet Manager..-. [ OK ]
     
     [root@exa01sw-ibs01 ~]# getmaster
     Local SM not enabled
     
  3. Re-enable SM:
    [root@exa01sw-ibs01 ~]# enablesm
     
     Starting IB Subnet Manager. [ OK ]
     Starting partitiond-daemon [ OK ]
     Starting partcfgchk [ OK ]
     
     [root@exa01sw-ibs01 ~]# getmaster
     Local SM enabled and running, state STANDBY
     Last change in Master SubnetManager status detected at: Sun May 24 08:20:16 CDT 2020
     Master SubnetManager on sm lid 1 sm guid 0x10e0dc218da0a0 : SUN DCS 36P QDR exa01sw-iba01 XX.YY.ZZ.ZZ
     Master SubnetManager Activity Count: 615361 Priority: 14
  4. Check the state of all switches in the cluster: (When SM was disabled on ibs01, iba01 picked up as "master")
    [root@exa01db01 pythian]# dcli -l root -g ~/ibs_group: getmaster | grep state
     exa01sw-iba01: Local SM enabled and running, state MASTER
     exa01sw-ibb01: Local SM enabled and running, state STANDBY
     exa01sw-ibs01: Local SM enabled and running, state STANDBY

Patching the switches with "patchmgr"

  1. It's preferable to patch one switch at a time since "patchmgr" doesn't check the state of the IB switch. Specify a single switch in the group's file as follows:
    [root@exa01sw-iba01 ~]# cat ~/ib_group
     exa01sw-ibs01
     
     [root@exa01sw-iba01 ~]# cd /u01/patches/exadata_patches/IB_PATCHING/patch_19.2.12.0.0.200317/
     [root@exa01sw-iba01 patch_19.2.12.0.0.200317]# ./patchmgr -ibswitches ~/ib_group -upgrade
     2020-08-15 15:49:38 -0500 1 of 1 :Working: DO: Initiate upgrade of InfiniBand switches to 2.2.14-1. Expect up to 40 minutes for each switch
     
      ----- InfiniBand switch update process started 2020-08-15 15:49:39 -0500 -----
     [NOTE ] Log file at /u01/patches/exadata_patches/IB_PATCHING/patch_19.2.12.0.0.200317/upgradeIBSwitch.log
     
     [INFO ] List of InfiniBand switches for upgrade: ( exa01sw-ibs01 )
     [SUCCESS ] Verifying Network connectivity to exa01sw-ibs01
     [SUCCESS ] Validating verify-topology output
     [INFO ] Proceeding with upgrade of InfiniBand switches to version 2.2.14_1
     [INFO ] Master Subnet Manager is set to "exa01sw-iba01" in all Switches
     
     [INFO ] ---------- Starting with InfiniBand Switch exa01sw-ibs01
     [WARNING ] Infiniband switch meets minimal version requirements, but downgrade is only available to 2.2.13-2 with the current package.
      To downgrade to other versions:
      - Manually download the InfiniBand switch firmware package to the patch directory
      - Set export variable "EXADATA_IMAGE_IBSWITCH_DOWNGRADE_VERSION" to the appropriate version
      - Run patchmgr command to initiate downgrade.
     [SUCCESS ] Verify SSH access to the patchmgr host exa01db01.example.com from the InfiniBand Switch exa01sw-ibs01.
     [INFO ] Starting pre-update validation on exa01sw-ibs01
     [SUCCESS ] Verifying that /tmp has 150M in exa01sw-ibs01, found 492M
     [SUCCESS ] Verifying that / has 20M in exa01sw-ibs01, found 26M
     [SUCCESS ] Service opensmd is running on InfiniBand Switch exa01sw-ibs01
     [SUCCESS ] NTP daemon is running on exa01sw-ibs01.
     [INFO ] Manually validate the following entries Date:(YYYY-MM-DD) 2020-08-15 Time:(HH:MM:SS) 15:59:12
     [INFO ] Validating the current firmware on the InfiniBand Switch
     [SUCCESS ] Firmware verification on InfiniBand switch exa01sw-ibs01
     [SUCCESS ] Verifying that the patchmgr host exa01db01.example.com is recognized on the InfiniBand Switch exa01sw-ibs01 through getHostByName
     [SUCCESS ] Execute plugin check for Patch Check Prereq on exa01sw-ibs01
     [INFO ] Finished pre-update validation on exa01sw-ibs01
     [SUCCESS ] Pre-update validation on exa01sw-ibs01
     [INFO ] Package will be downloaded at firmware update time via scp
     [SUCCESS ] Execute plugin check for Patching on exa01sw-ibs01
     [INFO ] Starting upgrade on exa01sw-ibs01 to 2.2.14_1. Please give upto 15 mins for the process to complete. DO NOT INTERRUPT or HIT CTRL+C during the upgrade
     [INFO ] Rebooting exa01sw-ibs01 to complete the firmware update. Wait for 15 minutes before continuing. DO NOT MANUALLY REBOOT THE INFINIBAND SWITCH
     [SUCCESS ] Load firmware 2.2.14_1 onto exa01sw-ibs01
     [SUCCESS ] Verify that /conf/configvalid is set to 1 on exa01sw-ibs01
     [INFO ] Set SMPriority to 8 on exa01sw-ibs01
     [INFO ] Starting post-update validation on exa01sw-ibs01
     [SUCCESS ] Service opensmd is running on InfiniBand Switch exa01sw-ibs01
     [SUCCESS ] NTP daemon is running on exa01sw-ibs01.
     [INFO ] Manually validate the following entries Date:(YYYY-MM-DD) 2020-08-15 Time:(HH:MM:SS) 16:07:02
     [INFO ] /conf/configvalid is 1
     [INFO ] Validating the current firmware on the InfiniBand Switch
     [SUCCESS ] Firmware verification on InfiniBand switch exa01sw-ibs01
     [SUCCESS ] Execute plugin check for Post Patch on exa01sw-ibs01
     [INFO ] Finished post-update validation on exa01sw-ibs01
     [SUCCESS ] Post-update validatio

No Comments Yet

Let us know what you think

Subscribe by email