Pythian Blog: Technical Track

RAC – Clusterware Intermittently Fails to Start

While working on a new Grid Infrastructure installation on Red Hat V8.4, we had some issues testing a new cluster: The clusterware failed to come back up after a node reboot… but sometimes it worked.

We had the latest 19c software patched to 19.12 and were using ASMLib to configure the ASM disks.

When failing, the messages in the CRS alert.log stated that the clusterware was failing to find any voting disks—see the last line of the output:

2021-09-23 08:12:21.292 [OSYSMOND(28810)]CRS-8500: Oracle Clusterware OSYSMOND process is starting with operating system process ID 28810
2021-09-23 08:12:21.288 [CSSDMONITOR(28808)]CRS-8500: Oracle Clusterware CSSDMONITOR process is starting with operating system process ID 28808
2021-09-23 08:12:21.651 [CSSDAGENT(28847)]CRS-8500: Oracle Clusterware CSSDAGENT process is starting with operating system process ID 28847
2021-09-23 08:12:27.533 [OCSSD(28867)]CRS-8500: Oracle Clusterware OCSSD process is starting with operating system process ID 28867
2021-09-23 08:12:28.592 [OCSSD(28867)]CRS-1713: CSSD daemon is started in hub mode
2021-09-23 08:12:28.703 [OCSSD(28867)]CRS-1714: Unable to discover any voting files, retrying discovery in 15 seconds; Details at (:CSSNM00070:) in /racd01/app/oracle/diag/crs/l-04296-d5c/crs/trace/ocssd.trc

And in the ocssd.trc file:

2021-09-23 08:12:28.703 :   SKGFD:3462854400: ERROR: -14(asmlib /opt/oracle/extapi/64/asm/orcl/1/libasm.so version failed with 2
)
2021-09-23 08:12:28.703 :    CLSF:3462854400: Allocated CLSF context
2021-09-23 08:12:28.703 :    CSSD:3462854400: [     INFO] clssnmvDDiscThread: using discovery string ORCL:* for initial discovery
2021-09-23 08:12:28.703 :   SKGFD:3462854400: Discovery with str:ORCL:*:

2021-09-23 08:12:28.703 :   SKGFD:3462854400: UFS discovery with :ORCL:*:

2021-09-23 08:12:28.703 :   SKGFD:3462854400: Execute glob on the string /racd01/app/19.3.0.0.0/grid/dbs/ORCL:*

2021-09-23 08:12:28.703 :   SKGFD:3462854400: OSS discovery with :ORCL:*:

2021-09-23 08:12:28.703 :   SKGFD:3462854400: Discovery skipping bad asmlib :ASM::

2021-09-23 08:12:28.703 :    CSSD:3462854400: [     INFO] clssnmvDiskVerify: Successful discovery of 0 disks
2021-09-23 08:12:28.703 :    CSSD:3462854400: [     INFO] clssnmCompleteInitVFDiscovery: Completing initial voting file discovery
2021-09-23 08:12:28.703 :    CSSD:3462854400: [     INFO] clssnmvFindInitialConfigs: No voting files found
2021-09-23 08:12:28.703 :    CSSD:3462854400: [     INFO] (:CSSNM00070:)clssnmCompleteInitVFDiscovery: Voting file not found. Retrying discovery in 15 seconds

Note at line 1, the ‘libasm.so version failed with 2’.

We checked the disks using the discovery tool:

$ /usr/sbin/oracleasm-discover
Using ASMLib from /opt/oracle/extapi/64/asm/orcl/1/libasm.so
asm_version() failed with code 2

It gave the same error! We traced the discovery—strace /usr/sbin/oracleasm-discover—which shows clearly that we have no permissions:

write(1, "Using ASMLib from /opt/oracle/ex"..., 61) = 61
openat(AT_FDCWD, "/dev/oracleasm/.query_version", O_RDWR) = -1 EACCES (Permission denied)
openat(AT_FDCWD, "/dev/oracleasm/iid", O_RDONLY) = -1 EACCES (Permission denied)
write(2, "asm_version() failed with code 2"..., 33) = 33
exit_group(2)                           = ?
+++ exited with 2 +++

We checked the disks and indeed found that they have the wrong userid and group:

$ ls -lasi
total 0
18446638755664080616 0 drwxr-xr-x.  4 root root    0 Sep 23 06:50 .
                   3 0 drwxr-xr-x. 23 root root 5260 Sep 23 06:51 ..
18446638755664098920 0 -rwxrwx---.  1 root root    0 Sep 23 06:50 .check_iid
18446638755664086952 0 drwxr-xr-x.  1 root root    0 Sep 23 06:50 disks
18446638755664083432 0 -rwxrwx---.  1 root root    0 Sep 23 06:50 .get_iid
18446638755664094696 0 drwxrwx---.  1 root root    0 Sep 23 06:50 iid
18446638755664099624 0 -rwxrwx---.  1 root root    0 Sep 23 06:50 .query_disk
18446638755664071464 0 -rwxrwx---.  1 root root    0 Sep 23 06:50 .query_handle
18446638755664090472 0 -rwxrwx---.  1 root root    0 Sep 23 06:50 .query_version

They should be owned by the Oracle software owner and the DBA group. After checking the oracleasm log /var/log/oracleasm, we found the following error:

Thu Sep 23 08:12:28 Creating /dev/oracleasm mount point: /dev/oracleasm
Thu Sep 23 08:12:28 Creating /dev/oracleasm mount point: /dev/oracleasm
Thu Sep 23 08:12:28 Loading module "oracleasm": oracleasm
Thu Sep 23 08:12:28 Loading module "oracleasm": oracleasm
Thu Sep 23 08:12:28 Configuring "oracleasm" to use device physical block size
Thu Sep 23 08:12:28 Configuring "oracleasm" to use device physical block size
Thu Sep 23 08:12:28 chown: invalid user: 'oracle:dba'
Thu Sep 23 08:12:28 Mounting ASMlib driver filesystem: failed
Thu Sep 23 08:12:28 Mounting ASMlib driver filesystem: failed
Thu Sep 23 08:12:28 Unable to fix ASM driver permissions
Thu Sep 23 08:12:28 Unable to fix ASM driver permissions

Note line 7, so it seems like the username is not valid! But I am logged in as Oracle?

Checking the users, we see that oracle is not in the /etc/passwd file, so we must be getting it from somewhere else.

So we checked /etc/nsswitch.conf and found that we also get user information from System Security Services:

passwd:      files sss

Checking the services, we see that they started at around the same time:

# systemctl status sssd
- sssd.service - System Security Services Daemon
   Loaded: loaded (/usr/lib/systemd/system/sssd.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2021-09-23 08:12:28 EDT; 46min ago
...

# systemctl status oracleasm
- oracleasm.service - Load oracleasm Modules
   Loaded: loaded (/usr/lib/systemd/system/oracleasm.service; enabled; vendor preset: disabled)
   Active: active (exited) since Thu 2021-09-23 08:12:28 EDT; 46min ago
...

This could be our cause and why the problem is intermittent. If oracleasm starts before the sssd service, then it will not be able to find any details for the oracle user when it tries to change the disk ownership, but if the oracleasm service is slower and starts after the sssd service, the user details will be available and the clusterware will start as normal.

So we added an After clause (line 3) to the /usr/lib/systemd/system/oracleasm.service file:

[Unit]
Description=Load oracleasm Modules
After=sssd.service

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/sbin/oracleasm.init start_sysctl
ExecStop=/usr/sbin/oracleasm.init stop_sysctl
ExecReload=/usr/sbin/oracleasm.init restart_sysctl

[Install]
WantedBy=multi-user.target

This makes sure that the sssd service starts before oracleasm.

This resolved the issue and the clusterware comes up normally every time now.

The disks look like this:

$ ls -lasi
total 0
18446638755664080616 0 drwxr-xr-x.  4 root   root    0 Sep 23 09:50 .
                   3 0 drwxr-xr-x. 23 root   root 5260 Sep 23 09:51 ..
18446638755664098920 0 -rw-rw----.  1 oracle dba     0 Sep 23 09:50 .check_iid
18446638755664086952 0 drwxr-xr-x.  1 root   root    0 Sep 23 09:50 disks
18446638755664083432 0 -rw-rw----.  1 oracle dba     0 Sep 23 09:50 .get_iid
18446638755664094696 0 drwxrwx---.  1 oracle dba     0 Sep 23 09:50 iid
18446638755664099624 0 -rw-rw----.  1 oracle dba     0 Sep 23 09:50 .query_disk
18446638755664071464 0 -rw-rw----.  1 oracle dba     0 Sep 23 09:50 .query_handle
18446638755664090472 0 -rw-rw----.  1 oracle dba     0 Sep 23 09:50 .query_version

Thank you for reading my post! For more like this, please subscribe at the top of the page or here.

No Comments Yet

Let us know what you think

Subscribe by email