Pythian Blog: Technical Track

Solving an uncommon Oracle error code - ORA-01041

So, here is the thing. We have a customer running a 11.2 Single Instance Oracle database on Windows. I know the version is a bit old and Windows may not be the most common choice to run an Oracle database, but this is not the point of the post. This customer asked us to back up the database directly to a CIFS volume using RMAN. This is a common and wise practice: save your database backups outside of the server where it is running. Piece of cake, the only caveat is that the Oracle Windows service must be running with the credentials of a domain account that has permission on the CIFS volume. Some basic research on MOS and voilà: How to Change Oracle Owner from Local System Account to Domain User Account in Windows (Doc ID 2035714.1). After obtaining the maintenance window and executing the action plan to change the services from under the local SYSTEM account to a domain account, a simple RMAN backup writing directly onto the CIFS volume works just fine:
RMAN> backup current controlfile format '\\backup-nas.domain.com\oracle\backups\oracle\ctl_file_text.bkp';
 Starting backup at 16-08-2018 11:50:24
 using channel ORA_DISK_1
 channel ORA_DISK_1: starting full datafile backup set
 channel ORA_DISK_1: specifying datafile(s) in backup set
 including current control file in backup set
 channel ORA_DISK_1: starting piece 1 at 16-08-2018 11:50:25
 channel ORA_DISK_1: finished piece 1 at 16-08-2018 11:50:40
 piece handle=\\backup-nas.domain.com\oracle\backups\oracle\CTL_FILE_TEXT.BKP tag=TAG20180816T115024 comment=NONE
 channel ORA_DISK_1: backup set complete, elapsed time: 00:00:15
 Finished backup at 16-08-2018 11:50:40
 
 Starting Control File and SPFILE Autobackup at 16-08-2018 11:50:40
 piece handle=G:\ORACLE\FAST_RECOVERY_AREA\PROTECT\AUTOBACKUP\2018_08_16\O1_MF_S_984311440_FQBOR0Y3_.BKP comment=NONE
 Finished Control File and SPFILE Autobackup at 16-08-2018 11:50:41
 
We now reboot the server to make sure that everything comes up fine. Ooops!!
PS C:\Users\pythian.admin> sqlplus / as sysdba
 
 SQL*Plus: Release 11.2.0.4.0 Production on Thu Aug 16 11:01:38 2018
 
 Copyright (c) 1982, 2017, Oracle. All rights reserved.
 
 ERROR:
 ORA-01041: internal error. hostdef extension doesn't exist
 

What happened?

Good question, indeed. The OERR information is not life-saving:
Error: ORA 1041
 Text: internal error. hostdef extension doesn't exist
 ---------------------------------------------------------
 Cause: Pointer to hstdef extension in hstdef is null.
 Action: Likely a known or new bug.
 
 Explanation: This is usually reported when a connection has broken for some reason.
 
 Diagnosis:
 
 1) Check the same operation for any ORA 3113 or ORA 3114 type errors. The ORA 1041 error usually results from an unexpected disconnection.
 2) Follow the same steps as you would to progress an ORA 3113
 
Obviously, there are no ORA-3113 or ORA-3114 to be seen. Also, there is not much information about this error in MOS and Google does not help much, either. Although, one MOS note looked promising: ORA-1041 When Trying to Connect as Sysdba to Startup Database (Doc ID 552218.1). From the note:
The system time was set incorrectly.
 The local system time did not match the time on the domain server causing the authentication to fail.
Why does this look promising? Because the database shows an uptime of one hour right after starting up, so something is changing the time where it shouldn't be changed.

Windows time service is fidgeting with my database

After reviewing the Windows event log, I noticed a sudden jump in the time during the startup of the server. [caption id="attachment_105062" align="aligncenter" width="1298"] Screenshot showing the jump in time from 9:23 to 10:23 Note the jump in the time[/caption] This shows the Windows kernel updating the system time after the Windows Time service connects to the domain controller and synchronizes the server time with it. The problem with this is that it all happens after the Oracle services have started, hence the startup issue. To add to this hypothesis, after manually restarting the services after the server is up and running, everything works as expected.

Diagnosis and prognostic

So, the situation is as follows:
  • Oracle Windows services work fine after being set up with a domain account.
  • RMAN backups are working after the above setup.
  • On Windows boot up, Oracle services start before the Window Time service can set the proper system time.
  • After the system is up and running, restarting the Oracle services gets them up and running.
The prognostic is that the system time differs from the Active Directory time and this prevents the credentials used for the Oracle services to be validated leading to the failure of the database start.

Final fix

There may be more, but we analyzed two different approaches here: one set the BIOS RTC to the proper time, in sync with the domain server, using an NTP configuration. The other is to have the Oracle services starting after the Windows Time service. But this does not ensure that the system time has been changed already, so an additional setting has to be put in place: to declare the Oracle services as automatic-delayed ones. This ensures that the Oracle services will start only after all the automatic services are up and running, with the caveat that it makes the Oracle services dependent on others that may not be directly related. So working with our customer and Oracle support we decided to go for the automatic-delayed service startup and it did the trick. While this is not the ideal situation, we don't expect this server to go down frequently, or at all actually, so this configuration should be enough to bring the database back up and running in case of an unexpected server failure.

No Comments Yet

Let us know what you think

Subscribe by email