Pythian Blog: Technical Track

Diagnosing Orchestrator Crashes Due to Time Drift Events

We were recently paged by a client because one node of their Orchestrator setup had crashed. The Orchestrator cluster was set up using the Orchestrator/raft consensus cluster and using a SQLite backend. When looking in the log files this is what I saw:

2021-01-08 03:01:35 ERROR Heath test is failing for over 5 seconds. raft yielding
2021-01-08 03:01:35 FATAL Node is unable to register health. Please check database connnectivity.

What Orchestrator was hinting to me was that it somehow lost connectivity to its database. I found that strange since, as mentioned above, this instance was using SQLite on a local disk. Looking at the OS logs there were no indications of problems with the local disk, so I had to look further for other possible causes.

Looking through the log file I found this was not the only occurrence of this type of issue; it had already been occurring for several days. On the two days leading up to the client call, it became worse and began crashing Orchestrator with a FATAL error.

2021-01-04 03:01:27 ERROR Heath test is failing for over 5 seconds. raft yielding
--
2021-01-05 03:01:37 ERROR Heath test is failing for over 5 seconds. raft yielding
--
2021-01-06 03:01:21 ERROR Heath test is failing for over 5 seconds. raft yielding
--
2021-01-07 03:02:03 ERROR Heath test is failing for over 5 seconds. raft yielding
2021-01-07 03:02:03 FATAL Node is unable to register health. Please check database connnectivity.
--
2021-01-08 03:01:35 ERROR Heath test is failing for over 5 seconds. raft yielding
2021-01-08 03:01:35 FATAL Node is unable to register health. Please check database connnectivity.

From the looks of it this server had been having issues just after 3 a.m. for a few days in a row. That called for some deeper digging to find out what was happening.

Looking at the syslog I observed the following. A few minutes after 3 a.m. (and verifying from the syslog, only at that time of the day) systemd reported that the system time had been changed. This notice seemed to be coming from SystemD as it detected a change in the system time. A few minutes later, chronyd, which is the default time synchronisation tool for this CentOS 7 machine, resynchronised the time with an NTP server. On the final two days in question, right after detecting the time change, Orchestrator crashed itself with a FATAL error.

Jan  4 03:01:27 orchestrator1 systemd: Time has been changed
Jan  4 03:04:58 orchestrator1 chronyd[789]: Selected source <ntp-server-ip>
Jan  4 03:04:58 orchestrator1 chronyd[789]: System clock wrong by -20.225029 seconds, adjustment started
--
Jan  5 03:01:37 orchestrator1 systemd: Time has been changed
Jan  5 03:04:59 orchestrator1 chronyd[789]: Selected source <ntp-server-ip>
Jan  5 03:04:59 orchestrator1 chronyd[789]: System clock wrong by -24.226423 seconds, adjustment started
--
Jan  6 03:01:21 orchestrator1 systemd: Time has been changed
Jan  6 03:15:20 orchestrator1 chronyd[789]: Selected source <ntp-server-ip>
Jan  6 03:15:20 orchestrator1 chronyd[789]: System clock wrong by -28.219162 seconds, adjustment started
--
Jan  7 03:02:03 orchestrator1 systemd: Time has been changed
Jan  7 03:02:03 orchestrator1 systemd: orchestrator.service: main process exited, code=exited, status=1/FAILURE
Jan  7 03:02:03 orchestrator1 systemd: Unit orchestrator.service entered failed state.
Jan  7 03:02:03 orchestrator1 systemd: orchestrator.service failed.
Jan  7 03:10:49 orchestrator1 chronyd[789]: Selected source <ntp-server-ip>
Jan  7 03:10:49 orchestrator1 chronyd[789]: System clock wrong by -32.215915 seconds, adjustment started
--
Jan  8 03:01:35 orchestrator1 systemd: Time has been changed
Jan  8 03:01:35 orchestrator1 systemd: orchestrator.service: main process exited, code=exited, status=1/FAILURE
Jan  8 03:01:35 orchestrator1 systemd: Unit orchestrator.service entered failed state.
Jan  8 03:01:35 orchestrator1 systemd: orchestrator.service failed.
Jan  8 03:05:52 orchestrator1 chronyd[789]: Selected source <ntp-server-ip>
Jan  8 03:05:52 orchestrator1 chronyd[789]: System clock wrong by -36.220819 seconds, adjustment started

To the source code

The nice thing about working with open source software is that you can go check the source code to find out what the software is doing. As we are looking at these lines of the Orchestrator source code, we see the errors logged in our log-file are close together – in fact, right after one another – in the same function call.

if process.SinceLastGoodHealthCheck() > yieldAfterUnhealthyDuration {
    log.Errorf("Heath test is failing for over %+v seconds. raft yielding", yieldAfterUnhealthyDuration.Seconds())
    orcraft.Yield()
}
if process.SinceLastGoodHealthCheck() > fatalAfterUnhealthyDuration {
    orcraft.FatalRaftError(fmt.Errorf("Node is unable to register health. Please check database connnectivity."))
}

The difference here is the duration on which they are triggered. Looking at these lines it shows their values.

const (
    discoveryMetricsName        = "DISCOVERY_METRICS"
    yieldAfterUnhealthyDuration = 5 * config.HealthPollSeconds * time.Second
    fatalAfterUnhealthyDuration = 30 * config.HealthPollSeconds * time.Second
)

HealthPollSeconds defaults to 1. When there is over five seconds since the last good health check, you get the raft yielding error but no crash. When over 30 seconds have passed since the last good health check, you get both the raft yielding and the fatal error about database connectivity. This shows that the error actually has nothing to do with database connectivity; just with failing health checks.

The most likely reason for this would be database connectivity, as Orchestrator persists the information about the last health check in the database. However, in this case it turned out the database had nothing to do with it.

Taking a closer look at the syslog messages we see on the days that Orchestrator was crashing, the time drift was bigger than 30 seconds and the days it didn’t crash, the drift was lower than 30 seconds. Putting this information together with what we found in the source code, we have many reasons to believe we’ve found the root cause of our crashes! Yay!

How VMware handles hardware clocks

As is the case with many Orchestrator installations I know of, this cluster also runs on Virtual Machines (VMs). In this particular case the VMs are running in a VMware environment. The VMs are configured not to sync time with the physical host underneath, however this knowledge base article from VMware suggests that some events that happen with the VM would still trigger a synchronization. One of the events would be when a snapshot is created. After some back and forth discussion with the client’s VMware admins, we learned the VM backup kicks off at 3 a.m. This will create a snapshot which, according to the article, will trigger a time synchronization. When looking at the vmx file we saw that the changes listed on the article were not there, thus the VM was in fact synchronizing with the host on snapshot creation.

These servers had previously been running well for several months, so I was reluctant to make any changes to address this synchronization problem. I first wanted to find out what had started happening a few days ago.

It was most likely that a recent change had caused the physical servers to start drifting time – a bit more each day – which eventually led to our crashes. The VMware team conducted some investigations and indeed found this to be the case. A change in the network security had blocked the time synchronization on the physical servers. Once the network team re-enabled the NTP protocol for those machines, their clocks were synced up again and the problem went away.

Conclusion

The resolution for this problem turned out to be fairly easy to implement but it shows that any change in your environment can have nasty, unanticipated side effects on some parts of your infrastructure. Keeping a good track of any changes, however small the change may be, will allow you to more easily trace an issue back to its root cause. To help people more quickly recognize this particular problem in Orchestrator, I have created a pull request to update the error messages in Orchestrator to include time synchronization as a possible cause for the FATAL error.

No Comments Yet

Let us know what you think

Subscribe by email