I “came of age” for data center operations well before cloud native was a thing. Hardware was (comparatively) slow, expensive, and often customized. (Database servers would have specialized SCSI arrays and caching RAID controllers; compute servers would have extra memory and CPU cores – but there would only be two of these machines per datacenter – they were expensive, and precious. An outage on an expensive Sun Microsystems server was caused by firmware version mismatches in the drives in RAID arrays that prevented drives failing over to a SCSI controller in another node. This took the entire database offline due to a single failure that should have been tolerated without impact. Luckily the database was replicated to an identical array in another datacenter, so the impact was brief.)
Because of the scarcity and expense of compute and storage resources, my inclination is to understand problems deeply to try to ensure they can be prevented. So it was quite eye opening for me to see the “cattle-not-pets” philosophy in real life in our Slack the other day.
There was an issue with image pulls being very slow in our Azure hosted development cluster – but they were only slow when served from a specific machine, and only from a pod IP (not from host networking IPs). Rather than run packet captures, analyzing disk queues, and CPU IO wait times – which would have been my inclination – one of our operations engineers just … shot the cow and added a new one to the herd.
From our internal Slack, two engineers (T and U) were chatting:
T: I can just try rebuilding the instance
U: maybe we try to get to the bottom of it first? is there some obvious places/config to check? Like on Azure etc?
T: Cattle, not pets dudeĀ
If replacing it doesn't fix the issue, then we can spend more time trying to solve it, but if it solves the issue then we save a bunch of time. Reducing the issue down to a single node is great work because a single node can be replaced super easily. For all we know the physical host in the Azure data center has issues and we can't do much about that but we can replace the instance easily and then check if the problem persists or not
U: ok, if replacing is fine, let's do it!
T: OK, it's fully removed now
And coming back...
and it's green, test away!
U: all good, now it's very fast
That was educational for me, demonstrating the power of cloud-native technologies and thinking. I knew all this in theory (Talos Linux is a Kubernetes operating system, which brings Kubernetes principles to the OS: the machine state is reconciled to a declarative configuration; immutable file systems mean a reboot ensures a pristine state, for example.) Yet it was informative to see the practical effect in real operations – quite a different way of thinking from what I was used to.
The takeaways for me at least were:
- You still need good monitoring to identify issues.
- You need to ensure the issue is isolated to a single node (or small subset of nodes). It’s no good shooting your herd one by one if they all suffer from the same issue.
- You need monitoring or another way to validate that once you have destroyed and replaced a node -the replacement does not suffer from the same issue.
Eliminating unnecessary troubleshooting and diagnosis can be a significant time saver and boost to efficiency. Make sure you take advantage of this boost that the cloud native mindset enables! And make sure you are running Talos Linux, so you can truly treat your nodes’ operating system as cattle.