When it’s good to shoot your cattle

I “came of age” for data center operations well before cloud native was a thing. Hardware was (comparatively) slow, expensive, and often customized. (Database servers would have specialized SCSI arrays and caching RAID controllers; compute servers would have extra memory and CPU cores – but there would only be two of these machines per datacenter – they were expensive, and precious. An outage on an expensive Sun Microsystems server was caused by firmware version mismatches in the drives in RAID arrays that prevented drives failing over to a SCSI controller in another node. This took the entire database offline due to a single failure that should have been tolerated without impact. Luckily the database was replicated to an identical array in another datacenter, so the impact was brief.)

Because of the scarcity and expense of compute and storage resources, my inclination is to understand problems deeply to try to ensure they can be prevented. So it was quite eye opening for me to see the “cattle-not-pets” philosophy in real life in our Slack the other day.

There was an issue with image pulls being very slow in our Azure hosted development cluster – but they were only slow when served from a specific machine, and only from a pod IP (not from host networking IPs). Rather than run packet captures, analyzing disk queues, and CPU IO wait times – which would have been my inclination – one of our operations engineers just … shot the cow and added a new one to the herd.

From our internal Slack, two engineers (T and U) were chatting:

T: I can just try rebuilding the instance

U: maybe we try to get to the bottom of it first? is there some obvious places/config to check? Like on Azure etc?

T: Cattle, not pets dudeĀ :smile:

If replacing it doesn't fix the issue, then we can spend more time trying to solve it, but if it solves the issue then we save a bunch of time. Reducing the issue down to a single node is great work because a single node can be replaced super easily. For all we know the physical host in the Azure data center has issues and we can't do much about that but we can replace the instance easily and then check if the problem persists or not

U: ok, if replacing is fine, let's do it!:+1:

T: OK, it's fully removed now:pray:

And coming back...

and it's green, test away!

U: all good, now it's very fast

That was educational for me, demonstrating the power of cloud-native technologies and thinking. I knew all this in theory (Talos Linux is a Kubernetes operating system, which brings Kubernetes principles to the OS: the machine state is reconciled to a declarative configuration; immutable file systems mean a reboot ensures a pristine state, for example.) Yet it was informative to see the practical effect in real operations – quite a different way of thinking from what I was used to.

The takeaways for me at least were:

  • You still need good monitoring to identify issues.
  • You need to ensure the issue is isolated to a single node (or small subset of nodes). It’s no good shooting your herd one by one if they all suffer from the same issue.
  • You need monitoring or another way to validate that once you have destroyed and replaced a node -the replacement does not suffer from the same issue.

Eliminating unnecessary troubleshooting and diagnosis can be a significant time saver and boost to efficiency. Make sure you take advantage of this boost that the cloud native mindset enables! And make sure you are running Talos Linux, so you can truly treat your nodes’ operating system as cattle.

Subscribe!

Occasional Updates On Sidero Labs, Kubernetes And More!

Hobby

For home labbers
$ 10 Monthly for 10 nodes
  • Includes 10 nodes in base price
  • Limited to 10 nodes, 1 user
  • Community Support

Startup

Build right
$ 250 Monthly for 10 nodes
  • Includes 10 nodes in base price
  • Additional nodes priced per node, per month
  • Scales to unlimited Clusters,
    Nodes and Users
  • Community Support

Business

Expert support
$ 600 Monthly for 10 nodes
  • Volume pricing
  • Scales to unlimited Clusters,
    Nodes and Users
  • Talos Linux, Omni and Kubernetes support from our experts
  • Business hours support with SLAs
  • Unlimited users with RBAC and SAML

Enterprise

Enterprise Ready
$ 1000 Monthly for 10 nodes
  • Business plan features, plus...
  • Volume pricing
  • 24 x 7 x 365 Support
  • Fully Managed Option
  • Can Self Host
  • Supports Air-Gapped
  • Private Slack Channel
On Prem
available

Edge

Manage scale
$ Call Starting at 100 nodes
  • Pricing designed for edge scale
  • 24 x 7 x 365 Support with SLAs
  • Only outgoing HTTPS required
  • Secure node enrollment flows
  • Reliable device management
  • Can Self Host On Prem
  • Private Slack Channel
On Prem
available