Add Your Heading Text Here

How Roche is on the way to managing thousands of Kubernetes clusters in hospitals and labs worldwide.

The below is a summary of a talk given by Alexander Hungenberg, Technology Lead for Edge Infrastructure at Roche, presented at Kubernetes Community Days Zurich.

The full video of the talk is available on YouTube.

Roche, one of the world’s largest biotech companies, has been around for 128 years. Companies with such a long history are not usually regarded as leaders in innovation (that is a lot of time for technical debt to build up!), yet despite such a long history, Roche is a leader in healthcare R&D and technology innovation.

To give you an idea of the scale of the company, there were 29 billion tests conducted with Roche diagnostic products in 2023.

As part of their innovation mindset, Roche is moving towards targeted medicines and personalized health care, both of which require a lot more data and advanced analytics, so software is increasingly important to the company and medicine in general.

Edge Computing and Roche

The cloud is amazing, but there are limitations and use cases where it is not a good fit – edge computing is one such case.

For Roche, edge computing usually means a laboratory. These laboratories can be found in hospitals or other locations around the world, and are not owned or operated by Roche. Edge computing is needed to bring computation and data storage closer to the sources of data, as some tests report gigabytes of data, which is not efficient to upload or process in the cloud, and bandwidth is often constrained. Privacy is another area that precludes use of the cloud – the labs have individual patient data, so it is often not legally possible to send this data to another company’s infrastructure and give up control about what is happening with patient data. Finally, the labs are critical infrastructure – tests have to run and results be processed during the midst of surgery, and have to be locally available.

Why is edge hard?

Today, the typical means of deploying software in a lab is to build hardware, and bundle software with the device. This leads to incompatibilities between devices, and makes updating the software itself hard (often entailing rolling a technician in a truck to drive to the equipment with a USB stick.)

This makes many interesting software applications that could advance medical technology simply not practical.

This paradigm is hard to change, because:

  • The labs are run by hospitals or independent laboratories. They have their own IT staff, or contractors, and all have a unique network setup.
  • Laboratories are deployed worldwide. Deployments require site visits; requiring local staff, often not in English speaking countries.
  • Connectivity constraints – bandwidth is often limited, and heavily (and inconsistently) firewalled.
  • Cost constraints – small labs in less developed countries cannot afford to pay tens of thousands of dollars for compute.
  • Remote operations – this is required for debugging, but still need to comply with local laws and regulations, which often require data to stay in the country or region.

So how does Roche run modern software at the edge? Kubernetes.

Roche has chosen to deploy Kubernetes at the edge. Why? Kubernetes is the now the Operating System that applications are built for – applications are not built for Windows or Linux, but rather for Kubernetes. Kubernetes provides horizontal scalability; can run highly available deployments; and has an amazing ecosystem with great documentation.

So Roche built an internal platform team to roll out Kubernetes to labs, hospitals, pharmacies, etc, so that application developers can focus on code.

However – Kubernetes itself is not enough!

Other issues that need to be addressed are:

  • Which hardware to run on
  • Which Operating System to run on
  • How to keep the OS up to date
  • How to prevent configuration drift
  • How to bootstrap new deployments
  • Which Kubernetes distribution to use
  • Which CNI to use
  • How to monitor the cluster and workloads
  • How to remotely connect to debug
  • How to maintain an overview of all clusters out there

Roche uses many open source tools to help address these issues – Thanos, OpenTelemetry, Cilium, MetalLB, etc – and it is the goal of the platform team to hide all this complexity from the end user field service engineers.

One of the key components of the Roche edge deployment is the Talos Linux operating system, “a very awesome operating system.”

Talos Linux is a purpose built operating system for Kubernetes. A kernel, an API server binary, and just the minimal needed to run K8s. No shell even. This is awesome from a security perspective; because the entire operating system is immutable, it eliminates configuration drift entirely, which greatly increases reliability at remote locations.

Talos Linux and secure boot

Unlike a proper datacenter, labs may not have much physical security – what happens if someone walks into a lab and takes a machine with patient data on it?

Secureboot with TPM based full disk encryption solves this. Roche sponsored the implementation of Secureboot by Sidero Labs, the company behind Talos Linux.

With SecureBoot we can ensure only a signed operating system can load the decryption key from the TPM chip built into the computer. By signing the OS image and kernel, and configuring the BIOS to only allow booting signed images, the OS will only be loaded if signed and validated; and only the signed OS can then load the keys from the TPM, and only then can it decrypt the data.

This means that even if someone steals the device, all they can do is run the original OS, which will not allow them any access to the data due to the security inherent in Talos Linux (there is no shell; no user accounts; etc).

To address some of the other questions from above and that arose in the talk:

  • Roche is using the Kubernetes that comes with Talos Linux – Talos Linux installs vanilla, upstream certified Kubernetes. This was an improvement over their prior system, that required separate management, updating and maintenance of both the OS and Kubernetes.
  • Roche uses a lot of open source technologies. For critical components they get commercial support, and this is what they did for Talos Linux.
  • How does Talos help with updating configurations and upgrading?  Talos Linux is configured via a declarative YAML file, so they just update the configuraiton file via an API (for example, to specify a new Kubernetes version), and the system updates itself. 
  • How does immutability help? The root file system and Operating System is read only. So for upgrades, Talos boots off a new image that is written to disk, with the ability to fallback if the upgrade is not successful.
  • Backups at the edge: not solved yet. Roche is currently running only stateless workloads – but that needs to be solved. Can use a persistent volume manager on K8s, or you upload to external storage (on prem or the cloud).