Deploying Talos on AWS with CDK

This post was first published at Steve Yackey’s blog, but we thought it was such a good article, and a great example of how the community is using and building on Talos OS, that with Steve’s permission we are reposting here.

As someone who has found a lot of uses for Go in both my professional and personal projects, I’ve been excited to explore the dev preview of AWS’s CDK in Go, to put it to the test. In my home lab, I use Talos to run my Kubernetes cluster. It has proven to be the most secure and fun operating system for running Kubernetes that I’ve ever tried, and has an incredibly helpful Slack community too. I’ve enjoyed having no SSH access to manage, no unnecessary packages to patch for vulnerabilities, and I love the ability to control nodes entirely through an API. Recently, I became curious as to how AWS’s CDK in Go could combine with Talos to create a Kubernetes cluster that is scalable, secure, and streamlined.

As with many newer tools, the number of existing libraries is still developing, and I had yet to find one that would enable me to easily test Talos and CDK together, so I decided to construct one of my own — taloscdk. This blog will walk you through my small library of flexible CDK constructs that can handle some of the AWS-specific configuration, that allows for the use of AWS load balancers. Many of the needed labels, tags, and configuration options are built into the constructs or included in the README files of example stacks.

The taloscdk construct library has the following example CDK stacks:

  • a single-node cluster that’s great for quickly trying out Talos on AWS
  • a public cluster (public endpoints, publicly available nodes) with a single control plane node (in an autoscaling group) and a single worker node (in an autoscaling group), a Network Load Balancer for the control plane, with easy configuration options to scale up
  • a private cluster (private subnet endpoints, private subnet nodes, private subnet NLB) with 3 control plane nodes and 3 worker nodes (all in autoscaling groups), a private Network Load Balancer, and a bastion host to access the cluster via SSM Session Manager

To view the examples along with the rest of the construct library, you can find them on GitHub at steveyackey/taloscdk.

A Quick Primer on CDK Terms

For those that are new to using CDK, there are a handful of building blocks used to write and deploy Infrastructure as Code (IaC) that may be helpful for you to understand:

  • Resources: the actual underlying AWS resources created from the code
  • Constructs: the basic building blocks of CDK made up of one or more resources, which AWS describes as “cloud components”
  • Stacks: the smallest deployable unit, using one or more constructs
  • Apps: an app is made up of one or more stacks which can be deployed all together or stack-by-stack

We’ll be working with the taloscdk constructs from the repository found at steveyackey/taloscdk to create stacks/apps.

If you haven’t yet worked with CDK in Go, you’ll notice jsii.String()jsii.Bool(), and jsii.Number() being used frequently. This is because most AWS CDK constructs are created using a properties struct made up primarily of pointers. Each of these methods returns a pointer to the specified value of that type. For more on that, visit https://docs.aws.amazon.com/cdk/latest/guide/work-with-cdk-go.html#go-cdk-idioms.

Key Helpers/Constructs We’ll Be Using

There are a handful of constructs available to deploy Talos, which all build upon AWS constructs. This is just a few of the available ones that I’ve used in my examples, and they should provide a good foundation to build upon.

taloscdk.LoadConfig(fileName *string)

Once you’ve generated configs for your cluster, LoadConfig() can be used to load the config as a *string. Each of the following constructs uses this as a helper for making configuration loading easy.

NewSingleNode(scope constructs.Construct, id *string, props *SingleNodeProps)

NewSingleNode() creates a new single EC2 instance to run Talos. By default, it also creates an Elastic IP address (though it can be disabled). This enables the construct library to create the EIP, get the IP address, and then replace the endpoint in the loaded config with that IP. When using this, you may have to delete the Elastic IP manually afterward, as destroying the stack unassigns it but does not delete it.

NewControlPlane(scope constructs.Construct, id *string, props *ControlPlaneProps)

NewControlPlane() creates a new autoscaling group, and registers the group’s instances to a Network Load Balancer. This is setup to create the NLB first, get the DNS name of the load balancer, and replace the endpoint in the loaded controlplane.yaml file.

NewWorkerASG(scope constructs.Construct, id *string, props *WorkerASGProps)

NewWorkerASG() creates a new autoscaling group to use for your worker nodes. It also has the ability to take in an endpoint to replace the control plane endpoint within its config with the control plane’s NLB.

NewSecurityGroup(scope constructs.Construct, id *string, props *SecurityGroupProps)

NewSecurityGroup() creates a new security group that by default allows access to ports 6443, 50000, 50001, and all internal traffic within the security group. This is typically used for the control plane or a single node.

Requirements

Before we dive in, here are the key requirements that you will need in order to use the taloscdk constructs:

Getting Started

Once you have confirmed those requirements, you are ready to get started on your own CDK project. From within a new directory (I’ll be using one called my-talos-cluster), run:

cdk init --language=go
go get github.com/steveyackey/taloscdk

Congrats! You’ve now got a working CDK project. Now, let’s deploy a simple two node Kubernetes cluster, running Talos on AWS! We’ll start with one control plane node and one worker, deploying them to the default VPC in the public subnets.

To start, let’s generate some configs for our cluster. At the command line, run:

talosctl gen config talos https://talos.cluster:6443 \
    --with-examples=false --with-docs=false \
    --config-patch='[{"op":"replace", "path":"/machine/kubelet", "value": {"registerWithFQDN": true}},
        {"op":"replace", "path":"/cluster/externalCloudProvider", "value": {
            "enabled": true,
            "manifests": [
                "https://raw.githubusercontent.com/kubernetes/cloud-provider-aws/v1.20.0-alpha.0/manifests/rbac.yaml", 
                "https://raw.githubusercontent.com/kubernetes/cloud-provider-aws/v1.20.0-alpha.0/manifests/aws-cloud-controller-manager-daemonset.yaml"
            ]
        }}]'

This generates new configs for a cluster named “talos” with the control plane endpoint “https://talos.cluster:6443”. talos.cluster is a placeholder endpoint that we’ll replace with the Elastic IP from our control plane node later.

These configuration files support creating AWS load balancers as Kubernetes resources. In order to use them, it adds:

  • the needed manifests
  • registerWithFQDN: true so that the node is able to be recognized by AWS
  • externalCloudProvider: true, which sets the needed kubelet, apiserver, and kube-controller-manager flags.

If you’d like to run a single node cluster instead, you can add this flag to your talosctl gen config command:

    --config-patch-control-plane='[{"op":"replace", "path":"/cluster/allowSchedulingOnMasters", "value":true}]'

Note: When running a single node cluster, the aws-controller-manager will not be able to create load balancers unless you remove the node-role.kubernetes.io/master role label from the node, as well as the nodeAffinity from the aws-controller-manager daemonset.

Let’s Launch a Two Node Cluster!

Now that you’ve generated your configuration files, let’s dive into some CDK. At this point, open the .go file with the same name as your current directory. In my case, it’s called my-talos-cluster.go.

We can start by removing the example SNS topic:

// Remove this section:
// as an example, here's how you would define an AWS SNS topic:
	awssns.NewTopic(stack, jsii.String("MyTopic"), &awssns.TopicProps{
		DisplayName: jsii.String("MyCoolTopic"),
	})

You’ll also want to remove the SNS import and add the taloscdk import, resulting in a section that looks like this:

import (
	"github.com/aws/aws-cdk-go/awscdk"
	"github.com/aws/constructs-go/constructs/v3"
	"github.com/aws/jsii-runtime-go"
	"github.com/steveyackey/taloscdk"
)

Now that we’ve got all the necessary imports, we’ll load our control plane config into a variable we can use with other constructs. If you saved your config files into a different directory, you can include the path in the file name.

// The code that defines your stack goes here
	cpConfig, err := taloscdk.LoadConfig("controlplane.yaml")
	if err != nil {
		panic("Could not load talos config")
	}

We’ll then create the control plane node. This will use our config file, and let the construct know to transform the config. The construct will take the endpoint given (which we also used in the talosctl gen config command) and replace it with the Elastic IP it will generate.

cp := taloscdk.NewSingleNode(stack, jsii.String("TalsoSingleNodeCluster"), &taloscdk.SingleNodeProps{
		ClusterName:         jsii.String("talos"),
		NodeName:            jsii.String("talos-cp"),
		TalosNodeConfig:     cpConfig,
		TransformConfig:     jsii.Bool(true),
		EndpointToOverwrite: jsii.String("talos.cluster"),
	})

You can find all of the options for SingleNodeProps here. In this example, we are taking mostly default settings and transforming the config to replace talos.cluster with the default OverwriteValue (the Elastic IP of the node). Doing so will allow us to configure the instance using the controlplane.yaml as the EC2 UserData (which is loaded on boot).

NewSingleNode() also takes care of subnet and EC2 instance tagging. The ClusterName is used to tag resources with kubernetes.io/cluster/<ClusterName>=owned and the NodeName is used as the name of the instance. As for the subnets, the following tags are added:

For public subnets: kubernetes.io/role/elb=1

For private subnets: kubernetes.io/role/internal-elb=1

After creating the control plane node, let’s add a single worker node:

workerConfig, err := taloscdk.LoadConfig("./join.yaml")
	if err != nil {
		panic("Could not load talos config")
	}

	taloscdk.NewSingleNode(stack, jsii.String("TalosWorker"), &taloscdk.SingleNodeProps{
		ClusterName:         jsii.String("talos"),
		NodeName:            jsii.String("talos-worker"),
		TalosNodeConfig:     workerConfig,
		TransformConfig:     jsii.Bool(true),
		EndpointToOverwrite: jsii.String("talos.cluster"),
		OverwriteValue:      cp.GetEIPAddress(),
		SecurityGroup:       cp.SecurityGroup,
		CreateEIP:           jsii.Bool(false),
		IAMRole:             taloscdk.NewWorkerIAMRole(stack, jsii.String("WorkerRole")),
	})

The worker’s SingleNodeProps struct uses the output of the control plane node to get the new endpoint (which is then used to transform the join.yaml file before loading to EC2 UserData). Both nodes produce IAM roles with the permissions needed to create AWS load balancers via the aws-controller-manager (which we installed as an additional manifest within the externalCloudProvider values).

Now that we’ve got our nodes, you’ll want to include a CloudFormation output to make it easy to get the Elastic IP address that’s been associated with our control plane node:

awscdk.NewCfnOutput(stack, jsii.String("TalosSingleNodeClusterEndpoint"), &awscdk.CfnOutputProps{
		Value:       cp.GetEIPAddress(),
		Description: jsii.String("Use this IP address in your talosconfig as the endpoint and node."),
	})

Lastly, we’ll comment out the return nil within func env() at the bottom, and uncomment the last option to use the account and region associated with your current local AWS credentials:

	return &awscdk.Environment{
		Account: jsii.String(os.Getenv("CDK_DEFAULT_ACCOUNT")),
		Region:  jsii.String(os.Getenv("CDK_DEFAULT_REGION")),
	}

Synth and Deploy

Now that we’ve got a CDK app with a stack containing two EC2 instances running Talos, we’re ready to synthesize and deploy our code.

CDK synthesizes down to CloudFormation. To view the CloudFormation output, run:

cdk synth

Next, to deploy our CDK stack, run:

cdk deploy

This will print the IAM and Security Group changes, prompting you as to whether or not you’d like to proceed. After answering y, your stack is created and deployed. If all goes well, at the end you should receive the IP for your control plane node:

 ✅  MyTalosClusterStack

Outputs:
MyTalosClusterStack.TalosSingleNodeClusterEndpoint = 

arn:aws:cloudformation:us-east-1:12345678910:stack/MyTalosClusterStack/....

Bootstrapping and Testing Your Cluster

You can now replace the endpoint in your talosconfig with the Elastic IP output from the previous step.

After that is complete, check on your control plane node to verify that the Talos API is accessible and that you’re ready to begin bootstrapping:

talosctl --talosconfig talosconfig -n  dmesg

If this runs successfully, you’re ready to bootstrap your cluster using the following commands:

talosctl --talosconfig talosconfig -n  bootstrap
talosctl --talosconfig talosconfig -n  dmesg -f

The -f flag follows the logs so that we can watch until the bootstrap process is complete. Once the boot sequence has finished, you will have a functioning Kubernetes cluster with etcd ready to go. To get your new kubeconfig file, run:

# To create a separate kubeconfig file
talosctl --talosconfig talosconfig -n  kubeconfig .

# To merge with your existing kubeconfig
talosctl --talosconfig talosconfig -n  kubeconfig

Next, let’s test out our cluster by deploying nginx and exposing it with a load balancer:

kubectl --kubeconfig=./kubeconfig get nodes
kubectl --kubeconfig=./kubeconfig get pods -A
kubectl --kubeconfig=./kubeconfig create deploy nginx --image=nginx
kubectl --kubeconfig=./kubeconfig expose deploy/nginx --port 80 --type LoadBalancer
kubectl --kubeconfig=./kubeconfig get svc

After getting the service, you will see a load balancer being created. If not, check the logs of the aws-controller-manager within the kube-system namespace for errors.

Once the instance has registered with the load balancer, you are able to visit the address that ends with elb.amazonaws.com. You can also check the status within the AWS console/CLI.

Cleaning Up

If you’d like to delete the resources you’ve created you can run the following commands:

kubectl --kubeconfig=./kubeconfig delete svc nginx
cdk destroy

After CloudFormation has deleted your stack, it’s best to make sure the Elastic IP has been released. You can view your existing Elastic IPs here. You only pay for EIPs that aren’t associated, and each account can only have 5 by default.

Exploring the Examples

Awesome job! You’ve created a two-node cluster. Now that you’re comfortable with taloscdk, you can explore examples of more complex clusters. At the time of writing, within the /examples directory of the taloscdk repository, there are three example CDK projects to explore:

  • single-node – a similar cluster to the one we just created
  • public-cluster – a new VPC, an autoscaling group for the control plane and workers, an NLB as the control plane endpoint, all within the public subnets
  • private-cluster – a cluster similar to the public cluster, except all resources/endpoints are within the private subnets, with the exception of a new bastion host (for using SSM Session Manager to reach your cluster)

Both the public and private clusters use NewControlPlane() and NewWorkerASG() to create the autoscaling groups for the control plane and worker nodes, along with the NLB for the control plane endpoint. You’ll want to use the control plane nodes’ IP addresses as your talosctl endpoints, and the NLB will be used for any kubectl commands. In all of the examples, it defaults to a t3.small instance type (matching the minimum specs for Talos), and Talos v0.11.2.

If you’d like to see the full list of properties available, visit:

single_node.go – SingleNodeProps

cluster.go – ControlPlaneProps, WorkerASGProps

Wrapping Up

I really enjoyed working with Talos and CDK in Go, and see many opportunities of how this can be leveraged in the future. The taloscdk constructs are flexible and can help you deploy clusters of many shapes and sizes. I hope you enjoy exploring them!

Subscribe!

Occasional Updates On Sidero Labs, Kubernetes And More!