I currently have multiple AWS accounts each with it's own Kubernetes cluster. Unfortunately, when the clusters were initially deployed using kops VPCs were created with overlapping CIDR blocks. This normally wouldn't be a problem as each cluster essentially existed in it's own universe.
Things have changed a bit and now we want to implement cross account VPC peering. The idea is users connect over the VPN have have access to all resources through said peering. My understanding is the CIDR block overlap is going to be a major problem when peering is implemented.
It doesn't seem one can just change the CIDR block of the existing cluster. Is my only option to backup and restore the cluster in a new VPC with something like ark? Has anyone gone through a full cluster migration? I'd be curious if there is a better answer.
Your understanding is correct: with kops, you can't change CIDR blocks of an existing cluster; it's stuck in the VPC in which it was created, and you can't change the CIDR block of a VPC:
The IP address range of a VPC is made up of the CIDR blocks associated with it. You select one CIDR block when you create the VPC, and you can add or remove secondary CIDR blocks later. The CIDR block that you add when you create the VPC cannot be changed, but you can add and remove secondary CIDR blocks to change the IP address range of the VPC. (emphasis mine)
That leads us to the second point: migrating your cluster. This can be broken down into two phases:
kops
1. Migrating the infrastructure managed by kops
You will need to migrate (i.e. recreate) the kops cluster itself: the ec2 instances, kops InstanceGroups
and Cluster
objects, various AWS infrastructure, etc. For that, you can use the kops toolbox template
command:
kops toolbox template --values /path/to/values.yaml --template /path/to/cluster/template.yaml > /path/to/output/cluster.yaml
kops create -f /path/to/output/cluster.yaml
This is a Helm-like tool that allows you to templatize your kops cluster configuration and pass in different values.yaml
files. You might want to include this command within a small shell script wrapper or a Makefile to create 1-click cluster deployments to get your k8s cluster infrastructure set up easily and repeatably.
A sample cluster template.yaml file and values.yaml file might look like the following, which includes the specs for the Cluster
, and master, worker, and autoscale InstanceGroup
s.
# template.yaml
{{ $clusterSubdomain := (env "CLUSTER_SUBDOMAIN") }}
{{ $subnetCidr := (env "SUBNET_CIDR") }}
apiVersion: kops/v1alpha2
kind: Cluster
metadata:
name: {{ $clusterSubdomain }}.k8s.example.io
spec:
hooks:
- manifest: |
[Unit]
Description=Create example user
ConditionPathExists=!/home/example/.ssh/authorized_keys
[Service]
Type=oneshot
ExecStart=/bin/sh -c 'useradd example && echo "{{ .examplePublicKey }}" > /home/example/.ssh/authorized_keys'
name: useradd-example.service
roles:
- Node
- Master
- manifest: |
Type=oneshot
ExecStart=/usr/bin/coreos-cloudinit --from-file=/home/core/cloud-config.yaml
name: reboot-window.service
roles:
- Node
- Master
kubeAPIServer:
authorizationRbacSuperUser: admin
featureGates:
TaintBasedEvictions: "true"
kubeControllerManager:
featureGates:
TaintBasedEvictions: "true"
horizontalPodAutoscalerUseRestClients: false
kubeScheduler:
featureGates:
TaintBasedEvictions: "true"
kubelet:
featureGates:
TaintBasedEvictions: "true"
fileAssets:
- content: |
yes
name: docker-1.12
path: /etc/coreos/docker-1.12
roles:
- Node
- Master
- content: |
#cloud-config
coreos:
update:
reboot-strategy: "etcd-lock"
locksmith:
window-start: {{ .locksmith.windowStart }}
window-length: {{ .locksmith.windowLength }}
name: cloud-config.yaml
path: /home/core/cloud-config.yaml
roles:
- Node
- Master
api:
dns: {}
authorization:
rbac: {}
channel: stable
cloudProvider: aws
configBase: s3://my-bucket.example.io/{{ $clusterSubdomain }}.k8s.example.io
etcdClusters:
- etcdMembers:
- instanceGroup: master-{{ .zone }}
name: a
name: main
- etcdMembers:
- instanceGroup: master-{{ .zone }}
name: a
name: events
iam:
allowContainerRegistry: true
legacy: false
kubernetesApiAccess:
- {{ .apiAccessCidr }}
kubernetesVersion: {{ .k8sVersion }}
masterPublicName: api.{{ $clusterSubdomain }}.k8s.example.io
networkCIDR: {{ .vpcCidr }}
networkID: {{ .vpcId }}
networking:
canal: {}
nonMasqueradeCIDR: 100.64.0.0/10
sshAccess:
- {{ .sshAccessCidr }}
subnets:
- cidr: {{ $subnetCidr }}
name: {{ .zone }}
type: Public
zone: {{ .zone }}
topology:
dns:
type: Public
masters: public
nodes: public
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
labels:
kops.k8s.io/cluster: {{ $clusterSubdomain }}.k8s.example.io
name: master-{{ .zone }}
spec:
{{- if .additionalSecurityGroups }}
additionalSecurityGroups:
{{- range .additionalSecurityGroups }}
- {{ . }}
{{- end }}
{{- end }}
image: {{ .image }}
machineType: {{ .awsMachineTypeMaster }}
maxSize: 1
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: master-{{ .zone }}
role: Master
subnets:
- {{ .zone }}
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
labels:
kops.k8s.io/cluster: {{ $clusterSubdomain }}.k8s.example.io
name: nodes
spec:
{{- if .additionalSecurityGroups }}
additionalSecurityGroups:
{{- range .additionalSecurityGroups }}
- {{ . }}
{{- end }}
{{- end }}
image: {{ .image }}
machineType: {{ .awsMachineTypeNode }}
maxSize: {{ .nodeCount }}
minSize: {{ .nodeCount }}
nodeLabels:
kops.k8s.io/instancegroup: nodes
role: Node
subnets:
- {{ .zone }}
---
apiVersion: kops/v1alpha2
kind: InstanceGroup
metadata:
name: ag.{{ $clusterSubdomain }}.k8s.example.io
labels:
kops.k8s.io/cluster: {{ $clusterSubdomain }}.k8s.example.io
spec:
{{- if .additionalSecurityGroups }}
additionalSecurityGroups:
{{- range .additionalSecurityGroups }}
- {{ . }}
{{- end }}
{{- end }}
image: {{ .image }}
machineType: {{ .awsMachineTypeAg }}
maxSize: 10
minSize: 1
nodeLabels:
kops.k8s.io/instancegroup: ag.{{ $clusterSubdomain }}.k8s.example.io
role: Node
subnets:
- {{ .zone }}
And the values.yaml file:
# values.yaml:
region: us-west-2
zone: us-west-2a
environment: staging
image: ami-abc123
awsMachineTypeNode: c5.large
awsMachineTypeMaster: m5.xlarge
awsMachineTypeAg: c5.large
nodeCount: "2"
k8sVersion: "1.9.3"
vpcId: vpc-abc123
vpcCidr: 172.23.0.0/16
apiAccessCidr: <e.g. office ip>
sshAccessCidr: <e.g. office ip>
additionalSecurityGroups:
- sg-def234 # kubernetes-standard
- sg-abc123 # example scan engine targets
examplePublicKey: "ssh-rsa ..."
locksmith:
windowStart: Mon 16:00 # 8am Monday PST
windowLength: 4h
2. Migrating the workloads on the cluster
Not having any hands-on experience with Ark, it does seem to fit your use case well:
Cluster migration
Using Backups and Restores
Heptio Ark can help you port your resources from one cluster to another, as long as you point each Ark Config to the same cloud object storage. In this scenario, we are also assuming that your clusters are hosted by the same cloud provider. Note that Heptio Ark does not support the migration of persistent volumes across cloud providers.
(Cluster 1) Assuming you haven’t already been checkpointing your data with the Ark schedule operation, you need to first back up your
entire cluster (replacing as desired):
ark backup create <BACKUP-NAME> The default TTL is 30 days (720 hours); you can use the --ttl flag to change this as necessary. (Cluster 2) Make sure that the persistentVolumeProvider and backupStorageProvider fields in the Ark Config match the ones from
Cluster 1, so that your new Ark server instance is pointing to the same bucket.
(Cluster 2) Make sure that the Ark Backup object has been created. Ark resources are synced with the backup files available in cloud
storage.
(Cluster 2) Once you have confirmed that the right Backup (<BACKUP-NAME>) is now present, you can restore everything with: ark restore create --from-backup <BACKUP-NAME>
Configuring Ark on AWS clusters seems straight-forward enough: https://github.com/heptio/ark/blob/master/docs/aws-config.md.
With some initial setup with the kops toolbox script and Ark configuration, you should have a clean, repeatable way to migrate your cluster and turn your pets into cattle, as the meme goes.