Autoscale AKS cluster with Cluster Autoscaler (CA) using multiple node pools
(VMSS)
Set environment variables.
$ export AKS_REGION=southeastasia
$ export AKS_CLUSTER_RG=vmssResourceGroup
$ export AKS_CLUSTER_NAME=vmssAKSCluster
# First create a resource group
$ az group create --name $AKS_CLUSTER_RG --location $AKS_REGION
Create an AKS cluster + VMSSPreview and enable the cluster autoscaler [1]
# Now create the AKS cluster and enable the cluster autoscaler
$ az aks create \
--resource-group $AKS_CLUSTER_RG \
--name $AKS_CLUSTER_NAME \
--enable-vmss \
--node-count 1 \
--generate-ssh-keys \
--kubernetes-version 1.13.5
$ az aks get-credentials --resource-group $AKS_CLUSTER_RG --name $AKS_CLUSTER_NAME
# Add second node pool.
# see detail commands at https://docs.microsoft.com/en-us/cli/azure/ext/aks-preview/aks/nodepool?view=azure-cli-latest
$ az aks nodepool add \
--resource-group $AKS_CLUSTER_RG \
--cluster-name $AKS_CLUSTER_NAME \
--name mynodepool \
--node-count 1 \
--kubernetes-version 1.13.5
# Specify a VM size for a node pool, see VM size in Azure at https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes
# --node-vm-size Standard_NC6
# List node pools.
$ az aks nodepool list --resource-group $AKS_CLUSTER_RG --name $AKS_CLUSTER_NAME -o table
# Scale a node pool
$ az aks nodepool scale \
--resource-group $AKS_CLUSTER_RG \
--cluster-name $AKS_CLUSTER_NAME \
--name mynodepool \
--node-count 2 \
--no-wait
# Delete a node pool
$ az aks nodepool delete --resource-group $AKS_CLUSTER_RG --name $AKS_CLUSTER_NAME --name mynodepool --no-wait
Autoscale VMSS instances [2]
# create a new service principal with "Contributor" role scoped.
$ export SUBSCRIPTION_ID=$(az account show --query id | tr -d '"')
$ export PERMISSIONS=$(az ad sp create-for-rbac --role="Contributor" --scopes="/subscriptions/$SUBSCRIPTION_ID")
$ export VMSS_RESOURCE_GROUP=$(az aks show --name $AKS_CLUSTER_NAME --resource-group $AKS_CLUSTER_RG -o tsv --query 'nodeResourceGroup')
$ export CLIENT_ID=$(echo $PERMISSIONS | jq .appId | tr -d '"','\n' | base64)
$ export CLIENT_SECRET=$(echo $PERMISSIONS | jq .password | tr -d '"','\n' | base64)
$ export VMSS_RESOURCE_GROUP_BASE64=$(az aks show --name $AKS_CLUSTER_NAME --resource-group $AKS_CLUSTER_RG -o tsv --query 'nodeResourceGroup' | tr -d '\n' | base64)
$ export SUBSCRIPTION_ID_BASE64=$(echo -n $SUBSCRIPTION_ID | tr -d '"' | base64)
$ export TENANT_ID=$(echo $PERMISSIONS | jq .tenant | tr -d '"','\n' | base64)
Modify following sections as your environment,
- Fill in the placeholder values for the cluster-autoscaler-azure secret data by base64-encoding each of your Azure credential fields.
- In the cluster-autoscaler spec, find the image: field and replace `` with a specific cluster autoscaler release.
In the
command:
section, update the--nodes=
arguments to reference your node limits and VMSS nameFor example, if node pool "k8s-nodepool-1-vmss" should scale from 1 to 10 nodes:
- --nodes=1:10:k8s-nodepool-1-vmss or to autoscale multiple VM scale sets:
- --nodes=1:10:k8s-nodepool-1-vmss
- --nodes=1:10:k8s-nodepool-2-vmss
# Make a copy of cluster-autoscaler-vmss.yaml (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/examples/cluster-autoscaler-vmss.yaml)
$ cat <<EOF | kubectl apply -f -
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
name: cluster-autoscaler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cluster-autoscaler
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
rules:
- apiGroups: [""]
resources: ["events", "endpoints"]
verbs: ["create", "patch"]
- apiGroups: [""]
resources: ["pods/eviction"]
verbs: ["create"]
- apiGroups: [""]
resources: ["pods/status"]
verbs: ["update"]
- apiGroups: [""]
resources: ["endpoints"]
resourceNames: ["cluster-autoscaler"]
verbs: ["get", "update"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["watch", "list", "get", "update"]
- apiGroups: [""]
resources:
- "pods"
- "services"
- "replicationcontrollers"
- "persistentvolumeclaims"
- "persistentvolumes"
verbs: ["watch", "list", "get"]
- apiGroups: ["extensions"]
resources: ["replicasets", "daemonsets"]
verbs: ["watch", "list", "get"]
- apiGroups: ["policy"]
resources: ["poddisruptionbudgets"]
verbs: ["watch", "list"]
- apiGroups: ["apps"]
resources: ["statefulsets", "replicasets", "daemonsets"]
verbs: ["watch", "list", "get"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses"]
verbs: ["get", "list", "watch"]
- apiGroups: ["batch"]
resources: ["jobs", "cronjobs"]
verbs: ["watch", "list", "get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["create","list","watch"]
- apiGroups: [""]
resources: ["configmaps"]
resourceNames:
- "cluster-autoscaler-status"
- "cluster-autoscaler-priority-expander"
verbs: ["delete", "get", "update", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: cluster-autoscaler
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-autoscaler
subjects:
- kind: ServiceAccount
name: cluster-autoscaler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: cluster-autoscaler
subjects:
- kind: ServiceAccount
name: cluster-autoscaler
namespace: kube-system
---
apiVersion: v1
data:
ClientID: $CLIENT_ID
ClientSecret: $CLIENT_SECRET
ResourceGroup: $VMSS_RESOURCE_GROUP_BASE64
SubscriptionID: $SUBSCRIPTION_ID_BASE64
TenantID: $TENANT_ID
VMType: dm1zcw==
kind: Secret
metadata:
name: cluster-autoscaler-azure
namespace: kube-system
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
app: cluster-autoscaler
name: cluster-autoscaler
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
spec:
serviceAccountName: cluster-autoscaler
containers:
- image: k8s.gcr.io/cluster-autoscaler:v1.13.5
imagePullPolicy: Always
name: cluster-autoscaler
resources:
limits:
cpu: 100m
memory: 300Mi
requests:
cpu: 100m
memory: 300Mi
command:
- ./cluster-autoscaler
- --v=3
- --vmodule=static_autoscaler*=10,azure_*=10
- --logtostderr=true
- --cloud-provider=azure
- --skip-nodes-with-local-storage=false
- --nodes=1:10:aks-nodepool1-41521705-vmss
env:
- name: ARM_SUBSCRIPTION_ID
valueFrom:
secretKeyRef:
key: SubscriptionID
name: cluster-autoscaler-azure
- name: ARM_RESOURCE_GROUP
valueFrom:
secretKeyRef:
key: ResourceGroup
name: cluster-autoscaler-azure
- name: ARM_TENANT_ID
valueFrom:
secretKeyRef:
key: TenantID
name: cluster-autoscaler-azure
- name: ARM_CLIENT_ID
valueFrom:
secretKeyRef:
key: ClientID
name: cluster-autoscaler-azure
- name: ARM_CLIENT_SECRET
valueFrom:
secretKeyRef:
key: ClientSecret
name: cluster-autoscaler-azure
- name: ARM_VM_TYPE
valueFrom:
secretKeyRef:
key: VMType
name: cluster-autoscaler-azure
volumeMounts:
- mountPath: /etc/ssl/certs/ca-certificates.crt
name: ssl-certs
readOnly: true
restartPolicy: Always
volumes:
- hostPath:
path: /etc/ssl/certs/ca-certificates.crt
type: ""
name: ssl-certs
EOF
## CA creates a Kubernetes configMap object to report the actual state of the CA and the AKS cluster.
$ kubectl -n kube-system describe configmap cluster-autoscaler-status
Helm chart for cluster-autoscaler
refer to https://github.com/helm/charts/tree/master/stable/cluster-autoscaler
- K8S PodDisruptionBudget refer to Disruptions and see yaml example.
Schedule pods using taints and tolerations
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
aks-gpunodepool-28993262-vmss000000 Ready agent 4m22s v1.12.6
aks-nodepool1-28993262-vmss000000 Ready agent 115m v1.12.6
$ kubectl taint node aks-gpunodepool-28993262-vmss000000 sku=gpu:NoSchedule
The Kubernetes scheduler can use taints and tolerations to restrict what workloads can run on nodes.
- A
taint
is applied to anode
that indicates only specific pods can be scheduled on them. - A
toleration
is then applied to apod
that allows them to tolerate a node's taint.
The following basic example YAML manifest uses a toleration to allow the Kubernetes scheduler to run an NGINX pod on the GPU-based node.
$ cat <<EOF | kubectl -f -
apiVersion: v1
kind: Pod
metadata:
name: mypod
spec:
containers:
- image: nginx:1.15.9
name: mypod
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 1
memory: 2G
tolerations:
- key: "sku"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
EOF
$ kubectl describe pod mypod
[...]
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
sku=gpu:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m48s default-scheduler Successfully assigned default/mypod to aks-gpunodepool-28993262-vmss000000
Normal Pulling 4m47s kubelet, aks-gpunodepool-28993262-vmss000000 pulling image "nginx:1.15.9"
Normal Pulled 4m43s kubelet, aks-gpunodepool-28993262-vmss000000 Successfully pulled image "nginx:1.15.9"
Normal Created 4m40s kubelet, aks-gpunodepool-28993262-vmss000000 Created container
Normal Started 4m40s kubelet, aks-gpunodepool-28993262-vmss000000 Started container
Only pods that have this taint applied can be scheduled on nodes in gpunodepool. Any other pod would be scheduled in the nodepool1 node pool.
Reference
[1] Create and manage multiple node pools (vmss) for a cluster in Azure Kubernetes Service (AKS), https://docs.microsoft.com/en-us/azure/aks/use-multiple-node-pools
[2] (GOOD)Cluster Autoscaler on Azure, https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/README.md
[3] What are the parameters to CA, https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-the-parameters-to-ca