Autoscale AKS cluster with Cluster Autoscaler (CA) using `multiple node pools` (VMSS)

Set environment variables.

$ export AKS_REGION=southeastasia
$ export AKS_CLUSTER_RG=vmssResourceGroup
$ export AKS_CLUSTER_NAME=vmssAKSCluster

# First create a resource group
$ az group create --name $AKS_CLUSTER_RG --location $AKS_REGION

Create an AKS cluster + VMSSPreview and enable the cluster autoscaler [1]

# Now create the AKS cluster and enable the cluster autoscaler
$ az aks create \
  --resource-group $AKS_CLUSTER_RG \
  --name $AKS_CLUSTER_NAME \
  --enable-vmss \
  --node-count 1 \
  --generate-ssh-keys \
  --kubernetes-version 1.13.5

$ az aks get-credentials --resource-group $AKS_CLUSTER_RG --name $AKS_CLUSTER_NAME

# Add second node pool.
# see detail commands at https://docs.microsoft.com/en-us/cli/azure/ext/aks-preview/aks/nodepool?view=azure-cli-latest
$ az aks nodepool add \
  --resource-group $AKS_CLUSTER_RG \
  --cluster-name $AKS_CLUSTER_NAME \
  --name mynodepool \
  --node-count 1 \
  --kubernetes-version 1.13.5

# Specify a VM size for a node pool, see VM size in Azure at https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes
  # --node-vm-size Standard_NC6

# List node pools.
$ az aks nodepool list --resource-group $AKS_CLUSTER_RG --name $AKS_CLUSTER_NAME -o table

# Scale a node pool
$ az aks nodepool scale \
  --resource-group $AKS_CLUSTER_RG \
  --cluster-name $AKS_CLUSTER_NAME \
  --name mynodepool \
  --node-count 2 \
  --no-wait

# Delete a node pool
$ az aks nodepool delete --resource-group $AKS_CLUSTER_RG --name $AKS_CLUSTER_NAME --name mynodepool --no-wait

Autoscale VMSS instances [2]

# create a new service principal with "Contributor" role scoped.
$ export SUBSCRIPTION_ID=$(az account show --query id | tr -d '"')
$ export PERMISSIONS=$(az ad sp create-for-rbac --role="Contributor" --scopes="/subscriptions/$SUBSCRIPTION_ID")
$ export VMSS_RESOURCE_GROUP=$(az aks show --name $AKS_CLUSTER_NAME  --resource-group $AKS_CLUSTER_RG -o tsv --query 'nodeResourceGroup')

$ export CLIENT_ID=$(echo $PERMISSIONS | jq .appId | tr -d '"','\n' | base64)
$ export CLIENT_SECRET=$(echo $PERMISSIONS | jq .password | tr -d '"','\n' | base64)
$ export VMSS_RESOURCE_GROUP_BASE64=$(az aks show --name $AKS_CLUSTER_NAME  --resource-group $AKS_CLUSTER_RG -o tsv --query 'nodeResourceGroup' | tr -d '\n' | base64)
$ export SUBSCRIPTION_ID_BASE64=$(echo -n $SUBSCRIPTION_ID | tr -d '"' | base64)
$ export TENANT_ID=$(echo $PERMISSIONS | jq .tenant | tr -d '"','\n' | base64)

Modify following sections as your environment,

Fill in the placeholder values for the cluster-autoscaler-azure secret data by base64-encoding each of your Azure credential fields.
In the cluster-autoscaler spec, find the image: field and replace `` with a specific cluster autoscaler release.
In the command: section, update the --nodes= arguments to reference your node limits and VMSS name
For example, if node pool "k8s-nodepool-1-vmss" should scale from 1 to 10 nodes:
- --nodes=1:10:k8s-nodepool-1-vmss or to autoscale multiple VM scale sets:
- --nodes=1:10:k8s-nodepool-1-vmss
- --nodes=1:10:k8s-nodepool-2-vmss

# Make a copy of cluster-autoscaler-vmss.yaml (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/examples/cluster-autoscaler-vmss.yaml)
$ cat <<EOF | kubectl apply -f -
---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
  name: cluster-autoscaler
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: cluster-autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
rules:
  - apiGroups: [""]
    resources: ["events", "endpoints"]
    verbs: ["create", "patch"]
  - apiGroups: [""]
    resources: ["pods/eviction"]
    verbs: ["create"]
  - apiGroups: [""]
    resources: ["pods/status"]
    verbs: ["update"]
  - apiGroups: [""]
    resources: ["endpoints"]
    resourceNames: ["cluster-autoscaler"]
    verbs: ["get", "update"]
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["watch", "list", "get", "update"]
  - apiGroups: [""]
    resources:
      - "pods"
      - "services"
      - "replicationcontrollers"
      - "persistentvolumeclaims"
      - "persistentvolumes"
    verbs: ["watch", "list", "get"]
  - apiGroups: ["extensions"]
    resources: ["replicasets", "daemonsets"]
    verbs: ["watch", "list", "get"]
  - apiGroups: ["policy"]
    resources: ["poddisruptionbudgets"]
    verbs: ["watch", "list"]
  - apiGroups: ["apps"]
    resources: ["statefulsets", "replicasets", "daemonsets"]
    verbs: ["watch", "list", "get"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["storageclasses"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["batch"]
    resources: ["jobs", "cronjobs"]
    verbs: ["watch", "list", "get"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
rules:
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["create","list","watch"]
  - apiGroups: [""]
    resources: ["configmaps"]
    resourceNames:
      - "cluster-autoscaler-status"
      - "cluster-autoscaler-priority-expander"
    verbs: ["delete", "get", "update", "watch"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: cluster-autoscaler
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-autoscaler
subjects:
  - kind: ServiceAccount
    name: cluster-autoscaler
    namespace: kube-system

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: cluster-autoscaler
  namespace: kube-system
  labels:
    k8s-addon: cluster-autoscaler.addons.k8s.io
    k8s-app: cluster-autoscaler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: cluster-autoscaler
subjects:
  - kind: ServiceAccount
    name: cluster-autoscaler
    namespace: kube-system

---
apiVersion: v1
data:
  ClientID: $CLIENT_ID
  ClientSecret: $CLIENT_SECRET
  ResourceGroup: $VMSS_RESOURCE_GROUP_BASE64
  SubscriptionID: $SUBSCRIPTION_ID_BASE64
  TenantID: $TENANT_ID
  VMType: dm1zcw==
kind: Secret
metadata:
  name: cluster-autoscaler-azure
  namespace: kube-system
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: cluster-autoscaler
  name: cluster-autoscaler
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: cluster-autoscaler
  template:
    metadata:
      labels:
        app: cluster-autoscaler
    spec:
      serviceAccountName: cluster-autoscaler
      containers:
        - image: k8s.gcr.io/cluster-autoscaler:v1.13.5
          imagePullPolicy: Always
          name: cluster-autoscaler
          resources:
            limits:
              cpu: 100m
              memory: 300Mi
            requests:
              cpu: 100m
              memory: 300Mi
          command:
            - ./cluster-autoscaler
            - --v=3
            - --vmodule=static_autoscaler*=10,azure_*=10
            - --logtostderr=true
            - --cloud-provider=azure
            - --skip-nodes-with-local-storage=false
            - --nodes=1:10:aks-nodepool1-41521705-vmss
          env:
            - name: ARM_SUBSCRIPTION_ID
              valueFrom:
                secretKeyRef:
                  key: SubscriptionID
                  name: cluster-autoscaler-azure
            - name: ARM_RESOURCE_GROUP
              valueFrom:
                secretKeyRef:
                  key: ResourceGroup
                  name: cluster-autoscaler-azure
            - name: ARM_TENANT_ID
              valueFrom:
                secretKeyRef:
                  key: TenantID
                  name: cluster-autoscaler-azure
            - name: ARM_CLIENT_ID
              valueFrom:
                secretKeyRef:
                  key: ClientID
                  name: cluster-autoscaler-azure
            - name: ARM_CLIENT_SECRET
              valueFrom:
                secretKeyRef:
                  key: ClientSecret
                  name: cluster-autoscaler-azure
            - name: ARM_VM_TYPE
              valueFrom:
                secretKeyRef:
                  key: VMType
                  name: cluster-autoscaler-azure
          volumeMounts:
            - mountPath: /etc/ssl/certs/ca-certificates.crt
              name: ssl-certs
              readOnly: true
      restartPolicy: Always
      volumes:
        - hostPath:
            path: /etc/ssl/certs/ca-certificates.crt
            type: ""
          name: ssl-certs
EOF

## CA creates a Kubernetes configMap object to report the actual state of the CA and the AKS cluster.
$ kubectl -n kube-system describe configmap cluster-autoscaler-status

Helm chart for cluster-autoscaler

refer to https://github.com/helm/charts/tree/master/stable/cluster-autoscaler

K8S PodDisruptionBudget refer to Disruptions and see yaml example.

Schedule pods using taints and tolerations

$ kubectl get nodes
NAME                                 STATUS   ROLES   AGE     VERSION
aks-gpunodepool-28993262-vmss000000  Ready    agent   4m22s   v1.12.6
aks-nodepool1-28993262-vmss000000    Ready    agent   115m    v1.12.6

$ kubectl taint node aks-gpunodepool-28993262-vmss000000 sku=gpu:NoSchedule

The Kubernetes scheduler can use taints and tolerations to restrict what workloads can run on nodes.

A taint is applied to a node that indicates only specific pods can be scheduled on them.
A toleration is then applied to a pod that allows them to tolerate a node's taint.

The following basic example YAML manifest uses a toleration to allow the Kubernetes scheduler to run an NGINX pod on the GPU-based node.

$ cat <<EOF | kubectl -f -
apiVersion: v1
kind: Pod
metadata:
  name: mypod
spec:
  containers:
  - image: nginx:1.15.9
    name: mypod
    resources:
      requests:
        cpu: 100m
        memory: 128Mi
      limits:
        cpu: 1
        memory: 2G
  tolerations:
  - key: "sku"
    operator: "Equal"
    value: "gpu"
    effect: "NoSchedule"
EOF

$ kubectl describe pod mypod
[...]
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
                 sku=gpu:NoSchedule
Events:
  Type    Reason     Age    From                                          Message
  ----    ------     ----   ----                                          -------
  Normal  Scheduled  4m48s  default-scheduler                             Successfully assigned default/mypod to aks-gpunodepool-28993262-vmss000000
  Normal  Pulling    4m47s  kubelet, aks-gpunodepool-28993262-vmss000000  pulling image "nginx:1.15.9"
  Normal  Pulled     4m43s  kubelet, aks-gpunodepool-28993262-vmss000000  Successfully pulled image "nginx:1.15.9"
  Normal  Created    4m40s  kubelet, aks-gpunodepool-28993262-vmss000000  Created container
  Normal  Started    4m40s  kubelet, aks-gpunodepool-28993262-vmss000000  Started container

Only pods that have this taint applied can be scheduled on nodes in gpunodepool. Any other pod would be scheduled in the nodepool1 node pool.

Reference

[1] Create and manage multiple node pools (vmss) for a cluster in Azure Kubernetes Service (AKS), https://docs.microsoft.com/en-us/azure/aks/use-multiple-node-pools
[2] (GOOD)Cluster Autoscaler on Azure, https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/README.md
[3] What are the parameters to CA, https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-the-parameters-to-ca

Autoscale AKS cluster with Cluster Autoscaler (CA) using multiple node pools (VMSS)