Azure Durable Functions & KEDA on AKS for event-driven compute & auto-scaling

I had a customer ask me about how to improve the performance of an Azure Durable Function-based Python system they had built. Durable Functions provide an orchestration system for you so you can focus on your business logic instead of having to also code orchestration (starts, stops, monitoring, etc.). They were concerned that their code in the Durable Function activity function was too computationally expensive & couldn’t be broken up further.

Therefore, I suggested they move the computationally expensive part of their application out of Azure Functions and put it on something with more horsepower, such as Azure Kubernetes Service, where they could pick a SKU that would meet their requirements.

Check out my GitHub repo that contains all the code needed to deploy this solution. There is also a GitHub repo version that uses Azure Container Instance.

Orchestration

The orchestration function manages the overall work process. Each step reports its status back to the orchestration function. It keeps track of the status of all work.

Here is the overall flow:

  1. A HTTP request is received to the Durable Function endpoint.
  2. The orchestration function kicks off.
  3. The orchestration function generates some input data and begins the orchestration.
  4. For each input data block, a Compute function is called.
  5. Each Compute function writes an input blob to the Azure Blob Storage input container & a message to the Azure Storage Queue with the path to the file
    • Note that the Compute function doesn’t call any actual computation function, all it does it put an input file in storage & a path to the file in the queue. This makes it so that the Compute function doesn’t have to know anything about how the computation actually occurs. They are loosely coupled.
  6. The KEDA process running in the AKS cluster will begin to spin up containers to process the queue messages because it sees new messages in the Azure Storage Queue.
  7. The Azure Kubernetes Service containers will pull each message off the queue, process the input data & write output data back to blob storage.
  8. When an output blob is created in the Azure Blob Storage output container, another Microsoft.Storage.BlobCreated message will get created. Another Azure Function will get called to process this message.
  9. The Azure Function will raise an event, so the orchestration function is notified that a computation is complete.
  10. After all the computations are complete (meaning all of the raise events have fired), the orchestration function will report that its status is Complete.

Container compute

Each container is self-contained and does not communicate with the other containers. It only reads the next message from the Azure Storage Queue and processes it. The Azure Storage Queue ensures a message is only processed by 1 container. If that container fails to complete the processing of the message, it will be put back on the queue and processed by another container. Once the container has received a message, it downloads the message from the Azure Blob Storage. It then computes the result and writes the result to a different Azure Blob Storage container. It then reports success to the Storage Queue and deletes the original input data.

containerCompute

Execute

The Azure Durable Function has been set up to respond to an HTTP trigger. Run a curl command to kick it off (note the inputCount, customize as needed)

curl --request POST --url https://func-funcAksKeda-ussc-demo.azurewebsites.net/api/orchestrators/ComputeOrchestrator?inputCount=1000 --header "Content-Length: 0"

You will get an output with a unique identifier for the orchestration run & some URIs to get the status.

{
    "id": "33842dacb38d4b20b1a04ebc7463325c",
    "statusQueryGetUri": "https://func-funcakskeda-ussc-demo.azurewebsites.net/runtime/webhooks/durabletask/instances/33842dacb38d4b20b1a04ebc7463325c?taskHub=funcfuncAksKedausscdemo&connection=Storage&code=OGWAo5TPdjX6PTg6zP1G5Xaxmzr8zwJTPlf2voKRTETmt4ESRb75hw==",
    "sendEventPostUri": "https://func-funcakskeda-ussc-demo.azurewebsites.net/runtime/webhooks/durabletask/instances/33842dacb38d4b20b1a04ebc7463325c/raiseEvent/{eventName}?taskHub=funcfuncAksKedausscdemo&connection=Storage&code=OGWAo5TPdjX6PTg6zP1G5Xaxmzr8zwJTPlf2voKRTETmt4ESRb75hw==",
    "terminatePostUri": "https://func-funcakskeda-ussc-demo.azurewebsites.net/runtime/webhooks/durabletask/instances/33842dacb38d4b20b1a04ebc7463325c/terminate?reason={text}&taskHub=funcfuncAksKedausscdemo&connection=Storage&code=OGWAo5TPdjX6PTg6zP1G5Xaxmzr8zwJTPlf2voKRTETmt4ESRb75hw==",
    "rewindPostUri": "https://func-funcakskeda-ussc-demo.azurewebsites.net/runtime/webhooks/durabletask/instances/33842dacb38d4b20b1a04ebc7463325c/rewind?reason={text}&taskHub=funcfuncAksKedausscdemo&connection=Storage&code=OGWAo5TPdjX6PTg6zP1G5Xaxmzr8zwJTPlf2voKRTETmt4ESRb75hw==",
    "purgeHistoryDeleteUri": "https://func-funcakskeda-ussc-demo.azurewebsites.net/runtime/webhooks/durabletask/instances/33842dacb38d4b20b1a04ebc7463325c?taskHub=funcfuncAksKedausscdemo&connection=Storage&code=OGWAo5TPdjX6PTg6zP1G5Xaxmzr8zwJTPlf2voKRTETmt4ESRb75hw==",
    "restartPostUri": "https://func-funcakskeda-ussc-demo.azurewebsites.net/runtime/webhooks/durabletask/instances/33842dacb38d4b20b1a04ebc7463325c/restart?taskHub=funcfuncAksKedausscdemo&connection=Storage&code=OGWAo5TPdjX6PTg6zP1G5Xaxmzr8zwJTPlf2voKRTETmt4ESRb75hw=="
}

You can query the status by either curling the statusQueryGetUri.

curl "https://func-funcakskeda-ussc-demo.azurewebsites.net/runtime/webhooks/durabletask/instances/33842dacb38d4b20b1a04ebc7463325c?taskHub=funcfuncAksKedausscdemo&connection=Storage&code=OGWAo5TPdjX6PTg6zP1G5Xaxmzr8zwJTPlf2voKRTETmt4ESRb75hw=="
{
  "name":"ComputeOrchestrator",
  "instanceId":"33842dacb38d4b20b1a04ebc7463325c",
  "runtimeStatus":"Running",
  "input":null,
  "customStatus":null,
  "output":null,
  "createdTime":"2021-10-01T13:36:25Z",
  "lastUpdatedTime":"2021-10-01T13:36:26Z"
}

Here are the results using the Azure Storage Explorer

orchestrationEvents

Scaling

However, this meant they would be paying for the compute to keep their cluster nodes alive, even when they weren’t using them. A key advantage they liked of their Azure Function approach was the cost savings when it wasn’t running. Their system was very batch-like and wouldn’t be running all the time. Therefore, they wanted a system that would scale up as needed (and would scale down to zero if not needed).

I asked a fellow CSA (Joseph Masengesho) about what he might do in this situation, and he recommended KEDA. He even wrote a blog post about it!

KEDA

KEDA is a service that gives you event-driven autoscaling in Kubernetes. This service listens to a queue you give it (such as an Azure Storage Queue), and scales pods up and down based upon the configuration options you provide. You can specify the minimum number of pods (including zero), the maximum number of pods & the length of queue messages you want to target.

This is specified in a Helm chart.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: compute-scaled-object
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: compute-deployment
  pollingInterval: {{ .Values.scale.pollingInterval }}
  cooldownPeriod: {{ .Values.scale.cooldownPeriod }}
  minReplicaCount: {{ .Values.scale.minReplicaCount }}
  maxReplicaCount: {{ .Values.scale.maxReplicaCount }}
  triggers:
  - type: azure-queue
    metadata:
      queueName: {{ .Values.storage.inputQueueName }}
      queueLength: '{{ .Values.scale.targetLengthOfQueue }}'
      accountName: {{ .Values.storage.storageAccountName }}
      cloud: AzurePublicCloud
    authenticationRef:
      name: azure-queue-auth

Using this in conjunction with the node pool auto-scaler, this solution will scale up as more messages are put on the queue and scale down as messages are processed.

This is exciting because I can specify only a few parameters to control the scaling and KEDA does everything else for me.

Web App

I built a demo web app that demonstrates the scaling of the AKS cluster based upon the length of the queue. It will generate dummy input data and display the # of work items, the # of AKS pods & the # of node pools serving your code.

Submit a web request to start the Azure Durable Function

curl -X POST https://func-funcAksKeda-ussc-demo.azurewebsites.net/api/orchestrators/ComputeOrchestrator?inputCount=100 -H 'Content-Length: 0'
web-app-chart

Leave a Reply

Your email address will not be published. Required fields are marked *