Skip to content

v1.2.1

Compare
Choose a tag to compare
@davidporter-id-au davidporter-id-au released this 19 Sep 03:56
· 3 commits to 1.2.x since this release
0e17485

Project release: Zonal isolation

This version introduces a few resiliency concepts into customers' worker task processing such that they can detect deployment or configuration failures earlier. These features are opt-in.

The high-level concept is to provide a means to subdivide work (called 'isolation-groups') for workers along whatever partitioning mechanism that is required for your service.

By default the partitioning mechanism provided will attempt to keep workflows running in the location the are started, such that customers may identify broken changes earlier, rather than waiting for the deployment of an entire region. However, if there are no pollers available available in that subdivision, it'll route the work elsewhere.

Nomenclature

Partitioning: A means to subdivide the tasks given to workflows, of which there are many possible schemes and one default one provided. When a workflow is started, a group of partition keys are provided by request headers. The partition keys are used to determine which isolation group of workers should process these workflows.
Workflow pinning: A partitioning scheme which emphasizes keeping workflows running in the location they were started
Isolation-groups: A division of work within a customer region in which they can subdivide their workers and pin the workflows. This originally was intended as a synonym for 'zone' in the site reliability, as a subdivision of a region. However the important point is that this is a failure domain for customer workflows, so this may be an arbitrary subdivision of your cluster's traffic.
Isolation-group drain: A means of excluding work from an isolation-group. If an isolation group is drained, workers from that isolation group won't be able to get any task. And customers cannot start workflows from that isolation group.

Default concepts and approaches

The partitioning and isolation concepts are intended to be provided as general purpose orchestration concepts and flexible, with some basic defaults provided. By default the following behaviour is given:

  • Partition data is persisted with workflow execution records by the provided middleware if the provided header is passed when workflows are created.
  • The cadence client and worker Go libraries will pass these as headers if provided in client options

Pinning behaviour

The workflow original zone is captured on workflow start and will be used on workflow processing.

The default partitioner provides the following behaviour: It will attempt to dispatch work in a zone where the workflow was started. However, workers may not be available in that zone, or no longer available for some reason. So the partitioner takes information from a lookback of poller information and uses this lookback data to ensure that the workflow can be processed. If the the start isolation-group is not available it'll another healthy random one.

'Health', here, is determined as the presence of pollers and the absence of drains.

The 'unpinning' is import for two main reasons: firstly, it's quite possible to start a workflow from an unrelated isolation-group in which the pollers are created and to suddenly blackhole that work would likely be not the desired behaviour. But secondly, and probably more importantly, this prevents a head-of-line blocking problem internally for Cadence. At the database level (in this release anyway) tasks need to be dispatched in-order and so if an isolation-group were to be not processed it would block task processing.

Drains

This release also introduces a simplistic notion of drains, which allow for isolation-groups to be excluded from traffic processing, should that be required. Drains are issuable via the Admin API or via cli:

eg:

cadence admin isolation-groups update-global --set-drains zone-1
cadence admin isolation-groups get-global

This information is stored in the config-store and is not part of dynamic configuration.

Configuration

In order to use this feature, the requisite configuration is required:

system.allIsolationGroups: This is a list of all the possible isolation-groups
system.enableTasklistIsolation: This is the bool flag to enable it for a domain

Implementation

The changes for this feature are largely in Matching and can be (reductively) described as: Sync and Async-match in Cadence as being made aware of a new dimension; their associated isolation-group. The tasks piped through the Matching service are matching the appropriate isolation-group channel.

What's Changed

New Contributors

Full Changelog: v1.0.0...v1.2.1