Cortex: Zone Aware Replication

The Cortex metrics project has, from the beginning, supported data replication within its component that handles data processing, known as ingesters. With the default configuration of 3 replicas, it was possible to have an ingester go offline due to fault or maintenance without risking data loss. However, this replication only considered the ID of the ingester for what replicas it would be assigned. It was possible to have all replicas for a given time-series assigned to the same availability zone, thus increasing the risk of data loss if multiple ingesters went offline in the same zone.

A recent update to the cortex codebase now allows an availability zone value to be assigned to the ingester via config, which is used during replication to ensure replicas of the same time series are not assigned within the same zone. The zone configuration is purely a string and tries not to be prescriptive in its meaning outside of its use for replication. It is left to the operator of the system to decide what should define a particular zone.

Configuration of the availability zone is quite straightforward using the yaml config format of cortex:

    availability_zone: "zone-3"

Full details about cortex ingester configuration can be found in the docs.

Something essential to consider before enabling zone-based replications is that the system must have at least the same number of zones as there are replicas. The system does not fall back to re-using a zone for a replica if there are not enough zones, so this does cause a replica loss or write failure. The default replication count for cortex is 3, so at minimum, there should be 3 different availability zones across the ingester instances.

Another item to consider before using zone-aware replication is cost. If you are running on a cloud provider currently in a single availability zone, utilizing multiple zones for cortex will most likely increase your running costs as cloud providers charge for cross-zone traffic. The extra cost may or may not have value based on your risk tolerance.

For cortex users that require their data to be highly available across multiple zones, this added feature ensures just that.

More information about cortex metrics can be found @