We set reasonable retention policies in oximeter's clickhouse tables: ttls are set to 30d for oximeter tables, and we now expose a configuration option for lowering that ttl via omdb. However, a few gaps remain here. Only support has access to configure the oximeter ttl, so the user would have to monitor clickhouse disk use and potentially ask support to adjust the ttl if necessary—or wait for an outage. It's also challenging for users to tune the ttl to manage clickhouse disk use, since disk use is a function of the cardinality of multiple resource types, and not straightforward to estimate in advance.
Rather than requiring users to tune ttls or risk and outage, we should teach oximeter to proactively manage disk use. Prometheus offers a good example here, providing a --storage.tsdb.retention.size configuration flag that automatically drops old samples if the tsdb hits its disk quota. Clickhouse, as a generic sql database, doesn't have an analogous flag, but we can build this logic in oximeter or clickhouse-admin: configure a clickhouse disk quota, then run a background job that periodically checks if we're over the quota and drops old records if we are. The quota doesn't have to be user-configurable to begin with, but could eventually be exposed either via omdb or nexus, and should expose metrics about how much and how often it's trimming old records.
This might be easiest to implement by dropping old partitions of measurement tables, but since we don't currently partition those tables, I believe the next easiest option would be to drop the oldest clickhouse parts until below quota.
Note: labeling as a postmortem issue, since this relates to a recent incident related to clickhouse filling up its disk.
We set reasonable retention policies in oximeter's clickhouse tables: ttls are set to 30d for oximeter tables, and we now expose a configuration option for lowering that ttl via omdb. However, a few gaps remain here. Only support has access to configure the oximeter ttl, so the user would have to monitor clickhouse disk use and potentially ask support to adjust the ttl if necessary—or wait for an outage. It's also challenging for users to tune the ttl to manage clickhouse disk use, since disk use is a function of the cardinality of multiple resource types, and not straightforward to estimate in advance.
Rather than requiring users to tune ttls or risk and outage, we should teach oximeter to proactively manage disk use. Prometheus offers a good example here, providing a
--storage.tsdb.retention.sizeconfiguration flag that automatically drops old samples if the tsdb hits its disk quota. Clickhouse, as a generic sql database, doesn't have an analogous flag, but we can build this logic in oximeter or clickhouse-admin: configure a clickhouse disk quota, then run a background job that periodically checks if we're over the quota and drops old records if we are. The quota doesn't have to be user-configurable to begin with, but could eventually be exposed either via omdb or nexus, and should expose metrics about how much and how often it's trimming old records.This might be easiest to implement by dropping old partitions of measurement tables, but since we don't currently partition those tables, I believe the next easiest option would be to drop the oldest clickhouse parts until below quota.
Note: labeling as a postmortem issue, since this relates to a recent incident related to clickhouse filling up its disk.