From 81ea062967152dcf9cbec8527d7d79b9cafbb24b Mon Sep 17 00:00:00 2001 From: John Wilkins Date: Fri, 12 Jun 2026 12:52:48 -0700 Subject: [PATCH 1/5] OBSDOCS-3383: Document Vector metrics cardinality and monitoring impact Addresses the issue where Vector collector metrics exhibit high cardinality in multitenant environments, impacting Prometheus resource consumption and cluster stability. Creates comprehensive guidance covering design, diagnosis, and remediation of high metrics cardinality caused by complex ClusterLogForwarder configurations. New modules: - collector-metrics-cardinality-impact.adoc (CONCEPT) * Explains what metrics cardinality is * How ClusterLogForwarder configuration creates cardinality * Which Vector metrics are affected (component_id label) * Impact on Prometheus (memory, CPU, storage, queries) * When to be concerned about cardinality - best-practices-multitenant-logging.adoc (REFERENCE) * Design patterns to minimize cardinality impact * Consolidate inputs using label selectors * Consolidate outputs to same destination * Minimize pipeline count * Example multitenant architecture (40-60 vs 400-500 components) * Cardinality estimation formulas - troubleshooting-collector-metrics-cardinality.adoc (PROCEDURE) * Diagnostic steps using promtool and PromQL * Identify problematic ClusterLogForwarder instances * Remediation options with examples * Verification steps Content based on: - Red Hat KB article 7137995 (diagnostic procedures) - OBSDA-1341 RFE description (problem statement) - Support case analysis (customer impact) Added new section to configuring/cluster-logging-collector.adoc: "Collector metrics and monitoring impact" This documentation addresses 3 JTBDs: 1. Design multitenant logging with monitoring impact awareness 2. Diagnose if logging causes Prometheus issues 3. Remediate high cardinality issues Related Jira: https://redhat.atlassian.net/browse/OBSDOCS-3383 Related RFE: https://redhat.atlassian.net/browse/OBSDA-1341 Signed-off-by: John Wilkins Co-authored-by: Claude Sonnet 4.5 --- configuring/cluster-logging-collector.adoc | 11 + .../best-practices-multitenant-logging.adoc | 315 ++++++++++++++++++ .../collector-metrics-cardinality-impact.adoc | 131 ++++++++ ...hooting-collector-metrics-cardinality.adoc | 232 +++++++++++++ 4 files changed, 689 insertions(+) create mode 100644 modules/best-practices-multitenant-logging.adoc create mode 100644 modules/collector-metrics-cardinality-impact.adoc create mode 100644 modules/troubleshooting-collector-metrics-cardinality.adoc diff --git a/configuring/cluster-logging-collector.adoc b/configuring/cluster-logging-collector.adoc index c7bdfda2f7af..6564d50fd0ae 100644 --- a/configuring/cluster-logging-collector.adoc +++ b/configuring/cluster-logging-collector.adoc @@ -63,4 +63,15 @@ include::modules/configuring-network-policy-rule-set-for-logfilemetricexporter.a include::modules/creating-an-adminnetworkpolicy-rule-for-collector-network-policy.adoc[leveloffset=+2] +[id="collector-metrics-and-monitoring-impact_{context}"] +== Collector metrics and monitoring impact + +The Vector log collector exposes metrics that can affect the {product-title} monitoring stack in environments with complex log forwarding configurations. + +include::modules/collector-metrics-cardinality-impact.adoc[leveloffset=+2] + +include::modules/best-practices-multitenant-logging.adoc[leveloffset=+2] + +include::modules/troubleshooting-collector-metrics-cardinality.adoc[leveloffset=+2] + //include::modules/cluster-logging-collector-tuning.adoc[leveloffset=+1] diff --git a/modules/best-practices-multitenant-logging.adoc b/modules/best-practices-multitenant-logging.adoc new file mode 100644 index 000000000000..744a3e6c36fd --- /dev/null +++ b/modules/best-practices-multitenant-logging.adoc @@ -0,0 +1,315 @@ +// Module included in the following assemblies: +// +// * observability/logging/log_collection_forwarding/cluster-logging-collector.adoc + +:_mod-docs-content-type: REFERENCE +[id="best-practices-multitenant-logging_{context}"] += Best practices for multitenant logging configurations + +[role="_abstract"] +In multitenant {product-title} clusters, you can configure logging to isolate logs between tenants while minimizing the impact on the monitoring stack. The key is to balance tenant isolation requirements with the metrics cardinality that your configuration creates. + +[id="consolidate-inputs-with-selectors_{context}"] +== Consolidate inputs with label selectors + +Instead of creating separate inputs for each tenant or namespace, use label selectors to route logs from multiple sources through a single input. + +.Anti-pattern: Separate input per namespace +[source,yaml] +---- +apiVersion: observability.openshift.io/v1 +kind: ClusterLogForwarder +metadata: + name: tenant-logs + namespace: openshift-logging +spec: + inputs: + - name: tenant-a-logs # <1> + type: application + application: + namespaces: + - tenant-a + - name: tenant-b-logs # <1> + type: application + application: + namespaces: + - tenant-b + # ... 98 more tenant inputs + outputs: + - name: tenant-a-splunk + type: splunk + # ... + - name: tenant-b-splunk + type: splunk + # ... + pipelines: + - name: tenant-a-pipeline + inputRefs: + - tenant-a-logs + outputRefs: + - tenant-a-splunk + - name: tenant-b-pipeline + inputRefs: + - tenant-b-logs + outputRefs: + - tenant-b-splunk +---- +<1> Each tenant input creates multiple Vector components, increasing cardinality. + +This configuration with 100 tenants creates ~400-500 unique `component_id` values. + +.Recommended: Single input with filtering +[source,yaml] +---- +apiVersion: observability.openshift.io/v1 +kind: ClusterLogForwarder +metadata: + name: tenant-logs + namespace: openshift-logging +spec: + inputs: + - name: all-tenants # <1> + type: application + application: + selector: + matchLabels: + tenant-logging: "enabled" # <2> + filters: + - name: tenant-a-filter + type: drop + drop: + - test: + - field: .kubernetes.namespace_name + notMatches: "^tenant-a$" + - name: tenant-b-filter + type: drop + drop: + - test: + - field: .kubernetes.namespace_name + notMatches: "^tenant-b$" + outputs: + - name: tenant-a-splunk + type: splunk + # ... + - name: tenant-b-splunk + type: splunk + # ... + pipelines: + - name: tenant-a-pipeline + inputRefs: + - all-tenants + filterRefs: + - tenant-a-filter + outputRefs: + - tenant-a-splunk + - name: tenant-b-pipeline + inputRefs: + - all-tenants + filterRefs: + - tenant-b-filter + outputRefs: + - tenant-b-splunk +---- +<1> Single input for all tenant logs reduces component count. +<2> Use pod labels to control which pods are included in log collection. + +This configuration creates far fewer Vector components because there is only one input source. + +[id="consolidate-outputs-same-destination_{context}"] +== Consolidate outputs to the same destination + +If multiple tenants send logs to the same destination system (for example, the same Splunk instance or LokiStack), use a single output rather than creating separate outputs per tenant. + +.Anti-pattern: Separate output per tenant to same destination +[source,yaml] +---- +spec: + outputs: + - name: tenant-a-splunk + type: splunk + splunk: + url: https://splunk.example.com:8088 + token: + secretName: tenant-a-splunk-token + - name: tenant-b-splunk # <1> + type: splunk + splunk: + url: https://splunk.example.com:8088 # <1> + token: + secretName: tenant-b-splunk-token +---- +<1> Both outputs point to the same Splunk instance, creating duplicate components. + +.Recommended: Single output with tenant identification +[source,yaml] +---- +spec: + outputs: + - name: shared-splunk # <1> + type: splunk + splunk: + url: https://splunk.example.com:8088 + token: + secretName: splunk-token + pipelines: + - name: tenant-a-pipeline + inputRefs: + - all-tenants + filterRefs: + - tenant-a-filter + outputRefs: + - shared-splunk # <2> + - name: tenant-b-pipeline + inputRefs: + - all-tenants + filterRefs: + - tenant-b-filter + outputRefs: + - shared-splunk # <2> +---- +<1> Single output reduces component count. +<2> Multiple pipelines can share the same output. + +Tenant isolation is maintained by the namespace information in the log records. Use Splunk, Loki, or other destination capabilities to filter and route logs by tenant. + +[id="minimize-pipeline-count_{context}"] +== Minimize the number of pipelines + +Each pipeline creates additional components for routing and processing. Where possible, combine pipelines that share inputs and outputs. + +.When separate pipelines are necessary +* Logs require different transformations before reaching different outputs +* Security or compliance requires strict separation of processing paths +* Different tenants require different delivery guarantees (for example, `AtLeastOnce` versus `AtMostOnce`) + +.When pipelines can be combined +* Logs go through the same filters to reach the same output +* Only difference is the source namespace or labels +* Tenant isolation is handled at the destination + +[id="use-single-clusterlogforwarder_{context}"] +== Use a single ClusterLogForwarder when possible + +Creating multiple `ClusterLogForwarder` custom resources increases the overall component count because each `ClusterLogForwarder` deploys a separate collector pod with its own set of components. + +.When to use multiple ClusterLogForwarders +* Different service accounts are required for different log collection purposes +* Different security or network policies apply +* Logs from different sources require completely different processing pipelines + +.When a single ClusterLogForwarder is sufficient +* All logs can use the same service account +* Tenant isolation is achieved through filtering and routing +* Network policies allow a single collector to reach all destinations + +[id="example-multitenant-architecture_{context}"] +== Example multitenant architecture + +The following example shows a multitenant logging configuration that balances tenant isolation with low metrics cardinality: + +[source,yaml] +---- +apiVersion: observability.openshift.io/v1 +kind: ClusterLogForwarder +metadata: + name: multitenant-logging + namespace: openshift-logging +spec: + serviceAccount: + name: logcollector + inputs: + - name: application-logs # <1> + type: application + application: + selector: + matchExpressions: + - key: logging-enabled + operator: In + values: + - "true" + - name: infrastructure-logs + type: infrastructure + filters: + - name: add-tenant-label # <2> + type: detectMultilineException + outputs: + - name: tenant-lokistack # <3> + type: lokiStack + lokiStack: + target: + name: logging-loki + namespace: openshift-logging + - name: compliance-s3 # <4> + type: s3 + s3: + bucketName: audit-logs + # ... + pipelines: + - name: application-to-loki # <5> + inputRefs: + - application-logs + filterRefs: + - add-tenant-label + outputRefs: + - tenant-lokistack + - name: infrastructure-to-loki + inputRefs: + - infrastructure-logs + outputRefs: + - tenant-lokistack + - name: audit-to-s3 + inputRefs: + - audit + outputRefs: + - compliance-s3 +---- +<1> Single input for all application logs, controlled by pod label. +<2> Single filter applies to all tenants. +<3> Single LokiStack output for all tenant application and infrastructure logs. +<4> Separate output for compliance because it goes to a different destination type. +<5> Three pipelines for different log types, sharing inputs and outputs where possible. + +This configuration creates approximately 40-60 `component_id` values, compared to 400-500 values for a per-tenant input/output design. + +Tenant isolation is achieved in LokiStack through the namespace labels in the logs. Use LogQL queries to filter logs by tenant: + +[source,logql] +---- +{kubernetes_namespace_name="tenant-a"} +---- + +[id="estimating-cardinality-impact_{context}"] +== Estimating cardinality impact + +Use the following rough estimates to predict the cardinality impact of your `ClusterLogForwarder` configuration: + +* Each input creates approximately 5-10 components (source + metadata + routing) +* Each output creates approximately 3-5 components (routing + sink + buffer management) +* Each filter creates approximately 1-2 components +* Each pipeline adds approximately 2-4 components for routing + +.Example calculation + +A `ClusterLogForwarder` with: + +* 5 inputs → ~40 components +* 3 outputs → ~12 components +* 2 filters → ~4 components +* 5 pipelines → ~15 components + +Estimated total: ~70 component_id values + +Each histogram metric creates 10 time series per component_id: + +* `vector_component_received_events_count_bucket`: 70 × 10 = 700 series +* `vector_buffer_send_duration_seconds_bucket`: 70 × 10 = 700 series + +These estimates are approximate. Use the diagnostic procedures in "Troubleshooting high collector metrics cardinality" to measure actual cardinality. + +[role="_additional-resources"] +.Additional resources + +* xref:../configuring/cluster-logging-collector.adoc#collector-metrics-cardinality-impact_cluster-logging-collector[Understanding collector metrics cardinality and monitoring impact] +* xref:../configuring/cluster-logging-collector.adoc#troubleshooting-collector-metrics-cardinality_cluster-logging-collector[Troubleshooting high collector metrics cardinality] +* xref:../configuring/configuring-log-forwarding.adoc#configuring-inputs_configuring-log-forwarding[Configuring inputs] +* xref:../configuring/configuring-log-forwarding.adoc#configuring-filters_configuring-log-forwarding[Configuring filters] diff --git a/modules/collector-metrics-cardinality-impact.adoc b/modules/collector-metrics-cardinality-impact.adoc new file mode 100644 index 000000000000..1c44c10d92e0 --- /dev/null +++ b/modules/collector-metrics-cardinality-impact.adoc @@ -0,0 +1,131 @@ +// Module included in the following assemblies: +// +// * observability/logging/log_collection_forwarding/cluster-logging-collector.adoc + +:_mod-docs-content-type: CONCEPT +[id="collector-metrics-cardinality-impact_{context}"] += Understanding collector metrics cardinality and monitoring impact + +[role="_abstract"] +The Vector log collector exposes Prometheus metrics that track the performance and health of log collection. In deployments with many inputs, outputs, and pipelines, these metrics can exhibit high cardinality, which increases resource consumption in the {product-title} monitoring stack. + +Understanding how `ClusterLogForwarder` configuration affects metrics cardinality helps you balance tenant isolation requirements with monitoring stack stability. + +[id="what-is-metrics-cardinality_{context}"] +== What is metrics cardinality + +Metrics cardinality refers to the number of unique time series created by a metric. Each unique combination of metric name and label values creates a separate time series that Prometheus must store and query. + +For example, if a metric has labels `component_id` and `namespace`, and there are 100 unique `component_id` values and 50 unique `namespace` values, the potential cardinality is 100 × 50 = 5,000 time series for that metric. + +High cardinality metrics require: + +* More memory in Prometheus pods to store time series +* More CPU to process queries across many series +* More storage to persist historical data +* Longer query response times + +[id="how-clusterlogforwarder-creates-cardinality_{context}"] +== How ClusterLogForwarder configuration creates cardinality + +The Vector log collector creates internal components for each stage of log processing. Each component is identified by a `component_id` label in Vector metrics. The number of components grows with the complexity of your `ClusterLogForwarder` configuration. + +A `ClusterLogForwarder` with the following configuration: + +* 3 inputs (application, infrastructure, audit) +* 2 outputs (LokiStack, Splunk) +* 2 pipelines (one per output) + +Creates these types of components: + +* Input source components (3) +* Metadata enrichment transforms per input (3) +* Filter transforms if configured (varies) +* Routing and transformation components (varies) +* Output sink components (2) + +Each input typically requires multiple transformation stages before reaching an output, creating 5-10 components per input-output path. In this example, you might have 30-50 unique `component_id` values. + +In multitenant environments with many separate inputs, outputs, or pipelines for tenant isolation, the number of components can grow to hundreds: + +* 100 tenant-specific inputs +* 50 tenant-specific outputs +* 100 tenant-specific pipelines + +This configuration can create 400-500 unique `component_id` values or more. + +[id="which-metrics-are-affected_{context}"] +== Which Vector metrics are affected by high cardinality + +The following Vector metrics use the `component_id` label and are most affected by high cardinality: + +`vector_component_received_events_count_bucket`:: Histogram tracking the number of events received by each component. This is typically the highest cardinality metric. + +`vector_buffer_send_duration_seconds_bucket`:: Histogram tracking how long it takes to send events from buffers. High cardinality because it tracks every component that sends data. + +`vector_source_lag_time_seconds_bucket`:: Histogram tracking lag time for source components. + +`vector_adaptive_concurrency_*_bucket`:: Multiple histogram metrics tracking adaptive concurrency behavior for outputs that support it. Each output creates multiple series. + +Histogram metrics are particularly problematic because they create multiple time series per unique label combination (one per bucket plus sum and count). + +A single `component_id` value in a histogram metric with 8 buckets creates: + +* 8 bucket series (`le="0.001"`, `le="0.01"`, ... `le="+Inf"`) +* 1 sum series (`_sum`) +* 1 count series (`_count`) + +Total: 10 time series per `component_id`. + +With 450 unique `component_id` values: + +* `vector_component_received_events_count_bucket`: 450 × 10 = 4,500 time series +* `vector_buffer_send_duration_seconds_bucket`: 450 × 10 = 4,500 time series +* Additional histogram metrics contribute similarly + +[id="impact-on-monitoring-stack_{context}"] +== Impact on the monitoring stack + +High cardinality from Vector metrics affects the {product-title} monitoring stack in several ways: + +Memory consumption:: Prometheus stores all active time series in memory. High cardinality can cause Prometheus pods to consume significantly more memory, potentially triggering out-of-memory (OOM) conditions. + +CPU usage:: Query processing time increases with the number of time series. High cardinality metrics slow down Prometheus queries and dashboard rendering. + +Storage growth:: Prometheus persistently stores time series data. High cardinality increases the rate of storage consumption, potentially exhausting available storage faster than expected. + +Query performance:: Queries that aggregate across all `component_id` values become slower as cardinality increases. This affects monitoring dashboards and alerting evaluation. + +Cluster stability:: The monitoring stack is critical for cluster operations including: ++ +-- +* Cluster upgrades +* Horizontal pod autoscaling (HPA) +* Vertical pod autoscaling (VPA) +* Node management +* Alerting +-- ++ +Monitoring stack instability or resource exhaustion can impact these operations. + +[id="when-to-be-concerned_{context}"] +== When to be concerned about cardinality + +Consider reviewing your `ClusterLogForwarder` configuration if: + +* You have more than 50 inputs, outputs, or pipelines across all `ClusterLogForwarder` instances +* You have separate `ClusterLogForwarder` instances per tenant or namespace +* Prometheus pods are consuming more memory than expected +* Prometheus storage is growing faster than anticipated +* Prometheus queries are slow or timing out +* You observe high cardinality warnings in Prometheus logs + +You can check the current cardinality of Vector metrics using the Prometheus API. See "Troubleshooting high collector metrics cardinality" for diagnostic procedures. + +[role="_additional-resources"] +.Additional resources + +* xref:../configuring/cluster-logging-collector.adoc#best-practices-multitenant-logging_cluster-logging-collector[Best practices for multitenant logging configurations] +* xref:../configuring/cluster-logging-collector.adoc#troubleshooting-collector-metrics-cardinality_cluster-logging-collector[Troubleshooting high collector metrics cardinality] +* link:https://prometheus.io/docs/practices/naming/#labels[Prometheus metric and label naming best practices] +* link:https://www.robustperception.io/cardinality-is-key[Understanding cardinality in Prometheus] diff --git a/modules/troubleshooting-collector-metrics-cardinality.adoc b/modules/troubleshooting-collector-metrics-cardinality.adoc new file mode 100644 index 000000000000..ae40e6ab59bb --- /dev/null +++ b/modules/troubleshooting-collector-metrics-cardinality.adoc @@ -0,0 +1,232 @@ +// Module included in the following assemblies: +// +// * observability/logging/log_collection_forwarding/cluster-logging-collector.adoc + +:_mod-docs-content-type: PROCEDURE +[id="troubleshooting-collector-metrics-cardinality_{context}"] += Troubleshooting high collector metrics cardinality + +[role="_abstract"] +If you suspect that Vector collector metrics are causing high cardinality in Prometheus, you can diagnose the issue by analyzing the Prometheus time series database and identifying which `ClusterLogForwarder` configurations contribute most to cardinality. + +.Prerequisites + +* You have access to the cluster as a user with the `cluster-admin` cluster role. +* You have installed the {oc-first}. + +.Procedure + +. Analyze the Prometheus time series database to identify metrics with the highest cardinality: ++ +[source,terminal] +---- +$ oc exec -n openshift-monitoring prometheus-k8s-0 -c prometheus -- \ + /bin/sh -c "promtool tsdb analyze /prometheus" +---- ++ +.Example output (truncated) +[source,terminal] +---- +Block ID: 01ARZ3NDEKTSV4RRFFQ69G5FAV +Duration: 2h0m0s +Series: 142853 +Label names: 156 +Label pairs: 3642847 + +Most common label pairs: +3642847 endpoint=metrics +3166018 namespace=openshift-logging +3116791 container=collector +3105098 app_kubernetes_io_component=collector + +Highest cardinality metric names: +727740 vector_component_received_events_count_bucket # <1> +719960 vector_buffer_send_duration_seconds_bucket # <1> +166536 etcd_request_duration_seconds_bucket +150240 vector_source_lag_time_seconds_bucket +137720 vector_adaptive_concurrency_limit_bucket +---- +<1> Vector metrics with significantly higher cardinality than other cluster metrics. + +. Count the number of unique `component_id` values for Vector metrics: ++ +[source,terminal] +---- +$ oc exec -n openshift-monitoring prometheus-k8s-0 -- \ + promtool query instant http://localhost:9090 \ + 'count(sum by (component_id)(rate(vector_component_received_events_count_bucket[1m])))' +---- ++ +.Example output +[source,terminal] +---- +484 # <1> +---- +<1> There are 484 unique `component_id` values contributing to cardinality. + +. Count how many of these components are inputs: ++ +[source,terminal] +---- +$ oc exec -n openshift-monitoring prometheus-k8s-0 -- \ + promtool query instant http://localhost:9090 \ + 'count(sum by (component_id)(rate(vector_component_received_events_count_bucket{component_id=~"input_.*"}[1m])))' +---- ++ +.Example output +[source,terminal] +---- +188 # <1> +---- +<1> 188 of the 484 components are input sources. + +. Identify which `ClusterLogForwarder` instances contribute most to cardinality: ++ +[source,terminal] +---- +$ oc exec -n openshift-monitoring prometheus-k8s-0 -- \ + promtool query instant http://localhost:9090 \ + 'count(sum by (app_kubernetes_io_instance, component_id)(rate(vector_component_received_events_count_bucket[1m]))) by (app_kubernetes_io_instance)' +---- ++ +This query returns the number of components per `ClusterLogForwarder` instance. ++ +.Example output +[source,terminal] +---- +{app_kubernetes_io_instance="instance"} 31 +{app_kubernetes_io_instance="app-logforward-customers"} 453 # <1> +---- +<1> The `app-logforward-customers` ClusterLogForwarder creates 453 unique components. + +. Retrieve the configuration of the `ClusterLogForwarder` with high component count: ++ +[source,terminal] +---- +$ oc get clusterlogforwarder app-logforward-customers -n openshift-logging -o yaml +---- + +. Count the number of inputs, outputs, and pipelines in the configuration: ++ +[source,terminal] +---- +$ oc get clusterlogforwarder app-logforward-customers -n openshift-logging -o yaml | \ + grep -E "^ inputs:|^ outputs:|^ pipelines:" | wc -l +---- + +. Analyze the configuration to identify opportunities for consolidation: ++ +-- +* Count the number of inputs: ++ +[source,terminal] +---- +$ oc get clusterlogforwarder app-logforward-customers -n openshift-logging -o yaml | \ + yq '.spec.inputs | length' +---- + +* Count the number of outputs: ++ +[source,terminal] +---- +$ oc get clusterlogforwarder app-logforward-customers -n openshift-logging -o yaml | \ + yq '.spec.outputs | length' +---- + +* Count the number of pipelines: ++ +[source,terminal] +---- +$ oc get clusterlogforwarder app-logforward-customers -n openshift-logging -o yaml | \ + yq '.spec.pipelines | length' +---- +-- + +.Troubleshooting + +Based on your diagnosis, apply the appropriate remediation: + +Consolidate inputs using label selectors:: If you have many inputs that differ only by namespace or pod labels, replace them with a single input that uses label selectors to include all required pods. ++ +For example, instead of separate inputs for `tenant-a`, `tenant-b`, and so on, create one input with: ++ +[source,yaml] +---- +inputs: + - name: all-tenants + type: application + application: + selector: + matchLabels: + tenant-logging: "enabled" +---- ++ +See "Best practices for multitenant logging configurations" for detailed examples. + +Consolidate outputs to the same destination:: If multiple outputs point to the same destination system, combine them into a single output and use multiple pipelines to route different log streams to it. ++ +For example, instead of `tenant-a-splunk` and `tenant-b-splunk` both pointing to `https://splunk.example.com`, create one `shared-splunk` output. + +Reduce the number of pipelines:: Each pipeline adds components for routing. If pipelines differ only in their source namespace or labels, consider: ++ +-- +* Using filters within a single pipeline to route logs +* Relying on destination system capabilities to separate tenant logs +* Combining pipelines that share the same input and output +-- + +Use fewer ClusterLogForwarder instances:: If you have multiple `ClusterLogForwarder` custom resources that could be combined, consider consolidating them into a single instance. Each `ClusterLogForwarder` deploys separate collector pods with their own component sets. ++ +Only use multiple `ClusterLogForwarder` instances when you need: ++ +-- +* Different service accounts for different collection purposes +* Different security or network policies +* Completely different processing pipelines +-- + +Review filter usage:: While filters are necessary for log processing, excessive filtering can add components. Ensure that filters serve a clear purpose and consider whether filtering can be done at the destination instead. + +.Verification + +After making configuration changes to reduce cardinality: + +. Wait approximately 5-10 minutes for the collector to restart and metrics to stabilize. + +. Rerun the diagnostic commands to verify the component count decreased: ++ +[source,terminal] +---- +$ oc exec -n openshift-monitoring prometheus-k8s-0 -- \ + promtool query instant http://localhost:9090 \ + 'count(sum by (component_id)(rate(vector_component_received_events_count_bucket[1m])))' +---- + +. Monitor Prometheus pod resource usage over the next few hours: ++ +[source,terminal] +---- +$ oc adm top pod -n openshift-monitoring -l app.kubernetes.io/name=prometheus +---- ++ +You should see memory usage stabilize or decrease as old time series expire. + +. Check Prometheus storage growth rate using the web console: ++ +-- +.. Navigate to *Observe* -> *Metrics*. +.. Run the following query to see storage size: ++ +[source,promql] +---- +prometheus_tsdb_storage_blocks_bytes +---- +-- + +[role="_additional-resources"] +.Additional resources + +* xref:../configuring/cluster-logging-collector.adoc#collector-metrics-cardinality-impact_cluster-logging-collector[Understanding collector metrics cardinality and monitoring impact] +* xref:../configuring/cluster-logging-collector.adoc#best-practices-multitenant-logging_cluster-logging-collector[Best practices for multitenant logging configurations] +* link:https://prometheus.io/docs/prometheus/latest/storage/[Prometheus storage documentation] +* link:https://access.redhat.com/solutions/7137995[Red Hat Knowledgebase: Prometheus storage size impacted by high cardinality of Vector metrics] From 644030be386da001ba4bc7451f8a69f2ce35410a Mon Sep 17 00:00:00 2001 From: John Wilkins Date: Fri, 12 Jun 2026 13:03:27 -0700 Subject: [PATCH 2/5] OBSDOCS-3383: Remove DITA-incompatible callouts from modules Replaces numbered callouts (<1>, <2>, etc.) with DITA-compliant approaches: - Inline explanatory text after code blocks - Definition lists for key-value explanations - Bulleted lists for multi-point explanations Callouts are not supported in DITA and cause build warnings. Changes: - best-practices-multitenant-logging.adoc * Removed 10 callouts from YAML examples * Replaced with inline explanations and bulleted lists - troubleshooting-collector-metrics-cardinality.adoc * Removed 4 callouts from terminal output examples * Replaced with inline explanatory sentences Vale validation: 0 errors, 6 warnings (block titles only) Signed-off-by: John Wilkins Co-authored-by: Claude Sonnet 4.5 --- .../best-practices-multitenant-logging.adoc | 65 ++++++++++--------- ...hooting-collector-metrics-cardinality.adoc | 22 ++++--- 2 files changed, 46 insertions(+), 41 deletions(-) diff --git a/modules/best-practices-multitenant-logging.adoc b/modules/best-practices-multitenant-logging.adoc index 744a3e6c36fd..53767b9095b0 100644 --- a/modules/best-practices-multitenant-logging.adoc +++ b/modules/best-practices-multitenant-logging.adoc @@ -24,12 +24,12 @@ metadata: namespace: openshift-logging spec: inputs: - - name: tenant-a-logs # <1> + - name: tenant-a-logs type: application application: namespaces: - tenant-a - - name: tenant-b-logs # <1> + - name: tenant-b-logs type: application application: namespaces: @@ -54,9 +54,8 @@ spec: outputRefs: - tenant-b-splunk ---- -<1> Each tenant input creates multiple Vector components, increasing cardinality. - -This configuration with 100 tenants creates ~400-500 unique `component_id` values. ++ +Each tenant input creates multiple Vector components, increasing cardinality. This configuration with 100 tenants creates approximately 400-500 unique `component_id` values. .Recommended: Single input with filtering [source,yaml] @@ -68,12 +67,12 @@ metadata: namespace: openshift-logging spec: inputs: - - name: all-tenants # <1> + - name: all-tenants type: application application: selector: matchLabels: - tenant-logging: "enabled" # <2> + tenant-logging: "enabled" filters: - name: tenant-a-filter type: drop @@ -110,10 +109,8 @@ spec: outputRefs: - tenant-b-splunk ---- -<1> Single input for all tenant logs reduces component count. -<2> Use pod labels to control which pods are included in log collection. - -This configuration creates far fewer Vector components because there is only one input source. ++ +A single input for all tenant logs reduces component count. Use pod labels to control which pods are included in log collection. This configuration creates far fewer Vector components because there is only one input source. [id="consolidate-outputs-same-destination_{context}"] == Consolidate outputs to the same destination @@ -131,21 +128,22 @@ spec: url: https://splunk.example.com:8088 token: secretName: tenant-a-splunk-token - - name: tenant-b-splunk # <1> + - name: tenant-b-splunk type: splunk splunk: - url: https://splunk.example.com:8088 # <1> + url: https://splunk.example.com:8088 token: secretName: tenant-b-splunk-token ---- -<1> Both outputs point to the same Splunk instance, creating duplicate components. ++ +Both outputs point to the same Splunk instance, creating duplicate components. .Recommended: Single output with tenant identification [source,yaml] ---- spec: outputs: - - name: shared-splunk # <1> + - name: shared-splunk type: splunk splunk: url: https://splunk.example.com:8088 @@ -158,19 +156,17 @@ spec: filterRefs: - tenant-a-filter outputRefs: - - shared-splunk # <2> + - shared-splunk - name: tenant-b-pipeline inputRefs: - all-tenants filterRefs: - tenant-b-filter outputRefs: - - shared-splunk # <2> + - shared-splunk ---- -<1> Single output reduces component count. -<2> Multiple pipelines can share the same output. - -Tenant isolation is maintained by the namespace information in the log records. Use Splunk, Loki, or other destination capabilities to filter and route logs by tenant. ++ +A single output reduces component count. Multiple pipelines can share the same output. Tenant isolation is maintained by the namespace information in the log records. Use Splunk, Loki, or other destination capabilities to filter and route logs by tenant. [id="minimize-pipeline-count_{context}"] == Minimize the number of pipelines @@ -218,7 +214,7 @@ spec: serviceAccount: name: logcollector inputs: - - name: application-logs # <1> + - name: application-logs type: application application: selector: @@ -230,22 +226,22 @@ spec: - name: infrastructure-logs type: infrastructure filters: - - name: add-tenant-label # <2> + - name: add-tenant-label type: detectMultilineException outputs: - - name: tenant-lokistack # <3> + - name: tenant-lokistack type: lokiStack lokiStack: target: name: logging-loki namespace: openshift-logging - - name: compliance-s3 # <4> + - name: compliance-s3 type: s3 s3: bucketName: audit-logs # ... pipelines: - - name: application-to-loki # <5> + - name: application-to-loki inputRefs: - application-logs filterRefs: @@ -263,12 +259,17 @@ spec: outputRefs: - compliance-s3 ---- -<1> Single input for all application logs, controlled by pod label. -<2> Single filter applies to all tenants. -<3> Single LokiStack output for all tenant application and infrastructure logs. -<4> Separate output for compliance because it goes to a different destination type. -<5> Three pipelines for different log types, sharing inputs and outputs where possible. - ++ +This configuration uses: ++ +-- +* A single input for all application logs, controlled by pod label +* A single filter that applies to all tenants +* A single LokiStack output for all tenant application and infrastructure logs +* A separate output for compliance because it goes to a different destination type +* Three pipelines for different log types, sharing inputs and outputs where possible +-- ++ This configuration creates approximately 40-60 `component_id` values, compared to 400-500 values for a per-tenant input/output design. Tenant isolation is achieved in LokiStack through the namespace labels in the logs. Use LogQL queries to filter logs by tenant: diff --git a/modules/troubleshooting-collector-metrics-cardinality.adoc b/modules/troubleshooting-collector-metrics-cardinality.adoc index ae40e6ab59bb..29cacf4503fe 100644 --- a/modules/troubleshooting-collector-metrics-cardinality.adoc +++ b/modules/troubleshooting-collector-metrics-cardinality.adoc @@ -40,13 +40,14 @@ Most common label pairs: 3105098 app_kubernetes_io_component=collector Highest cardinality metric names: -727740 vector_component_received_events_count_bucket # <1> -719960 vector_buffer_send_duration_seconds_bucket # <1> +727740 vector_component_received_events_count_bucket +719960 vector_buffer_send_duration_seconds_bucket 166536 etcd_request_duration_seconds_bucket 150240 vector_source_lag_time_seconds_bucket 137720 vector_adaptive_concurrency_limit_bucket ---- -<1> Vector metrics with significantly higher cardinality than other cluster metrics. ++ +Vector metrics have significantly higher cardinality than other cluster metrics. . Count the number of unique `component_id` values for Vector metrics: + @@ -60,9 +61,10 @@ $ oc exec -n openshift-monitoring prometheus-k8s-0 -- \ .Example output [source,terminal] ---- -484 # <1> +484 ---- -<1> There are 484 unique `component_id` values contributing to cardinality. ++ +This output shows there are 484 unique `component_id` values contributing to cardinality. . Count how many of these components are inputs: + @@ -76,9 +78,10 @@ $ oc exec -n openshift-monitoring prometheus-k8s-0 -- \ .Example output [source,terminal] ---- -188 # <1> +188 ---- -<1> 188 of the 484 components are input sources. ++ +This output shows that 188 of the 484 components are input sources. . Identify which `ClusterLogForwarder` instances contribute most to cardinality: + @@ -95,9 +98,10 @@ This query returns the number of components per `ClusterLogForwarder` instance. [source,terminal] ---- {app_kubernetes_io_instance="instance"} 31 -{app_kubernetes_io_instance="app-logforward-customers"} 453 # <1> +{app_kubernetes_io_instance="app-logforward-customers"} 453 ---- -<1> The `app-logforward-customers` ClusterLogForwarder creates 453 unique components. ++ +This output shows that the `app-logforward-customers` ClusterLogForwarder creates 453 unique components. . Retrieve the configuration of the `ClusterLogForwarder` with high component count: + From 0d57c6bd76fe43d4ac69544e16c3c25259679e81 Mon Sep 17 00:00:00 2001 From: John Wilkins Date: Fri, 12 Jun 2026 13:09:16 -0700 Subject: [PATCH 3/5] OBSDOCS-3383: Fix DITA compliance issues in cardinality modules Removed level 2 subheadings and inappropriate block titles from: - modules/collector-metrics-cardinality-impact.adoc - modules/best-practices-multitenant-logging.adoc Changes: - Converted level 2 subheadings (==) to inline content within main module text - Removed inappropriate block titles (.Anti-pattern, .Recommended, etc.) - Replaced block titles with inline introductory text - Fixed PascalCase terms (LokiStack, LogQL) with backticks DITA compliance verified with vale-check-assembly - 0 errors, 0 warnings on these modules. Signed-off-by: John Wilkins Co-Authored-By: Claude Sonnet 4.5 --- .../best-practices-multitenant-logging.adoc | 42 +++++-------------- .../collector-metrics-cardinality-impact.adoc | 26 +----------- 2 files changed, 13 insertions(+), 55 deletions(-) diff --git a/modules/best-practices-multitenant-logging.adoc b/modules/best-practices-multitenant-logging.adoc index 53767b9095b0..b573561d93e3 100644 --- a/modules/best-practices-multitenant-logging.adoc +++ b/modules/best-practices-multitenant-logging.adoc @@ -9,12 +9,9 @@ [role="_abstract"] In multitenant {product-title} clusters, you can configure logging to isolate logs between tenants while minimizing the impact on the monitoring stack. The key is to balance tenant isolation requirements with the metrics cardinality that your configuration creates. -[id="consolidate-inputs-with-selectors_{context}"] -== Consolidate inputs with label selectors - Instead of creating separate inputs for each tenant or namespace, use label selectors to route logs from multiple sources through a single input. -.Anti-pattern: Separate input per namespace +The following configuration example shows an anti-pattern with separate inputs per namespace: [source,yaml] ---- apiVersion: observability.openshift.io/v1 @@ -57,7 +54,7 @@ spec: + Each tenant input creates multiple Vector components, increasing cardinality. This configuration with 100 tenants creates approximately 400-500 unique `component_id` values. -.Recommended: Single input with filtering +The following configuration example shows the recommended approach with a single input and filtering: [source,yaml] ---- apiVersion: observability.openshift.io/v1 @@ -112,12 +109,9 @@ spec: + A single input for all tenant logs reduces component count. Use pod labels to control which pods are included in log collection. This configuration creates far fewer Vector components because there is only one input source. -[id="consolidate-outputs-same-destination_{context}"] -== Consolidate outputs to the same destination - -If multiple tenants send logs to the same destination system (for example, the same Splunk instance or LokiStack), use a single output rather than creating separate outputs per tenant. +If multiple tenants send logs to the same destination system (for example, the same Splunk instance or `LokiStack`), use a single output rather than creating separate outputs per tenant. -.Anti-pattern: Separate output per tenant to same destination +The following configuration example shows an anti-pattern with separate outputs per tenant to the same destination: [source,yaml] ---- spec: @@ -138,7 +132,7 @@ spec: + Both outputs point to the same Splunk instance, creating duplicate components. -.Recommended: Single output with tenant identification +The following configuration example shows the recommended approach with a single output and tenant identification: [source,yaml] ---- spec: @@ -168,39 +162,30 @@ spec: + A single output reduces component count. Multiple pipelines can share the same output. Tenant isolation is maintained by the namespace information in the log records. Use Splunk, Loki, or other destination capabilities to filter and route logs by tenant. -[id="minimize-pipeline-count_{context}"] -== Minimize the number of pipelines - Each pipeline creates additional components for routing and processing. Where possible, combine pipelines that share inputs and outputs. -.When separate pipelines are necessary +Separate pipelines are necessary when: * Logs require different transformations before reaching different outputs * Security or compliance requires strict separation of processing paths * Different tenants require different delivery guarantees (for example, `AtLeastOnce` versus `AtMostOnce`) -.When pipelines can be combined +Pipelines can be combined when: * Logs go through the same filters to reach the same output * Only difference is the source namespace or labels * Tenant isolation is handled at the destination -[id="use-single-clusterlogforwarder_{context}"] -== Use a single ClusterLogForwarder when possible - Creating multiple `ClusterLogForwarder` custom resources increases the overall component count because each `ClusterLogForwarder` deploys a separate collector pod with its own set of components. -.When to use multiple ClusterLogForwarders +Use multiple `ClusterLogForwarder` instances when: * Different service accounts are required for different log collection purposes * Different security or network policies apply * Logs from different sources require completely different processing pipelines -.When a single ClusterLogForwarder is sufficient +A single `ClusterLogForwarder` is sufficient when: * All logs can use the same service account * Tenant isolation is achieved through filtering and routing * Network policies allow a single collector to reach all destinations -[id="example-multitenant-architecture_{context}"] -== Example multitenant architecture - The following example shows a multitenant logging configuration that balances tenant isolation with low metrics cardinality: [source,yaml] @@ -272,16 +257,13 @@ This configuration uses: + This configuration creates approximately 40-60 `component_id` values, compared to 400-500 values for a per-tenant input/output design. -Tenant isolation is achieved in LokiStack through the namespace labels in the logs. Use LogQL queries to filter logs by tenant: +Tenant isolation is achieved in `LokiStack` through the namespace labels in the logs. Use `LogQL` queries to filter logs by tenant: [source,logql] ---- {kubernetes_namespace_name="tenant-a"} ---- -[id="estimating-cardinality-impact_{context}"] -== Estimating cardinality impact - Use the following rough estimates to predict the cardinality impact of your `ClusterLogForwarder` configuration: * Each input creates approximately 5-10 components (source + metadata + routing) @@ -289,9 +271,7 @@ Use the following rough estimates to predict the cardinality impact of your `Clu * Each filter creates approximately 1-2 components * Each pipeline adds approximately 2-4 components for routing -.Example calculation - -A `ClusterLogForwarder` with: +For example, a `ClusterLogForwarder` with: * 5 inputs → ~40 components * 3 outputs → ~12 components diff --git a/modules/collector-metrics-cardinality-impact.adoc b/modules/collector-metrics-cardinality-impact.adoc index 1c44c10d92e0..0d629ed4ed3d 100644 --- a/modules/collector-metrics-cardinality-impact.adoc +++ b/modules/collector-metrics-cardinality-impact.adoc @@ -11,22 +11,9 @@ The Vector log collector exposes Prometheus metrics that track the performance a Understanding how `ClusterLogForwarder` configuration affects metrics cardinality helps you balance tenant isolation requirements with monitoring stack stability. -[id="what-is-metrics-cardinality_{context}"] -== What is metrics cardinality +Metrics cardinality refers to the number of unique time series created by a metric. Each unique combination of metric name and label values creates a separate time series that Prometheus must store and query. For example, if a metric has labels `component_id` and `namespace`, and there are 100 unique `component_id` values and 50 unique `namespace` values, the potential cardinality is 100 × 50 = 5,000 time series for that metric. -Metrics cardinality refers to the number of unique time series created by a metric. Each unique combination of metric name and label values creates a separate time series that Prometheus must store and query. - -For example, if a metric has labels `component_id` and `namespace`, and there are 100 unique `component_id` values and 50 unique `namespace` values, the potential cardinality is 100 × 50 = 5,000 time series for that metric. - -High cardinality metrics require: - -* More memory in Prometheus pods to store time series -* More CPU to process queries across many series -* More storage to persist historical data -* Longer query response times - -[id="how-clusterlogforwarder-creates-cardinality_{context}"] -== How ClusterLogForwarder configuration creates cardinality +High cardinality metrics require more memory in Prometheus pods to store time series, more CPU to process queries across many series, more storage to persist historical data, and longer query response times. The Vector log collector creates internal components for each stage of log processing. Each component is identified by a `component_id` label in Vector metrics. The number of components grows with the complexity of your `ClusterLogForwarder` configuration. @@ -54,9 +41,6 @@ In multitenant environments with many separate inputs, outputs, or pipelines for This configuration can create 400-500 unique `component_id` values or more. -[id="which-metrics-are-affected_{context}"] -== Which Vector metrics are affected by high cardinality - The following Vector metrics use the `component_id` label and are most affected by high cardinality: `vector_component_received_events_count_bucket`:: Histogram tracking the number of events received by each component. This is typically the highest cardinality metric. @@ -83,9 +67,6 @@ With 450 unique `component_id` values: * `vector_buffer_send_duration_seconds_bucket`: 450 × 10 = 4,500 time series * Additional histogram metrics contribute similarly -[id="impact-on-monitoring-stack_{context}"] -== Impact on the monitoring stack - High cardinality from Vector metrics affects the {product-title} monitoring stack in several ways: Memory consumption:: Prometheus stores all active time series in memory. High cardinality can cause Prometheus pods to consume significantly more memory, potentially triggering out-of-memory (OOM) conditions. @@ -108,9 +89,6 @@ Cluster stability:: The monitoring stack is critical for cluster operations incl + Monitoring stack instability or resource exhaustion can impact these operations. -[id="when-to-be-concerned_{context}"] -== When to be concerned about cardinality - Consider reviewing your `ClusterLogForwarder` configuration if: * You have more than 50 inputs, outputs, or pipelines across all `ClusterLogForwarder` instances From dd75f74ca05a6c21f8712ccff203a064c88cc86d Mon Sep 17 00:00:00 2001 From: John Wilkins Date: Fri, 12 Jun 2026 13:26:29 -0700 Subject: [PATCH 4/5] OBSDOCS-3383: Remove incorrect continuation markers from best practices module Removed standalone '+' continuation markers that were causing visible plus signs in HTML output. These markers were left over from removed block titles and are not needed for standalone paragraphs after code blocks. Signed-off-by: John Wilkins Co-Authored-By: Claude Sonnet 4.5 --- modules/best-practices-multitenant-logging.adoc | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/modules/best-practices-multitenant-logging.adoc b/modules/best-practices-multitenant-logging.adoc index b573561d93e3..b9f5b6ececd6 100644 --- a/modules/best-practices-multitenant-logging.adoc +++ b/modules/best-practices-multitenant-logging.adoc @@ -51,7 +51,7 @@ spec: outputRefs: - tenant-b-splunk ---- -+ + Each tenant input creates multiple Vector components, increasing cardinality. This configuration with 100 tenants creates approximately 400-500 unique `component_id` values. The following configuration example shows the recommended approach with a single input and filtering: @@ -106,7 +106,7 @@ spec: outputRefs: - tenant-b-splunk ---- -+ + A single input for all tenant logs reduces component count. Use pod labels to control which pods are included in log collection. This configuration creates far fewer Vector components because there is only one input source. If multiple tenants send logs to the same destination system (for example, the same Splunk instance or `LokiStack`), use a single output rather than creating separate outputs per tenant. @@ -129,7 +129,7 @@ spec: token: secretName: tenant-b-splunk-token ---- -+ + Both outputs point to the same Splunk instance, creating duplicate components. The following configuration example shows the recommended approach with a single output and tenant identification: @@ -159,7 +159,7 @@ spec: outputRefs: - shared-splunk ---- -+ + A single output reduces component count. Multiple pipelines can share the same output. Tenant isolation is maintained by the namespace information in the log records. Use Splunk, Loki, or other destination capabilities to filter and route logs by tenant. Each pipeline creates additional components for routing and processing. Where possible, combine pipelines that share inputs and outputs. From e8b6da0afc885bcb22a144fe9291cf98c6543ca1 Mon Sep 17 00:00:00 2001 From: John Wilkins Date: Fri, 12 Jun 2026 15:52:14 -0700 Subject: [PATCH 5/5] OBSDOCS-3383: Fix vale warnings - replace 'using' with 'by using' Fixed RedHat.Using warnings: - modules/collector-metrics-cardinality-impact.adoc:101 - modules/troubleshooting-collector-metrics-cardinality.adoc:153 All three modules now have 0 errors, 0 warnings. Signed-off-by: John Wilkins Co-Authored-By: Claude Sonnet 4.5 --- modules/collector-metrics-cardinality-impact.adoc | 2 +- modules/troubleshooting-collector-metrics-cardinality.adoc | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/modules/collector-metrics-cardinality-impact.adoc b/modules/collector-metrics-cardinality-impact.adoc index 0d629ed4ed3d..c908a9283c99 100644 --- a/modules/collector-metrics-cardinality-impact.adoc +++ b/modules/collector-metrics-cardinality-impact.adoc @@ -98,7 +98,7 @@ Consider reviewing your `ClusterLogForwarder` configuration if: * Prometheus queries are slow or timing out * You observe high cardinality warnings in Prometheus logs -You can check the current cardinality of Vector metrics using the Prometheus API. See "Troubleshooting high collector metrics cardinality" for diagnostic procedures. +You can check the current cardinality of Vector metrics by using the Prometheus API. See "Troubleshooting high collector metrics cardinality" for diagnostic procedures. [role="_additional-resources"] .Additional resources diff --git a/modules/troubleshooting-collector-metrics-cardinality.adoc b/modules/troubleshooting-collector-metrics-cardinality.adoc index 29cacf4503fe..60aad02a3862 100644 --- a/modules/troubleshooting-collector-metrics-cardinality.adoc +++ b/modules/troubleshooting-collector-metrics-cardinality.adoc @@ -150,7 +150,7 @@ $ oc get clusterlogforwarder app-logforward-customers -n openshift-logging -o ya Based on your diagnosis, apply the appropriate remediation: -Consolidate inputs using label selectors:: If you have many inputs that differ only by namespace or pod labels, replace them with a single input that uses label selectors to include all required pods. +Consolidate inputs by using label selectors:: If you have many inputs that differ only by namespace or pod labels, replace them with a single input that uses label selectors to include all required pods. + For example, instead of separate inputs for `tenant-a`, `tenant-b`, and so on, create one input with: +