Skip to content
Draft
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,7 @@ versions.lock text eol=lf

# Gradle files are always in LF.
*.gradle text eol=lf

# Monitoring mixin generated outputs — always LF to ensure make regeneration is idempotent.
solr/monitoring/grafana-solr-dashboard.json text eol=lf
solr/monitoring/prometheus-solr-alerts.yml text eol=lf
8 changes: 8 additions & 0 deletions changelog/unreleased/SOLR-18147-grafana-dashboard-solr10.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
title: Add a new Grafana dashboard for Solr 10. This uses the new OTEL based metrics introduced in 10.0 and does not rely on solr-exporter
type: added
authors:
- name: Jan Høydahl
url: https://home.apache.org/phonebook.html?uid=janhoy
links:
- name: SOLR-18147
url: https://issues.apache.org/jira/browse/SOLR-18147
2 changes: 1 addition & 1 deletion dev-tools/scripts/smokeTestRelease.py
Original file line number Diff line number Diff line change
Expand Up @@ -636,7 +636,7 @@ def verifyUnpacked(java, artifact, unpackPath, gitRevision, version, testArgs):
expected_src_root_folders = ['build-tools', 'changelog', 'dev-docs', 'dev-tools', 'gradle', 'solr']
expected_src_root_files = ['build.gradle', 'gradlew', 'gradlew.bat', 'settings.gradle', 'settings-gradle.lockfile', 'versions.lock']
expected_src_solr_files = ['build.gradle']
expected_src_solr_folders = ['benchmark', 'bin', 'modules', 'api', 'core', 'cross-dc-manager', 'docker', 'documentation', 'example', 'licenses', 'packaging', 'distribution', 'server', 'solr-ref-guide', 'solrj', 'solrj-jetty', 'solrj-streaming', 'solrj-zookeeper', 'test-framework', 'webapp', '.gitignore', '.gitattributes']
expected_src_solr_folders = ['benchmark', 'bin', 'modules', 'api', 'core', 'cross-dc-manager', 'docker', 'documentation', 'example', 'licenses', 'monitoring', 'packaging', 'distribution', 'server', 'solr-ref-guide', 'solrj', 'solrj-jetty', 'solrj-streaming', 'solrj-zookeeper', 'test-framework', 'webapp', '.gitignore', '.gitattributes']
is_in_list(in_root_folder, expected_src_root_folders)
is_in_list(in_root_folder, expected_src_root_files)
is_in_list(in_solr_folder, expected_src_solr_folders)
Expand Down
249 changes: 249 additions & 0 deletions solr/monitoring/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,249 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Solr 10.x Monitoring Artifacts

This directory provides two ready-to-use monitoring artifacts for Solr 10.x that you can
drop into your existing Prometheus + Grafana installation:

| File | Description |
|---|---|
| **`grafana-solr-dashboard.json`** | Grafana dashboard — import directly into Grafana |
| **`prometheus-solr-alerts.yml`** | Prometheus alert rules — reference from your `prometheus.yml` |

These are the **main artifacts**. Everything else in this directory supports their creation
or testing.

| File / Directory | Description |
|---|---|
| `otel-collector-solr.yml` | OTel Collector config snippet for the OTLP push path |
| `mixin/` | Jsonnet source (single source of truth used to regenerate the artifacts above) |
| `dev/` | **Developer convenience only** — docker-compose stack for testing changes to the artifacts |

For the full integration guide see the Solr Reference Guide:
**[Monitoring with Prometheus and Grafana](solr-ref-guide/modules/deployment-guide/pages/monitoring-with-prometheus-and-grafana.adoc)**

---

## Importing the Dashboard

1. Open Grafana → **Dashboards → Import**
2. Upload `grafana-solr-dashboard.json` or paste its contents
3. Select your Prometheus datasource when prompted
4. Click **Import**

The dashboard has 5 sections:
- **Node Overview** — query/indexing rates, search latency, active cores, disk space
- **JVM** — heap usage, GC pauses and rates, threads, CPU utilization
- **SolrCloud** *(collapsed)* — Overseer queues, ZooKeeper ops, shard leaders
- **Index Health** *(collapsed)* — segment counts, index size, merge rates, MMap efficiency
- **Cache Efficiency** *(collapsed)* — filter/query/document cache hit rates and evictions

---

## Loading Alert Rules into Prometheus

Add the alert rules file to your Prometheus configuration:

```yaml
# prometheus.yml
rule_files:
- /path/to/prometheus-solr-alerts.yml
```

Validate locally before deploying:
```bash
promtool check rules prometheus-solr-alerts.yml
```

Seven alert rules are included (3 critical, 4 warning). See the reference guide for details.

---

## Prometheus Scrape Configuration

Configure Prometheus to scrape Solr's metrics endpoint:

```yaml
scrape_configs:
- job_name: solr
metrics_path: /api/metrics
static_configs:
- targets: ['solr-host:8983']
relabel_configs:
- target_label: environment
replacement: prod # change per environment: dev, staging, prod
- target_label: cluster
replacement: cluster-1 # unique name for this Solr cluster
```

---

## Adding `environment` and `cluster` Labels

The dashboard includes `environment` and `cluster` dropdown variables that let you
scope panels to a specific deployment environment (dev/staging/prod) and SolrCloud cluster.
These labels are **operator-supplied** — Solr does not emit them natively.

If labels are absent, both dropdowns default to "All" and panels match all series —
the dashboard works correctly without these labels.

### OTel Collector path

Map OTel resource attributes in `otel-collector-solr.yml`:

```yaml
processors:
transform/promote_resource_attrs:
metric_statements:
- context: datapoint
statements:
- set(attributes["environment"], resource.attributes["deployment.environment"])
where resource.attributes["deployment.environment"] != nil
- set(attributes["cluster"], resource.attributes["service.namespace"])
where resource.attributes["service.namespace"] != nil
```

---

## Customizing Label Names

If your organization uses different Prometheus label names (e.g., `deployment_environment`
instead of `environment`), edit `mixin/config.libsonnet`:

```jsonnet
{
_config:: {
environmentLabel: 'deployment_environment', // change to match your label
clusterLabel: 'service_namespace', // change to match your label
// ... other thresholds ...
},
}
```

Then regenerate the dashboard:

```bash
cd mixin
make all
```

---

## Customizing Alert Thresholds

All thresholds are in `mixin/config.libsonnet`. For example, to lower the heap alert
threshold from 90% to 85%:

```jsonnet
{
_config:: {
heapUsageThreshold: 0.85, // default: 0.9
},
}
```

Regenerate after editing:

```bash
cd mixin
make all # regenerates grafana-solr-dashboard.json and prometheus-solr-alerts.yml
make check # validates alert rules with promtool (optional)
```

---

## Mixin Build

Only contributors who need to **regenerate** the dashboard or alert rules need build tooling.
Operators can use the pre-generated files directly.

There are two equivalent ways to build:

### Option A — Docker (no local tool installation required)

```bash
cd mixin
./make.sh install # fetch grafonnet dependency into vendor/
./make.sh all # regenerate both artifacts
./make.sh check # validate alert rules
```

`make.sh` automatically builds the `solr-mixin-make:latest` Docker image on first run.
Pass `--rebuild` to force a fresh image build (e.g. after a Dockerfile update):

```bash
./make.sh --rebuild all
```

### Option B — Local tools

Install the required tools once:

| Tool | Purpose | Install |
|---|---|---|
| `jsonnet` (go-jsonnet) | Evaluate jsonnet source | `brew install go-jsonnet` or `go install github.com/google/go-jsonnet/cmd/jsonnet@latest` |
| `jb` (jsonnet-bundler) | Manage dependencies | `brew install jsonnet-bundler` or `go install github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb@latest` |
| `gojsontoyaml` | Convert JSON→YAML for alert rules | `go install github.com/brancz/gojsontoyaml@latest` |
| `promtool` *(optional)* | Validate alert rules | `brew install prometheus` or bundled with Prometheus binary download |

Then build:

```bash
cd mixin
make install # fetch grafonnet dependency into vendor/
make all # regenerate both artifacts
```

### Makefile targets

Both `./make.sh <target>` and `make <target>` accept the same targets:

| Target | Action |
|---|---|
| `install` | Download jsonnet dependencies (grafonnet) into `vendor/` |
| `dashboards` | Regenerate `grafana-solr-dashboard.json` |
| `alerts` | Regenerate `prometheus-solr-alerts.yml` |
| `all` | Regenerate both outputs |
| `check` | Run `promtool check rules` on alert rules |
| `fmt` | Auto-format all `.libsonnet` source files |

---

## Developer Convenience Stack

The `dev/` directory contains a docker-compose stack that starts a local Solr + Prometheus
+ Grafana + Alertmanager environment. **This is solely for contributors who want to
visually test changes to the two artifacts above.** It is not a reference deployment
or a recommended production setup.

```bash
cd dev
docker-compose up -d
```

Services: Solr (`:8983`), Prometheus (`:9090`), Grafana (`:3000`, admin/admin), Alertmanager (`:9093`)

---

## Grafana Marketplace (Post-Merge)

After merging to the Solr main branch, a committer can publish the dashboard to the
Grafana marketplace at https://grafana.com/grafana/dashboards/ for wider discoverability.
Steps:
1. Log in to grafana.com with a Solr/Apache account
2. Upload `grafana-solr-dashboard.json`
3. Set tags: `solr`, `prometheus`, `solr10`
4. Reference the dashboard ID in this README for easy import via dashboard ID lookup
86 changes: 86 additions & 0 deletions solr/monitoring/dev/alertmanager/alertmanager.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# alertmanager.yml — Example Alertmanager routing configuration for Solr alerts.
#
# Two-tier severity routing:
# severity: critical → critical-receiver (intended for: PagerDuty, OpsGenie, paging on-call)
# severity: warning → warning-receiver (intended for: Slack, email, non-urgent notification)
#
# Replace the stub "null" receivers with your real notification integrations.
# See https://prometheus.io/docs/alerting/latest/configuration/ for receiver examples.

global:
resolve_timeout: 5m

route:
# Default receiver for any alert not matched by child routes.
receiver: "default"
group_by: ["alertname", "instance"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h

routes:
# Critical alerts — page the on-call engineer immediately.
# Solr critical alerts: SolrHighHeapUsage, SolrJvmGcThrashing, SolrLowDiskSpace
- matchers:
- severity = "critical"
receiver: "critical-receiver"
group_wait: 10s
repeat_interval: 1h

# Warning alerts — notify the team but do not page.
# Solr warning alerts: SolrHighSearchLatency, SolrHighErrorRate,
# SolrOverseerQueueBuildup, SolrHighMmapRatio
- matchers:
- severity = "warning"
receiver: "warning-receiver"
repeat_interval: 6h

receivers:
# Default catch-all receiver (no-op stub).
- name: "default"

# Critical receiver — replace with PagerDuty, OpsGenie, VictorOps, or similar.
# PagerDuty example:
# - name: "critical-receiver"
# pagerduty_configs:
# - service_key: YOUR_PAGERDUTY_INTEGRATION_KEY
- name: "critical-receiver"

# Warning receiver — replace with Slack, email, or similar.
# Slack example:
# - name: "warning-receiver"
# slack_configs:
# - api_url: https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK
# channel: '#solr-alerts'
# title: '{{ .GroupLabels.alertname }}'
# text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
# Email example:
# - name: "warning-receiver"
# email_configs:
# - to: 'solr-team@example.com'
# from: 'alertmanager@example.com'
# smarthost: 'smtp.example.com:587'
- name: "warning-receiver"

inhibit_rules:
# Suppress warning alerts on an instance when a critical alert is already firing for it.
- source_matchers:
- severity = "critical"
target_matchers:
- severity = "warning"
equal: ["instance"]
Loading
Loading