Skip to content

Allow Fleet Server to reload TLS certificates without restarting#6838

Open
ycombinator wants to merge 19 commits intoelastic:mainfrom
ycombinator:pr/6835
Open

Allow Fleet Server to reload TLS certificates without restarting#6838
ycombinator wants to merge 19 commits intoelastic:mainfrom
ycombinator:pr/6835

Conversation

@ycombinator
Copy link
Copy Markdown
Contributor

@ycombinator ycombinator commented Apr 15, 2026

What is the problem this PR solves?

Fleet Server currently requires an explicit restart to reload TLS certificates used for serving HTTPS requests to Elastic Agents. In environments with frequent certificate rotation — Kubernetes (where Secrets are remounted into pods), serverless (where cert-manager manages per-project certificates), and general ops — this creates operational burden and requires external orchestration to coordinate cert rotation with server restarts.

How does this PR solve the problem?

Adds an opt-in ssl.certificate_reload.enabled config option (disabled by default) that, when enabled, watches the configured cert/key files on disk using fsnotify and hot-reloads them into the running TLS listener without restart.

Key design decisions:

  • Detection: Watches parent directories of cert/key files (not the files directly) to handle atomic rename/replace used by cert-manager and Kubernetes secret mounts
  • Debounce: Waits after the first file-change event for writes to settle, handling non-atomic writes of cert and key as separate operations. The debounce period defaults to 5 seconds and is configurable via ssl.certificate_reload.debounce. Automated tooling (cert-manager, K8s secret mounts) can use the default or a shorter value, while manual rotation workflows may benefit from a longer window to allow time between replacing the cert and key files
  • Validation: Validates the new cert/key pair with tls.LoadX509KeyPair before swapping; invalid pairs keep the old cert active and log an error
  • Serving: Uses tls.Config.GetCertificate callback with atomic.Pointer[tls.Certificate] for lock-free concurrent reads on every TLS handshake
  • Existing connections: Unaffected — only new TLS handshakes see the new cert

New files:

  • internal/pkg/reload/tls/cert_reloader.goCertReloader type with New(), GetCertificate(), and Run() methods
  • internal/pkg/reload/tls/cert_reloader_test.go — unit tests

Modified files:

  • internal/pkg/config/input.goServerTLSConfig wrapper around tlscommon.ServerConfig with CertificateReload field
  • internal/pkg/api/server.go — wires CertReloader into TLS setup when feature is enabled
  • internal/pkg/api/server_test.go — updates existing tests for new type, adds end-to-end cert reload test
  • fleet-server.reference.yml — documents the new config option

How to test this PR locally

1. Build Fleet Server

mage build:local

2. Generate a CA and server cert/key pair

mkdir -p /tmp/tls-reload-test
cd /tmp/tls-reload-test

# Generate CA
openssl req -x509 -newkey rsa:2048 -keyout ca-key.pem -out ca.pem -days 1 -nodes -subj "/CN=Test CA"

# Generate server cert signed by the CA
openssl req -newkey rsa:2048 -keyout server-key.pem -out server.csr -nodes \
  -subj "/CN=localhost" -addext "subjectAltName=DNS:localhost,IP:127.0.0.1"
openssl x509 -req -in server.csr -CA ca.pem -CAkey ca-key.pem -CAcreateserial \
  -out server-cert.pem -days 1 -copy_extensions copyall

3. Create a config file

cat > /tmp/tls-reload-test/fleet-server.yml << 'EOF'
output:
  elasticsearch:
    hosts: ['localhost:9200']
    service_token: 'fake-token'

fleet:
  agent:
    id: test-agent-id

inputs:
  - type: fleet-server
    server:
      host: localhost
      port: 8220
      ssl:
        enabled: true
        certificate: /tmp/tls-reload-test/server-cert.pem
        key: /tmp/tls-reload-test/server-key.pem
        certificate_reload:
          enabled: true
EOF

4. Start Fleet Server

bin/fleet-server -c /tmp/tls-reload-test/fleet-server.yml \
  -E logging.to_files=true \
  -E logging.to_stderr=false \
  -E logging.files.path=/tmp/tls-reload-test \
  -E logging.files.name=fleet-server.log &

Wait for the "Listening on localhost:8220" log line:

grep "Listening" /tmp/tls-reload-test/fleet-server.log-*.ndjson

5. Make a TLS connection and note the server cert fingerprint

openssl s_client -connect localhost:8220 -CAfile /tmp/tls-reload-test/ca.pem \
  -servername localhost </dev/null 2>&1 | openssl x509 -noout -serial -fingerprint

6. Replace the cert/key files with a new pair (same CA)

# Generate the new cert/key outside the watched directory
openssl req -newkey rsa:2048 -keyout /tmp/server-key2.pem -out /tmp/server2.csr -nodes \
  -subj "/CN=localhost" -addext "subjectAltName=DNS:localhost,IP:127.0.0.1"
openssl x509 -req -in /tmp/server2.csr -CA /tmp/tls-reload-test/ca.pem \
  -CAkey /tmp/tls-reload-test/ca-key.pem -CAcreateserial \
  -out /tmp/server-cert2.pem -days 1 -copy_extensions copyall

# Move the new files into the watched directory
mv /tmp/server-cert2.pem /tmp/tls-reload-test/server-cert.pem
mv /tmp/server-key2.pem /tmp/tls-reload-test/server-key.pem

7. Wait ~5 seconds for the debounce period, then check logs

grep -E "detected change|reloaded" /tmp/tls-reload-test/fleet-server.log-*.ndjson

You should see "TLS certificate reloaded successfully".

8. Make another TLS connection — the server should present the new cert

openssl s_client -connect localhost:8220 -CAfile /tmp/tls-reload-test/ca.pem \
  -servername localhost </dev/null 2>&1 | openssl x509 -noout -serial -fingerprint

The fingerprint should differ from step 5.

9. Stop Fleet Server

kill %1

Automated tests:

  • go test ./internal/pkg/reload/tls/... — unit tests for CertReloader (valid/invalid/missing cert pairs, cert change reload, invalid cert keeps old, debounce timer reset, context cancellation)
  • go test -run Test_server_TLSCertReload ./internal/pkg/api/... — end-to-end integration test

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

@ycombinator ycombinator added the enhancement New feature or request label Apr 15, 2026
ycombinator and others added 11 commits April 15, 2026 11:25
Add github.com/fsnotify/fsnotify v1.9.0 as a direct dependency.
This will be used to watch TLS certificate and key files for changes,
enabling hot-reload without server restart (issue elastic#6433).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Introduce ServerTLSConfig that wraps tlscommon.ServerConfig with an
additional CertificateReload field, enabling opt-in TLS certificate
hot-reload. The new config lives under ssl.certificate_reload.enabled
and defaults to false (disabled).

Update Server.TLS field type from *tlscommon.ServerConfig to
*ServerTLSConfig. Promoted methods (IsEnabled, Validate, DiagCerts)
continue to work through embedding. Update call sites in server.go
and server_test.go to use the new wrapper type.

Part of elastic#6433.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a new reload/tls package with a CertReloader type that watches TLS
certificate and key files on disk using fsnotify and atomically reloads
them when changes are detected.

Key design decisions:
- Watches parent directories (not files directly) to handle atomic
  rename/replace used by cert-manager and Kubernetes secret mounts
- Debounces file change events (default 5s) to handle non-atomic writes
  of cert and key as separate operations
- Validates new cert/key pair with tls.LoadX509KeyPair before swapping;
  invalid pairs keep old cert active and log an error
- Uses atomic.Pointer[tls.Certificate] for lock-free concurrent reads
- Exposes GetCertificate callback for tls.Config.GetCertificate

Part of elastic#6433.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests cover:
- Valid and invalid cert pair loading
- Missing files and empty paths
- Certificate change detection and reload after debounce
- Invalid new cert keeps old cert active
- Debounce timer reset on additional file changes
- Clean shutdown on context cancellation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When ssl.certificate_reload.enabled is true, the server now creates a
CertReloader that watches cert/key files for changes. The reloader's
GetCertificate callback is set on the tls.Config so that every new TLS
handshake serves the latest certificate. The static Certificates slice
is cleared to ensure Go's TLS library always uses GetCertificate.

The reloader goroutine is tied to the server's context and shuts down
cleanly when the server stops.

When the feature is disabled (default), the code path is unchanged.

Part of elastic#6433.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds Test_server_TLSCertReload that verifies end-to-end certificate
rotation through a running server:
1. Starts server with certificate_reload enabled
2. Makes HTTPS request, captures server cert from TLS handshake
3. Writes a new cert/key pair to disk
4. Waits for debounce period
5. Makes another HTTPS request and asserts the cert has changed

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document the new ssl.certificate_reload.enabled setting in the
reference configuration file.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Run mage check:imports to fix struct field alignment in input.go.
Run mage check:notice to regenerate NOTICE files after adding
fsnotify as a direct dependency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Improve readability of test cases by adding comments that explain
the setup, the action being tested, and what each assertion verifies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 15, 2026

This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

ycombinator and others added 5 commits April 15, 2026 11:36
…tests

Replace fixed-duration sleeps with polling assertions to make tests
less flaky and faster. Also reduces the debounce period in
TestReload_Debounce from 500ms to 200ms for quicker test execution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the debounce=0 sentinel value with a WithDebounce option,
making the default debounce implicit rather than relying on a magic
zero value.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ycombinator ycombinator added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Apr 15, 2026
@ycombinator ycombinator requested a review from swiatekm April 15, 2026 18:48
@ycombinator ycombinator marked this pull request as ready for review April 15, 2026 18:48
@ycombinator ycombinator requested a review from a team as a code owner April 15, 2026 18:48
@ycombinator ycombinator requested a review from blakerouse April 15, 2026 18:48
ycombinator and others added 2 commits April 15, 2026 12:09
Add a `debounce` duration field to `ssl.certificate_reload` so users
can tune the delay between detecting a file change and reloading the
cert/key pair. Defaults to 5s when not specified.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow Fleet Server to reload TLS certificates without restarting

1 participant