Allow Fleet Server to reload TLS certificates without restarting#6838
Open
ycombinator wants to merge 19 commits intoelastic:mainfrom
Open
Allow Fleet Server to reload TLS certificates without restarting#6838ycombinator wants to merge 19 commits intoelastic:mainfrom
ycombinator wants to merge 19 commits intoelastic:mainfrom
Conversation
Add github.com/fsnotify/fsnotify v1.9.0 as a direct dependency. This will be used to watch TLS certificate and key files for changes, enabling hot-reload without server restart (issue elastic#6433). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Introduce ServerTLSConfig that wraps tlscommon.ServerConfig with an additional CertificateReload field, enabling opt-in TLS certificate hot-reload. The new config lives under ssl.certificate_reload.enabled and defaults to false (disabled). Update Server.TLS field type from *tlscommon.ServerConfig to *ServerTLSConfig. Promoted methods (IsEnabled, Validate, DiagCerts) continue to work through embedding. Update call sites in server.go and server_test.go to use the new wrapper type. Part of elastic#6433. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a new reload/tls package with a CertReloader type that watches TLS certificate and key files on disk using fsnotify and atomically reloads them when changes are detected. Key design decisions: - Watches parent directories (not files directly) to handle atomic rename/replace used by cert-manager and Kubernetes secret mounts - Debounces file change events (default 5s) to handle non-atomic writes of cert and key as separate operations - Validates new cert/key pair with tls.LoadX509KeyPair before swapping; invalid pairs keep old cert active and log an error - Uses atomic.Pointer[tls.Certificate] for lock-free concurrent reads - Exposes GetCertificate callback for tls.Config.GetCertificate Part of elastic#6433. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests cover: - Valid and invalid cert pair loading - Missing files and empty paths - Certificate change detection and reload after debounce - Invalid new cert keeps old cert active - Debounce timer reset on additional file changes - Clean shutdown on context cancellation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When ssl.certificate_reload.enabled is true, the server now creates a CertReloader that watches cert/key files for changes. The reloader's GetCertificate callback is set on the tls.Config so that every new TLS handshake serves the latest certificate. The static Certificates slice is cleared to ensure Go's TLS library always uses GetCertificate. The reloader goroutine is tied to the server's context and shuts down cleanly when the server stops. When the feature is disabled (default), the code path is unchanged. Part of elastic#6433. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds Test_server_TLSCertReload that verifies end-to-end certificate rotation through a running server: 1. Starts server with certificate_reload enabled 2. Makes HTTPS request, captures server cert from TLS handshake 3. Writes a new cert/key pair to disk 4. Waits for debounce period 5. Makes another HTTPS request and asserts the cert has changed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document the new ssl.certificate_reload.enabled setting in the reference configuration file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Run mage check:imports to fix struct field alignment in input.go. Run mage check:notice to regenerate NOTICE files after adding fsnotify as a direct dependency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Improve readability of test cases by adding comments that explain the setup, the action being tested, and what each assertion verifies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
|
This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
|
…tests Replace fixed-duration sleeps with polling assertions to make tests less flaky and faster. Also reduces the debounce period in TestReload_Debounce from 500ms to 200ms for quicker test execution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the debounce=0 sentinel value with a WithDebounce option, making the default debounce implicit rather than relying on a magic zero value. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a `debounce` duration field to `ssl.certificate_reload` so users can tune the delay between detecting a file change and reloading the cert/key pair. Defaults to 5s when not specified. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is the problem this PR solves?
Fleet Server currently requires an explicit restart to reload TLS certificates used for serving HTTPS requests to Elastic Agents. In environments with frequent certificate rotation — Kubernetes (where Secrets are remounted into pods), serverless (where cert-manager manages per-project certificates), and general ops — this creates operational burden and requires external orchestration to coordinate cert rotation with server restarts.
How does this PR solve the problem?
Adds an opt-in
ssl.certificate_reload.enabledconfig option (disabled by default) that, when enabled, watches the configured cert/key files on disk usingfsnotifyand hot-reloads them into the running TLS listener without restart.Key design decisions:
ssl.certificate_reload.debounce. Automated tooling (cert-manager, K8s secret mounts) can use the default or a shorter value, while manual rotation workflows may benefit from a longer window to allow time between replacing the cert and key filestls.LoadX509KeyPairbefore swapping; invalid pairs keep the old cert active and log an errortls.Config.GetCertificatecallback withatomic.Pointer[tls.Certificate]for lock-free concurrent reads on every TLS handshakeNew files:
internal/pkg/reload/tls/cert_reloader.go—CertReloadertype withNew(),GetCertificate(), andRun()methodsinternal/pkg/reload/tls/cert_reloader_test.go— unit testsModified files:
internal/pkg/config/input.go—ServerTLSConfigwrapper aroundtlscommon.ServerConfigwithCertificateReloadfieldinternal/pkg/api/server.go— wiresCertReloaderinto TLS setup when feature is enabledinternal/pkg/api/server_test.go— updates existing tests for new type, adds end-to-end cert reload testfleet-server.reference.yml— documents the new config optionHow to test this PR locally
1. Build Fleet Server
2. Generate a CA and server cert/key pair
3. Create a config file
4. Start Fleet Server
bin/fleet-server -c /tmp/tls-reload-test/fleet-server.yml \ -E logging.to_files=true \ -E logging.to_stderr=false \ -E logging.files.path=/tmp/tls-reload-test \ -E logging.files.name=fleet-server.log &Wait for the
"Listening on localhost:8220"log line:5. Make a TLS connection and note the server cert fingerprint
6. Replace the cert/key files with a new pair (same CA)
7. Wait ~5 seconds for the debounce period, then check logs
You should see
"TLS certificate reloaded successfully".8. Make another TLS connection — the server should present the new cert
The fingerprint should differ from step 5.
9. Stop Fleet Server
kill %1Automated tests:
go test ./internal/pkg/reload/tls/...— unit tests for CertReloader (valid/invalid/missing cert pairs, cert change reload, invalid cert keeps old, debounce timer reset, context cancellation)go test -run Test_server_TLSCertReload ./internal/pkg/api/...— end-to-end integration testDesign Checklist
Checklist
./changelog/fragmentsusing the changelog toolRelated issues